GPT-3. Believe the Hype? An overview
GPT-3, or to use its full title, the third generation Generative Pre-trained Transformer, is a neural network machine learning model trained to generate any type of text. A component of Natural language processing is natural language generation, which focuses on generating human language natural text. Up until now, generating realistic human understandable content has been a challenge for machines that don't really know the complexities and nuances of language. Anyone interacting with a chatbot online appreciates this difficulty as they struggle to provide a realistic experience of a human conversation.
GPT-3 promises to be different. Developed by OpenAI, it requires a small amount of input text to generate large volumes of relevant and sophisticated machine-generated text using algorithms that are pre-trained. The idea is that they have been trained on all of the data they need to carry out their task…..45 terabytes of unlabelled data gathered by crawling the internet (a publicly available dataset known as CommonCrawl) along with other texts selected by OpenAI. During the training, words or phrases are randomly removed from the text, and the model must learn to fill them in using only the surrounding words as context. It’s a simple training task that results in a powerful and generalizable model.
Unlike most AI systems which are designed for one use-case, OpenAI’s API today provides a general-purpose ‘text in, text out’ interface, allowing users to try it on virtually any English language task. GPT-3's deep learning neural network is a model that has over 175 billion machine learning parameters. This, it appears, is the main reason GPT-3 is so impressively “smart” and human-sounding. To place this into context the previous record holder before GPT-3 was Microsoft's Turing NLG model, had 10 billion parameters. Currently, GPT-3 is the largest neural network ever produced and as a result OpenAI claims that GPT-3 is the best ever model for producing text that is convincing enough to seem like a human could have written it.
When Open-AI introduced GPT-3 last year, it was met with much enthusiasm.
GPT-3 purports to do what no other model can do (well): perform specific tasks without any special tuning. Other language models require an elaborate fine-tuning step where you gather thousands of examples to teach it how to do translation. To adapt these models to a specific task (like translation, summarization, spam detection, etc.), you have to go out and find a large training dataset (on the order of thousands or tens of thousands of examples), which can be almost impossible, depending on the task. With GPT-3, you don’t need to do that fine-tuning step. This is the heart of it. This is what gets people excited about GPT-3: custom language tasks without training data.
GPT-3 is built using the same model and architecture as GPT-2. To study the dependence of ML performance on model size, the researchers trained 8 different sizes of model, ranging over three orders of magnitude from 125 million parameters to 175 billion parameters, with the last being the model being the one labelled GPT-3.
How it works
At its core, GPT-3 is basically a transformer model. Transformers were developed to solve the problem of sequence transduction, or neural machine translation. That means any task that transforms an input sequence to an output sequence. This includes speech recognition, text-to-speech transformation, question-answering, text summarisation, and machine translation.
The Transformer only performs a small, constant number of steps where it applies a self-attention mechanism which directly models relationships between all words in a sentence, regardless of their respective position. This architecture was first proposed in the seminal paper — Attention is all you need from Google in the mid 2017.
In many Machine Learning applications, the amount of available labelled data is normally a barrier to producing a high-performing model. The researchers at OpenAI were arrempting to demonstrate who this limitation can be overcome through a technique known as Few-Shot Learning.
Few-Shot Learning refers to the practice of feeding a machine learning model with a very small amount of training data to guide its predictions, as opposed to standard fine-tuning techniques which require a relatively large amount of training data for the pre-trained model to adapt to the desired task with accuracy. OpenAI showed in the GPT-3 Paper that the few-shot prompting ability improves with the number of language model parameters.
In training their model they were able to leverage the available large datasets for language models, namely Common Crawl which has almost a trillion words. This size is sufficient to train GPT-3 without ever updating on the same sequence twice. They did have quality issues with the data so they:
1)filtered it for quality by comparing the documents to supposed high-quality reference texts such as a collection of curated datasets such as WebText and Wikiedia,
2)performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of the validation set as an accurate measure of overfitting, and
3) also added known high-quality reference texts such as the WebText database and wikapedia to the training mix to augment CommonCrawl and increase its diversity.
Issues with training
As with all difficult engineering efforts the openAI team encountered a few issues along the way. A major methodological concern with language models pretrained on a broad swath of internet data, particularly with large models that have the capacity to memorize vast amounts of content, could thereby contaminating downstream tasks through having their test or development sets inadvertently seen during pre-training.
While the team believed they had removed any overlap between the datasets a bug in the filtering caused them to ignore some of these duplicates. While the error was picked up it was not feasible for them to rerun the training due to its cost, a downside of working with such large models.
They evaluated GPT-3 against a variety of tasks including language modelling, translation, reading comprehension, Learning and Using Novel Words, and News Article Generation and The team at openAI found some limitations with GPT-3. which included
Functional weaknesses in text synthesis. Although the overall quality is high, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences or paragraphs. Funnily enough it still struggles with basic questions such as “If I put cheese into the fridge, will it melt?”.
Known structural and algorithmic limitations, which could account for some of these issues. They could improve the algorithm through training objectives such as denoising, or look to bi-directional architectures