State of GPT - Andrej Karpathy [Summary]
Dive into Andrej Karpathy's insights on GPT's development, powerful learning, mode collapse, tool usage, and key recommendations. #AI #GPT #OpenAI
Original Video Link: State of GPT | BRK216HFS
Summary
GPT Assistant training pipeline
AI researcher and founding member of OpenAI, Andrej Karpathy, discussed the training of large language models like GPT in a two-part presentation. In the first part, he described the training process of GPT Assistants, highlighting a four-stage pipeline: pretraining, supervised finetuning, reward modeling, and reinforcement learning.
During pretraining, most of the computational work happens. It involves training on Internet-scale datasets with thousands of GPUs over a period of months. The process begins by collecting a large amount of data from diverse sources such as web scrapes, GitHub, Wikipedia, and more. This data is tokenized and turned into sequences of integers, the native representation for GPTs.
He elaborated on the pretraining phase by providing examples of hyperparameters used in this stage using GPT-3 and Meta's LLaMA models. These models deal with a vocabulary size of tens of thousands of tokens and context lengths that can go up to 100,000, with the number of parameters ranging in the billions.
The data batches formed during pretraining are fed into a transformer neural network, which aims to predict the next token in a sequence. Over time, with iterative training, the transformer can learn and make more coherent and consistent predictions, generating increasingly sophisticated language.
After pretraining, the model undergoes a fine-tuning process. Karpathy notes that these base models learn powerful general representations that can be efficiently fine-tuned for a variety of downstream tasks. These tasks can include anything from sentiment classification to question-answering systems, leveraging the versatile multitasking ability of the transformer model.
The talk marked an evolution in the understanding of model prompting, demonstrating how the models could be guided to perform specific tasks effectively without additional fine-tuning. The presentation served as an in-depth exploration of the processes behind the training of large language models, setting the stage for further developments in AI.
Base Models Learn Powerful, General Representations
The GPT-4 model available over API is an assistant model, not a base model, with the best available base model currently being the LLaMA series from Meta. Base models are primarily designed for document completion and don't typically answer questions directly. However, they can be manipulated into providing more assistant-like responses by crafting specific prompts.
Supervised finetuning is one approach to create assistant models, which involves gathering high-quality data in the form of prompts and ideal responses. These models are then trained to be helpful, truthful, and harmless. However, while these models work to some extent, they can be further refined using reinforcement learning from human feedback (RLHF).
The RLHF process consists of creating multiple completions from a model, having human contractors rank these completions, and then training the model to align with these rankings. This produces a reward model that can score the quality of any arbitrary completion for any given prompt.
This reward model can then be used for reinforcement learning, wherein it's used to weigh the language modeling objective. Over time, the model is adjusted to generate responses that score higher according to the reward model.
RLHF models like ChatGPT are typically preferred by users compared to base and SFT models. However, one trade-off is that RLHF models tend to produce more peaky or focused results, meaning they might lose some of the broader context captured by base models.
Karpathy discusses the characteristics and applications of advanced AI models such as GPT-4 and their variations. They explain that these models tend to produce more 'peaky' results, with lower variation compared to base models which yield more diverse outputs. The speaker appreciates the base model for tasks that require a high level of creativity or diversity, providing an example of generating new Pokémon names. The speaker then mentions that GPT-4 is currently the best model according to a ranking system developed by a Berkeley team.
They delve into the topic of applying AI models to problem-solving, using the example of comparing California and Alaska's populations. The speaker contrasts human thought processes and tool use, with an AI model's sequence of tokens. They point out the limitations of AI models, which lack the ability to reflect, self-correct or perform sanity checks. However, they also highlight the models' advantages, such as the vast storage of fact-based knowledge and perfect working memory.
Prompt Engineering
Karpathy asserts that the technique of 'prompting' can be employed to bridge the cognitive gap between humans and AI models. They advocate spreading out the reasoning process across more tokens, and demonstrate a few methods to facilitate better reasoning, including the 'step-by-step' approach and 'self-consistency'. They note the transformer's inability to recover from bad reasoning paths, and discuss techniques to help the AI models rectify these situations.
The talk provides insights into the usage and engineering of prompts to enhance the effectiveness of language models like GPT-4. The speaker discusses the importance of specific prompts that ask the model to self-evaluate and improve, in order to compensate for the lack of spontaneous reassessment in these models. Drawing a parallel to the concept of System 1 (fast, automatic) and System 2 (slower, deliberate) in human cognition, the speaker explores how this dichotomy can be recreated in language models through prompt engineering and additional Python 'glue' code. A recent technique, "Tree of Thought", is cited, which maintains multiple completions for a given prompt and scores them along the way. The speaker also introduces the concepts of using a sequence of thought-action-observation to answer queries and tool use by the model.
However, the speaker notes that Language Models are built to imitate rather than succeed. Therefore, specific prompts are needed to coax high-quality responses from the model, essentially instructing it to act as an expert in a given field. Furthermore, the speaker advises leveraging the computational strengths of language models, essentially treating them as tools that aid problem-solving.
Caution is advised when setting the "IQ" or competence level of the model, as setting it too high may result in the model generating sci-fi or fantastical responses. The speaker suggests finding the optimal balance for the model's "IQ".
Karpathy discusses advanced techniques for working with large language models (LLMs) like GPT-4. They delve into details about 'Reflection', where the model is asked to reassess whether it has met a given assignment, helping to enhance its performance. They mention techniques like Tree of Thought that maintains and scores multiple prompts for more accurate responses.
The speaker also discusses the concept of 'Imitation' versus 'Success', emphasizing that LLMs, by default, attempt to imitate language patterns from the training data instead of striving for successful responses. However, prompts can be engineered to push the model towards providing high-quality, successful responses.
Tool use/ Plugins
'Tool use / Plugins' is another important topic addressed. LLMs can be improved by providing tools like calculators or code interpreters. These plugins enable the model to perform tasks it's not inherently good at, like large number calculations. The speaker also mentions retrieval-augmented models, which use embedded data chunks from external documents to boost the transformer's 'working memory'.
The discussion also delves into 'Constraint Prompting', a technique for enforcing specific templates or formats in the output of LLMs. Microsoft's 'Guidance' was cited as an example.
Finally, the speaker talks about 'Fine Tuning', a more technically demanding process of adjusting the weights of the model. Tools like LORA are highlighted, which allow the tuning of small, sparse parts of the model, making it more cost-effective. While fine-tuning offers potential benefits, the speaker warns that it's technically complex and may slow down the iteration process.
Recommendations
In this part of the transcript, Karpathy presents recommendations for utilizing large language models (LLMs) like GPT-4. To achieve top performance, they suggest using GPT-4 due to its superior capabilities. Other advice includes crafting detailed prompts filled with task-specific content and incorporating relevant information and instructions.
Karpathy emphasizes the importance of understanding the 'psychology' of LLMs since they lack the human qualities of an inner monologue and cleverness. Therefore, prompts should cater to this distinction. Furthermore, he suggests exploring 'prompt engineering' techniques, which involve using various ways to structure and present information to the AI model to generate desirable responses.
The speaker encourages experimentation with 'few-shot' examples, where the model is shown examples of the output desired, and tool use/plugins that can handle tasks difficult for LLMs. The transcript also covers consideration for chains and reflection in responses, and when the maximum has been obtained from prompt engineering, the speaker advises looking into fine-tuning the model to the specific application.
However, they caution that Reinforcement Learning from Human Feedback (RLHF), although potentially more effective than Supervised Fine Tuning (SFT), is a challenging and intricate process. The speaker also mentions exploring lower capacity models or shorter prompts to optimize costs.
Regarding suitable use-cases for LLMs, the speaker recommends low-stakes applications due to limitations like bias, fabrication, reasoning errors, and susceptibility to attacks. They emphasize the importance of combining LLMs with human oversight and using them as a source of inspiration or a 'co-pilot'. The speech concludes by sharing an inspiring message generated by GPT-4 for the audience of Microsoft Build 2023, showcasing the AI model's prowess.