Latent Space Podcast 4/6/23 [Summary] - AI Fundamentals: Benchmarks 101
Explore AI Fundamentals in 'Benchmarks 101.' Dive into the history of LLM Benchmarking and uncover the intricacies that shape machine learning's progress.
Link to Original: https://www.latent.space/p/benchmarks-101#details
Summary
Hosts: Alessio from Decibel Partners and swyx, writer and editor of Latent Space.
Key Points:
Benchmark Fun with Emojis & Physics:
Alessio quizzes co-host swyx on various benchmarks including emoji-based movie questions and physics-based ones, revealing the humorous human errors and underscoring the difference between human cognition and AI processing.
Importance of AI Benchmarks:
GPT-4's recent launch prompts a discussion on AI benchmarks. Every AI model release is usually accompanied by claims of improved benchmark performance.
The progression of benchmarks from the 1990s to today shows a marked increase in difficulty.
Benchmarks not only assess the AI's capabilities but influence the direction of research.
A model's performance on benchmarks is a crucial marketing tool, with some models omitting performance on certain benchmarks, leading to issues in reproducing results.
Benchmark Metrics Introduced:
The primary measures of benchmark metrics include Accuracy, Precision, and Recall.
Precision and Recall are often at odds: increasing one might decrease the other.
F1 score combines Precision and Recall and is widely used. Stanford also introduced metrics like calibration, robustness, fairness, bias, and toxicity.
Benchmarking Methodologies:
Zero Shot: AI is tested without any examples to see its generalizing ability.
Few Shot: A couple of examples are given (like 5 examples, denoted as K=5) to guide the AI.
Fine Tune: The AI is provided with ample data specific to a task and then tested. This method requires more data and compute time.
Historical Perspective on Benchmarking:
Tracing the history of benchmarking leads back to studies as early as 1985.
1985-1989: WordNet and Entailment
WordNet is an English benchmark from Princeton University by George Miller and Christian Fellbaum. George Miller was also known for the "Magical Number Seven plus Minus Two" observation about human short-term memory capacity.
Created pre-personal computer era, it manually organized 155,000 words into 175,000 synsets (groupings) to show relationships between words. Notable relationships include hypernyms, holonyms, and entailments. For instance, "sleep" is entailed by "snore" as one can't snore without sleeping.
The database turned out to be instrumental in understanding semantic similarity, sentiment analysis, and machine translation.
Mention of Penn Tree Bank in 1989, which had 4.5 million words of text.
1998-2004: Enron Emails and MNIST
Enron dataset, released post the company's collapse in 2004, consists of 600,000 emails from 150 senior Enron employees, useful for email classification, summarization, entity recognition, and language modeling.
MNIST is a dataset of 60,000 training images of handwritten numbers, foundational in computer vision and easy to train with modern machine learning tools.
It's noted that datasets like Enron, being domain-specific, can introduce biases. MNIST too had a bias towards handwritten numbers.
2009-2014: ImageNet, CIFAR, and the AlexNet Moment for Deep Learning
ImageNet, introduced by Fei-Fei Li, marked a significant turn in deep learning. She collaborated with Christian Fellbaum of WordNet to develop it. They utilized Amazon Mechanical Turk for image classification.
A pivotal point was in 2012 when a deep learning model called AlexNet significantly outperformed others on ImageNet with a 15% error rate.
CIFAR-10 and CIFAR-100 are image datasets introduced in 2009 and 2014 respectively. CIFAR-10 has 60,000 32x32 color images across 10 categories, while CIFAR-100 increased the classes to 100.
The trend observed is that datasets get solved and outgrown, leading to the creation of newer, more challenging datasets.
Development and Evolution of Language Model Benchmarks
2018-19: GLUE and SuperGLUE
GLUE (General Language Understanding Evaluation) was an influential benchmark introduced in 2018, focusing on tasks such as single-sentence tasks, similarity & paraphrase tasks, and inference tasks.
Example: Stanford Sentiment Tree Bank involved rating sentiments from movie reviews.
Challenge: Predicting if a pair of sentences are semantically equivalent.
Note: Addressed challenges of imbalanced datasets using accuracy and F1 scores.
Inference Example: Using Stanford Question Answering Dataset (SQuAD) to determine if an answer exists in a given paragraph.
SuperGLUE was introduced in 2019 as an enhancement to GLUE. It moved beyond single-sentence evaluations and introduced multi-sentence/context-driven evaluations.
Challenge: Questions might require understanding beyond the most recent context. For instance, determining if a drink mentioned is a Pepsi product when told it's owned by the Coca-Cola company.
2018-2019: Swag and HellaSwag - Common Sense Inference
SWAG (Situations with Adversarial Generations) was introduced in 2018, emphasizing common sense inference. It used multiple-choice questions requiring models to predict the next likely action in a given scenario.
Example: If a woman takes a seat at a piano, what does she most likely do next?
HellaSwag expanded on SWAG in 2019 with more questions and real-world datasets, pushing the models further.
2021 - MMLU: Human-level Professional Knowledge
MMLU (Massive Multi-task Language Understanding) emerged in 2021 as one of the most comprehensive benchmarks.
Coverage: 57 tasks including elementary math, US history, computer science, law, and GRE test practice questions, US medical exams, and undergrad courses from Oxford.
Complexity: Questions range from elementary to professional levels, even encompassing questions about thyroid cancer diagnosis and microeconomics.
Note: Benchmarks have been evolving rapidly, reflecting the advancements in language model capabilities. The focus has been on pushing models towards more human-like understanding and problem-solving capabilities.
In 2021, a significant benchmark named HumanEval was introduced for code generation. It gained attention as it was released in tandem with OpenAI's Codex, the model that powers GitHub co-pilot. This benchmark highlighted the progress of language models in coding, showing that while GPT-3.4 scored 48% on this, GPT-4 achieved 67%. This suggests that automation in coding might be closer than anticipated. However, there's still a gap as the model struggles with newer frameworks or technologies.
In 2020, XTREME introduced multilingual benchmarks, aiming to predict conversions between different parts of words. These benchmarks, especially the XTREME, were praised for their efforts to cover less common languages and dialects, offering a broader perspective on linguistic abilities and cultural insights.
2022 saw the introduction of BIG-Bench, the largest benchmark to date, with 204 tasks and contributions from numerous institutions. It encompassed diverse fields like linguistics, software development, biology, and more. Some tasks were designed to test models beyond just memorizing internet data, requiring deeper reasoning and understanding. Interestingly, despite its significance, BIG-Bench results were absent from GPT-4's report, sparking speculations.
Finally, a test anomaly was discussed concerning GPT-4's results on AMC10 and AMC12, American math tests for 10th and 12th graders respectively. GPT-4 scored lower on the supposedly easier AMC10 than on AMC12, a puzzling outcome that indicates possible inconsistencies in model evaluations.
Data Contamination:
Language models scrape the majority of the internet for training data.
Over time, previous test data becomes part of the new corpus, leading to memorization over reasoning.
This mimics overfitting in traditional machine learning.
Example: GPT-4 memorized Code Forces problems from pre-2021 but failed on problems from 2022 onwards.
Benchmarking Issues:
Bias: The data has inherent biases, e.g., how certain races or genders are labeled differently.
Data Quality: Some datasets, like the Iris dataset, contain labeling issues which can lead to models that are inaccurate or even biologically impossible.
Task Specificity: Overfitting to specific tasks without real-world applicability.
Reproducibility: Differences in data pre-processing and post-processing can lead to inconsistent results.
Resource Requirements: Larger models are more expensive to run, and not all are openly accessible.
Confidence Calibration: Models like GPT-4 sometimes exhibit overconfidence in their answers, leading to hallucination.
Production Benchmarking:
Emphasizes on latency, inference cost, and throughput.
Different use cases will prioritize different benchmarks.
Models should adapt to the needs of tasks - some need quick responses, others can tolerate waiting for a more accurate answer.
The software development lifecycle suggests starting with large models and refining them for specific needs.
There is an interest in AI agents that can utilize large models to save users time, even if they take longer to produce results.
Conclusion:
The podcast episode explored the evolution and intricacies of benchmarking in AI and language models.
They are seeking feedback and ideas for future episodes on foundational topics in the field.