Latent Space Podcast 5/3/23 [Summary] - Training a SOTA Code LLM in 1 week and Quantifying the Vibes — with Reza Shabani of Replit
Ep. 10 with Reza Shabani: Dive deep into the rapid training of a state-of-the-art Code LLM, explore Replit Ghostwriter's future, and journey from Finance to AI. Discover the transition from Kaplan to Chinchilla and more!
Original Link: Training a SOTA Code LLM in 1 week and Quantifying the Vibes — with Reza Shabani of Replit
Summary
From Quantitative Trading to AI Leadership: Reza Shabani’s Journey and Predictions
Alessio Fanelli, partner and CTO in residence at Decibel Partners, and co-host swyx, a writer and editor of the Latent Space podcast, invite Reza Shabani, the Head of AI at Replit, for a chat. Reza details his surprising background, beginning with a PhD in economics from Berkeley, moving on to startup founding, followed by a stint in systematic equity trading at BlackRock and Wellington. A common assumption is that Reza doesn't know how to code given his econ background, but he clarifies that coding and data analysis were indeed part of his wheelhouse.
The conversation takes a deep dive into quantitative finance and data engineering. Reza describes his grad school experience, which entailed extracting and analyzing data from financial news channels to gauge the market response to specific companies. He touches on his experiences at BlackRock, where he dabbled in utilizing emerging technologies, like NLP and machine learning, to trade effectively. They further discuss how identifying early adoption of emerging technologies by companies can serve as an indicator of their potential success in the stock market. For instance, Walmart's early focus on mobile technology as opposed to Sears’ lack of attention to it was discussed as an example. The conversation also touches on the challenge of signals being overshadowed by noise in the finance world. Towards the end, Reza raises an intriguing question about the potential for AI to excel in quantitative finance.
From Data Foundations to Cutting-Edge AI: Reza Shabani's Work at Replit
Reza Shabani, during his tenure at Replit, has played an instrumental role in transforming the company's data infrastructure. When he first came on board about a year and a half ago, the company was grappling with scalability issues. The primary challenge was the inability to query vast amounts of data effectively. For instance, a seemingly simple question, such as identifying the most forked repository, could not be answered due to system limitations.
Shabani's initial efforts centered around building and modernizing Replit's data infrastructure. By streamlining processes, they were able to extract data insights within minutes rather than the earlier timelines of days, weeks, or even months. This robust foundation was pivotal for the next steps - venturing into artificial intelligence and model training, particularly using Replit's data.
As time progressed, Replit expanded its AI and data team, working on a range of AI-driven features. Notably, the team developed 'Ghostrider', a suite of tools designed for tasks like code explanation, code generation, code transformation, and in-context IDE chats. The foundation of Ghostrider was built on open-source initiatives, like Salesforce's 'cogen' model, which was optimized for Replit's user base.
Alessio Fanelli pointed out the evolving nature of Shabani's role - transitioning from analytical tasks, focusing on data insights, to more production-oriented roles that encompass the latest language model systems (LMS). Shabani highlighted a noticeable trend - the shift from traditional machine learning approaches to more natural language processing-based techniques. While the hype around language models has overshadowed other ML realms, Shabani emphasized the continuing value of other ML expertise areas.
Adding to the discussion, swyx underscored the pivotal moment most startups experience as they mature: the realization of the need for a robust data team. This is especially pertinent as companies grow, and data-driven decisions become crucial. Interestingly, many finance professionals, like Shabani, are well-equipped for this transition given their knack for building reliable and scalable systems in fast-paced environments.
Ending the conversation on a high note, Shabani teased the imminent release of Replit's first open-source code model, signifying another milestone in their AI journey.
Evaluating Code Generative Models
In a lively discussion between Alessio Fanelli, Reza Shabani, and Swyx, the trio delves into the intricacies of benchmarking and evaluating code-generating AI models. They use two primary benchmarks: the "human eval", where a model is given a function definition and then tested based on its completion of that function, and the "Amjad eval", an informal vibe test named after an individual with a knack for quickly gauging a model's performance.
Interestingly, models might ace the "human eval" but flunk the "Amjad eval". This highlights the disparity between quantitative benchmarks and qualitative user experience. The conversations illustrate that while some models excel in traditional tasks, they might perform poorly when posed with nuanced, context-heavy challenges or even straightforward instructions. Conversely, certain models, though lesser-known, may outperform their high-end counterparts in specific scenarios.
The "vibe test", as elaborated, doesn’t solely rely on the correctness of the model’s output but also factors in the latency, productivity enhancements, and user experience. The discussion closes with an acknowledgment of the challenges in benchmarking, stressing the importance of holistic model evaluation beyond just performance metrics.
Exploring the Nuances of AI Model Vibes and Advanced Coding Tools
In a thought-provoking discussion, Alessio Fanelli and Reza Shabani delve deep into the challenges and nuances of training AI models for optimal "vibes." They highlight the intricate balance between training data and resulting model outputs. Shabani emphasizes the inherent difficulty in refining certain vibe elements in models like "Bard." The optimal strategy hinges on feeding the model the right type of data and hoping for a generative output that aligns with desired outcomes. It's asserted that you can't merely add vibes; it's inherently present or absent.
The conversation pivots to the evolution of coding assistance tools. Initially dominated by Co, a myriad of new tools have emerged, raising the question of differentiation. Ghost Rider, one such tool, promises not just to complete codes but to offer a more holistic support in the software development process. The vision for Ghost Rider is to generate software scaffolding, assist in backend database creation, and even automate tasks like setting up new service accounts. The true ambition is to help generate entire software applications, not just specific code sections.
Introducing the concept of the Ghostwriter Autonomous Agent, Shabani envisions an autonomous system that can drive the IDE (Integrated Development Environment). Such an agent can predict sequences of actions, extending beyond just predicting the next line of code. The goal is to create software, fully incorporating the steps of cloning repos, editing, adding files, and deploying.
As the talk concludes, attention is directed towards the release of Replit-code-v1-3b, a 2.7 billion parameter model trained on a massive 525 billion tokens of code. The uniqueness of this model lies in its tailor-made vocabulary specifically for coding, leading to faster inference and more relevant content generation.
The discussion provides an exciting glimpse into the advancements of AI in coding, painting a future where AI does not just assist but actively participates in the software development process.
The Adventurous Journey to the YOLO Training Run
During a recent discussion, Reza Shabani recalled the events leading up to a major developer day. The team at Reza's organization had been tirelessly working on building infrastructure for training their own models for months. This required creating an extensive data infrastructure to handle vast amounts of data and content.
By the end of the previous year, they had successfully built a system capable of parsing vast datasets in record time. As they approached the developer day, they had built pipelines, started training models, and were deploying them into production. However, they were somewhat limited in their approach, focusing on single-language models and not fully leveraging the potential of their data.
A pivotal moment came when Amjad proposed the idea of just 'yoloing' the process. Instead of meticulously planning, he suggested they run the models on all the data they had. This was a risky move, given the cost, time, and potential for error in such a massive data processing task. Yet, driven by this adventurous spirit, they went ahead and even resampled their data multiple times, which is generally viewed as a risky method that can lead to model overfitting. Still, the results were surprisingly good.
An ongoing debate emerged regarding the most efficient way to train the models, reflecting on the "scaling laws" of model training. They debated whether they should strictly adhere to accepted scaling laws like Chinchilla's or venture into the unknown. The overarching sentiment was that perhaps the community has been undertraining models and that there's room for pushing boundaries.
The conversation also touched upon other significant figures in the field, like Jonathan from Mosaic, who's working on massive language models, highlighting that while there might be limitations with code models due to data shortages, there's vast potential in the broader language model arena.
Replit & MosaicML: Advancing AI Infrastructure and Embracing the Future
In a recent conversation between "swyx" and "Reza Shabani", the success and advantages of MosaicML were discussed. Shabani highlighted that Mosaic provides a beneficial separation between GPU offerings and cloud providers, offering a versatile training infrastructure. One of Mosaic's significant advantages includes sourcing GPUs from various providers, which makes the training infrastructure more fault-tolerant. They also bring expertise in training models, providing pre-configured setups that optimize GPU utilization and ensure efficient model training.
Despite the efficiency Google claims its TPUs have, Reza emphasized a preference for systems that the majority uses, indicating TPUs lack widespread adoption.
Furthermore, Reza delved into the future plans for Replit, mentioning current hiring needs. Positions include an Applied AI/ML Engineer focused on data pipelines and an Applied AI Full Stack Engineer that combines model training with user-focused application integration. Notably, Replit's team comprises skilled individuals, like Bradley, an early YouTube employee who contributes significantly to Replit's inference stack.
The conversation underlines the complexity and potential of modern AI infrastructure, the importance of strategic hardware choices, and the dynamic future that Replit envisions for its team.
Embracing the Future: Understanding AI's Rapid Evolution and Societal Impact
In a "Lightning Round" discussion with Alessio Fanelli and swyx, Reza Shabani touches on the rapid evolution of AI, especially in replicating human communication, as seen in popular culture like Black Mirror. Shabani highlights societal concerns over AI's potential to replace both blue and white-collar jobs. He stresses the importance of harnessing AI to assist rather than displace human workers and touches on the unforeseen applications of advanced AI in industries beyond chat. Discussing prompt engineering in AI models, Shabani expects it will diminish in certain algorithmic models, but will remain vital for more human-like interactions. As a final takeaway, Shabani encourages embracing AI by learning its benefits and potential, comparing its societal impact to the internet's transformative role.