Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

1h 44m video Published Aug 27, 2024 Transcribed Jun 15, 2026 S Stanford Online

Intermediate 25 min read For: Students and professionals with basic machine learning knowledge interested in understanding how LLMs are built.

AI Trust Score 85/100

✅ Highly Legit

"Title accurately reflects the lecture's content on building LLMs, though it's more of an overview than a deep dive."

AI Summary

This lecture provides a comprehensive overview of building large language models (LLMs), covering key components like architecture, training loss, data, evaluation, and systems. The speaker emphasizes that while academia focuses on architecture and algorithms, industry success hinges on data, evaluation, and systems. The talk is divided into pre-training and post-training phases, explaining how LLMs are trained on internet data and then fine-tuned to become AI assistants.

Chapters

1 Introduction and Key Components 00:05 2 Pre-training: Task, Loss, and Tokenization 02:56 3 Evaluation: Perplexity and Benchmarks 19:06 4 Data: Collection, Filtering, and Scaling 28:41 5 Scaling Laws and Optimal Allocation 40:39 6 Post-training: SFT and RLHF 59:56 7 Evaluation of Aligned Models and Systems 87:41

[00:05]

Introduction to LLMs

LLMs are large language models like ChatGPT, Claude, Gemini, and LLaMA. The lecture will cover how they work, focusing on five key components: architecture, training loss, data, evaluation, and systems.

[01:00]

Key Components for Training LLMs

The five key components are architecture, training loss/algorithm, data, evaluation, and systems. Industry focuses more on data, evaluation, and systems, while academia emphasizes architecture and algorithms.

[02:56]

Pre-training vs. Post-training

Pre-training involves training on internet data to model language. Post-training (e.g., ChatGPT) turns the model into an AI assistant via fine-tuning.

[03:44]

Language Modeling Task

Language models model probability distributions over sequences of tokens. Autoregressive models decompose this into predicting the next token given previous tokens.

[06:36]

Autoregressive Language Models

The task is predicting the next word. During training, the model predicts the next token and compares it to the actual token using cross-entropy loss.

[10:45]

Tokenization

Tokenizers convert text into tokens, balancing generality and sequence length. Byte Pair Encoding (BPE) is a common method that merges frequent character pairs.

[19:06]

Evaluation: Perplexity

Perplexity is the exponentiated average per-token loss, ranging from 1 to vocabulary size. It indicates how many tokens the model is 'hesitating' between.

[21:28]

Evaluation: Benchmarks

Academic benchmarks like MMLU evaluate LLMs on multiple-choice questions. Evaluation methods vary, leading to inconsistencies.

[26:04]

Evaluation Challenges

Challenges include test set contamination and inconsistent evaluation methods. For example, LLaMA 65B scored 63.7% on HELM but 48.8% on another benchmark.

[28:41]

Data Collection and Filtering

Data is collected from Common Crawl (250 billion pages). Steps include text extraction, filtering undesirable content, deduplication, heuristic filtering, and model-based filtering.

[35:28]

Data Scaling

Academic datasets grew from 150 billion tokens (800 GB) to 15 trillion tokens. LLaMA 3 was trained on 15 trillion tokens.

[40:39]

Scaling Laws

Scaling laws show that performance improves predictably with more compute, data, and parameters. They allow predicting optimal resource allocation.

[49:27]

Chinchilla Optimal Allocation

The Chinchilla paper found that for optimal training, use 20 tokens per parameter. For inference efficiency, the ratio is around 150 tokens per parameter.

[55:16]

Cost of Training LLaMA 3 400B

Training LLaMA 3 400B cost approximately $75 million, using 30 million GPU hours on 16,000 H100s over 70 days.

[59:56]

Post-training: Alignment

Post-training aligns LLMs to follow instructions and be helpful, honest, and harmless. It uses supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF).

[62:50]

Supervised Fine-Tuning (SFT)

SFT fine-tunes the pre-trained model on human-written question-answer pairs. Surprisingly, only a few thousand examples are needed.

[69:59]

RLHF and DPO

RLHF uses a reward model trained on human preferences and then optimizes the policy via PPO. DPO simplifies this by directly maximizing preference likelihood.

[83:41]

Challenges with Human Data

Human labeling is slow, expensive, and inconsistent. LLMs can replace humans for preference labeling, achieving higher agreement at lower cost.

[87:41]

Evaluation of Post-trained Models

Evaluating aligned models is challenging due to open-ended outputs. Chatbot Arena uses human voting, while AlpacaEval uses LLM judges.

[97:05]

Systems: GPU Optimization

GPUs are optimized for throughput and matrix multiplication. Key techniques include low-precision training (mixed precision) and operator fusion (e.g., torch.compile).

Building LLMs involves a complex pipeline from pre-training on massive internet data to post-training alignment. While scaling laws guide resource allocation, practical success depends heavily on data quality, evaluation, and systems optimization.

Mentioned in this Video

Common Crawl

tool

HELM

tool

Hugging Face Open LLM Leaderboard

tool

Chatbot Arena

tool

AlpacaEval

tool

torch.compile

tool

Richard Sutton

person

John Schulman

person

The Bitter Lesson

link

Study Flashcards (11)

What are the five key components for training LLMs?

easy Click to reveal answer

Architecture, training loss/algorithm, data, evaluation, and systems.