Let's build GPT: from scratch, in code, spelled out.

1h 56m video Published Jan 17, 2023 Transcribed Jun 15, 2026 Andrej Karpathy

Andrej Karpathy

Intermediate 58 min read For: Software developers and machine learning practitioners with basic Python and PyTorch experience.

AI Trust Score 95/100

✅ Highly Legit

"Title accurately describes the content: building GPT from scratch with code explanations."

AI Summary

This video provides a step-by-step tutorial on building a GPT-like Transformer language model from scratch using Python and PyTorch. The presenter explains the core concepts of the Transformer architecture, including self-attention, multi-head attention, and feed-forward networks, and implements them in code. The tutorial culminates in training a character-level language model on the Tiny Shakespeare dataset to generate Shakespeare-like text.

Chapters

1 Introduction and Motivation 00:00 2 Data Preparation and Tokenization 03:53 3 Bigram Language Model 14:28 4 Self-Attention Mechanism 36:00 5 Multi-Head Attention and Feed-Forward Networks 58:25 6 Residual Connections and Layer Normalization 85:00 7 Scaling Up and Results 99:34 8 From Pre-training to ChatGPT 108:55

[00:00]

Introduction to ChatGPT and Language Models

ChatGPT is a probabilistic system that generates text based on prompts. It is a language model that models sequences of tokens.

[02:05]

Transformer Architecture Origin

The Transformer architecture was introduced in the 2017 paper 'Attention is All You Need'. GPT stands for 'Generatively Pre-trained Transformer'.

[03:53]

Tiny Shakespeare Dataset

The tutorial uses the Tiny Shakespeare dataset (1MB, concatenated works of Shakespeare) to train a character-level language model.

[05:43]

NanoGPT Repository

The code is available in the nanoGPT repository on GitHub, consisting of two files (model.py and train.py) of about 300 lines each.

[08:00]

Tokenization

Character-level tokenization is used: each character is mapped to an integer. The vocabulary size is 65 characters.

[14:28]

Data Batching

Data is processed in chunks (blocks) of size block_size. Each chunk contains multiple training examples (predicting next character given context).

[22:19]

Bigram Language Model

A simple bigram model is implemented first: it predicts the next character based solely on the current character using an embedding table.

[36:00]

Self-Attention Mechanism

Self-attention allows tokens to communicate with each other. It uses queries, keys, and values to compute weighted averages of past tokens.

[58:25]

Multi-Head Attention

Multiple self-attention heads run in parallel and their outputs are concatenated, allowing the model to attend to different types of information.

[85:00]

Feed-Forward Networks and Residual Connections

After self-attention, a feed-forward network (MLP) is applied per token. Residual connections and layer normalization help with training deep networks.

[99:34]

Scaling Up the Model

By increasing model size (embedding dimension 384, 6 heads, 6 layers, block size 256), the validation loss drops to 1.48, generating more coherent Shakespeare-like text.

[102:24]

Decoder-Only vs Encoder-Decoder

The implemented model is a decoder-only Transformer (like GPT), suitable for unconditional text generation. The original Transformer paper used an encoder-decoder for translation.

[108:55]

From Pre-training to ChatGPT

Pre-training trains a language model on internet text. Fine-tuning (e.g., with reinforcement learning from human feedback) aligns the model to be an assistant like ChatGPT.

This tutorial successfully builds a decoder-only Transformer from scratch, demonstrating the core components of GPT. The final model, trained on Tiny Shakespeare, generates plausible Shakespeare-like text, illustrating the power of the Transformer architecture.

Mentioned in this Video

Attention is All You Need

paper

PyTorch

tool

NanoGPT GitHub Repository

tool

Tiny Shakespeare

dataset

Google Colab

tool

Tutorial Checklist

1 07:53 Set up a Google Colab notebook and download the Tiny Shakespeare dataset.

2 08:37 Create a character-level tokenizer: build vocabulary of unique characters, create encoder/decoder mappings.

3 13:41 Split the dataset into training (90%) and validation (10%) sets.

4 14:28 Implement data batching: sample random chunks of size block_size from the training set, create input-target pairs.

5 22:19 Implement a bigram language model using nn.Embedding: map tokens to logits directly.

6 28:53 Add generation function: sample from the model iteratively to produce new text.

7 34:53 Train the bigram model using AdamW optimizer and cross-entropy loss.

8 42:13 Implement self-attention: compute queries, keys, values; apply masked softmax and weighted aggregation.

9 58:25 Implement multi-head attention: run multiple self-attention heads in parallel and concatenate outputs.

10 85:00 Add feed-forward network (MLP) with residual connections and layer normalization.

11 99:34 Scale up the model: increase embedding dimension, number of heads, layers, and block size; train on GPU.

Study Flashcards (10)

What does GPT stand for?

easy Click to reveal answer

Generatively Pre-trained Transformer