I read every major CS paper of the last 100 years...

0h 10m video Transcribed Jun 28, 2026 Watch on YouTube ↗

566.3K

Views

38.2K

Likes

1.2K

Comments

226

Dislikes

7.0%

🔥 High Engagement

AI Summary

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

The paper that invented the computer by accident

53s

Reveals a surprising origin story of computing that challenges common assumptions and hooks viewers with the halting problem.

▶ Play Clip

The math behind AI prediction

56s

Connects Shannon's entropy to modern AI token prediction, making complex information theory accessible and mind-blowing.

▶ Play Clip

Why AI dreams died in 1969

49s

Explains a dramatic AI winter triggered by a single paper, creating a compelling narrative of hype and downfall.

▶ Play Clip

The moment AI became terrifyingly good

53s

Highlights AlexNet's shocking victory in image classification, a pivotal moment that sparked the deep learning revolution.

▶ Play Clip

The paper that gave us ChatGPT

45s

Explains the transformer architecture that powers all modern LLMs, delivering a concise 'aha' moment about AI's breakthrough.

▶ Play Clip

Full Transcript

Download .txt Download .md

[00:00] The year was 1936. Alan Turring asked a

[00:03] simple question. It can machines think?

[00:05] Actually, no, that's not right. What he

[00:07] really asked was something way more

[00:08] boring. It can every mathematical

[00:10] problem be solved by an algorithm?

[00:12] Surprisingly, he proved the answer is

[00:14] no. But in the process, he accidentally

[00:16] invented the computer. Then 12 years

[00:18] later, in 1948, another legend shows up

[00:21] named Claude Shannon, and he reduced all

[00:23] human communication down to ones and

[00:25] zeros, casually inventing the bit like

[00:27] it was no big deal. One thing led to

[00:29] another, and now in 2026, we have

[00:31] 18-year-olds in hoodies typing import

[00:34] torch into Python files and cashing

[00:36] billion-dollar checks from venture

[00:37] capitalist boomers. But reaching this

[00:39] point has been underpinned by a century

[00:41] long chain reaction of computer science

[00:43] papers written mostly by dead people

[00:45] much smarter than us. In today's video,

[00:47] we'll look at 10 of the most important

[00:49] scientific papers in the history of

[00:51] computer science and how they changed

[00:53] the world for better or worse. Our story

[00:55] begins nearly a century ago when

[00:57] mathematician David Hilbert asked the

[00:59] field's biggest flex of a question. Is

[01:01] there a universal algorithm that can

[01:03] decide whether any mathematical

[01:05] statement is true? Or in other words,

[01:07] can we automate math itself? He called

[01:09] this the Enchunk's problem which is

[01:10] German for decision problem. By 1936,

[01:13] Alan Turing comes around and gives a

[01:15] brutal answer to this question. No. But

[01:18] in order to prove it, he wrote this

[01:19] paper on computable numbers that had to

[01:22] define what an algorithm even is. And so

[01:24] he imagined a hypothetical machine with

[01:27] an infinite tape, a read write head, and

[01:29] a tiny table of rules. This touring

[01:31] machine is the abstract blueprint for

[01:33] every computing device you've ever

[01:35] owned. Once created, he asked it to

[01:37] solve the halting problem. Can you write

[01:39] a program that looks at any other

[01:41] program and tells you if it'll finish

[01:42] running or loop forever? During proved

[01:45] that it's impossible for a program like

[01:47] this to exist. It simply leads to a

[01:49] logical contradiction, which means math

[01:51] has problems that no algorithm can

[01:53] solve. That's annoying. But 12 years

[01:55] later, a guy named Claude Shannon would

[01:57] ask his own annoying question. What is

[01:59] information as a thing you can measure?

[02:01] In his paper, a mathematical theory of

[02:03] communication, he rips out the meaning

[02:05] from normal words entirely. I love you

[02:08] and the cat is on fire carry the same

[02:10] information if they're equally

[02:11] surprising. And he measures that

[02:13] surprise in a unit called the bit. He

[02:15] proved that all information could

[02:17] ultimately be boiled down to a stream of

[02:19] ones and zeros. But here's the crazy

[02:20] part. To estimate how much information

[02:22] was needed to transmit a message, he

[02:24] borrowed a word from thermodynamics

[02:26] nobody understands. Entropy. To estimate

[02:28] entropy of English, Shannon made people

[02:31] guess the next letter in a sentence.

[02:32] When a letter is easy to guess, it has

[02:34] low entropy. When a letter is hard to

[02:36] guess, it has high entropy. But wait a

[02:38] minute. Having humans guess the next

[02:40] token is exactly what AI does today,

[02:43] just on a much bigger scale. Shannon

[02:45] wasn't trying to build artificial

[02:46] intelligence, but he gave us the math

[02:48] for uncertainty, prediction, and

[02:50] compression and accidentally wrote the

[02:52] spiritual ancestor to the loss function.

[02:54] And that's exactly why Anthropic named

[02:56] their AI model Claude. Then 10 years

[02:58] later at Cornell, a psychologist, not a

[03:01] computer scientist, builds the first

[03:03] machine that actually learns. He gets

[03:05] inspired by the way neurons work in the

[03:06] brain. So he designs a thing called a

[03:08] perceptron that takes inputs, weighs

[03:11] them, and then adjusts those weights

[03:12] when it's wrong until it can classify

[03:14] patterns on its own. It's the building

[03:16] block for modern neural networks, and

[03:18] the hype is immediate and unhinged. The

[03:20] Navy funds it, and the New York Times

[03:22] reports that the computer will soon be

[03:23] conscious, but 11 years later, the hype

[03:25] would die out completely, thanks to two

[03:27] haters at MIT, who published another

[03:30] paper with a completely different vibe.

[03:32] With basic math, they prove that a

[03:34] single layer perceptron can't even learn

[03:36] exclusive ore, which is just trivial

[03:38] logic that means this or that, but not

[03:40] both. This paper, or technically a book,

[03:43] was essentially a death certificate for

[03:45] AI at the time. Funding evaporated, and

[03:47] deep neural networks entered their first

[03:49] AI winter, but there was a twist buried

[03:51] in the fine print. They actually figured

[03:53] out that stacking layers of perceptrons

[03:55] fixes everything. The only problem is

[03:57] that back then, nobody knew how to train

[03:59] a stack of perceptrons. It would take

[04:01] another 17 years to figure it out. But

[04:03] first, we need to talk about times,

[04:04] clocks, and the ordering of events in a

[04:06] distributed system by Lesie Lamport.

[04:09] Because neural networks are useless

[04:10] unless you can run them on a massive

[04:12] scale. This paper realized that separate

[04:14] computers with no shared clock, it can't

[04:16] really have a universal now time. And

[04:18] that's a big problem when you have

[04:20] multiple computers in a distributed

[04:21] system trying to do things in order.

[04:23] Well, he figured out a way to fix this

[04:25] with the happen before relation. You

[04:27] stop trusting the wall clock time and

[04:29] order events by causality instead. If A

[04:32] could have caused B, A comes first. From

[04:34] that, he builds logical clocks which

[04:36] allow an unlimited number of machines to

[04:38] stay in agreement without ever looking

[04:40] at a real clock. Eventually, this paper

[04:42] would become the bedrock for every

[04:44] database, blockchain, and every massive

[04:46] AI training run because you need

[04:48] thousands of GPUs that constantly stay

[04:50] in sync and agree on state without

[04:52] dissolving into chaos. That was a

[04:54] gamecher. But 17 years after neural

[04:56] networks were left for dead, the three

[04:58] researchers, including the godfather

[05:00] Jeffrey Hinton, answered the question

[05:02] that everyone gave up on. How do you

[05:04] train a stack of layers? But before we

[05:06] answer that, we need to quickly talk

[05:08] about Coder, who was cool enough to

[05:09] sponsor this 10-minute video on esoteric

[05:12] computer science papers. They provide

[05:14] self-hosted development environments

[05:16] that let you work with multiple agents

[05:17] in parallel and with enterprise level

[05:20] security. and they just launched coder

[05:21] agents, a chat interface and API for

[05:24] delegating coding jobs to agents running

[05:26] on your own infrastructure. It's the

[05:28] only architecture that lets

[05:29] organizations self-host both the agent

[05:31] workflow and the development

[05:33] environments where the code is actually

[05:34] executed. This gives teams greater

[05:36] control over source code access, agent

[05:39] execution, governance, and security

[05:41] boundaries. It's also model agnostic, so

[05:43] you can connect any LLM you want and

[05:46] switch between them with just a config

[05:47] change. Coder agents are designed for

[05:49] teams in regulated industries who need

[05:51] to self-host their AI workflows with

[05:53] complete control that they're already

[05:55] used by dozens of financial institutions

[05:57] and government organizations. And you

[05:59] can check it out at the link below. Now,

[06:01] back to the question, how do you train a

[06:03] stack of layers? The answer is back

[06:05] propagation. Run your data forward,

[06:07] measure how wrong the output is, and

[06:09] then push that error backward through

[06:10] every layer using the chain rule from

[06:12] calculus to nudge each weight in the

[06:14] direction that's a little less wrong. Do

[06:16] that a few million times and the network

[06:18] teaches itself. The crazy discovery

[06:20] though is that the middle hidden layers

[06:22] started inventing their own features.

[06:24] Edges, shapes, and concepts that nobody

[06:26] programmed in that exclusive or problem

[06:29] that was impossible 17 years ago. It

[06:31] just became trivial. Back propagation is

[06:33] still essential to neural networks

[06:35] today, but back then they sucked because

[06:37] we didn't have enough data or compute.

[06:39] Well, that was about to change in 1998

[06:41] with the rise of the internet and this

[06:43] famous paper from Larry and Sergey about

[06:45] the anatomy of a large-scale web search

[06:47] engine. The paper describes the page

[06:49] rank algorithm where instead of ranking

[06:51] a web page by how often a word appears,

[06:53] it treats every link as a vote and each

[06:56] vote is weighted by how trustworthy the

[06:58] voter is. They built a prototype in

[06:59] their dorm room which eventually became

[07:01] a company called Google that you may

[07:03] have heard of. Most importantly though,

[07:04] this algorithm helped assemble the

[07:06] largest structured pile of human text

[07:08] ever created. And that massive pile of

[07:10] text would eventually become the

[07:11] training data or feed stock for future

[07:14] AI models. We'd finally see this in

[07:15] action in 2012 with a legendary imageet

[07:18] paper. It created by a dream team of

[07:20] Alex Kresensky, Ilaskever, and Jeffrey

[07:23] Hinton. Remember when I said back

[07:25] propagation needs data and compute?

[07:27] Well, finally the star is aligned. The

[07:29] data set is called ImageNet and it's a

[07:31] monster data set of millions of

[07:32] handlabeled photos. While the compute is

[07:34] a couple of Nvidia consumer grade gaming

[07:37] GPUs, a grad student named Alex wires up

[07:40] a deep convolutional neural network,

[07:42] names it AlexNet, and trains it in his

[07:44] bedroom. Then he walks it into the

[07:45] annual imageet contest and humiliates

[07:48] everyone. This is a contest where AI

[07:50] models try to classify objects in an

[07:52] image like hot dog or not hot dog. And

[07:55] while everyone was fighting over a

[07:56] fraction of a percent, Alex Net walked

[07:58] in and dropped the error rate by 10

[08:00] points in a single year. And this

[08:02] freaked everyone out because it was

[08:03] suddenly clear that deep learning

[08:05] actually works. It just needs more data,

[08:07] more compute, and the right

[08:08] architecture. Luckily, we would get that

[08:10] architecture a few years later thanks to

[08:12] Ashes Vashwani and Google in the paper.

[08:15] Attention is all you need. Around this

[08:17] time, large language models had a huge

[08:19] problem. They would start a sentence and

[08:21] by the end they would forget what they

[08:22] were even talking about. That's because

[08:24] they would read and predict tokens

[08:25] sequentially one after the other. This

[08:27] paper fixed that by introducing a new

[08:29] architecture called the transformer that

[08:31] throws out sequential reading entirely.

[08:34] Instead, it lets every word look at

[08:36] every other word at once and decide

[08:38] what's relevant. Not only does this make

[08:40] large language models feel more

[08:41] intelligent, but transformers also scale

[08:43] better as well. Google made the big

[08:45] mistake of giving this architecture away

[08:47] for free, and now every AI lab uses it,

[08:49] and that's where you get the T in chat

[08:51] GPT. Speaking of which, that brings us

[08:54] to a paper released by OpenAI in 2020.

[08:56] Language models are fewshot learners.

[08:59] Basically, OpenAI takes the transformer

[09:01] and then asks the dumbest question

[09:03] possible. What if we just make it

[09:04] enormous? Not two times bigger, but

[09:07] scale it to 175 billion parameters and

[09:10] feed it the entire internet as a data

[09:11] set. They made a crazy bet that

[09:13] intelligence isn't some secret algorithm

[09:15] we're missing, but rather it simply

[09:17] emerges once you cross a threshold of

[09:19] scale. The end result was GPT3, the

[09:22] model that ignited the current AI bubble

[09:24] that we're living through right now.

[09:25] What's crazy is that all of a sudden,

[09:27] this model could translate, summarize,

[09:29] and write code without ever being

[09:31] specifically told how to do these things

[09:33] at such a large scale. It learned how to

[09:35] generalize these things on the fly. 2

[09:37] years later, this paper would evolve

[09:38] into Chat GPT, which today is now a

[09:41] trillion dollar product. But when you

[09:43] think about it, what is chat GPT even

[09:45] doing? Well, it's just predicting the

[09:46] next word or token just like Claude

[09:48] Shannon was doing in 1948. So, here's

[09:51] the TLDDR for the last 100 years. Alan

[09:53] Turing defined the machine. Claude

[09:55] Shannon gave it currency. Rosenl Black

[09:57] gave it a neuron. Jeffrey Hinton taught

[09:59] it how to learn. Google gave it data and

[10:01] an architecture. And Open AI just turned

[10:03] the dial to the maximum. This has been

[10:05] the history of artificial intelligence

[10:07] in 10 scientific papers. Thanks for

[10:09] watching and I will see you in the next