---
title: 'I read every major CS paper of the last 100 years...'
source: 'https://youtube.com/watch?v=ML3q7Ok4hJg'
video_id: 'ML3q7Ok4hJg'
date: 2026-06-28
duration_sec: 611
---

# I read every major CS paper of the last 100 years...

> Source: [I read every major CS paper of the last 100 years...](https://youtube.com/watch?v=ML3q7Ok4hJg)

## Summary



## Transcript

The year was 1936. Alan Turring asked a
simple question. It can machines think?
Actually, no, that's not right. What he
really asked was something way more
boring. It can every mathematical
problem be solved by an algorithm?
Surprisingly, he proved the answer is
no. But in the process, he accidentally
invented the computer. Then 12 years
later, in 1948, another legend shows up
named Claude Shannon, and he reduced all
human communication down to ones and
zeros, casually inventing the bit like
it was no big deal. One thing led to
another, and now in 2026, we have
18-year-olds in hoodies typing import
torch into Python files and cashing
billion-dollar checks from venture
capitalist boomers. But reaching this
point has been underpinned by a century
long chain reaction of computer science
papers written mostly by dead people
much smarter than us. In today's video,
we'll look at 10 of the most important
scientific papers in the history of
computer science and how they changed
the world for better or worse. Our story
begins nearly a century ago when
mathematician David Hilbert asked the
field's biggest flex of a question. Is
there a universal algorithm that can
decide whether any mathematical
statement is true? Or in other words,
can we automate math itself? He called
this the Enchunk's problem which is
German for decision problem. By 1936,
Alan Turing comes around and gives a
brutal answer to this question. No. But
in order to prove it, he wrote this
paper on computable numbers that had to
define what an algorithm even is. And so
he imagined a hypothetical machine with
an infinite tape, a read write head, and
a tiny table of rules. This touring
machine is the abstract blueprint for
every computing device you've ever
owned. Once created, he asked it to
solve the halting problem. Can you write
a program that looks at any other
program and tells you if it'll finish
running or loop forever? During proved
that it's impossible for a program like
this to exist. It simply leads to a
logical contradiction, which means math
has problems that no algorithm can
solve. That's annoying. But 12 years
later, a guy named Claude Shannon would
ask his own annoying question. What is
information as a thing you can measure?
In his paper, a mathematical theory of
communication, he rips out the meaning
from normal words entirely. I love you
and the cat is on fire carry the same
information if they're equally
surprising. And he measures that
surprise in a unit called the bit. He
proved that all information could
ultimately be boiled down to a stream of
ones and zeros. But here's the crazy
part. To estimate how much information
was needed to transmit a message, he
borrowed a word from thermodynamics
nobody understands. Entropy. To estimate
entropy of English, Shannon made people
guess the next letter in a sentence.
When a letter is easy to guess, it has
low entropy. When a letter is hard to
guess, it has high entropy. But wait a
minute. Having humans guess the next
token is exactly what AI does today,
just on a much bigger scale. Shannon
wasn't trying to build artificial
intelligence, but he gave us the math
for uncertainty, prediction, and
compression and accidentally wrote the
spiritual ancestor to the loss function.
And that's exactly why Anthropic named
their AI model Claude. Then 10 years
later at Cornell, a psychologist, not a
computer scientist, builds the first
machine that actually learns. He gets
inspired by the way neurons work in the
brain. So he designs a thing called a
perceptron that takes inputs, weighs
them, and then adjusts those weights
when it's wrong until it can classify
patterns on its own. It's the building
block for modern neural networks, and
the hype is immediate and unhinged. The
Navy funds it, and the New York Times
reports that the computer will soon be
conscious, but 11 years later, the hype
would die out completely, thanks to two
haters at MIT, who published another
paper with a completely different vibe.
With basic math, they prove that a
single layer perceptron can't even learn
exclusive ore, which is just trivial
logic that means this or that, but not
both. This paper, or technically a book,
was essentially a death certificate for
AI at the time. Funding evaporated, and
deep neural networks entered their first
AI winter, but there was a twist buried
in the fine print. They actually figured
out that stacking layers of perceptrons
fixes everything. The only problem is
that back then, nobody knew how to train
a stack of perceptrons. It would take
another 17 years to figure it out. But
first, we need to talk about times,
clocks, and the ordering of events in a
distributed system by Lesie Lamport.
Because neural networks are useless
unless you can run them on a massive
scale. This paper realized that separate
computers with no shared clock, it can't
really have a universal now time. And
that's a big problem when you have
multiple computers in a distributed
system trying to do things in order.
Well, he figured out a way to fix this
with the happen before relation. You
stop trusting the wall clock time and
order events by causality instead. If A
could have caused B, A comes first. From
that, he builds logical clocks which
allow an unlimited number of machines to
stay in agreement without ever looking
at a real clock. Eventually, this paper
would become the bedrock for every
database, blockchain, and every massive
AI training run because you need
thousands of GPUs that constantly stay
in sync and agree on state without
dissolving into chaos. That was a
gamecher. But 17 years after neural
networks were left for dead, the three
researchers, including the godfather
Jeffrey Hinton, answered the question
that everyone gave up on. How do you
train a stack of layers? But before we
answer that, we need to quickly talk
about Coder, who was cool enough to
sponsor this 10-minute video on esoteric
computer science papers. They provide
self-hosted development environments
that let you work with multiple agents
in parallel and with enterprise level
security. and they just launched coder
agents, a chat interface and API for
delegating coding jobs to agents running
on your own infrastructure. It's the
only architecture that lets
organizations self-host both the agent
workflow and the development
environments where the code is actually
executed. This gives teams greater
control over source code access, agent
execution, governance, and security
boundaries. It's also model agnostic, so
you can connect any LLM you want and
switch between them with just a config
change. Coder agents are designed for
teams in regulated industries who need
to self-host their AI workflows with
complete control that they're already
used by dozens of financial institutions
and government organizations. And you
can check it out at the link below. Now,
back to the question, how do you train a
stack of layers? The answer is back
propagation. Run your data forward,
measure how wrong the output is, and
then push that error backward through
every layer using the chain rule from
calculus to nudge each weight in the
direction that's a little less wrong. Do
that a few million times and the network
teaches itself. The crazy discovery
though is that the middle hidden layers
started inventing their own features.
Edges, shapes, and concepts that nobody
programmed in that exclusive or problem
that was impossible 17 years ago. It
just became trivial. Back propagation is
still essential to neural networks
today, but back then they sucked because
we didn't have enough data or compute.
Well, that was about to change in 1998
with the rise of the internet and this
famous paper from Larry and Sergey about
the anatomy of a large-scale web search
engine. The paper describes the page
rank algorithm where instead of ranking
a web page by how often a word appears,
it treats every link as a vote and each
vote is weighted by how trustworthy the
voter is. They built a prototype in
their dorm room which eventually became
a company called Google that you may
have heard of. Most importantly though,
this algorithm helped assemble the
largest structured pile of human text
ever created. And that massive pile of
text would eventually become the
training data or feed stock for future
AI models. We'd finally see this in
action in 2012 with a legendary imageet
paper. It created by a dream team of
Alex Kresensky, Ilaskever, and Jeffrey
Hinton. Remember when I said back
propagation needs data and compute?
Well, finally the star is aligned. The
data set is called ImageNet and it's a
monster data set of millions of
handlabeled photos. While the compute is
a couple of Nvidia consumer grade gaming
GPUs, a grad student named Alex wires up
a deep convolutional neural network,
names it AlexNet, and trains it in his
bedroom. Then he walks it into the
annual imageet contest and humiliates
everyone. This is a contest where AI
models try to classify objects in an
image like hot dog or not hot dog. And
while everyone was fighting over a
fraction of a percent, Alex Net walked
in and dropped the error rate by 10
points in a single year. And this
freaked everyone out because it was
suddenly clear that deep learning
actually works. It just needs more data,
more compute, and the right
architecture. Luckily, we would get that
architecture a few years later thanks to
Ashes Vashwani and Google in the paper.
Attention is all you need. Around this
time, large language models had a huge
problem. They would start a sentence and
by the end they would forget what they
were even talking about. That's because
they would read and predict tokens
sequentially one after the other. This
paper fixed that by introducing a new
architecture called the transformer that
throws out sequential reading entirely.
Instead, it lets every word look at
every other word at once and decide
what's relevant. Not only does this make
large language models feel more
intelligent, but transformers also scale
better as well. Google made the big
mistake of giving this architecture away
for free, and now every AI lab uses it,
and that's where you get the T in chat
GPT. Speaking of which, that brings us
to a paper released by OpenAI in 2020.
Language models are fewshot learners.
Basically, OpenAI takes the transformer
and then asks the dumbest question
possible. What if we just make it
enormous? Not two times bigger, but
scale it to 175 billion parameters and
feed it the entire internet as a data
set. They made a crazy bet that
intelligence isn't some secret algorithm
we're missing, but rather it simply
emerges once you cross a threshold of
scale. The end result was GPT3, the
model that ignited the current AI bubble
that we're living through right now.
What's crazy is that all of a sudden,
this model could translate, summarize,
and write code without ever being
specifically told how to do these things
at such a large scale. It learned how to
generalize these things on the fly. 2
years later, this paper would evolve
into Chat GPT, which today is now a
trillion dollar product. But when you
think about it, what is chat GPT even
doing? Well, it's just predicting the
next word or token just like Claude
Shannon was doing in 1948. So, here's
the TLDDR for the last 100 years. Alan
Turing defined the machine. Claude
Shannon gave it currency. Rosenl Black
gave it a neuron. Jeffrey Hinton taught
it how to learn. Google gave it data and
an architecture. And Open AI just turned
the dial to the maximum. This has been
the history of artificial intelligence
in 10 scientific papers. Thanks for
watching and I will see you in the next
