[0:00] The year was 1936. Alan Turring asked a
[0:03] simple question. It can machines think?
[0:05] Actually, no, that's not right. What he
[0:07] really asked was something way more
[0:08] boring. It can every mathematical
[0:10] problem be solved by an algorithm?
[0:12] Surprisingly, he proved the answer is
[0:14] no. But in the process, he accidentally
[0:16] invented the computer. Then 12 years
[0:18] later, in 1948, another legend shows up
[0:21] named Claude Shannon, and he reduced all
[0:23] human communication down to ones and
[0:25] zeros, casually inventing the bit like
[0:27] it was no big deal. One thing led to
[0:29] another, and now in 2026, we have
[0:31] 18-year-olds in hoodies typing import
[0:34] torch into Python files and cashing
[0:36] billion-dollar checks from venture
[0:37] capitalist boomers. But reaching this
[0:39] point has been underpinned by a century
[0:41] long chain reaction of computer science
[0:43] papers written mostly by dead people
[0:45] much smarter than us. In today's video,
[0:47] we'll look at 10 of the most important
[0:49] scientific papers in the history of
[0:51] computer science and how they changed
[0:53] the world for better or worse. Our story
[0:55] begins nearly a century ago when
[0:57] mathematician David Hilbert asked the
[0:59] field's biggest flex of a question. Is
[1:01] there a universal algorithm that can
[1:03] decide whether any mathematical
[1:05] statement is true? Or in other words,
[1:07] can we automate math itself? He called
[1:09] this the Enchunk's problem which is
[1:10] German for decision problem. By 1936,
[1:13] Alan Turing comes around and gives a
[1:15] brutal answer to this question. No. But
[1:18] in order to prove it, he wrote this
[1:19] paper on computable numbers that had to
[1:22] define what an algorithm even is. And so
[1:24] he imagined a hypothetical machine with
[1:27] an infinite tape, a read write head, and
[1:29] a tiny table of rules. This touring
[1:31] machine is the abstract blueprint for
[1:33] every computing device you've ever
[1:35] owned. Once created, he asked it to
[1:37] solve the halting problem. Can you write
[1:39] a program that looks at any other
[1:41] program and tells you if it'll finish
[1:42] running or loop forever? During proved
[1:45] that it's impossible for a program like
[1:47] this to exist. It simply leads to a
[1:49] logical contradiction, which means math
[1:51] has problems that no algorithm can
[1:53] solve. That's annoying. But 12 years
[1:55] later, a guy named Claude Shannon would
[1:57] ask his own annoying question. What is
[1:59] information as a thing you can measure?
[2:01] In his paper, a mathematical theory of
[2:03] communication, he rips out the meaning
[2:05] from normal words entirely. I love you
[2:08] and the cat is on fire carry the same
[2:10] information if they're equally
[2:11] surprising. And he measures that
[2:13] surprise in a unit called the bit. He
[2:15] proved that all information could
[2:17] ultimately be boiled down to a stream of
[2:19] ones and zeros. But here's the crazy
[2:20] part. To estimate how much information
[2:22] was needed to transmit a message, he
[2:24] borrowed a word from thermodynamics
[2:26] nobody understands. Entropy. To estimate
[2:28] entropy of English, Shannon made people
[2:31] guess the next letter in a sentence.
[2:32] When a letter is easy to guess, it has
[2:34] low entropy. When a letter is hard to
[2:36] guess, it has high entropy. But wait a
[2:38] minute. Having humans guess the next
[2:40] token is exactly what AI does today,
[2:43] just on a much bigger scale. Shannon
[2:45] wasn't trying to build artificial
[2:46] intelligence, but he gave us the math
[2:48] for uncertainty, prediction, and
[2:50] compression and accidentally wrote the
[2:52] spiritual ancestor to the loss function.
[2:54] And that's exactly why Anthropic named
[2:56] their AI model Claude. Then 10 years
[2:58] later at Cornell, a psychologist, not a
[3:01] computer scientist, builds the first
[3:03] machine that actually learns. He gets
[3:05] inspired by the way neurons work in the
[3:06] brain. So he designs a thing called a
[3:08] perceptron that takes inputs, weighs
[3:11] them, and then adjusts those weights
[3:12] when it's wrong until it can classify
[3:14] patterns on its own. It's the building
[3:16] block for modern neural networks, and
[3:18] the hype is immediate and unhinged. The
[3:20] Navy funds it, and the New York Times
[3:22] reports that the computer will soon be
[3:23] conscious, but 11 years later, the hype
[3:25] would die out completely, thanks to two
[3:27] haters at MIT, who published another
[3:30] paper with a completely different vibe.
[3:32] With basic math, they prove that a
[3:34] single layer perceptron can't even learn
[3:36] exclusive ore, which is just trivial
[3:38] logic that means this or that, but not
[3:40] both. This paper, or technically a book,
[3:43] was essentially a death certificate for
[3:45] AI at the time. Funding evaporated, and
[3:47] deep neural networks entered their first
[3:49] AI winter, but there was a twist buried
[3:51] in the fine print. They actually figured
[3:53] out that stacking layers of perceptrons
[3:55] fixes everything. The only problem is
[3:57] that back then, nobody knew how to train
[3:59] a stack of perceptrons. It would take
[4:01] another 17 years to figure it out. But
[4:03] first, we need to talk about times,
[4:04] clocks, and the ordering of events in a
[4:06] distributed system by Lesie Lamport.
[4:09] Because neural networks are useless
[4:10] unless you can run them on a massive
[4:12] scale. This paper realized that separate
[4:14] computers with no shared clock, it can't
[4:16] really have a universal now time. And
[4:18] that's a big problem when you have
[4:20] multiple computers in a distributed
[4:21] system trying to do things in order.
[4:23] Well, he figured out a way to fix this
[4:25] with the happen before relation. You
[4:27] stop trusting the wall clock time and
[4:29] order events by causality instead. If A
[4:32] could have caused B, A comes first. From
[4:34] that, he builds logical clocks which
[4:36] allow an unlimited number of machines to
[4:38] stay in agreement without ever looking
[4:40] at a real clock. Eventually, this paper
[4:42] would become the bedrock for every
[4:44] database, blockchain, and every massive
[4:46] AI training run because you need
[4:48] thousands of GPUs that constantly stay
[4:50] in sync and agree on state without
[4:52] dissolving into chaos. That was a
[4:54] gamecher. But 17 years after neural
[4:56] networks were left for dead, the three
[4:58] researchers, including the godfather
[5:00] Jeffrey Hinton, answered the question
[5:02] that everyone gave up on. How do you
[5:04] train a stack of layers? But before we
[5:06] answer that, we need to quickly talk
[5:08] about Coder, who was cool enough to
[5:09] sponsor this 10-minute video on esoteric
[5:12] computer science papers. They provide
[5:14] self-hosted development environments
[5:16] that let you work with multiple agents
[5:17] in parallel and with enterprise level
[5:20] security. and they just launched coder
[5:21] agents, a chat interface and API for
[5:24] delegating coding jobs to agents running
[5:26] on your own infrastructure. It's the
[5:28] only architecture that lets
[5:29] organizations self-host both the agent
[5:31] workflow and the development
[5:33] environments where the code is actually
[5:34] executed. This gives teams greater
[5:36] control over source code access, agent
[5:39] execution, governance, and security
[5:41] boundaries. It's also model agnostic, so
[5:43] you can connect any LLM you want and
[5:46] switch between them with just a config
[5:47] change. Coder agents are designed for
[5:49] teams in regulated industries who need
[5:51] to self-host their AI workflows with
[5:53] complete control that they're already
[5:55] used by dozens of financial institutions
[5:57] and government organizations. And you
[5:59] can check it out at the link below. Now,
[6:01] back to the question, how do you train a
[6:03] stack of layers? The answer is back
[6:05] propagation. Run your data forward,
[6:07] measure how wrong the output is, and
[6:09] then push that error backward through
[6:10] every layer using the chain rule from
[6:12] calculus to nudge each weight in the
[6:14] direction that's a little less wrong. Do
[6:16] that a few million times and the network
[6:18] teaches itself. The crazy discovery
[6:20] though is that the middle hidden layers
[6:22] started inventing their own features.
[6:24] Edges, shapes, and concepts that nobody
[6:26] programmed in that exclusive or problem
[6:29] that was impossible 17 years ago. It
[6:31] just became trivial. Back propagation is
[6:33] still essential to neural networks
[6:35] today, but back then they sucked because
[6:37] we didn't have enough data or compute.
[6:39] Well, that was about to change in 1998
[6:41] with the rise of the internet and this
[6:43] famous paper from Larry and Sergey about
[6:45] the anatomy of a large-scale web search
[6:47] engine. The paper describes the page
[6:49] rank algorithm where instead of ranking
[6:51] a web page by how often a word appears,
[6:53] it treats every link as a vote and each
[6:56] vote is weighted by how trustworthy the
[6:58] voter is. They built a prototype in
[6:59] their dorm room which eventually became
[7:01] a company called Google that you may
[7:03] have heard of. Most importantly though,
[7:04] this algorithm helped assemble the
[7:06] largest structured pile of human text
[7:08] ever created. And that massive pile of
[7:10] text would eventually become the
[7:11] training data or feed stock for future
[7:14] AI models. We'd finally see this in
[7:15] action in 2012 with a legendary imageet
[7:18] paper. It created by a dream team of
[7:20] Alex Kresensky, Ilaskever, and Jeffrey
[7:23] Hinton. Remember when I said back
[7:25] propagation needs data and compute?
[7:27] Well, finally the star is aligned. The
[7:29] data set is called ImageNet and it's a
[7:31] monster data set of millions of
[7:32] handlabeled photos. While the compute is
[7:34] a couple of Nvidia consumer grade gaming
[7:37] GPUs, a grad student named Alex wires up
[7:40] a deep convolutional neural network,
[7:42] names it AlexNet, and trains it in his
[7:44] bedroom. Then he walks it into the
[7:45] annual imageet contest and humiliates
[7:48] everyone. This is a contest where AI
[7:50] models try to classify objects in an
[7:52] image like hot dog or not hot dog. And
[7:55] while everyone was fighting over a
[7:56] fraction of a percent, Alex Net walked
[7:58] in and dropped the error rate by 10
[8:00] points in a single year. And this
[8:02] freaked everyone out because it was
[8:03] suddenly clear that deep learning
[8:05] actually works. It just needs more data,
[8:07] more compute, and the right
[8:08] architecture. Luckily, we would get that
[8:10] architecture a few years later thanks to
[8:12] Ashes Vashwani and Google in the paper.
[8:15] Attention is all you need. Around this
[8:17] time, large language models had a huge
[8:19] problem. They would start a sentence and
[8:21] by the end they would forget what they
[8:22] were even talking about. That's because
[8:24] they would read and predict tokens
[8:25] sequentially one after the other. This
[8:27] paper fixed that by introducing a new
[8:29] architecture called the transformer that
[8:31] throws out sequential reading entirely.
[8:34] Instead, it lets every word look at
[8:36] every other word at once and decide
[8:38] what's relevant. Not only does this make
[8:40] large language models feel more
[8:41] intelligent, but transformers also scale
[8:43] better as well. Google made the big
[8:45] mistake of giving this architecture away
[8:47] for free, and now every AI lab uses it,
[8:49] and that's where you get the T in chat
[8:51] GPT. Speaking of which, that brings us
[8:54] to a paper released by OpenAI in 2020.
[8:56] Language models are fewshot learners.
[8:59] Basically, OpenAI takes the transformer
[9:01] and then asks the dumbest question
[9:03] possible. What if we just make it
[9:04] enormous? Not two times bigger, but
[9:07] scale it to 175 billion parameters and
[9:10] feed it the entire internet as a data
[9:11] set. They made a crazy bet that
[9:13] intelligence isn't some secret algorithm
[9:15] we're missing, but rather it simply
[9:17] emerges once you cross a threshold of
[9:19] scale. The end result was GPT3, the
[9:22] model that ignited the current AI bubble
[9:24] that we're living through right now.
[9:25] What's crazy is that all of a sudden,
[9:27] this model could translate, summarize,
[9:29] and write code without ever being
[9:31] specifically told how to do these things
[9:33] at such a large scale. It learned how to
[9:35] generalize these things on the fly. 2
[9:37] years later, this paper would evolve
[9:38] into Chat GPT, which today is now a
[9:41] trillion dollar product. But when you
[9:43] think about it, what is chat GPT even
[9:45] doing? Well, it's just predicting the
[9:46] next word or token just like Claude
[9:48] Shannon was doing in 1948. So, here's
[9:51] the TLDDR for the last 100 years. Alan
[9:53] Turing defined the machine. Claude
[9:55] Shannon gave it currency. Rosenl Black
[9:57] gave it a neuron. Jeffrey Hinton taught
[9:59] it how to learn. Google gave it data and
[10:01] an architecture. And Open AI just turned
[10:03] the dial to the maximum. This has been
[10:05] the history of artificial intelligence
[10:07] in 10 scientific papers. Thanks for
[10:09] watching and I will see you in the next