The paper that invented the computer by accident
53sReveals a surprising origin story of computing that challenges common assumptions and hooks viewers with the halting problem.
▶ Play Clip
[00:00] The year was 1936. Alan Turring asked a
[00:03] simple question. It can machines think?
[00:05] Actually, no, that's not right. What he
[00:07] really asked was something way more
[00:08] boring. It can every mathematical
[00:10] problem be solved by an algorithm?
[00:12] Surprisingly, he proved the answer is
[00:14] no. But in the process, he accidentally
[00:16] invented the computer. Then 12 years
[00:18] later, in 1948, another legend shows up
[00:21] named Claude Shannon, and he reduced all
[00:23] human communication down to ones and
[00:25] zeros, casually inventing the bit like
[00:27] it was no big deal. One thing led to
[00:29] another, and now in 2026, we have
[00:31] 18-year-olds in hoodies typing import
[00:34] torch into Python files and cashing
[00:36] billion-dollar checks from venture
[00:37] capitalist boomers. But reaching this
[00:39] point has been underpinned by a century
[00:41] long chain reaction of computer science
[00:43] papers written mostly by dead people
[00:45] much smarter than us. In today's video,
[00:47] we'll look at 10 of the most important
[00:49] scientific papers in the history of
[00:51] computer science and how they changed
[00:53] the world for better or worse. Our story
[00:55] begins nearly a century ago when
[00:57] mathematician David Hilbert asked the
[00:59] field's biggest flex of a question. Is
[01:01] there a universal algorithm that can
[01:03] decide whether any mathematical
[01:05] statement is true? Or in other words,
[01:07] can we automate math itself? He called
[01:09] this the Enchunk's problem which is
[01:10] German for decision problem. By 1936,
[01:13] Alan Turing comes around and gives a
[01:15] brutal answer to this question. No. But
[01:18] in order to prove it, he wrote this
[01:19] paper on computable numbers that had to
[01:22] define what an algorithm even is. And so
[01:24] he imagined a hypothetical machine with
[01:27] an infinite tape, a read write head, and
[01:29] a tiny table of rules. This touring
[01:31] machine is the abstract blueprint for
[01:33] every computing device you've ever
[01:35] owned. Once created, he asked it to
[01:37] solve the halting problem. Can you write
[01:39] a program that looks at any other
[01:41] program and tells you if it'll finish
[01:42] running or loop forever? During proved
[01:45] that it's impossible for a program like
[01:47] this to exist. It simply leads to a
[01:49] logical contradiction, which means math
[01:51] has problems that no algorithm can
[01:53] solve. That's annoying. But 12 years
[01:55] later, a guy named Claude Shannon would
[01:57] ask his own annoying question. What is
[01:59] information as a thing you can measure?
[02:01] In his paper, a mathematical theory of
[02:03] communication, he rips out the meaning
[02:05] from normal words entirely. I love you
[02:08] and the cat is on fire carry the same
[02:10] information if they're equally
[02:11] surprising. And he measures that
[02:13] surprise in a unit called the bit. He
[02:15] proved that all information could
[02:17] ultimately be boiled down to a stream of
[02:19] ones and zeros. But here's the crazy
[02:20] part. To estimate how much information
[02:22] was needed to transmit a message, he
[02:24] borrowed a word from thermodynamics
[02:26] nobody understands. Entropy. To estimate
[02:28] entropy of English, Shannon made people
[02:31] guess the next letter in a sentence.
[02:32] When a letter is easy to guess, it has
[02:34] low entropy. When a letter is hard to
[02:36] guess, it has high entropy. But wait a
[02:38] minute. Having humans guess the next
[02:40] token is exactly what AI does today,
[02:43] just on a much bigger scale. Shannon
[02:45] wasn't trying to build artificial
[02:46] intelligence, but he gave us the math
[02:48] for uncertainty, prediction, and
[02:50] compression and accidentally wrote the
[02:52] spiritual ancestor to the loss function.
[02:54] And that's exactly why Anthropic named
[02:56] their AI model Claude. Then 10 years
[02:58] later at Cornell, a psychologist, not a
[03:01] computer scientist, builds the first
[03:03] machine that actually learns. He gets
[03:05] inspired by the way neurons work in the
[03:06] brain. So he designs a thing called a
[03:08] perceptron that takes inputs, weighs
[03:11] them, and then adjusts those weights
[03:12] when it's wrong until it can classify
[03:14] patterns on its own. It's the building
[03:16] block for modern neural networks, and
[03:18] the hype is immediate and unhinged. The
[03:20] Navy funds it, and the New York Times
[03:22] reports that the computer will soon be
[03:23] conscious, but 11 years later, the hype
[03:25] would die out completely, thanks to two
[03:27] haters at MIT, who published another
[03:30] paper with a completely different vibe.
[03:32] With basic math, they prove that a
[03:34] single layer perceptron can't even learn
[03:36] exclusive ore, which is just trivial
[03:38] logic that means this or that, but not
[03:40] both. This paper, or technically a book,
[03:43] was essentially a death certificate for
[03:45] AI at the time. Funding evaporated, and
[03:47] deep neural networks entered their first
[03:49] AI winter, but there was a twist buried
[03:51] in the fine print. They actually figured
[03:53] out that stacking layers of perceptrons
[03:55] fixes everything. The only problem is
[03:57] that back then, nobody knew how to train
[03:59] a stack of perceptrons. It would take
[04:01] another 17 years to figure it out. But
[04:03] first, we need to talk about times,
[04:04] clocks, and the ordering of events in a
[04:06] distributed system by Lesie Lamport.
[04:09] Because neural networks are useless
[04:10] unless you can run them on a massive
[04:12] scale. This paper realized that separate
[04:14] computers with no shared clock, it can't
[04:16] really have a universal now time. And
[04:18] that's a big problem when you have
[04:20] multiple computers in a distributed
[04:21] system trying to do things in order.
[04:23] Well, he figured out a way to fix this
[04:25] with the happen before relation. You
[04:27] stop trusting the wall clock time and
[04:29] order events by causality instead. If A
[04:32] could have caused B, A comes first. From
[04:34] that, he builds logical clocks which
[04:36] allow an unlimited number of machines to
[04:38] stay in agreement without ever looking
[04:40] at a real clock. Eventually, this paper
[04:42] would become the bedrock for every
[04:44] database, blockchain, and every massive
[04:46] AI training run because you need
[04:48] thousands of GPUs that constantly stay
[04:50] in sync and agree on state without
[04:52] dissolving into chaos. That was a
[04:54] gamecher. But 17 years after neural
[04:56] networks were left for dead, the three
[04:58] researchers, including the godfather
[05:00] Jeffrey Hinton, answered the question
[05:02] that everyone gave up on. How do you
[05:04] train a stack of layers? But before we
[05:06] answer that, we need to quickly talk
[05:08] about Coder, who was cool enough to
[05:09] sponsor this 10-minute video on esoteric
[05:12] computer science papers. They provide
[05:14] self-hosted development environments
[05:16] that let you work with multiple agents
[05:17] in parallel and with enterprise level
[05:20] security. and they just launched coder
[05:21] agents, a chat interface and API for
[05:24] delegating coding jobs to agents running
[05:26] on your own infrastructure. It's the
[05:28] only architecture that lets
[05:29] organizations self-host both the agent
[05:31] workflow and the development
[05:33] environments where the code is actually
[05:34] executed. This gives teams greater
[05:36] control over source code access, agent
[05:39] execution, governance, and security
[05:41] boundaries. It's also model agnostic, so
[05:43] you can connect any LLM you want and
[05:46] switch between them with just a config
[05:47] change. Coder agents are designed for
[05:49] teams in regulated industries who need
[05:51] to self-host their AI workflows with
[05:53] complete control that they're already
[05:55] used by dozens of financial institutions
[05:57] and government organizations. And you
[05:59] can check it out at the link below. Now,
[06:01] back to the question, how do you train a
[06:03] stack of layers? The answer is back
[06:05] propagation. Run your data forward,
[06:07] measure how wrong the output is, and
[06:09] then push that error backward through
[06:10] every layer using the chain rule from
[06:12] calculus to nudge each weight in the
[06:14] direction that's a little less wrong. Do
[06:16] that a few million times and the network
[06:18] teaches itself. The crazy discovery
[06:20] though is that the middle hidden layers
[06:22] started inventing their own features.
[06:24] Edges, shapes, and concepts that nobody
[06:26] programmed in that exclusive or problem
[06:29] that was impossible 17 years ago. It
[06:31] just became trivial. Back propagation is
[06:33] still essential to neural networks
[06:35] today, but back then they sucked because
[06:37] we didn't have enough data or compute.
[06:39] Well, that was about to change in 1998
[06:41] with the rise of the internet and this
[06:43] famous paper from Larry and Sergey about
[06:45] the anatomy of a large-scale web search
[06:47] engine. The paper describes the page
[06:49] rank algorithm where instead of ranking
[06:51] a web page by how often a word appears,
[06:53] it treats every link as a vote and each
[06:56] vote is weighted by how trustworthy the
[06:58] voter is. They built a prototype in
[06:59] their dorm room which eventually became
[07:01] a company called Google that you may
[07:03] have heard of. Most importantly though,
[07:04] this algorithm helped assemble the
[07:06] largest structured pile of human text
[07:08] ever created. And that massive pile of
[07:10] text would eventually become the
[07:11] training data or feed stock for future
[07:14] AI models. We'd finally see this in
[07:15] action in 2012 with a legendary imageet
[07:18] paper. It created by a dream team of
[07:20] Alex Kresensky, Ilaskever, and Jeffrey
[07:23] Hinton. Remember when I said back
[07:25] propagation needs data and compute?
[07:27] Well, finally the star is aligned. The
[07:29] data set is called ImageNet and it's a
[07:31] monster data set of millions of
[07:32] handlabeled photos. While the compute is
[07:34] a couple of Nvidia consumer grade gaming
[07:37] GPUs, a grad student named Alex wires up
[07:40] a deep convolutional neural network,
[07:42] names it AlexNet, and trains it in his
[07:44] bedroom. Then he walks it into the
[07:45] annual imageet contest and humiliates
[07:48] everyone. This is a contest where AI
[07:50] models try to classify objects in an
[07:52] image like hot dog or not hot dog. And
[07:55] while everyone was fighting over a
[07:56] fraction of a percent, Alex Net walked
[07:58] in and dropped the error rate by 10
[08:00] points in a single year. And this
[08:02] freaked everyone out because it was
[08:03] suddenly clear that deep learning
[08:05] actually works. It just needs more data,
[08:07] more compute, and the right
[08:08] architecture. Luckily, we would get that
[08:10] architecture a few years later thanks to
[08:12] Ashes Vashwani and Google in the paper.
[08:15] Attention is all you need. Around this
[08:17] time, large language models had a huge
[08:19] problem. They would start a sentence and
[08:21] by the end they would forget what they
[08:22] were even talking about. That's because
[08:24] they would read and predict tokens
[08:25] sequentially one after the other. This
[08:27] paper fixed that by introducing a new
[08:29] architecture called the transformer that
[08:31] throws out sequential reading entirely.
[08:34] Instead, it lets every word look at
[08:36] every other word at once and decide
[08:38] what's relevant. Not only does this make
[08:40] large language models feel more
[08:41] intelligent, but transformers also scale
[08:43] better as well. Google made the big
[08:45] mistake of giving this architecture away
[08:47] for free, and now every AI lab uses it,
[08:49] and that's where you get the T in chat
[08:51] GPT. Speaking of which, that brings us
[08:54] to a paper released by OpenAI in 2020.
[08:56] Language models are fewshot learners.
[08:59] Basically, OpenAI takes the transformer
[09:01] and then asks the dumbest question
[09:03] possible. What if we just make it
[09:04] enormous? Not two times bigger, but
[09:07] scale it to 175 billion parameters and
[09:10] feed it the entire internet as a data
[09:11] set. They made a crazy bet that
[09:13] intelligence isn't some secret algorithm
[09:15] we're missing, but rather it simply
[09:17] emerges once you cross a threshold of
[09:19] scale. The end result was GPT3, the
[09:22] model that ignited the current AI bubble
[09:24] that we're living through right now.
[09:25] What's crazy is that all of a sudden,
[09:27] this model could translate, summarize,
[09:29] and write code without ever being
[09:31] specifically told how to do these things
[09:33] at such a large scale. It learned how to
[09:35] generalize these things on the fly. 2
[09:37] years later, this paper would evolve
[09:38] into Chat GPT, which today is now a
[09:41] trillion dollar product. But when you
[09:43] think about it, what is chat GPT even
[09:45] doing? Well, it's just predicting the
[09:46] next word or token just like Claude
[09:48] Shannon was doing in 1948. So, here's
[09:51] the TLDDR for the last 100 years. Alan
[09:53] Turing defined the machine. Claude
[09:55] Shannon gave it currency. Rosenl Black
[09:57] gave it a neuron. Jeffrey Hinton taught
[09:59] it how to learn. Google gave it data and
[10:01] an architecture. And Open AI just turned
[10:03] the dial to the maximum. This has been
[10:05] the history of artificial intelligence
[10:07] in 10 scientific papers. Thanks for
[10:09] watching and I will see you in the next
⚡ Saved you 0h 10m reading this? Transcribe any YouTube video for free — no signup needed.