[0:00] The year was 1936. Alan Turring asked a [0:03] simple question. It can machines think? [0:05] Actually, no, that's not right. What he [0:07] really asked was something way more [0:08] boring. It can every mathematical [0:10] problem be solved by an algorithm? [0:12] Surprisingly, he proved the answer is [0:14] no. But in the process, he accidentally [0:16] invented the computer. Then 12 years [0:18] later, in 1948, another legend shows up [0:21] named Claude Shannon, and he reduced all [0:23] human communication down to ones and [0:25] zeros, casually inventing the bit like [0:27] it was no big deal. One thing led to [0:29] another, and now in 2026, we have [0:31] 18-year-olds in hoodies typing import [0:34] torch into Python files and cashing [0:36] billion-dollar checks from venture [0:37] capitalist boomers. But reaching this [0:39] point has been underpinned by a century [0:41] long chain reaction of computer science [0:43] papers written mostly by dead people [0:45] much smarter than us. In today's video, [0:47] we'll look at 10 of the most important [0:49] scientific papers in the history of [0:51] computer science and how they changed [0:53] the world for better or worse. Our story [0:55] begins nearly a century ago when [0:57] mathematician David Hilbert asked the [0:59] field's biggest flex of a question. Is [1:01] there a universal algorithm that can [1:03] decide whether any mathematical [1:05] statement is true? Or in other words, [1:07] can we automate math itself? He called [1:09] this the Enchunk's problem which is [1:10] German for decision problem. By 1936, [1:13] Alan Turing comes around and gives a [1:15] brutal answer to this question. No. But [1:18] in order to prove it, he wrote this [1:19] paper on computable numbers that had to [1:22] define what an algorithm even is. And so [1:24] he imagined a hypothetical machine with [1:27] an infinite tape, a read write head, and [1:29] a tiny table of rules. This touring [1:31] machine is the abstract blueprint for [1:33] every computing device you've ever [1:35] owned. Once created, he asked it to [1:37] solve the halting problem. Can you write [1:39] a program that looks at any other [1:41] program and tells you if it'll finish [1:42] running or loop forever? During proved [1:45] that it's impossible for a program like [1:47] this to exist. It simply leads to a [1:49] logical contradiction, which means math [1:51] has problems that no algorithm can [1:53] solve. That's annoying. But 12 years [1:55] later, a guy named Claude Shannon would [1:57] ask his own annoying question. What is [1:59] information as a thing you can measure? [2:01] In his paper, a mathematical theory of [2:03] communication, he rips out the meaning [2:05] from normal words entirely. I love you [2:08] and the cat is on fire carry the same [2:10] information if they're equally [2:11] surprising. And he measures that [2:13] surprise in a unit called the bit. He [2:15] proved that all information could [2:17] ultimately be boiled down to a stream of [2:19] ones and zeros. But here's the crazy [2:20] part. To estimate how much information [2:22] was needed to transmit a message, he [2:24] borrowed a word from thermodynamics [2:26] nobody understands. Entropy. To estimate [2:28] entropy of English, Shannon made people [2:31] guess the next letter in a sentence. [2:32] When a letter is easy to guess, it has [2:34] low entropy. When a letter is hard to [2:36] guess, it has high entropy. But wait a [2:38] minute. Having humans guess the next [2:40] token is exactly what AI does today, [2:43] just on a much bigger scale. Shannon [2:45] wasn't trying to build artificial [2:46] intelligence, but he gave us the math [2:48] for uncertainty, prediction, and [2:50] compression and accidentally wrote the [2:52] spiritual ancestor to the loss function. [2:54] And that's exactly why Anthropic named [2:56] their AI model Claude. Then 10 years [2:58] later at Cornell, a psychologist, not a [3:01] computer scientist, builds the first [3:03] machine that actually learns. He gets [3:05] inspired by the way neurons work in the [3:06] brain. So he designs a thing called a [3:08] perceptron that takes inputs, weighs [3:11] them, and then adjusts those weights [3:12] when it's wrong until it can classify [3:14] patterns on its own. It's the building [3:16] block for modern neural networks, and [3:18] the hype is immediate and unhinged. The [3:20] Navy funds it, and the New York Times [3:22] reports that the computer will soon be [3:23] conscious, but 11 years later, the hype [3:25] would die out completely, thanks to two [3:27] haters at MIT, who published another [3:30] paper with a completely different vibe. [3:32] With basic math, they prove that a [3:34] single layer perceptron can't even learn [3:36] exclusive ore, which is just trivial [3:38] logic that means this or that, but not [3:40] both. This paper, or technically a book, [3:43] was essentially a death certificate for [3:45] AI at the time. Funding evaporated, and [3:47] deep neural networks entered their first [3:49] AI winter, but there was a twist buried [3:51] in the fine print. They actually figured [3:53] out that stacking layers of perceptrons [3:55] fixes everything. The only problem is [3:57] that back then, nobody knew how to train [3:59] a stack of perceptrons. It would take [4:01] another 17 years to figure it out. But [4:03] first, we need to talk about times, [4:04] clocks, and the ordering of events in a [4:06] distributed system by Lesie Lamport. [4:09] Because neural networks are useless [4:10] unless you can run them on a massive [4:12] scale. This paper realized that separate [4:14] computers with no shared clock, it can't [4:16] really have a universal now time. And [4:18] that's a big problem when you have [4:20] multiple computers in a distributed [4:21] system trying to do things in order. [4:23] Well, he figured out a way to fix this [4:25] with the happen before relation. You [4:27] stop trusting the wall clock time and [4:29] order events by causality instead. If A [4:32] could have caused B, A comes first. From [4:34] that, he builds logical clocks which [4:36] allow an unlimited number of machines to [4:38] stay in agreement without ever looking [4:40] at a real clock. Eventually, this paper [4:42] would become the bedrock for every [4:44] database, blockchain, and every massive [4:46] AI training run because you need [4:48] thousands of GPUs that constantly stay [4:50] in sync and agree on state without [4:52] dissolving into chaos. That was a [4:54] gamecher. But 17 years after neural [4:56] networks were left for dead, the three [4:58] researchers, including the godfather [5:00] Jeffrey Hinton, answered the question [5:02] that everyone gave up on. How do you [5:04] train a stack of layers? But before we [5:06] answer that, we need to quickly talk [5:08] about Coder, who was cool enough to [5:09] sponsor this 10-minute video on esoteric [5:12] computer science papers. They provide [5:14] self-hosted development environments [5:16] that let you work with multiple agents [5:17] in parallel and with enterprise level [5:20] security. and they just launched coder [5:21] agents, a chat interface and API for [5:24] delegating coding jobs to agents running [5:26] on your own infrastructure. It's the [5:28] only architecture that lets [5:29] organizations self-host both the agent [5:31] workflow and the development [5:33] environments where the code is actually [5:34] executed. This gives teams greater [5:36] control over source code access, agent [5:39] execution, governance, and security [5:41] boundaries. It's also model agnostic, so [5:43] you can connect any LLM you want and [5:46] switch between them with just a config [5:47] change. Coder agents are designed for [5:49] teams in regulated industries who need [5:51] to self-host their AI workflows with [5:53] complete control that they're already [5:55] used by dozens of financial institutions [5:57] and government organizations. And you [5:59] can check it out at the link below. Now, [6:01] back to the question, how do you train a [6:03] stack of layers? The answer is back [6:05] propagation. Run your data forward, [6:07] measure how wrong the output is, and [6:09] then push that error backward through [6:10] every layer using the chain rule from [6:12] calculus to nudge each weight in the [6:14] direction that's a little less wrong. Do [6:16] that a few million times and the network [6:18] teaches itself. The crazy discovery [6:20] though is that the middle hidden layers [6:22] started inventing their own features. [6:24] Edges, shapes, and concepts that nobody [6:26] programmed in that exclusive or problem [6:29] that was impossible 17 years ago. It [6:31] just became trivial. Back propagation is [6:33] still essential to neural networks [6:35] today, but back then they sucked because [6:37] we didn't have enough data or compute. [6:39] Well, that was about to change in 1998 [6:41] with the rise of the internet and this [6:43] famous paper from Larry and Sergey about [6:45] the anatomy of a large-scale web search [6:47] engine. The paper describes the page [6:49] rank algorithm where instead of ranking [6:51] a web page by how often a word appears, [6:53] it treats every link as a vote and each [6:56] vote is weighted by how trustworthy the [6:58] voter is. They built a prototype in [6:59] their dorm room which eventually became [7:01] a company called Google that you may [7:03] have heard of. Most importantly though, [7:04] this algorithm helped assemble the [7:06] largest structured pile of human text [7:08] ever created. And that massive pile of [7:10] text would eventually become the [7:11] training data or feed stock for future [7:14] AI models. We'd finally see this in [7:15] action in 2012 with a legendary imageet [7:18] paper. It created by a dream team of [7:20] Alex Kresensky, Ilaskever, and Jeffrey [7:23] Hinton. Remember when I said back [7:25] propagation needs data and compute? [7:27] Well, finally the star is aligned. The [7:29] data set is called ImageNet and it's a [7:31] monster data set of millions of [7:32] handlabeled photos. While the compute is [7:34] a couple of Nvidia consumer grade gaming [7:37] GPUs, a grad student named Alex wires up [7:40] a deep convolutional neural network, [7:42] names it AlexNet, and trains it in his [7:44] bedroom. Then he walks it into the [7:45] annual imageet contest and humiliates [7:48] everyone. This is a contest where AI [7:50] models try to classify objects in an [7:52] image like hot dog or not hot dog. And [7:55] while everyone was fighting over a [7:56] fraction of a percent, Alex Net walked [7:58] in and dropped the error rate by 10 [8:00] points in a single year. And this [8:02] freaked everyone out because it was [8:03] suddenly clear that deep learning [8:05] actually works. It just needs more data, [8:07] more compute, and the right [8:08] architecture. Luckily, we would get that [8:10] architecture a few years later thanks to [8:12] Ashes Vashwani and Google in the paper. [8:15] Attention is all you need. Around this [8:17] time, large language models had a huge [8:19] problem. They would start a sentence and [8:21] by the end they would forget what they [8:22] were even talking about. That's because [8:24] they would read and predict tokens [8:25] sequentially one after the other. This [8:27] paper fixed that by introducing a new [8:29] architecture called the transformer that [8:31] throws out sequential reading entirely. [8:34] Instead, it lets every word look at [8:36] every other word at once and decide [8:38] what's relevant. Not only does this make [8:40] large language models feel more [8:41] intelligent, but transformers also scale [8:43] better as well. Google made the big [8:45] mistake of giving this architecture away [8:47] for free, and now every AI lab uses it, [8:49] and that's where you get the T in chat [8:51] GPT. Speaking of which, that brings us [8:54] to a paper released by OpenAI in 2020. [8:56] Language models are fewshot learners. [8:59] Basically, OpenAI takes the transformer [9:01] and then asks the dumbest question [9:03] possible. What if we just make it [9:04] enormous? Not two times bigger, but [9:07] scale it to 175 billion parameters and [9:10] feed it the entire internet as a data [9:11] set. They made a crazy bet that [9:13] intelligence isn't some secret algorithm [9:15] we're missing, but rather it simply [9:17] emerges once you cross a threshold of [9:19] scale. The end result was GPT3, the [9:22] model that ignited the current AI bubble [9:24] that we're living through right now. [9:25] What's crazy is that all of a sudden, [9:27] this model could translate, summarize, [9:29] and write code without ever being [9:31] specifically told how to do these things [9:33] at such a large scale. It learned how to [9:35] generalize these things on the fly. 2 [9:37] years later, this paper would evolve [9:38] into Chat GPT, which today is now a [9:41] trillion dollar product. But when you [9:43] think about it, what is chat GPT even [9:45] doing? Well, it's just predicting the [9:46] next word or token just like Claude [9:48] Shannon was doing in 1948. So, here's [9:51] the TLDDR for the last 100 years. Alan [9:53] Turing defined the machine. Claude [9:55] Shannon gave it currency. Rosenl Black [9:57] gave it a neuron. Jeffrey Hinton taught [9:59] it how to learn. Google gave it data and [10:01] an architecture. And Open AI just turned [10:03] the dial to the maximum. This has been [10:05] the history of artificial intelligence [10:07] in 10 scientific papers. Thanks for [10:09] watching and I will see you in the next