Data Centers on Fire: Jeff Dean's Wild Stories
45sThe shocking and humorous revelation of a chat group called 'Data Centers on Fire' and the reality of cosmic rays flipping bits in memory is highly engaging and shareable.
▶ Play ClipIn this interview, Jeff Dean, Google's Chief Scientist, discusses the challenges and future of AI, including the shift towards inference-heavy workloads, the potential of extremely low-precision hardware, and the importance of continual learning. He also shares insights on data center reliability, the role of distillation in open-source models, and his excitement about multi-agent systems and infinite context windows.
Jeff Dean is the chief scientist of Google, co-creator of MapReduce and TensorFlow, and led Google Brain. They discuss data center failures and cosmic ray bit flips.
Training data is not running out; there is still much video data and synthetic data potential, along with algorithmic improvements to extract more from existing data.
Inference now dominates data center compute (80%+), driving the need for specialized hardware like Google's TPU v8i and v8T chips with low precision (e.g., FP4).
Interleaving observation (pre-training) and action (learning from consequences) could lead to more capable models, though safety (red teaming) remains a challenge.
With a millionfold compute increase over 10 years, multi-agent systems could design complex artifacts (e.g., airplanes) in days instead of years, as shown by AI autonomously building an OS running Doom.
Open-source models rely heavily on distillation from larger frontier models; without new frontier models, open progress would slow.
Continual learning remains an unsolved problem, and efficient context windows for billion-token inputs would enable 'lifetime AI' systems.
"The title is accurate: the conversation does cover the implications of a millionfold compute leap, though it's not the central focus of the entire interview."
Can cosmic rays actually flip memory bits in data centers?
Alpha particles from cosmic rays can flip bits in DRAM, causing single-bit errors that ECC memory can correct.
23:57
How can models effectively handle a context window of billions of tokens?
They can be choreographed using cascaded retrieval mechanisms: first retrieve 30,000 relevant documents from billions, then use a lightweight model to select 117 relevant items for the full attention window.
20:07
What is the proposed approach to merge pre-training and post-training into a continuous process?
Alternate between passive observation of data (pre-training) and active learning from actions (similar to DQN experience replay).
09:12
What lower-than-FP4 precision formats might be possible in the future?
Two-bit integer, one-bit integer, or two-bit floats with scaling factors every 64–256 weights.
09:02
What unsolved problem does Jeff Dean identify that he has tried to crack multiple times?
Achieving proper continual learning—a way for models to interleave observation, action, and learning safely.
27:29
What percentage of modern data center compute is inference, according to Bod Deli?
80% of ML compute in modern data centers is inference, not training.
06:22
What are two features of Google's AlloyDB Omni architecture for PostgreSQL that improve performance?
Flexible sharding (stripe, sharded, replicated) and location-awareness to minimize hops, plus own auto-scaling, failover, and repair.
N/A
Shift to Inference Dominance
Contradicts the assumption that data centers are still training-heavy; only 20% of ML compute is for training, driving hardware specialization toward inference workloads.
08:07Cosmic Ray Bit Flips
Real cosmic rays can cause single-bit errors in DRAM, and Google monitors this via ECC error logs, demonstrating real-world hardware reliability challenges.
23:00Software Checksums for Unreliable Hardware
Google built a software-based checksum system for large datasets when using consumer-grade hardware without ECC, a key lesson in building reliable systems from unreliable components.
25:00Open Models Depend on Distillation
Jonathan points out that open-source model progress heavily relies on distillation from frontier models, implying that lack of new frontier models would slow open progress.
26:00Millionfold Compute Leap Possibilities
If compute grows 1 millionx in 10 years, we could design complex systems (e.g., airplanes) in days rather than years, highlighting exponential potential in engineering automation.
13:10[00:00] There used to be a chat group internally
[00:01] called data centers on fire that would
[00:03] have like exciting uh exciting events
[00:06] happening.
[00:06] >> A distant supernova goes off, a cosmic
[00:09] ray hits a memory cell and a zero flips
[00:11] to a one. Does that really happen?
[00:14] >> Oh yeah.
[00:14] >> So my question is do you enjoy these
[00:16] Chuck Norris style jokes about you?
[00:19] >> It could be true. um one problem that
[00:22] you solved tried to solve many times but
[00:24] have never been able to crack.
[00:30] I cannot believe that this is happening
[00:32] but I got to talk to a legendary
[00:35] engineer the chief scientist of Google
[00:38] Jeff Dean. He led Google Brain, one of
[00:40] the most legendary AI labs in history.
[00:43] He co-created map produce which taught
[00:46] thousands of computers to work together
[00:49] as one. He co-built TensorFlow, the
[00:52] engine behind a huge chunk of AI
[00:54] research. And for all this, they call
[00:57] him the Chuck Norris of computer
[00:59] science. Yes, I will tell him a joke
[01:02] about that too. Now, when I see
[01:04] interviews with these executives,
[01:06] everyone is asking about China and taxes
[01:09] and all that. Look, I know nothing about
[01:11] that. I am just a student who loves to
[01:15] talk about research. So, my goal was to
[01:17] try to go a bit deeper and ask him
[01:20] questions that maybe only he knows the
[01:23] answer to, which is incredible. I'll
[01:26] also ask him about problems that even he
[01:28] couldn't solve yet. And I will ask him
[01:30] about some of the secret sauce at Google
[01:33] and see if we get something and more.
[01:36] And I am so happy to share it with you
[01:38] fellow scholars so we can learn
[01:41] together. I am not sure if I saw Jeff
[01:43] smile and laugh this much before. So, I
[01:46] hope he enjoyed it too. And once again,
[01:48] this is an incredible honor. I cannot
[01:51] believe that I was sitting there. There
[01:53] were some production issues with the
[01:55] video part. I apologize for those. Also,
[01:58] I was super nervous. I could barely hold
[02:01] on to my papers. Now, fellow scholars,
[02:04] let's learn together with Jeff Dean.
[02:07] Thank you so much for doing this, Jeff.
[02:08] We talked a bit last year and I learned
[02:11] so much from you. It was incredible. And
[02:14] then I got a message that we we get to
[02:16] do this and I was so happy.
[02:18] >> So thank you so much for this and and we
[02:20] get to share your knowledge.
[02:22] >> A small part of your knowledge with with
[02:24] the fellow scholars. So that's that's
[02:26] absolutely
[02:26] >> it was great chatting with you last
[02:27] year. I'm looking forward to this.
[02:29] >> Thank you. Thank you. So everyone says
[02:32] that we are running out of training data
[02:33] for LLMs, but you you said that there is
[02:36] still plenty of data out there.
[02:39] >> What did you mean?
[02:40] >> Yeah, I mean I think everyone has this
[02:42] view that uh we're running out of
[02:44] training data and um it's true we've
[02:47] like used quite a lot of of the public
[02:49] text data in the world. Um but I think
[02:52] there's lots of interesting video data
[02:54] that we're not really training on yet.
[02:56] uh there's lots of interesting kind of
[02:58] um ways to generate synthetic data and
[03:01] then use that for training
[03:03] >> and then I also think we can start doing
[03:05] things like uh making more passes over
[03:08] the data that we do have to make more
[03:10] and more capable models and also come up
[03:12] with algorithmic techniques that enable
[03:14] us to get a lot more information from
[03:17] every piece of data that we do have. So
[03:19] I'm not too worried about that as like
[03:21] an impediment to making progress. It
[03:23] seems like there's lots and lots of
[03:24] things we can do. People also say that
[03:26] with so much simulation data as as you
[03:28] mentioned sooner or later most of the
[03:30] data will be AI generated which is then
[03:33] used to train a different AI and then
[03:36] suddenly everyone starts to you know
[03:37] learn on the same thing but you said
[03:40] wait it still helps I think the argument
[03:43] was that uh if you have enough compute
[03:45] you can crunch through a lot of data and
[03:48] if there is just a little needle in the
[03:50] haststack that's useful the system is
[03:52] able to learn from it. Is that true?
[03:54] because my previous crappy little
[03:57] experiment uh it it was not true at all.
[04:00] So you had to be very careful with the
[04:02] data.
[04:02] >> Yeah. I mean I think it is true in
[04:04] general. I mean there's a lot of details
[04:06] to get right to make this a reality.
[04:08] Think about for example doing RL
[04:12] training and rollouts to uh you know
[04:15] figure out how to solve some fairly
[04:17] highle phrased uh coding question right.
[04:21] So you might explore a hundred or a
[04:24] thousand different ways of generating
[04:26] solutions to these problems and you
[04:28] might have some, you know, some filters
[04:30] that you apply to these things like does
[04:32] the code even compile? Well, you can
[04:33] throw out 800 of them right off the bat.
[04:36] >> Uh does it pass the unit tests? Does it
[04:39] like perform well? And so you can really
[04:42] start to hone in on like which of these
[04:44] you know potentially many solutions to
[04:46] the problem is the one that actually
[04:49] sort of generates the highest you know
[04:51] characteristics that you're looking for
[04:52] the reward in some sense
[04:54] >> and that I think is is definitely true
[04:56] like more compute will generate you more
[04:59] interesting solutions and then those can
[05:01] then be put into the training data they
[05:03] can be enriched with like data
[05:05] augmentation techniques you know I
[05:07] generated the solution in Python
[05:09] >> now I could generate a solution in Oh,
[05:10] and have more go programming language
[05:13] training data.
[05:14] >> That's like an incredible kind of
[05:16] augmentation like augmentation before
[05:18] with convolutional neural networks, you
[05:20] know, it was just just shift the image
[05:22] by a couple pixels and whatnot and here
[05:24] the augmentation can be like completely
[05:26] different programming language and and
[05:28] whatnot.
[05:28] >> Yeah, I mean I think you know a lot of
[05:30] times we think about coding based
[05:31] problems as you go from natural language
[05:34] which is
[05:35] >> often very underspecified. It's like you
[05:37] know make me a cool space invader game
[05:39] or something. Um, but actually if you
[05:42] have a program that already works that
[05:44] does what you want and you want to
[05:45] translate it, that's awesome because in
[05:47] effect your prompt is the fully
[05:50] specified behavior of the system you
[05:52] want and you just want it in a different
[05:54] language for whatever reason. Maybe
[05:56] better performance or better safety
[05:58] characteristics or whatever. So that
[06:00] we've seen internally with some tools
[06:02] that have been written in Python and
[06:03] people have been able to sort of just
[06:05] say
[06:06] >> please use all the tests for this code
[06:08] and the actual Python codebase and make
[06:10] different versions of it and found you
[06:12] know much faster solutions.
[06:14] >> So you can you can suddenly get so much
[06:16] more out of the same amount of data
[06:18] basically.
[06:18] >> Yeah. So that's that's why you're not
[06:20] worried about the data. Okay. Nice. Now
[06:22] Bod Deli has said that something like
[06:25] 90% of what happens in modern data
[06:27] centers is not training anymore which I
[06:30] I found really surprising. It's
[06:32] inference like there's more less
[06:34] training and more using like relatively
[06:36] speaking.
[06:37] >> Um how does that shift the way you
[06:39] design hardware at Google? Yeah, I mean
[06:41] I first there's a lot of other things
[06:43] that are not either inference or
[06:45] training that happen in data centers
[06:46] like all the applications we run and
[06:48] search and Gmail and so on. But of the
[06:51] sort of machine learning workloads you
[06:53] know I it is the case that training
[06:57] uh is becoming you know less proportion
[07:00] of the overall compute that we want to
[07:02] do because there's so much you know
[07:04] inference workload you want to do and
[07:06] the inference workload includes both
[07:08] like offline inference u sort of RL
[07:12] rollouts during RL training uh and then
[07:15] also online inference for handling user
[07:17] requests or agent-based behavior.
[07:20] Because of that shift and the different
[07:23] characteristics of those two kinds of
[07:24] computations, it makes a ton more sense
[07:27] to now specialize much more for
[07:30] inference workloads in hardware for
[07:32] example. Um because the characteristics
[07:34] are quite different. You need lower
[07:36] precision. You
[07:37] >> you know are handling a very large
[07:39] volume of requests on this particular
[07:41] model. The model weights don't
[07:43] necessarily change uh at inference time.
[07:46] Um all these things lead to very
[07:48] different solutions for hardware and
[07:49] much more energy efficiency can be
[07:51] gained by specializing and so I think
[07:53] you'll see a lot more in this area uh
[07:56] you know now and in the future. We've
[07:58] already done this with our TPU uh 8i and
[08:01] 8T chips that we announced a couple um
[08:03] maybe a month ago.
[08:04] >> Um but you'll see even more
[08:05] specialization I think.
[08:07] >> And that's pretty crazy that you said
[08:09] that even FP4 kind of works. And I when
[08:12] I first heard it I was like it cannot
[08:14] possibly work. can do anything useful
[08:16] and and it does.
[08:17] >> Yeah. If you told that to a computer
[08:19] scientist from 15 years ago, they'd be
[08:21] like,
[08:22] >> that's that's not not enough numbers.
[08:24] >> Yeah. Yeah. Exactly.
[08:25] >> And I look at every now and then at
[08:27] these papers and you you you have these
[08:28] these different transforms that are the
[08:30] the the distance preserving transforms,
[08:33] rotations between the points and all
[08:35] kinds of compression. But still FP4,
[08:37] that's unbelievable. It's not many bits
[08:39] for expert or enters or sign
[08:41] >> and it and it and and it's high quality,
[08:43] you know, intelligence that comes out of
[08:45] it. So, it's just
[08:46] >> it's a good sign that it works.
[08:48] >> Yeah. Yeah. But I I I don't know if we
[08:50] can get lower. Uh what what do you think
[08:52] like even lower?
[08:53] >> Possible. I mean I think um you know
[08:55] people are seeing and experimenting with
[08:57] things where you have some even lower
[08:59] precision and then it every so many
[09:02] weights of that you know lower precision
[09:04] you have a scaling factor and that seems
[09:06] like you get a little bit of a higher
[09:07] precision thing that's kind of shared
[09:09] across all the other lower lower bit
[09:12] precision u formats whatever they might
[09:14] be two bit integer one bit integer you
[09:17] know I haven't heard anyone say two bit
[09:19] float because I'm not sure what that
[09:20] would mean
[09:21] >> but um yeah I
[09:23] that plus a scaling factor seems to be
[09:25] able to get you pretty far.
[09:27] >> And the question is like how often do
[09:29] you need the scaling factor? Is it every
[09:30] 64 or 128 or 256 weights?
[09:34] >> Pre and post training are typically
[09:36] separate steps today. Do you see that
[09:39] split holding or do you expect the two
[09:41] to merge as as capabilities increase?
[09:43] >> Yeah, I mean I feel like it's a little
[09:45] intellectually dissatisfying that they
[09:47] are these distinct phases and you do one
[09:49] and then you do the other. it like
[09:52] conceptually the right uh thing to do is
[09:56] to have interle periods where you're
[09:59] sort of observing data and then periods
[10:02] where you're trying to use that new
[10:05] knowledge you've gotten from the data
[10:06] you
[10:06] >> like like with DQN this experience
[10:09] replay kind of thing
[10:10] >> yeah and then you want to now take
[10:12] actions in some environment maybe it's a
[10:14] simulated environment maybe it's the
[10:16] world with a robot or whatever it is and
[10:19] then you know learn from those actions
[10:21] because I think you get a lot more
[10:23] benefit from actually um taking actions
[10:28] and observing the consequences or trying
[10:29] to write code and seeing does the code
[10:32] work
[10:32] >> than you do from just passively sitting
[10:35] there and seeing tokens streamed by you
[10:37] which is really what most of
[10:38] pre-training is these days. It's really
[10:40] interesting that you say that that in an
[10:42] interled manner because when I when I
[10:44] hear merging the two what what in my
[10:47] mind is continuous like continuous
[10:49] learning
[10:50] >> but at the same time people have to test
[10:51] models you cannot just chuck it out
[10:53] there you know you finish training you
[10:55] finish the post and and then maybe the
[10:57] red teaming steps and and and you know
[10:59] safety and everything and then you
[11:01] package it up and you say okay this is
[11:02] good to go but if there's continuous
[11:04] learning then then there's no challenges
[11:06] because how do you know that this
[11:09] >> intermediate state is actually safe.
[11:11] Maybe some more research there too.
[11:13] >> Yeah, I mean I think uh first like a
[11:16] bunch of discrete steps where maybe you
[11:18] do this a 100 times or a thousand times
[11:20] starts to look more like an integral
[11:21] than a summation.
[11:23] >> Um and so um
[11:26] >> I do think interle in that way will make
[11:28] sense
[11:30] >> but you're right
[11:31] >> like you have a bunch of things you need
[11:33] to do for a live model that is serving
[11:36] user requests. You need to make sure
[11:37] that it's safe. Um so it may be that the
[11:40] continual learning happens and then
[11:42] there's some uh application of uh you
[11:45] know safety protocols and red teaming as
[11:47] you say uh and then you release a new
[11:49] version of that but then that model
[11:51] still continues to learn kind of behind
[11:53] the scenes and then before the newest
[11:56] version of it is is provided to users
[11:58] you redo the sort of final safety
[12:00] testing and and teaming. Jensen likes to
[12:03] say that compute capabilities advanced 1
[12:06] millionx over the last 10 years. So if
[12:10] in the next 10 years, assuming we get
[12:12] another 1 millionx, what would we be
[12:14] able to do that we cannot do now?
[12:17] >> Yeah. I mean it's like imagining the
[12:19] future is always a hard thing because
[12:21] this field is moving quickly.
[12:23] >> I mean I think if you think back, you
[12:25] know, 10, it was 10 years.
[12:26] >> 10 years.
[12:27] >> 10 years. If you think back 10 years,
[12:29] you know, we were kind of just starting
[12:31] to have language models that were the
[12:33] sequence to sequence paper had appeared.
[12:36] You know, it was just before the
[12:37] transformer.
[12:38] >> LSTMs, maybe
[12:38] >> LSTMs were were popular.
[12:41] >> Um, and now those models sort of look uh
[12:46] >> not nearly as ancient and not nearly as
[12:48] capable as the models we have today. So,
[12:50] I think if you project forward that
[12:51] level of advancement, you're going to
[12:53] see
[12:54] >> huge investments in both like new kinds
[12:56] of hardware
[12:57] um you know new kinds of research
[12:59] techniques uh there's just a lot more
[13:01] attention being paid to the field. So I
[13:03] I see that progress rate not slowing
[13:06] down um over the next 10 years. And so
[13:09] that's going to be incredible like the
[13:10] multi- aent workflows we're now able to
[13:13] start to
[13:14] >> kind of get to work on very complicated
[13:16] tasks like you saw in the IO uh keynote
[13:19] >> being able to write an operating system
[13:22] >> autonomously with a relatively simple
[13:24] prompt.
[13:25] >> Crazy. uh you know obviously there's a
[13:26] lot of operating systemy like things in
[13:29] the training data so it's not completely
[13:30] out of distribution but you know the
[13:33] fact that it's able to build an OS that
[13:34] can run Doom uh successfully is is
[13:37] pretty amazing
[13:38] >> I couldn't couldn't believe it I mean
[13:40] last year I heard a talk from Steven
[13:42] Balaban the Lambda CEO
[13:44] >> and he had this neural OS like hey you
[13:47] know it it does more and more like like
[13:49] forget the UI forget forget the maybe
[13:52] the drivers I don't know but but just
[13:54] let's let's have a neural OS and I was
[13:55] like, "Yeah, that that sounds like an
[13:57] amazing science fiction idea. I would
[13:59] love to see it, but I don't know. I
[14:01] mean, it sounds far off." A year later
[14:03] and we got you, you know, not exactly
[14:06] like that. I know but but if if you look
[14:09] at the derivatives over time
[14:11] >> I mean I would say one thing I'm
[14:12] particularly excited about is
[14:15] you know can we with these tools
[14:18] accomplish so much more in you know
[14:20] science Demis was mentioning in the
[14:21] keynote or in you know complicated
[14:24] engineering tasks that often would take
[14:26] you know lots and lots of people
[14:28] multiple years to accomplish. Could you
[14:30] actually have a system that with the
[14:33] correct access to the right kinds of
[14:34] simulation environments and a learning
[14:37] set of agents that are trying to
[14:38] accomplish the task and break it down
[14:39] into smaller tasks,
[14:41] >> could you design an airplane in, you
[14:43] know, five days instead of, you know,
[14:46] many many years? That would be amazing.
[14:48] >> 1 millionx and we we can we can try
[14:50] again.
[14:51] >> Yeah. I mean, we're not there yet, but
[14:52] that would be a pretty pretty amazing
[14:54] capability. Or designing new new
[14:56] computer chips or computer systems, new
[14:58] hardware. Um, you know, I'm pretty
[15:00] excited about that.
[15:01] >> Yeah, incredible times. Are open models
[15:04] standing on the shoulders of giants? And
[15:07] by that I mean if if Frontier models
[15:09] suddenly stopped being released, would
[15:11] open models improve as quickly as they
[15:13] do now or is their progress mostly
[15:16] driven by distillation?
[15:17] >> Yeah, I mean I think certainly a bunch
[15:19] of the progress is driven by
[15:20] distillation. For example, our own Gemma
[15:22] models are definitely distilled from
[15:24] higher quality larger scale models. Um
[15:27] and I think a lot of other open source
[15:28] models are getting benefit from
[15:30] distillation data. Uh distillation has
[15:32] always been a you know amazing way to
[15:34] get really capable models into a smaller
[15:38] footprint thing and you know uh that's
[15:40] how our flash models are quite capable
[15:42] for their size relative to the pro
[15:44] models is we're able to use the pro
[15:46] model to
[15:47] >> to teach the the flash models. So I mean
[15:50] I think really the the question is
[15:54] uh not so much one of closed versus
[15:57] open. It's you know if we want small
[16:01] incredibly capable models we have to
[16:03] keep building larger scale models that
[16:06] are maybe less inference efficient but
[16:08] are more capable and then use
[16:10] distillation
[16:11] >> to uh you know transfer the knowledge
[16:14] into into the smaller models whether
[16:15] they are open or closed. Now I'm I'm
[16:17] wondering you might be the only one who
[16:19] can answer that. So I I really want to
[16:21] ask this. Everyone has their their
[16:23] flagship models and yes the distilled
[16:25] models like pretty much every company
[16:26] does this tiered level thing. the
[16:29] quicker faster models are always were
[16:31] well below the the frontier models and
[16:33] at some point I think 3.1 where there
[16:36] was one version where where the the
[16:40] quick one was suddenly so so close to
[16:43] the frontier one there was like a 3%
[16:45] difference
[16:46] >> in in in tough benchmarks and and I just
[16:50] heard someone saying I don't even know
[16:51] who that was that that yeah it's not
[16:53] like just distillation there is some
[16:55] magic sauce in there that's been in the
[16:57] works for years. So, can I hear a bit
[16:59] about that?
[17:00] >> Sure. Well, not too much. I mean, there
[17:02] is always some magic sauce that we don't
[17:04] reveal, but distillation is definitely
[17:06] one of the key things that makes those,
[17:08] you know, much smaller models much
[17:10] cheaper, much faster, much more
[17:12] affordable um models be, you know,
[17:15] nearly as good as those frontier models.
[17:17] And then we push ahead and build an even
[17:20] better frontier model. And then we have
[17:22] to then do the process again where we
[17:24] now transfer the the knowledge and the
[17:27] really capable frontier model it back
[17:29] into a a lighter weight one. And I think
[17:31] um you know this is this is really
[17:34] important because the flash models are
[17:37] really the workhorse of what people
[17:39] generally want to use because they're
[17:41] you know they're almost as capable. We
[17:43] saw it. Yeah.
[17:43] >> Yeah. And uh
[17:45] >> and they're they're quite good.
[17:47] >> Yeah. It's unbelievable how close they
[17:48] can get like this. This didn't used to
[17:50] be like that at all. All right. What
[17:53] trends in machine learning are you most
[17:54] excited about right now? You you have a
[17:56] separate talk about like exciting trends
[17:58] in machine learning or something like
[18:00] that.
[18:00] >> Yeah. I mean
[18:01] >> what's what's the newer version of that?
[18:02] >> Yeah, the newer version I guess I mean
[18:04] there's a few different trends that I
[18:06] think are really exciting. The one is um
[18:09] uh
[18:11] so first I think continual learning is
[18:14] still a little bit nent but I think
[18:16] looking at ways to make models that are
[18:19] more interled in their way use of so
[18:22] sort of seeing data passively and taking
[18:24] action and learning from that seems like
[18:26] a really important thing. Uh you know
[18:28] agents and multi- aent use of uh these
[18:31] systems is really really important. Um,
[18:35] as one trend of that though, I think as
[18:37] you see, uh, you know, we're going to
[18:39] need a lot more inference hardware and
[18:41] capability for that because those
[18:44] systems that are working autonomously in
[18:46] the background actually consume lots of
[18:47] tokens in order to sort of
[18:49] >> do the the kind of important work
[18:51] they've been asked to do. Um, you know,
[18:54] I think, uh, being able to build really
[18:57] efficient inference hardware will enable
[18:59] a lot of of things. So looking at you
[19:02] know co-design of model architectures
[19:05] and hardware architectures to make sort
[19:08] of the best use of um things and have
[19:11] really good properties in terms of very
[19:12] low latency you know much higher
[19:15] performance per watt performance per
[19:16] dollar are things we we really care
[19:18] about.
[19:19] um you know I think looking at how do
[19:22] you you know the context window of these
[19:25] models is an important characteristic
[19:27] but
[19:28] uh I think there's a lot we could do if
[19:31] we come up with mechanisms that are sort
[19:33] of cascaded series of things that kind
[19:36] of give you the illusion that you have
[19:38] all information in the context window
[19:40] >> like you'd like to have the whole
[19:42] internet at your model's fingertips
[19:45] >> or on a personal level if you've opted
[19:47] in you know all of your email and your
[19:49] photos and your the videos you've
[19:51] watched and things like that. Um, but
[19:54] you can't really do it with the sort of
[19:56] quadratic attention mechanism. But I
[19:57] think you can build a series of kind of
[20:00] retrieval and lighter weight mechanisms
[20:03] and then ways of cascading from you know
[20:07] here are the 30,000 documents out of 10
[20:09] billion that seem most relevant and then
[20:11] you know have a lighter weight model
[20:14] that looks at those and decides these
[20:16] 117 things seem really relevant to what
[20:19] you're trying to do and puts those in
[20:21] the sort of more expensive context
[20:23] window of a a bigger model perhaps. Uh
[20:26] that's going to be kind of exciting. And
[20:27] how do you orchestrate and interle all
[20:29] that stuff so it gives you the illusion
[20:32] uh without you having to even think
[20:33] about it?
[20:34] >> Interesting. So it's very advanced games
[20:36] to be played with the context window
[20:37] because obviously very expensive. So the
[20:39] attention mechanism you get you get bigo
[20:41] squared.
[20:42] >> Uh are we still there or are do we have
[20:45] some I mean I've heard some n login
[20:47] things. Can we go lower? There's like a
[20:49] whole series.
[20:49] >> Obviously we can go lower but the
[20:51] question is what what the trade-offs are
[20:52] right like what do you have to pay for
[20:54] that? Yep. um where are we in that?
[20:56] >> Yeah, I mean I think there's actually
[20:58] quite a large body of work there
[21:00] probably, you know, hundred papers on
[21:02] more efficient context uh uh algorithms
[21:05] than than the than N squared one.
[21:07] >> I mean the N squared one works really
[21:09] well. uh so it has a pretty high bar but
[21:12] I do think there is traction in finding
[21:14] things that are you know much lower cost
[21:17] whether it's you know reducing
[21:18] algorithmic factors or very large
[21:21] constant factors on the the base n squed
[21:23] algorithm I think all of these are
[21:25] pretty exciting you can actually combine
[21:26] many of these these approaches
[21:29] >> um and and get uh you know much cheaper
[21:31] attention over many more tokens
[21:34] >> yeah I think that's one of the most
[21:35] important things because if it was
[21:37] cheaper in some sense and and and and
[21:39] you could still find the the needles in
[21:41] the in the haststack over very long
[21:44] contexts. Then you could you could have
[21:45] some sort of lifetime AI thing.
[21:48] >> Yeah, totally. Like I'd like my whole
[21:49] life of all the digital things I've seen
[21:52] uh in there. Uh as a say internal Google
[21:56] developer, I'd love for the entire
[21:57] Google codebase to be in there, which is
[21:59] you know probably 10 billion lines of
[22:01] codes, probably you know big you know
[22:04] 100 billion tokens.
[22:04] >> I just want my wine list.
[22:06] >> I just want 100 billion. All I want is a
[22:08] 100 billion tokens of attention. It's
[22:10] all I need.
[22:11] >> Amazing. I think we got to do this one.
[22:13] So, Google's data centers run an
[22:15] enormous number of machines. And at that
[22:17] scale, anything that can go wrong will
[22:19] go wrong. Like I hear that wires wear
[22:22] down,
[22:23] >> hard drives fall apart, motherboards
[22:25] overheat. Um, is that something that
[22:27] actually happens day by day? And do you
[22:29] have any good stories?
[22:30] >> Absolutely. I mean, I don't have that
[22:33] many personal stories, but there used to
[22:34] be a chat group internally called Data
[22:36] Centers on Fire that would have like
[22:38] exciting uh exciting events happening
[22:40] and sometimes exciting videos. Um yeah,
[22:43] I mean I think
[22:45] >> at scale lots of things that are very
[22:47] very unexpected happen and usually those
[22:49] are the combination of one thing fails
[22:52] and something else fails simultaneously
[22:54] or in cascade of during the yeah you
[22:56] have a cascaded failure of some sort.
[22:58] You know, sometimes that means some
[23:00] software system stops working. Sometimes
[23:02] it means like the the bus bar overheats
[23:05] and you get too much power to the to the
[23:07] rack and like it catches on fire. I mean
[23:10] that's a much rarer thing. But um you
[23:12] know you have to be prepared for this
[23:13] and I think one of the things even from
[23:15] the very earliest days of Google is we
[23:18] have really focused on how do you build
[23:21] reliable systems out of unreliable
[23:23] parts. Yes.
[23:24] >> Right. Like in the earliest Google days,
[23:25] we were buying consumer machines without
[23:28] uh ECC memory didn't not not only not
[23:31] ECC not even parody
[23:33] >> we were buying consumer motherboards
[23:35] that didn't have like redundant power
[23:37] supplies and you can do that if you can
[23:41] handle things at a higher level and
[23:42] that's generally what we try to do in
[23:44] all cases is
[23:45] >> I actually wanted to ask you about that
[23:47] the ECC thing because here here's one of
[23:49] my favorite failure modes if if that's
[23:52] true but you you tell me the distant
[23:54] supernova goes off, a cosmic ray hits a
[23:57] memory cell and a zero flips to a one.
[24:00] Does that really happen?
[24:01] >> Oh yeah. Yeah, absolutely. I mean, alpha
[24:04] particles definitely can flip uh you
[24:06] know DRAM state. We've actually observed
[24:08] this because we have monitoring data of
[24:10] how many ECC uh errors and like single
[24:14] bit errors that are corrected and
[24:15] two-bit errors that are not corrected
[24:17] are happening in all of our machines.
[24:19] And you can actually see this where some
[24:21] clusters that are pointing in a
[24:23] particular direction in the earth have a
[24:25] much higher rate for a you know a brief
[24:27] period like 10-minute period or
[24:28] something and then the other ones in the
[24:29] other side of the earth do not have
[24:30] that. So it's definitely something that
[24:32] happens.
[24:33] >> How worried should I be? Because MacBook
[24:35] Pros don't have ECC memory as far as I
[24:37] know like for for one machine is it so
[24:39] vanishingly you know unlikely that you
[24:42] shouldn't care but for data center or
[24:44] >> I mean for one machine it's generally
[24:45] not too bad. I mean I I think they have
[24:47] par so at least they detect it typically
[24:49] if it's a single bit error
[24:50] >> so detection but not fixing
[24:52] >> right but ECC usually gives you single
[24:54] bit error correction and dual bit dual
[24:57] error detection. Yeah.
[24:59] >> So for with that you don't have to worry
[25:00] about it too much
[25:02] >> um at a single machine level but even at
[25:05] you know tens of thousands of machines
[25:07] you'd have to start thinking about that.
[25:08] So you know one of the things we did
[25:10] when we were using machines without even
[25:12] parody is we built an entire
[25:15] softwarebased check summing system for
[25:17] large amounts of our data. So
[25:18] >> doing it by hand
[25:19] >> doing it by hand essentially and like we
[25:21] would you know for crawling web pages
[25:24] and putting them in the index
[25:25] >> you know if you detect that this
[25:28] particular record is corrupted it's
[25:29] usually generally okay to just you know
[25:32] ignore that record.
[25:34] >> Now I have something interesting for
[25:35] you. I call it lightning round. So,
[25:37] please try to answer in one sentence.
[25:39] One word is okay. One one sentence.
[25:41] >> Can I make run-on sentences?
[25:44] >> We'll see. We'll see. So, I I read that
[25:46] Jeff Dean's pin code is the last four
[25:49] digits of pi. I I give this one an eight
[25:52] out of 10. So, my question is, do you
[25:54] enjoy these Chuck Norris style jokes
[25:57] about you?
[25:58] >> It could be true. Um uh I I do enjoy
[26:02] them. I mean, it's a April Fool's joke
[26:04] gone ary by my colleagues in 2009, but
[26:07] it's very both flattering and kind of
[26:08] embarrassing.
[26:10] >> I think I think he felt the same way
[26:12] about them, too. But he he he enjoyed
[26:14] them, too. Legend. All right. One big
[26:16] thing that you were wrong about and came
[26:18] around.
[26:20] I think AI is going to influence health
[26:23] care quite dramatically, but I think it
[26:27] is harder not necessarily for technical
[26:30] reasons, but for you know, how do you
[26:32] actually get things in regulated
[26:35] industries that are super important and
[26:36] have all kinds of privacy constraints
[26:38] and safety concerns,
[26:39] >> but I think ultimately that will happen.
[26:42] It's just taking longer than than I I
[26:44] hoped. Yes. Because I think there's
[26:46] tremendous world benefit to do it. Um,
[26:49] but we need to do it carefully and
[26:50] safely.
[26:50] >> Vim or Emacs or something else? Hint,
[26:54] there's only one good answer.
[26:55] >> Emacs. Was that it? Oh, no.
[27:00] >> Look, I I'm a Vim person, but but I'm
[27:02] I'm not
[27:04] >> Maybe I'm I'm an embarrassment of a Vim
[27:06] person because I I I looked at Emacs,
[27:08] too, and I was like, that's pretty cool,
[27:10] too, but I I don't want to learn both.
[27:11] It's it's just so much time. So,
[27:13] >> yeah, it's true. One can spend a lot of
[27:14] time customizing Emacs. the VRC I wrote
[27:18] up and then and then it never ends.
[27:20] Yeah. One problem that you solved tried
[27:22] to solve many times but have never been
[27:24] able to crack.
[27:29] >> I mean I think in some sense we still
[27:31] don't have an answer to how do you do
[27:33] continual learning appropriately? That's
[27:35] something I've thought about a little.
[27:36] I' I've dabbled a little bit with some
[27:38] some techniques along with colleagues.
[27:40] >> But I think uh you know if we're able to
[27:43] crack that it's going to be amazing. Um,
[27:45] but it's not there yet.
[27:46] >> Last one. Favorite Two-Minute Papers
[27:49] episode.
[27:51] >> Oh,
[27:53] yeah. I mean, I assume the the
[27:55] Transformer one was a good one.
[27:56] >> All right. All right. Well, that's
[27:58] that's a good one. Okay, Jeeoff, I I
[28:00] learned a lot today. Thank you so much.
[28:02] This chatting with you again.
[28:03] >> Thank you so much.
[28:04] >> Thank you.
[28:04] >> Here you see me running the full
[28:06] Deepseek AI model through Lambda GPU
[28:09] cloud. 671
[28:12] billion parameters running super fast
[28:15] and super reliably. This is insane. I
[28:18] love it and I use it on a regular basis.
[28:22] Lambda provides you with powerful NVIDIA
[28:24] GPUs to run your own chatbots and
[28:27] experiments. Seriously, try it out now
[28:30] at lambda.ai/papers AI/papers
[28:33] or click the link in the description.
⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.