[0:00] There used to be a chat group internally
[0:01] called data centers on fire that would
[0:03] have like exciting uh exciting events
[0:06] happening.
[0:06] >> A distant supernova goes off, a cosmic
[0:09] ray hits a memory cell and a zero flips
[0:11] to a one. Does that really happen?
[0:14] >> Oh yeah.
[0:14] >> So my question is do you enjoy these
[0:16] Chuck Norris style jokes about you?
[0:19] >> It could be true. um one problem that
[0:22] you solved tried to solve many times but
[0:24] have never been able to crack.
[0:30] I cannot believe that this is happening
[0:32] but I got to talk to a legendary
[0:35] engineer the chief scientist of Google
[0:38] Jeff Dean. He led Google Brain, one of
[0:40] the most legendary AI labs in history.
[0:43] He co-created map produce which taught
[0:46] thousands of computers to work together
[0:49] as one. He co-built TensorFlow, the
[0:52] engine behind a huge chunk of AI
[0:54] research. And for all this, they call
[0:57] him the Chuck Norris of computer
[0:59] science. Yes, I will tell him a joke
[1:02] about that too. Now, when I see
[1:04] interviews with these executives,
[1:06] everyone is asking about China and taxes
[1:09] and all that. Look, I know nothing about
[1:11] that. I am just a student who loves to
[1:15] talk about research. So, my goal was to
[1:17] try to go a bit deeper and ask him
[1:20] questions that maybe only he knows the
[1:23] answer to, which is incredible. I'll
[1:26] also ask him about problems that even he
[1:28] couldn't solve yet. And I will ask him
[1:30] about some of the secret sauce at Google
[1:33] and see if we get something and more.
[1:36] And I am so happy to share it with you
[1:38] fellow scholars so we can learn
[1:41] together. I am not sure if I saw Jeff
[1:43] smile and laugh this much before. So, I
[1:46] hope he enjoyed it too. And once again,
[1:48] this is an incredible honor. I cannot
[1:51] believe that I was sitting there. There
[1:53] were some production issues with the
[1:55] video part. I apologize for those. Also,
[1:58] I was super nervous. I could barely hold
[2:01] on to my papers. Now, fellow scholars,
[2:04] let's learn together with Jeff Dean.
[2:07] Thank you so much for doing this, Jeff.
[2:08] We talked a bit last year and I learned
[2:11] so much from you. It was incredible. And
[2:14] then I got a message that we we get to
[2:16] do this and I was so happy.
[2:18] >> So thank you so much for this and and we
[2:20] get to share your knowledge.
[2:22] >> A small part of your knowledge with with
[2:24] the fellow scholars. So that's that's
[2:26] absolutely
[2:26] >> it was great chatting with you last
[2:27] year. I'm looking forward to this.
[2:29] >> Thank you. Thank you. So everyone says
[2:32] that we are running out of training data
[2:33] for LLMs, but you you said that there is
[2:36] still plenty of data out there.
[2:39] >> What did you mean?
[2:40] >> Yeah, I mean I think everyone has this
[2:42] view that uh we're running out of
[2:44] training data and um it's true we've
[2:47] like used quite a lot of of the public
[2:49] text data in the world. Um but I think
[2:52] there's lots of interesting video data
[2:54] that we're not really training on yet.
[2:56] uh there's lots of interesting kind of
[2:58] um ways to generate synthetic data and
[3:01] then use that for training
[3:03] >> and then I also think we can start doing
[3:05] things like uh making more passes over
[3:08] the data that we do have to make more
[3:10] and more capable models and also come up
[3:12] with algorithmic techniques that enable
[3:14] us to get a lot more information from
[3:17] every piece of data that we do have. So
[3:19] I'm not too worried about that as like
[3:21] an impediment to making progress. It
[3:23] seems like there's lots and lots of
[3:24] things we can do. People also say that
[3:26] with so much simulation data as as you
[3:28] mentioned sooner or later most of the
[3:30] data will be AI generated which is then
[3:33] used to train a different AI and then
[3:36] suddenly everyone starts to you know
[3:37] learn on the same thing but you said
[3:40] wait it still helps I think the argument
[3:43] was that uh if you have enough compute
[3:45] you can crunch through a lot of data and
[3:48] if there is just a little needle in the
[3:50] haststack that's useful the system is
[3:52] able to learn from it. Is that true?
[3:54] because my previous crappy little
[3:57] experiment uh it it was not true at all.
[4:00] So you had to be very careful with the
[4:02] data.
[4:02] >> Yeah. I mean I think it is true in
[4:04] general. I mean there's a lot of details
[4:06] to get right to make this a reality.
[4:08] Think about for example doing RL
[4:12] training and rollouts to uh you know
[4:15] figure out how to solve some fairly
[4:17] highle phrased uh coding question right.
[4:21] So you might explore a hundred or a
[4:24] thousand different ways of generating
[4:26] solutions to these problems and you
[4:28] might have some, you know, some filters
[4:30] that you apply to these things like does
[4:32] the code even compile? Well, you can
[4:33] throw out 800 of them right off the bat.
[4:36] >> Uh does it pass the unit tests? Does it
[4:39] like perform well? And so you can really
[4:42] start to hone in on like which of these
[4:44] you know potentially many solutions to
[4:46] the problem is the one that actually
[4:49] sort of generates the highest you know
[4:51] characteristics that you're looking for
[4:52] the reward in some sense
[4:54] >> and that I think is is definitely true
[4:56] like more compute will generate you more
[4:59] interesting solutions and then those can
[5:01] then be put into the training data they
[5:03] can be enriched with like data
[5:05] augmentation techniques you know I
[5:07] generated the solution in Python
[5:09] >> now I could generate a solution in Oh,
[5:10] and have more go programming language
[5:13] training data.
[5:14] >> That's like an incredible kind of
[5:16] augmentation like augmentation before
[5:18] with convolutional neural networks, you
[5:20] know, it was just just shift the image
[5:22] by a couple pixels and whatnot and here
[5:24] the augmentation can be like completely
[5:26] different programming language and and
[5:28] whatnot.
[5:28] >> Yeah, I mean I think you know a lot of
[5:30] times we think about coding based
[5:31] problems as you go from natural language
[5:34] which is
[5:35] >> often very underspecified. It's like you
[5:37] know make me a cool space invader game
[5:39] or something. Um, but actually if you
[5:42] have a program that already works that
[5:44] does what you want and you want to
[5:45] translate it, that's awesome because in
[5:47] effect your prompt is the fully
[5:50] specified behavior of the system you
[5:52] want and you just want it in a different
[5:54] language for whatever reason. Maybe
[5:56] better performance or better safety
[5:58] characteristics or whatever. So that
[6:00] we've seen internally with some tools
[6:02] that have been written in Python and
[6:03] people have been able to sort of just
[6:05] say
[6:06] >> please use all the tests for this code
[6:08] and the actual Python codebase and make
[6:10] different versions of it and found you
[6:12] know much faster solutions.
[6:14] >> So you can you can suddenly get so much
[6:16] more out of the same amount of data
[6:18] basically.
[6:18] >> Yeah. So that's that's why you're not
[6:20] worried about the data. Okay. Nice. Now
[6:22] Bod Deli has said that something like
[6:25] 90% of what happens in modern data
[6:27] centers is not training anymore which I
[6:30] I found really surprising. It's
[6:32] inference like there's more less
[6:34] training and more using like relatively
[6:36] speaking.
[6:37] >> Um how does that shift the way you
[6:39] design hardware at Google? Yeah, I mean
[6:41] I first there's a lot of other things
[6:43] that are not either inference or
[6:45] training that happen in data centers
[6:46] like all the applications we run and
[6:48] search and Gmail and so on. But of the
[6:51] sort of machine learning workloads you
[6:53] know I it is the case that training
[6:57] uh is becoming you know less proportion
[7:00] of the overall compute that we want to
[7:02] do because there's so much you know
[7:04] inference workload you want to do and
[7:06] the inference workload includes both
[7:08] like offline inference u sort of RL
[7:12] rollouts during RL training uh and then
[7:15] also online inference for handling user
[7:17] requests or agent-based behavior.
[7:20] Because of that shift and the different
[7:23] characteristics of those two kinds of
[7:24] computations, it makes a ton more sense
[7:27] to now specialize much more for
[7:30] inference workloads in hardware for
[7:32] example. Um because the characteristics
[7:34] are quite different. You need lower
[7:36] precision. You
[7:37] >> you know are handling a very large
[7:39] volume of requests on this particular
[7:41] model. The model weights don't
[7:43] necessarily change uh at inference time.
[7:46] Um all these things lead to very
[7:48] different solutions for hardware and
[7:49] much more energy efficiency can be
[7:51] gained by specializing and so I think
[7:53] you'll see a lot more in this area uh
[7:56] you know now and in the future. We've
[7:58] already done this with our TPU uh 8i and
[8:01] 8T chips that we announced a couple um
[8:03] maybe a month ago.
[8:04] >> Um but you'll see even more
[8:05] specialization I think.
[8:07] >> And that's pretty crazy that you said
[8:09] that even FP4 kind of works. And I when
[8:12] I first heard it I was like it cannot
[8:14] possibly work. can do anything useful
[8:16] and and it does.
[8:17] >> Yeah. If you told that to a computer
[8:19] scientist from 15 years ago, they'd be
[8:21] like,
[8:22] >> that's that's not not enough numbers.
[8:24] >> Yeah. Yeah. Exactly.
[8:25] >> And I look at every now and then at
[8:27] these papers and you you you have these
[8:28] these different transforms that are the
[8:30] the the distance preserving transforms,
[8:33] rotations between the points and all
[8:35] kinds of compression. But still FP4,
[8:37] that's unbelievable. It's not many bits
[8:39] for expert or enters or sign
[8:41] >> and it and it and and it's high quality,
[8:43] you know, intelligence that comes out of
[8:45] it. So, it's just
[8:46] >> it's a good sign that it works.
[8:48] >> Yeah. Yeah. But I I I don't know if we
[8:50] can get lower. Uh what what do you think
[8:52] like even lower?
[8:53] >> Possible. I mean I think um you know
[8:55] people are seeing and experimenting with
[8:57] things where you have some even lower
[8:59] precision and then it every so many
[9:02] weights of that you know lower precision
[9:04] you have a scaling factor and that seems
[9:06] like you get a little bit of a higher
[9:07] precision thing that's kind of shared
[9:09] across all the other lower lower bit
[9:12] precision u formats whatever they might
[9:14] be two bit integer one bit integer you
[9:17] know I haven't heard anyone say two bit
[9:19] float because I'm not sure what that
[9:20] would mean
[9:21] >> but um yeah I
[9:23] that plus a scaling factor seems to be
[9:25] able to get you pretty far.
[9:27] >> And the question is like how often do
[9:29] you need the scaling factor? Is it every
[9:30] 64 or 128 or 256 weights?
[9:34] >> Pre and post training are typically
[9:36] separate steps today. Do you see that
[9:39] split holding or do you expect the two
[9:41] to merge as as capabilities increase?
[9:43] >> Yeah, I mean I feel like it's a little
[9:45] intellectually dissatisfying that they
[9:47] are these distinct phases and you do one
[9:49] and then you do the other. it like
[9:52] conceptually the right uh thing to do is
[9:56] to have interle periods where you're
[9:59] sort of observing data and then periods
[10:02] where you're trying to use that new
[10:05] knowledge you've gotten from the data
[10:06] you
[10:06] >> like like with DQN this experience
[10:09] replay kind of thing
[10:10] >> yeah and then you want to now take
[10:12] actions in some environment maybe it's a
[10:14] simulated environment maybe it's the
[10:16] world with a robot or whatever it is and
[10:19] then you know learn from those actions
[10:21] because I think you get a lot more
[10:23] benefit from actually um taking actions
[10:28] and observing the consequences or trying
[10:29] to write code and seeing does the code
[10:32] work
[10:32] >> than you do from just passively sitting
[10:35] there and seeing tokens streamed by you
[10:37] which is really what most of
[10:38] pre-training is these days. It's really
[10:40] interesting that you say that that in an
[10:42] interled manner because when I when I
[10:44] hear merging the two what what in my
[10:47] mind is continuous like continuous
[10:49] learning
[10:50] >> but at the same time people have to test
[10:51] models you cannot just chuck it out
[10:53] there you know you finish training you
[10:55] finish the post and and then maybe the
[10:57] red teaming steps and and and you know
[10:59] safety and everything and then you
[11:01] package it up and you say okay this is
[11:02] good to go but if there's continuous
[11:04] learning then then there's no challenges
[11:06] because how do you know that this
[11:09] >> intermediate state is actually safe.
[11:11] Maybe some more research there too.
[11:13] >> Yeah, I mean I think uh first like a
[11:16] bunch of discrete steps where maybe you
[11:18] do this a 100 times or a thousand times
[11:20] starts to look more like an integral
[11:21] than a summation.
[11:23] >> Um and so um
[11:26] >> I do think interle in that way will make
[11:28] sense
[11:30] >> but you're right
[11:31] >> like you have a bunch of things you need
[11:33] to do for a live model that is serving
[11:36] user requests. You need to make sure
[11:37] that it's safe. Um so it may be that the
[11:40] continual learning happens and then
[11:42] there's some uh application of uh you
[11:45] know safety protocols and red teaming as
[11:47] you say uh and then you release a new
[11:49] version of that but then that model
[11:51] still continues to learn kind of behind
[11:53] the scenes and then before the newest
[11:56] version of it is is provided to users
[11:58] you redo the sort of final safety
[12:00] testing and and teaming. Jensen likes to
[12:03] say that compute capabilities advanced 1
[12:06] millionx over the last 10 years. So if
[12:10] in the next 10 years, assuming we get
[12:12] another 1 millionx, what would we be
[12:14] able to do that we cannot do now?
[12:17] >> Yeah. I mean it's like imagining the
[12:19] future is always a hard thing because
[12:21] this field is moving quickly.
[12:23] >> I mean I think if you think back, you
[12:25] know, 10, it was 10 years.
[12:26] >> 10 years.
[12:27] >> 10 years. If you think back 10 years,
[12:29] you know, we were kind of just starting
[12:31] to have language models that were the
[12:33] sequence to sequence paper had appeared.
[12:36] You know, it was just before the
[12:37] transformer.
[12:38] >> LSTMs, maybe
[12:38] >> LSTMs were were popular.
[12:41] >> Um, and now those models sort of look uh
[12:46] >> not nearly as ancient and not nearly as
[12:48] capable as the models we have today. So,
[12:50] I think if you project forward that
[12:51] level of advancement, you're going to
[12:53] see
[12:54] >> huge investments in both like new kinds
[12:56] of hardware
[12:57] um you know new kinds of research
[12:59] techniques uh there's just a lot more
[13:01] attention being paid to the field. So I
[13:03] I see that progress rate not slowing
[13:06] down um over the next 10 years. And so
[13:09] that's going to be incredible like the
[13:10] multi- aent workflows we're now able to
[13:13] start to
[13:14] >> kind of get to work on very complicated
[13:16] tasks like you saw in the IO uh keynote
[13:19] >> being able to write an operating system
[13:22] >> autonomously with a relatively simple
[13:24] prompt.
[13:25] >> Crazy. uh you know obviously there's a
[13:26] lot of operating systemy like things in
[13:29] the training data so it's not completely
[13:30] out of distribution but you know the
[13:33] fact that it's able to build an OS that
[13:34] can run Doom uh successfully is is
[13:37] pretty amazing
[13:38] >> I couldn't couldn't believe it I mean
[13:40] last year I heard a talk from Steven
[13:42] Balaban the Lambda CEO
[13:44] >> and he had this neural OS like hey you
[13:47] know it it does more and more like like
[13:49] forget the UI forget forget the maybe
[13:52] the drivers I don't know but but just
[13:54] let's let's have a neural OS and I was
[13:55] like, "Yeah, that that sounds like an
[13:57] amazing science fiction idea. I would
[13:59] love to see it, but I don't know. I
[14:01] mean, it sounds far off." A year later
[14:03] and we got you, you know, not exactly
[14:06] like that. I know but but if if you look
[14:09] at the derivatives over time
[14:11] >> I mean I would say one thing I'm
[14:12] particularly excited about is
[14:15] you know can we with these tools
[14:18] accomplish so much more in you know
[14:20] science Demis was mentioning in the
[14:21] keynote or in you know complicated
[14:24] engineering tasks that often would take
[14:26] you know lots and lots of people
[14:28] multiple years to accomplish. Could you
[14:30] actually have a system that with the
[14:33] correct access to the right kinds of
[14:34] simulation environments and a learning
[14:37] set of agents that are trying to
[14:38] accomplish the task and break it down
[14:39] into smaller tasks,
[14:41] >> could you design an airplane in, you
[14:43] know, five days instead of, you know,
[14:46] many many years? That would be amazing.
[14:48] >> 1 millionx and we we can we can try
[14:50] again.
[14:51] >> Yeah. I mean, we're not there yet, but
[14:52] that would be a pretty pretty amazing
[14:54] capability. Or designing new new
[14:56] computer chips or computer systems, new
[14:58] hardware. Um, you know, I'm pretty
[15:00] excited about that.
[15:01] >> Yeah, incredible times. Are open models
[15:04] standing on the shoulders of giants? And
[15:07] by that I mean if if Frontier models
[15:09] suddenly stopped being released, would
[15:11] open models improve as quickly as they
[15:13] do now or is their progress mostly
[15:16] driven by distillation?
[15:17] >> Yeah, I mean I think certainly a bunch
[15:19] of the progress is driven by
[15:20] distillation. For example, our own Gemma
[15:22] models are definitely distilled from
[15:24] higher quality larger scale models. Um
[15:27] and I think a lot of other open source
[15:28] models are getting benefit from
[15:30] distillation data. Uh distillation has
[15:32] always been a you know amazing way to
[15:34] get really capable models into a smaller
[15:38] footprint thing and you know uh that's
[15:40] how our flash models are quite capable
[15:42] for their size relative to the pro
[15:44] models is we're able to use the pro
[15:46] model to
[15:47] >> to teach the the flash models. So I mean
[15:50] I think really the the question is
[15:54] uh not so much one of closed versus
[15:57] open. It's you know if we want small
[16:01] incredibly capable models we have to
[16:03] keep building larger scale models that
[16:06] are maybe less inference efficient but
[16:08] are more capable and then use
[16:10] distillation
[16:11] >> to uh you know transfer the knowledge
[16:14] into into the smaller models whether
[16:15] they are open or closed. Now I'm I'm
[16:17] wondering you might be the only one who
[16:19] can answer that. So I I really want to
[16:21] ask this. Everyone has their their
[16:23] flagship models and yes the distilled
[16:25] models like pretty much every company
[16:26] does this tiered level thing. the
[16:29] quicker faster models are always were
[16:31] well below the the frontier models and
[16:33] at some point I think 3.1 where there
[16:36] was one version where where the the
[16:40] quick one was suddenly so so close to
[16:43] the frontier one there was like a 3%
[16:45] difference
[16:46] >> in in in tough benchmarks and and I just
[16:50] heard someone saying I don't even know
[16:51] who that was that that yeah it's not
[16:53] like just distillation there is some
[16:55] magic sauce in there that's been in the
[16:57] works for years. So, can I hear a bit
[16:59] about that?
[17:00] >> Sure. Well, not too much. I mean, there
[17:02] is always some magic sauce that we don't
[17:04] reveal, but distillation is definitely
[17:06] one of the key things that makes those,
[17:08] you know, much smaller models much
[17:10] cheaper, much faster, much more
[17:12] affordable um models be, you know,
[17:15] nearly as good as those frontier models.
[17:17] And then we push ahead and build an even
[17:20] better frontier model. And then we have
[17:22] to then do the process again where we
[17:24] now transfer the the knowledge and the
[17:27] really capable frontier model it back
[17:29] into a a lighter weight one. And I think
[17:31] um you know this is this is really
[17:34] important because the flash models are
[17:37] really the workhorse of what people
[17:39] generally want to use because they're
[17:41] you know they're almost as capable. We
[17:43] saw it. Yeah.
[17:43] >> Yeah. And uh
[17:45] >> and they're they're quite good.
[17:47] >> Yeah. It's unbelievable how close they
[17:48] can get like this. This didn't used to
[17:50] be like that at all. All right. What
[17:53] trends in machine learning are you most
[17:54] excited about right now? You you have a
[17:56] separate talk about like exciting trends
[17:58] in machine learning or something like
[18:00] that.
[18:00] >> Yeah. I mean
[18:01] >> what's what's the newer version of that?
[18:02] >> Yeah, the newer version I guess I mean
[18:04] there's a few different trends that I
[18:06] think are really exciting. The one is um
[18:09] uh
[18:11] so first I think continual learning is
[18:14] still a little bit nent but I think
[18:16] looking at ways to make models that are
[18:19] more interled in their way use of so
[18:22] sort of seeing data passively and taking
[18:24] action and learning from that seems like
[18:26] a really important thing. Uh you know
[18:28] agents and multi- aent use of uh these
[18:31] systems is really really important. Um,
[18:35] as one trend of that though, I think as
[18:37] you see, uh, you know, we're going to
[18:39] need a lot more inference hardware and
[18:41] capability for that because those
[18:44] systems that are working autonomously in
[18:46] the background actually consume lots of
[18:47] tokens in order to sort of
[18:49] >> do the the kind of important work
[18:51] they've been asked to do. Um, you know,
[18:54] I think, uh, being able to build really
[18:57] efficient inference hardware will enable
[18:59] a lot of of things. So looking at you
[19:02] know co-design of model architectures
[19:05] and hardware architectures to make sort
[19:08] of the best use of um things and have
[19:11] really good properties in terms of very
[19:12] low latency you know much higher
[19:15] performance per watt performance per
[19:16] dollar are things we we really care
[19:18] about.
[19:19] um you know I think looking at how do
[19:22] you you know the context window of these
[19:25] models is an important characteristic
[19:27] but
[19:28] uh I think there's a lot we could do if
[19:31] we come up with mechanisms that are sort
[19:33] of cascaded series of things that kind
[19:36] of give you the illusion that you have
[19:38] all information in the context window
[19:40] >> like you'd like to have the whole
[19:42] internet at your model's fingertips
[19:45] >> or on a personal level if you've opted
[19:47] in you know all of your email and your
[19:49] photos and your the videos you've
[19:51] watched and things like that. Um, but
[19:54] you can't really do it with the sort of
[19:56] quadratic attention mechanism. But I
[19:57] think you can build a series of kind of
[20:00] retrieval and lighter weight mechanisms
[20:03] and then ways of cascading from you know
[20:07] here are the 30,000 documents out of 10
[20:09] billion that seem most relevant and then
[20:11] you know have a lighter weight model
[20:14] that looks at those and decides these
[20:16] 117 things seem really relevant to what
[20:19] you're trying to do and puts those in
[20:21] the sort of more expensive context
[20:23] window of a a bigger model perhaps. Uh
[20:26] that's going to be kind of exciting. And
[20:27] how do you orchestrate and interle all
[20:29] that stuff so it gives you the illusion
[20:32] uh without you having to even think
[20:33] about it?
[20:34] >> Interesting. So it's very advanced games
[20:36] to be played with the context window
[20:37] because obviously very expensive. So the
[20:39] attention mechanism you get you get bigo
[20:41] squared.
[20:42] >> Uh are we still there or are do we have
[20:45] some I mean I've heard some n login
[20:47] things. Can we go lower? There's like a
[20:49] whole series.
[20:49] >> Obviously we can go lower but the
[20:51] question is what what the trade-offs are
[20:52] right like what do you have to pay for
[20:54] that? Yep. um where are we in that?
[20:56] >> Yeah, I mean I think there's actually
[20:58] quite a large body of work there
[21:00] probably, you know, hundred papers on
[21:02] more efficient context uh uh algorithms
[21:05] than than the than N squared one.
[21:07] >> I mean the N squared one works really
[21:09] well. uh so it has a pretty high bar but
[21:12] I do think there is traction in finding
[21:14] things that are you know much lower cost
[21:17] whether it's you know reducing
[21:18] algorithmic factors or very large
[21:21] constant factors on the the base n squed
[21:23] algorithm I think all of these are
[21:25] pretty exciting you can actually combine
[21:26] many of these these approaches
[21:29] >> um and and get uh you know much cheaper
[21:31] attention over many more tokens
[21:34] >> yeah I think that's one of the most
[21:35] important things because if it was
[21:37] cheaper in some sense and and and and
[21:39] you could still find the the needles in
[21:41] the in the haststack over very long
[21:44] contexts. Then you could you could have
[21:45] some sort of lifetime AI thing.
[21:48] >> Yeah, totally. Like I'd like my whole
[21:49] life of all the digital things I've seen
[21:52] uh in there. Uh as a say internal Google
[21:56] developer, I'd love for the entire
[21:57] Google codebase to be in there, which is
[21:59] you know probably 10 billion lines of
[22:01] codes, probably you know big you know
[22:04] 100 billion tokens.
[22:04] >> I just want my wine list.
[22:06] >> I just want 100 billion. All I want is a
[22:08] 100 billion tokens of attention. It's
[22:10] all I need.
[22:11] >> Amazing. I think we got to do this one.
[22:13] So, Google's data centers run an
[22:15] enormous number of machines. And at that
[22:17] scale, anything that can go wrong will
[22:19] go wrong. Like I hear that wires wear
[22:22] down,
[22:23] >> hard drives fall apart, motherboards
[22:25] overheat. Um, is that something that
[22:27] actually happens day by day? And do you
[22:29] have any good stories?
[22:30] >> Absolutely. I mean, I don't have that
[22:33] many personal stories, but there used to
[22:34] be a chat group internally called Data
[22:36] Centers on Fire that would have like
[22:38] exciting uh exciting events happening
[22:40] and sometimes exciting videos. Um yeah,
[22:43] I mean I think
[22:45] >> at scale lots of things that are very
[22:47] very unexpected happen and usually those
[22:49] are the combination of one thing fails
[22:52] and something else fails simultaneously
[22:54] or in cascade of during the yeah you
[22:56] have a cascaded failure of some sort.
[22:58] You know, sometimes that means some
[23:00] software system stops working. Sometimes
[23:02] it means like the the bus bar overheats
[23:05] and you get too much power to the to the
[23:07] rack and like it catches on fire. I mean
[23:10] that's a much rarer thing. But um you
[23:12] know you have to be prepared for this
[23:13] and I think one of the things even from
[23:15] the very earliest days of Google is we
[23:18] have really focused on how do you build
[23:21] reliable systems out of unreliable
[23:23] parts. Yes.
[23:24] >> Right. Like in the earliest Google days,
[23:25] we were buying consumer machines without
[23:28] uh ECC memory didn't not not only not
[23:31] ECC not even parody
[23:33] >> we were buying consumer motherboards
[23:35] that didn't have like redundant power
[23:37] supplies and you can do that if you can
[23:41] handle things at a higher level and
[23:42] that's generally what we try to do in
[23:44] all cases is
[23:45] >> I actually wanted to ask you about that
[23:47] the ECC thing because here here's one of
[23:49] my favorite failure modes if if that's
[23:52] true but you you tell me the distant
[23:54] supernova goes off, a cosmic ray hits a
[23:57] memory cell and a zero flips to a one.
[24:00] Does that really happen?
[24:01] >> Oh yeah. Yeah, absolutely. I mean, alpha
[24:04] particles definitely can flip uh you
[24:06] know DRAM state. We've actually observed
[24:08] this because we have monitoring data of
[24:10] how many ECC uh errors and like single
[24:14] bit errors that are corrected and
[24:15] two-bit errors that are not corrected
[24:17] are happening in all of our machines.
[24:19] And you can actually see this where some
[24:21] clusters that are pointing in a
[24:23] particular direction in the earth have a
[24:25] much higher rate for a you know a brief
[24:27] period like 10-minute period or
[24:28] something and then the other ones in the
[24:29] other side of the earth do not have
[24:30] that. So it's definitely something that
[24:32] happens.
[24:33] >> How worried should I be? Because MacBook
[24:35] Pros don't have ECC memory as far as I
[24:37] know like for for one machine is it so
[24:39] vanishingly you know unlikely that you
[24:42] shouldn't care but for data center or
[24:44] >> I mean for one machine it's generally
[24:45] not too bad. I mean I I think they have
[24:47] par so at least they detect it typically
[24:49] if it's a single bit error
[24:50] >> so detection but not fixing
[24:52] >> right but ECC usually gives you single
[24:54] bit error correction and dual bit dual
[24:57] error detection. Yeah.
[24:59] >> So for with that you don't have to worry
[25:00] about it too much
[25:02] >> um at a single machine level but even at
[25:05] you know tens of thousands of machines
[25:07] you'd have to start thinking about that.
[25:08] So you know one of the things we did
[25:10] when we were using machines without even
[25:12] parody is we built an entire
[25:15] softwarebased check summing system for
[25:17] large amounts of our data. So
[25:18] >> doing it by hand
[25:19] >> doing it by hand essentially and like we
[25:21] would you know for crawling web pages
[25:24] and putting them in the index
[25:25] >> you know if you detect that this
[25:28] particular record is corrupted it's
[25:29] usually generally okay to just you know
[25:32] ignore that record.
[25:34] >> Now I have something interesting for
[25:35] you. I call it lightning round. So,
[25:37] please try to answer in one sentence.
[25:39] One word is okay. One one sentence.
[25:41] >> Can I make run-on sentences?
[25:44] >> We'll see. We'll see. So, I I read that
[25:46] Jeff Dean's pin code is the last four
[25:49] digits of pi. I I give this one an eight
[25:52] out of 10. So, my question is, do you
[25:54] enjoy these Chuck Norris style jokes
[25:57] about you?
[25:58] >> It could be true. Um uh I I do enjoy
[26:02] them. I mean, it's a April Fool's joke
[26:04] gone ary by my colleagues in 2009, but
[26:07] it's very both flattering and kind of
[26:08] embarrassing.
[26:10] >> I think I think he felt the same way
[26:12] about them, too. But he he he enjoyed
[26:14] them, too. Legend. All right. One big
[26:16] thing that you were wrong about and came
[26:18] around.
[26:20] I think AI is going to influence health
[26:23] care quite dramatically, but I think it
[26:27] is harder not necessarily for technical
[26:30] reasons, but for you know, how do you
[26:32] actually get things in regulated
[26:35] industries that are super important and
[26:36] have all kinds of privacy constraints
[26:38] and safety concerns,
[26:39] >> but I think ultimately that will happen.
[26:42] It's just taking longer than than I I
[26:44] hoped. Yes. Because I think there's
[26:46] tremendous world benefit to do it. Um,
[26:49] but we need to do it carefully and
[26:50] safely.
[26:50] >> Vim or Emacs or something else? Hint,
[26:54] there's only one good answer.
[26:55] >> Emacs. Was that it? Oh, no.
[27:00] >> Look, I I'm a Vim person, but but I'm
[27:02] I'm not
[27:04] >> Maybe I'm I'm an embarrassment of a Vim
[27:06] person because I I I looked at Emacs,
[27:08] too, and I was like, that's pretty cool,
[27:10] too, but I I don't want to learn both.
[27:11] It's it's just so much time. So,
[27:13] >> yeah, it's true. One can spend a lot of
[27:14] time customizing Emacs. the VRC I wrote
[27:18] up and then and then it never ends.
[27:20] Yeah. One problem that you solved tried
[27:22] to solve many times but have never been
[27:24] able to crack.
[27:29] >> I mean I think in some sense we still
[27:31] don't have an answer to how do you do
[27:33] continual learning appropriately? That's
[27:35] something I've thought about a little.
[27:36] I' I've dabbled a little bit with some
[27:38] some techniques along with colleagues.
[27:40] >> But I think uh you know if we're able to
[27:43] crack that it's going to be amazing. Um,
[27:45] but it's not there yet.
[27:46] >> Last one. Favorite Two-Minute Papers
[27:49] episode.
[27:51] >> Oh,
[27:53] yeah. I mean, I assume the the
[27:55] Transformer one was a good one.
[27:56] >> All right. All right. Well, that's
[27:58] that's a good one. Okay, Jeeoff, I I
[28:00] learned a lot today. Thank you so much.
[28:02] This chatting with you again.
[28:03] >> Thank you so much.
[28:04] >> Thank you.
[28:04] >> Here you see me running the full
[28:06] Deepseek AI model through Lambda GPU
[28:09] cloud. 671
[28:12] billion parameters running super fast
[28:15] and super reliably. This is insane. I
[28:18] love it and I use it on a regular basis.
[28:22] Lambda provides you with powerful NVIDIA
[28:24] GPUs to run your own chatbots and
[28:27] experiments. Seriously, try it out now
[28:30] at lambda.ai/papers AI/papers
[28:33] or click the link in the description.