[0:00] There used to be a chat group internally [0:01] called data centers on fire that would [0:03] have like exciting uh exciting events [0:06] happening. [0:06] >> A distant supernova goes off, a cosmic [0:09] ray hits a memory cell and a zero flips [0:11] to a one. Does that really happen? [0:14] >> Oh yeah. [0:14] >> So my question is do you enjoy these [0:16] Chuck Norris style jokes about you? [0:19] >> It could be true. um one problem that [0:22] you solved tried to solve many times but [0:24] have never been able to crack. [0:30] I cannot believe that this is happening [0:32] but I got to talk to a legendary [0:35] engineer the chief scientist of Google [0:38] Jeff Dean. He led Google Brain, one of [0:40] the most legendary AI labs in history. [0:43] He co-created map produce which taught [0:46] thousands of computers to work together [0:49] as one. He co-built TensorFlow, the [0:52] engine behind a huge chunk of AI [0:54] research. And for all this, they call [0:57] him the Chuck Norris of computer [0:59] science. Yes, I will tell him a joke [1:02] about that too. Now, when I see [1:04] interviews with these executives, [1:06] everyone is asking about China and taxes [1:09] and all that. Look, I know nothing about [1:11] that. I am just a student who loves to [1:15] talk about research. So, my goal was to [1:17] try to go a bit deeper and ask him [1:20] questions that maybe only he knows the [1:23] answer to, which is incredible. I'll [1:26] also ask him about problems that even he [1:28] couldn't solve yet. And I will ask him [1:30] about some of the secret sauce at Google [1:33] and see if we get something and more. [1:36] And I am so happy to share it with you [1:38] fellow scholars so we can learn [1:41] together. I am not sure if I saw Jeff [1:43] smile and laugh this much before. So, I [1:46] hope he enjoyed it too. And once again, [1:48] this is an incredible honor. I cannot [1:51] believe that I was sitting there. There [1:53] were some production issues with the [1:55] video part. I apologize for those. Also, [1:58] I was super nervous. I could barely hold [2:01] on to my papers. Now, fellow scholars, [2:04] let's learn together with Jeff Dean. [2:07] Thank you so much for doing this, Jeff. [2:08] We talked a bit last year and I learned [2:11] so much from you. It was incredible. And [2:14] then I got a message that we we get to [2:16] do this and I was so happy. [2:18] >> So thank you so much for this and and we [2:20] get to share your knowledge. [2:22] >> A small part of your knowledge with with [2:24] the fellow scholars. So that's that's [2:26] absolutely [2:26] >> it was great chatting with you last [2:27] year. I'm looking forward to this. [2:29] >> Thank you. Thank you. So everyone says [2:32] that we are running out of training data [2:33] for LLMs, but you you said that there is [2:36] still plenty of data out there. [2:39] >> What did you mean? [2:40] >> Yeah, I mean I think everyone has this [2:42] view that uh we're running out of [2:44] training data and um it's true we've [2:47] like used quite a lot of of the public [2:49] text data in the world. Um but I think [2:52] there's lots of interesting video data [2:54] that we're not really training on yet. [2:56] uh there's lots of interesting kind of [2:58] um ways to generate synthetic data and [3:01] then use that for training [3:03] >> and then I also think we can start doing [3:05] things like uh making more passes over [3:08] the data that we do have to make more [3:10] and more capable models and also come up [3:12] with algorithmic techniques that enable [3:14] us to get a lot more information from [3:17] every piece of data that we do have. So [3:19] I'm not too worried about that as like [3:21] an impediment to making progress. It [3:23] seems like there's lots and lots of [3:24] things we can do. People also say that [3:26] with so much simulation data as as you [3:28] mentioned sooner or later most of the [3:30] data will be AI generated which is then [3:33] used to train a different AI and then [3:36] suddenly everyone starts to you know [3:37] learn on the same thing but you said [3:40] wait it still helps I think the argument [3:43] was that uh if you have enough compute [3:45] you can crunch through a lot of data and [3:48] if there is just a little needle in the [3:50] haststack that's useful the system is [3:52] able to learn from it. Is that true? [3:54] because my previous crappy little [3:57] experiment uh it it was not true at all. [4:00] So you had to be very careful with the [4:02] data. [4:02] >> Yeah. I mean I think it is true in [4:04] general. I mean there's a lot of details [4:06] to get right to make this a reality. [4:08] Think about for example doing RL [4:12] training and rollouts to uh you know [4:15] figure out how to solve some fairly [4:17] highle phrased uh coding question right. [4:21] So you might explore a hundred or a [4:24] thousand different ways of generating [4:26] solutions to these problems and you [4:28] might have some, you know, some filters [4:30] that you apply to these things like does [4:32] the code even compile? Well, you can [4:33] throw out 800 of them right off the bat. [4:36] >> Uh does it pass the unit tests? Does it [4:39] like perform well? And so you can really [4:42] start to hone in on like which of these [4:44] you know potentially many solutions to [4:46] the problem is the one that actually [4:49] sort of generates the highest you know [4:51] characteristics that you're looking for [4:52] the reward in some sense [4:54] >> and that I think is is definitely true [4:56] like more compute will generate you more [4:59] interesting solutions and then those can [5:01] then be put into the training data they [5:03] can be enriched with like data [5:05] augmentation techniques you know I [5:07] generated the solution in Python [5:09] >> now I could generate a solution in Oh, [5:10] and have more go programming language [5:13] training data. [5:14] >> That's like an incredible kind of [5:16] augmentation like augmentation before [5:18] with convolutional neural networks, you [5:20] know, it was just just shift the image [5:22] by a couple pixels and whatnot and here [5:24] the augmentation can be like completely [5:26] different programming language and and [5:28] whatnot. [5:28] >> Yeah, I mean I think you know a lot of [5:30] times we think about coding based [5:31] problems as you go from natural language [5:34] which is [5:35] >> often very underspecified. It's like you [5:37] know make me a cool space invader game [5:39] or something. Um, but actually if you [5:42] have a program that already works that [5:44] does what you want and you want to [5:45] translate it, that's awesome because in [5:47] effect your prompt is the fully [5:50] specified behavior of the system you [5:52] want and you just want it in a different [5:54] language for whatever reason. Maybe [5:56] better performance or better safety [5:58] characteristics or whatever. So that [6:00] we've seen internally with some tools [6:02] that have been written in Python and [6:03] people have been able to sort of just [6:05] say [6:06] >> please use all the tests for this code [6:08] and the actual Python codebase and make [6:10] different versions of it and found you [6:12] know much faster solutions. [6:14] >> So you can you can suddenly get so much [6:16] more out of the same amount of data [6:18] basically. [6:18] >> Yeah. So that's that's why you're not [6:20] worried about the data. Okay. Nice. Now [6:22] Bod Deli has said that something like [6:25] 90% of what happens in modern data [6:27] centers is not training anymore which I [6:30] I found really surprising. It's [6:32] inference like there's more less [6:34] training and more using like relatively [6:36] speaking. [6:37] >> Um how does that shift the way you [6:39] design hardware at Google? Yeah, I mean [6:41] I first there's a lot of other things [6:43] that are not either inference or [6:45] training that happen in data centers [6:46] like all the applications we run and [6:48] search and Gmail and so on. But of the [6:51] sort of machine learning workloads you [6:53] know I it is the case that training [6:57] uh is becoming you know less proportion [7:00] of the overall compute that we want to [7:02] do because there's so much you know [7:04] inference workload you want to do and [7:06] the inference workload includes both [7:08] like offline inference u sort of RL [7:12] rollouts during RL training uh and then [7:15] also online inference for handling user [7:17] requests or agent-based behavior. [7:20] Because of that shift and the different [7:23] characteristics of those two kinds of [7:24] computations, it makes a ton more sense [7:27] to now specialize much more for [7:30] inference workloads in hardware for [7:32] example. Um because the characteristics [7:34] are quite different. You need lower [7:36] precision. You [7:37] >> you know are handling a very large [7:39] volume of requests on this particular [7:41] model. The model weights don't [7:43] necessarily change uh at inference time. [7:46] Um all these things lead to very [7:48] different solutions for hardware and [7:49] much more energy efficiency can be [7:51] gained by specializing and so I think [7:53] you'll see a lot more in this area uh [7:56] you know now and in the future. We've [7:58] already done this with our TPU uh 8i and [8:01] 8T chips that we announced a couple um [8:03] maybe a month ago. [8:04] >> Um but you'll see even more [8:05] specialization I think. [8:07] >> And that's pretty crazy that you said [8:09] that even FP4 kind of works. And I when [8:12] I first heard it I was like it cannot [8:14] possibly work. can do anything useful [8:16] and and it does. [8:17] >> Yeah. If you told that to a computer [8:19] scientist from 15 years ago, they'd be [8:21] like, [8:22] >> that's that's not not enough numbers. [8:24] >> Yeah. Yeah. Exactly. [8:25] >> And I look at every now and then at [8:27] these papers and you you you have these [8:28] these different transforms that are the [8:30] the the distance preserving transforms, [8:33] rotations between the points and all [8:35] kinds of compression. But still FP4, [8:37] that's unbelievable. It's not many bits [8:39] for expert or enters or sign [8:41] >> and it and it and and it's high quality, [8:43] you know, intelligence that comes out of [8:45] it. So, it's just [8:46] >> it's a good sign that it works. [8:48] >> Yeah. Yeah. But I I I don't know if we [8:50] can get lower. Uh what what do you think [8:52] like even lower? [8:53] >> Possible. I mean I think um you know [8:55] people are seeing and experimenting with [8:57] things where you have some even lower [8:59] precision and then it every so many [9:02] weights of that you know lower precision [9:04] you have a scaling factor and that seems [9:06] like you get a little bit of a higher [9:07] precision thing that's kind of shared [9:09] across all the other lower lower bit [9:12] precision u formats whatever they might [9:14] be two bit integer one bit integer you [9:17] know I haven't heard anyone say two bit [9:19] float because I'm not sure what that [9:20] would mean [9:21] >> but um yeah I [9:23] that plus a scaling factor seems to be [9:25] able to get you pretty far. [9:27] >> And the question is like how often do [9:29] you need the scaling factor? Is it every [9:30] 64 or 128 or 256 weights? [9:34] >> Pre and post training are typically [9:36] separate steps today. Do you see that [9:39] split holding or do you expect the two [9:41] to merge as as capabilities increase? [9:43] >> Yeah, I mean I feel like it's a little [9:45] intellectually dissatisfying that they [9:47] are these distinct phases and you do one [9:49] and then you do the other. it like [9:52] conceptually the right uh thing to do is [9:56] to have interle periods where you're [9:59] sort of observing data and then periods [10:02] where you're trying to use that new [10:05] knowledge you've gotten from the data [10:06] you [10:06] >> like like with DQN this experience [10:09] replay kind of thing [10:10] >> yeah and then you want to now take [10:12] actions in some environment maybe it's a [10:14] simulated environment maybe it's the [10:16] world with a robot or whatever it is and [10:19] then you know learn from those actions [10:21] because I think you get a lot more [10:23] benefit from actually um taking actions [10:28] and observing the consequences or trying [10:29] to write code and seeing does the code [10:32] work [10:32] >> than you do from just passively sitting [10:35] there and seeing tokens streamed by you [10:37] which is really what most of [10:38] pre-training is these days. It's really [10:40] interesting that you say that that in an [10:42] interled manner because when I when I [10:44] hear merging the two what what in my [10:47] mind is continuous like continuous [10:49] learning [10:50] >> but at the same time people have to test [10:51] models you cannot just chuck it out [10:53] there you know you finish training you [10:55] finish the post and and then maybe the [10:57] red teaming steps and and and you know [10:59] safety and everything and then you [11:01] package it up and you say okay this is [11:02] good to go but if there's continuous [11:04] learning then then there's no challenges [11:06] because how do you know that this [11:09] >> intermediate state is actually safe. [11:11] Maybe some more research there too. [11:13] >> Yeah, I mean I think uh first like a [11:16] bunch of discrete steps where maybe you [11:18] do this a 100 times or a thousand times [11:20] starts to look more like an integral [11:21] than a summation. [11:23] >> Um and so um [11:26] >> I do think interle in that way will make [11:28] sense [11:30] >> but you're right [11:31] >> like you have a bunch of things you need [11:33] to do for a live model that is serving [11:36] user requests. You need to make sure [11:37] that it's safe. Um so it may be that the [11:40] continual learning happens and then [11:42] there's some uh application of uh you [11:45] know safety protocols and red teaming as [11:47] you say uh and then you release a new [11:49] version of that but then that model [11:51] still continues to learn kind of behind [11:53] the scenes and then before the newest [11:56] version of it is is provided to users [11:58] you redo the sort of final safety [12:00] testing and and teaming. Jensen likes to [12:03] say that compute capabilities advanced 1 [12:06] millionx over the last 10 years. So if [12:10] in the next 10 years, assuming we get [12:12] another 1 millionx, what would we be [12:14] able to do that we cannot do now? [12:17] >> Yeah. I mean it's like imagining the [12:19] future is always a hard thing because [12:21] this field is moving quickly. [12:23] >> I mean I think if you think back, you [12:25] know, 10, it was 10 years. [12:26] >> 10 years. [12:27] >> 10 years. If you think back 10 years, [12:29] you know, we were kind of just starting [12:31] to have language models that were the [12:33] sequence to sequence paper had appeared. [12:36] You know, it was just before the [12:37] transformer. [12:38] >> LSTMs, maybe [12:38] >> LSTMs were were popular. [12:41] >> Um, and now those models sort of look uh [12:46] >> not nearly as ancient and not nearly as [12:48] capable as the models we have today. So, [12:50] I think if you project forward that [12:51] level of advancement, you're going to [12:53] see [12:54] >> huge investments in both like new kinds [12:56] of hardware [12:57] um you know new kinds of research [12:59] techniques uh there's just a lot more [13:01] attention being paid to the field. So I [13:03] I see that progress rate not slowing [13:06] down um over the next 10 years. And so [13:09] that's going to be incredible like the [13:10] multi- aent workflows we're now able to [13:13] start to [13:14] >> kind of get to work on very complicated [13:16] tasks like you saw in the IO uh keynote [13:19] >> being able to write an operating system [13:22] >> autonomously with a relatively simple [13:24] prompt. [13:25] >> Crazy. uh you know obviously there's a [13:26] lot of operating systemy like things in [13:29] the training data so it's not completely [13:30] out of distribution but you know the [13:33] fact that it's able to build an OS that [13:34] can run Doom uh successfully is is [13:37] pretty amazing [13:38] >> I couldn't couldn't believe it I mean [13:40] last year I heard a talk from Steven [13:42] Balaban the Lambda CEO [13:44] >> and he had this neural OS like hey you [13:47] know it it does more and more like like [13:49] forget the UI forget forget the maybe [13:52] the drivers I don't know but but just [13:54] let's let's have a neural OS and I was [13:55] like, "Yeah, that that sounds like an [13:57] amazing science fiction idea. I would [13:59] love to see it, but I don't know. I [14:01] mean, it sounds far off." A year later [14:03] and we got you, you know, not exactly [14:06] like that. I know but but if if you look [14:09] at the derivatives over time [14:11] >> I mean I would say one thing I'm [14:12] particularly excited about is [14:15] you know can we with these tools [14:18] accomplish so much more in you know [14:20] science Demis was mentioning in the [14:21] keynote or in you know complicated [14:24] engineering tasks that often would take [14:26] you know lots and lots of people [14:28] multiple years to accomplish. Could you [14:30] actually have a system that with the [14:33] correct access to the right kinds of [14:34] simulation environments and a learning [14:37] set of agents that are trying to [14:38] accomplish the task and break it down [14:39] into smaller tasks, [14:41] >> could you design an airplane in, you [14:43] know, five days instead of, you know, [14:46] many many years? That would be amazing. [14:48] >> 1 millionx and we we can we can try [14:50] again. [14:51] >> Yeah. I mean, we're not there yet, but [14:52] that would be a pretty pretty amazing [14:54] capability. Or designing new new [14:56] computer chips or computer systems, new [14:58] hardware. Um, you know, I'm pretty [15:00] excited about that. [15:01] >> Yeah, incredible times. Are open models [15:04] standing on the shoulders of giants? And [15:07] by that I mean if if Frontier models [15:09] suddenly stopped being released, would [15:11] open models improve as quickly as they [15:13] do now or is their progress mostly [15:16] driven by distillation? [15:17] >> Yeah, I mean I think certainly a bunch [15:19] of the progress is driven by [15:20] distillation. For example, our own Gemma [15:22] models are definitely distilled from [15:24] higher quality larger scale models. Um [15:27] and I think a lot of other open source [15:28] models are getting benefit from [15:30] distillation data. Uh distillation has [15:32] always been a you know amazing way to [15:34] get really capable models into a smaller [15:38] footprint thing and you know uh that's [15:40] how our flash models are quite capable [15:42] for their size relative to the pro [15:44] models is we're able to use the pro [15:46] model to [15:47] >> to teach the the flash models. So I mean [15:50] I think really the the question is [15:54] uh not so much one of closed versus [15:57] open. It's you know if we want small [16:01] incredibly capable models we have to [16:03] keep building larger scale models that [16:06] are maybe less inference efficient but [16:08] are more capable and then use [16:10] distillation [16:11] >> to uh you know transfer the knowledge [16:14] into into the smaller models whether [16:15] they are open or closed. Now I'm I'm [16:17] wondering you might be the only one who [16:19] can answer that. So I I really want to [16:21] ask this. Everyone has their their [16:23] flagship models and yes the distilled [16:25] models like pretty much every company [16:26] does this tiered level thing. the [16:29] quicker faster models are always were [16:31] well below the the frontier models and [16:33] at some point I think 3.1 where there [16:36] was one version where where the the [16:40] quick one was suddenly so so close to [16:43] the frontier one there was like a 3% [16:45] difference [16:46] >> in in in tough benchmarks and and I just [16:50] heard someone saying I don't even know [16:51] who that was that that yeah it's not [16:53] like just distillation there is some [16:55] magic sauce in there that's been in the [16:57] works for years. So, can I hear a bit [16:59] about that? [17:00] >> Sure. Well, not too much. I mean, there [17:02] is always some magic sauce that we don't [17:04] reveal, but distillation is definitely [17:06] one of the key things that makes those, [17:08] you know, much smaller models much [17:10] cheaper, much faster, much more [17:12] affordable um models be, you know, [17:15] nearly as good as those frontier models. [17:17] And then we push ahead and build an even [17:20] better frontier model. And then we have [17:22] to then do the process again where we [17:24] now transfer the the knowledge and the [17:27] really capable frontier model it back [17:29] into a a lighter weight one. And I think [17:31] um you know this is this is really [17:34] important because the flash models are [17:37] really the workhorse of what people [17:39] generally want to use because they're [17:41] you know they're almost as capable. We [17:43] saw it. Yeah. [17:43] >> Yeah. And uh [17:45] >> and they're they're quite good. [17:47] >> Yeah. It's unbelievable how close they [17:48] can get like this. This didn't used to [17:50] be like that at all. All right. What [17:53] trends in machine learning are you most [17:54] excited about right now? You you have a [17:56] separate talk about like exciting trends [17:58] in machine learning or something like [18:00] that. [18:00] >> Yeah. I mean [18:01] >> what's what's the newer version of that? [18:02] >> Yeah, the newer version I guess I mean [18:04] there's a few different trends that I [18:06] think are really exciting. The one is um [18:09] uh [18:11] so first I think continual learning is [18:14] still a little bit nent but I think [18:16] looking at ways to make models that are [18:19] more interled in their way use of so [18:22] sort of seeing data passively and taking [18:24] action and learning from that seems like [18:26] a really important thing. Uh you know [18:28] agents and multi- aent use of uh these [18:31] systems is really really important. Um, [18:35] as one trend of that though, I think as [18:37] you see, uh, you know, we're going to [18:39] need a lot more inference hardware and [18:41] capability for that because those [18:44] systems that are working autonomously in [18:46] the background actually consume lots of [18:47] tokens in order to sort of [18:49] >> do the the kind of important work [18:51] they've been asked to do. Um, you know, [18:54] I think, uh, being able to build really [18:57] efficient inference hardware will enable [18:59] a lot of of things. So looking at you [19:02] know co-design of model architectures [19:05] and hardware architectures to make sort [19:08] of the best use of um things and have [19:11] really good properties in terms of very [19:12] low latency you know much higher [19:15] performance per watt performance per [19:16] dollar are things we we really care [19:18] about. [19:19] um you know I think looking at how do [19:22] you you know the context window of these [19:25] models is an important characteristic [19:27] but [19:28] uh I think there's a lot we could do if [19:31] we come up with mechanisms that are sort [19:33] of cascaded series of things that kind [19:36] of give you the illusion that you have [19:38] all information in the context window [19:40] >> like you'd like to have the whole [19:42] internet at your model's fingertips [19:45] >> or on a personal level if you've opted [19:47] in you know all of your email and your [19:49] photos and your the videos you've [19:51] watched and things like that. Um, but [19:54] you can't really do it with the sort of [19:56] quadratic attention mechanism. But I [19:57] think you can build a series of kind of [20:00] retrieval and lighter weight mechanisms [20:03] and then ways of cascading from you know [20:07] here are the 30,000 documents out of 10 [20:09] billion that seem most relevant and then [20:11] you know have a lighter weight model [20:14] that looks at those and decides these [20:16] 117 things seem really relevant to what [20:19] you're trying to do and puts those in [20:21] the sort of more expensive context [20:23] window of a a bigger model perhaps. Uh [20:26] that's going to be kind of exciting. And [20:27] how do you orchestrate and interle all [20:29] that stuff so it gives you the illusion [20:32] uh without you having to even think [20:33] about it? [20:34] >> Interesting. So it's very advanced games [20:36] to be played with the context window [20:37] because obviously very expensive. So the [20:39] attention mechanism you get you get bigo [20:41] squared. [20:42] >> Uh are we still there or are do we have [20:45] some I mean I've heard some n login [20:47] things. Can we go lower? There's like a [20:49] whole series. [20:49] >> Obviously we can go lower but the [20:51] question is what what the trade-offs are [20:52] right like what do you have to pay for [20:54] that? Yep. um where are we in that? [20:56] >> Yeah, I mean I think there's actually [20:58] quite a large body of work there [21:00] probably, you know, hundred papers on [21:02] more efficient context uh uh algorithms [21:05] than than the than N squared one. [21:07] >> I mean the N squared one works really [21:09] well. uh so it has a pretty high bar but [21:12] I do think there is traction in finding [21:14] things that are you know much lower cost [21:17] whether it's you know reducing [21:18] algorithmic factors or very large [21:21] constant factors on the the base n squed [21:23] algorithm I think all of these are [21:25] pretty exciting you can actually combine [21:26] many of these these approaches [21:29] >> um and and get uh you know much cheaper [21:31] attention over many more tokens [21:34] >> yeah I think that's one of the most [21:35] important things because if it was [21:37] cheaper in some sense and and and and [21:39] you could still find the the needles in [21:41] the in the haststack over very long [21:44] contexts. Then you could you could have [21:45] some sort of lifetime AI thing. [21:48] >> Yeah, totally. Like I'd like my whole [21:49] life of all the digital things I've seen [21:52] uh in there. Uh as a say internal Google [21:56] developer, I'd love for the entire [21:57] Google codebase to be in there, which is [21:59] you know probably 10 billion lines of [22:01] codes, probably you know big you know [22:04] 100 billion tokens. [22:04] >> I just want my wine list. [22:06] >> I just want 100 billion. All I want is a [22:08] 100 billion tokens of attention. It's [22:10] all I need. [22:11] >> Amazing. I think we got to do this one. [22:13] So, Google's data centers run an [22:15] enormous number of machines. And at that [22:17] scale, anything that can go wrong will [22:19] go wrong. Like I hear that wires wear [22:22] down, [22:23] >> hard drives fall apart, motherboards [22:25] overheat. Um, is that something that [22:27] actually happens day by day? And do you [22:29] have any good stories? [22:30] >> Absolutely. I mean, I don't have that [22:33] many personal stories, but there used to [22:34] be a chat group internally called Data [22:36] Centers on Fire that would have like [22:38] exciting uh exciting events happening [22:40] and sometimes exciting videos. Um yeah, [22:43] I mean I think [22:45] >> at scale lots of things that are very [22:47] very unexpected happen and usually those [22:49] are the combination of one thing fails [22:52] and something else fails simultaneously [22:54] or in cascade of during the yeah you [22:56] have a cascaded failure of some sort. [22:58] You know, sometimes that means some [23:00] software system stops working. Sometimes [23:02] it means like the the bus bar overheats [23:05] and you get too much power to the to the [23:07] rack and like it catches on fire. I mean [23:10] that's a much rarer thing. But um you [23:12] know you have to be prepared for this [23:13] and I think one of the things even from [23:15] the very earliest days of Google is we [23:18] have really focused on how do you build [23:21] reliable systems out of unreliable [23:23] parts. Yes. [23:24] >> Right. Like in the earliest Google days, [23:25] we were buying consumer machines without [23:28] uh ECC memory didn't not not only not [23:31] ECC not even parody [23:33] >> we were buying consumer motherboards [23:35] that didn't have like redundant power [23:37] supplies and you can do that if you can [23:41] handle things at a higher level and [23:42] that's generally what we try to do in [23:44] all cases is [23:45] >> I actually wanted to ask you about that [23:47] the ECC thing because here here's one of [23:49] my favorite failure modes if if that's [23:52] true but you you tell me the distant [23:54] supernova goes off, a cosmic ray hits a [23:57] memory cell and a zero flips to a one. [24:00] Does that really happen? [24:01] >> Oh yeah. Yeah, absolutely. I mean, alpha [24:04] particles definitely can flip uh you [24:06] know DRAM state. We've actually observed [24:08] this because we have monitoring data of [24:10] how many ECC uh errors and like single [24:14] bit errors that are corrected and [24:15] two-bit errors that are not corrected [24:17] are happening in all of our machines. [24:19] And you can actually see this where some [24:21] clusters that are pointing in a [24:23] particular direction in the earth have a [24:25] much higher rate for a you know a brief [24:27] period like 10-minute period or [24:28] something and then the other ones in the [24:29] other side of the earth do not have [24:30] that. So it's definitely something that [24:32] happens. [24:33] >> How worried should I be? Because MacBook [24:35] Pros don't have ECC memory as far as I [24:37] know like for for one machine is it so [24:39] vanishingly you know unlikely that you [24:42] shouldn't care but for data center or [24:44] >> I mean for one machine it's generally [24:45] not too bad. I mean I I think they have [24:47] par so at least they detect it typically [24:49] if it's a single bit error [24:50] >> so detection but not fixing [24:52] >> right but ECC usually gives you single [24:54] bit error correction and dual bit dual [24:57] error detection. Yeah. [24:59] >> So for with that you don't have to worry [25:00] about it too much [25:02] >> um at a single machine level but even at [25:05] you know tens of thousands of machines [25:07] you'd have to start thinking about that. [25:08] So you know one of the things we did [25:10] when we were using machines without even [25:12] parody is we built an entire [25:15] softwarebased check summing system for [25:17] large amounts of our data. So [25:18] >> doing it by hand [25:19] >> doing it by hand essentially and like we [25:21] would you know for crawling web pages [25:24] and putting them in the index [25:25] >> you know if you detect that this [25:28] particular record is corrupted it's [25:29] usually generally okay to just you know [25:32] ignore that record. [25:34] >> Now I have something interesting for [25:35] you. I call it lightning round. So, [25:37] please try to answer in one sentence. [25:39] One word is okay. One one sentence. [25:41] >> Can I make run-on sentences? [25:44] >> We'll see. We'll see. So, I I read that [25:46] Jeff Dean's pin code is the last four [25:49] digits of pi. I I give this one an eight [25:52] out of 10. So, my question is, do you [25:54] enjoy these Chuck Norris style jokes [25:57] about you? [25:58] >> It could be true. Um uh I I do enjoy [26:02] them. I mean, it's a April Fool's joke [26:04] gone ary by my colleagues in 2009, but [26:07] it's very both flattering and kind of [26:08] embarrassing. [26:10] >> I think I think he felt the same way [26:12] about them, too. But he he he enjoyed [26:14] them, too. Legend. All right. One big [26:16] thing that you were wrong about and came [26:18] around. [26:20] I think AI is going to influence health [26:23] care quite dramatically, but I think it [26:27] is harder not necessarily for technical [26:30] reasons, but for you know, how do you [26:32] actually get things in regulated [26:35] industries that are super important and [26:36] have all kinds of privacy constraints [26:38] and safety concerns, [26:39] >> but I think ultimately that will happen. [26:42] It's just taking longer than than I I [26:44] hoped. Yes. Because I think there's [26:46] tremendous world benefit to do it. Um, [26:49] but we need to do it carefully and [26:50] safely. [26:50] >> Vim or Emacs or something else? Hint, [26:54] there's only one good answer. [26:55] >> Emacs. Was that it? Oh, no. [27:00] >> Look, I I'm a Vim person, but but I'm [27:02] I'm not [27:04] >> Maybe I'm I'm an embarrassment of a Vim [27:06] person because I I I looked at Emacs, [27:08] too, and I was like, that's pretty cool, [27:10] too, but I I don't want to learn both. [27:11] It's it's just so much time. So, [27:13] >> yeah, it's true. One can spend a lot of [27:14] time customizing Emacs. the VRC I wrote [27:18] up and then and then it never ends. [27:20] Yeah. One problem that you solved tried [27:22] to solve many times but have never been [27:24] able to crack. [27:29] >> I mean I think in some sense we still [27:31] don't have an answer to how do you do [27:33] continual learning appropriately? That's [27:35] something I've thought about a little. [27:36] I' I've dabbled a little bit with some [27:38] some techniques along with colleagues. [27:40] >> But I think uh you know if we're able to [27:43] crack that it's going to be amazing. Um, [27:45] but it's not there yet. [27:46] >> Last one. Favorite Two-Minute Papers [27:49] episode. [27:51] >> Oh, [27:53] yeah. I mean, I assume the the [27:55] Transformer one was a good one. [27:56] >> All right. All right. Well, that's [27:58] that's a good one. Okay, Jeeoff, I I [28:00] learned a lot today. Thank you so much. [28:02] This chatting with you again. [28:03] >> Thank you so much. [28:04] >> Thank you. [28:04] >> Here you see me running the full [28:06] Deepseek AI model through Lambda GPU [28:09] cloud. 671 [28:12] billion parameters running super fast [28:15] and super reliably. This is insane. I [28:18] love it and I use it on a regular basis. [28:22] Lambda provides you with powerful NVIDIA [28:24] GPUs to run your own chatbots and [28:27] experiments. Seriously, try it out now [28:30] at lambda.ai/papers AI/papers [28:33] or click the link in the description.