[0:05] Hi, welcome to another video. So, [0:07] Deepseek has launched Deepseek V4 Lite [0:10] and V4 Pro. But they are not the only [0:13] ones who have released a new model. [0:15] OpenAI has also launched GPT 5.5 which [0:18] aims to be better in all aspects and [0:20] also fix the gripes of the old model [0:22] which are mostly just front-end issues [0:24] but there might be some more which we'll [0:26] talk about later as well. Apart from [0:28] this, I have also been working on [0:29] Kingbench 2.0, which is a refresh of my [0:32] old benchmark. The aim was to test a [0:34] model on all aspects of coding. You [0:36] might see some general questions as well [0:38] in the benchmark, but that is trivial. [0:40] Some questions in here require the model [0:42] to be used with an agentic contraption, [0:44] while the others don't. Basically, as I [0:46] said before, the benchmark is supposed [0:48] to be a benchmark that doesn't just test [0:50] agentic behavior or something. It also [0:52] tests other aspects and makes it a [0:54] better benchmark all around. Currently, [0:56] I don't have a UI to show all of this. [0:57] I'm still working on it. So, you guessed [0:59] it. We'll use an Excel sheet. By the [1:01] way, how many of you have been following [1:03] this benchmark since the Excel sheet [1:04] days? Comment below. Anyway, I'll be [1:06] testing three models here, which are GPT [1:08] 5.5, Deepseek V4, and Opus 4.7. These [1:12] models give you a good look into the [1:13] worldview of models in respect to the [1:15] resolve of these companies. Before [1:17] jumping into the bench, I do want to [1:18] talk a bit about GPT 5.5 and Deepseek. [1:21] Though Deepseek has launched two models [1:23] which are Deepseek V4 Pro and Deepseek [1:25] V4 Flash. The Pro version is a 1.6 [1:28] trillion parameter model that is a [1:30] mixture of experts with only 49 billion [1:33] parameters activated in a pass. The V4 [1:36] flash, however, is a relatively smaller [1:38] model at 284 billion parameters with [1:40] only 13B activated in a pass. I don't [1:43] want to talk a lot about the model's [1:44] architecture as it can be boring, but as [1:47] an ML enthusiast, I can't help but [1:49] notice that it uses a muon optimizer. I [1:51] believe Kimmy also uses it. Moonshot [1:53] even has a paper on this where they talk [1:55] about how it is actually scalable. [1:57] Contrary to some of the prior beliefs [1:58] that it is just good for small LLMs or [2:01] something, I believe it might be one of [2:02] the biggest models using Muon if I am [2:04] not wrong. GPT 5.5 on the other hand is [2:07] one of the most anticipated models. It [2:10] apparently beats Opus 4.7 across the [2:12] board and even on front-end tasks now. I [2:15] mean, if the card issue with GPT models [2:17] is fixed, then I can easily recommend it [2:19] to anyone. It seems that GPT 5.5 also [2:22] now works better with lower token [2:23] consumption, which looks very [2:25] interesting. You'd know that I have [2:27] indeed tested the alleged DeepSseek V4 [2:29] when it kind of became available on [2:30] their chat platform, but that is not [2:32] something I can fully trust. So, a new [2:34] test is kind of mandatory. All of these [2:36] models are indeed available everywhere [2:37] now including open router, kilo gateway [2:40] etc. I mostly use my models with kilo [2:42] CLI and it is available there as well [2:44] which is quite cool to see. So let's [2:46] look at the results direct to start. I [2:48] have a question where I ask the models [2:49] to build me an elevator simulator. I [2:52] want to make a simulator kind of thing [2:53] where there are floors and on each floor [2:55] we can spawn a person and the elevator [2:57] should take the simulated person to [2:59] their floor and then keep doing that [3:01] until no one is left. Each elevator is [3:03] only allowed to take one person at a [3:05] time. So this one allows me to test how [3:07] good a model is at front end and complex [3:10] backend at the same time. To start, if I [3:12] show you the generation from DeepSeek V4 [3:14] Pro, then it is not good. The elevator [3:17] positioning is not correct at all and it [3:19] all seems very random. So yeah, this is [3:21] not great. Then we've got the generation [3:23] from GPT 545 and this is also not good. [3:27] I mean, it kind of works, but it just [3:29] flickers too much. And if you see here, [3:31] then it looks really bad. In terms of [3:32] UI, it still does the same bad card [3:35] designs and stuff. So, yeah, not great. [3:37] Next up, we've got the generation from [3:39] Opus 4.7. In here, you can see that this [3:41] one actually looks awesome, works really [3:43] well, and it is just really good. This [3:45] is what I imagine when I ask a human to [3:47] do it. So, yeah, Opus 4.7 kind of nails [3:50] it. Next up, I ask it to make me a 3J [3:52] contact lens case that also opens up [3:54] when clicked. This is also quite good [3:56] because it is something that is not very [3:58] much in the training data of the models [3:59] and 3D is also tricky for models. So if [4:02] we look at the generation from DeepSeek, [4:03] then it is not great either. I mean it [4:05] is fine but it looks like a brick with [4:07] two holes. So this ain't good at all. [4:09] Then we have the one from GPT545 and [4:12] this one is also not very good. I mean [4:13] it is fine but it isn't the best. [4:15] Clicking it, it does get opened but it [4:17] opens the cap on the left for some odd [4:19] reason. So that isn't the best for sure. [4:22] The one from Opus 4.7 does indeed look [4:24] like one of the best, but the L&R are [4:26] kind of flipped and the cap opens on the [4:28] bottom for some reason. So there's that. [4:31] not the best from anyone. Next up, I [4:33] have a question where I ask it to build [4:34] me a 3J's folding table. It should have [4:37] a slider that should allow the user to [4:39] fold it or unfold it accordingly. If we [4:41] look at the generation from Deepseek, [4:43] then it is pretty good. I mean, it's not [4:45] the best, but it is good nonetheless. I [4:47] can't complain much. It does indeed [4:49] work. Next, we've got the one from [4:51] GPT545, and it is not good either. It [4:54] does look good when unfolded, but it is [4:56] not good when folded. It looks out of [4:59] place. Both partitions overlap each [5:00] other and it ain't good at all. Next up, [5:04] we've got the one from Opus 4.7 and it [5:06] is kind of fine, but not very good [5:08] either. So, yeah, there's that. After [5:11] that, we have a comeback question where [5:12] I ask the models to make me an SVG of a [5:14] panda eating a burger. Well, all the [5:16] models are wonky in this. [5:19] The panda with a burger from Deepseek is [5:21] as good as a rock, so yeah, this is not [5:24] good. Similarly, the one from GPT545 is [5:27] also very weird. However, the one from [5:29] Claude is actually kind of good. It's [5:30] not the best, but good nonetheless. The [5:32] next question is to make me a bow and [5:34] arrow simulator game. And well, this one [5:37] is really interesting. So, the one from [5:39] Deep Seek basically doesn't even work. [5:41] It is quite buggy. However, the one from [5:44] GPT 5.5 is actually quite good. You can [5:47] aim and shoot and everything. It still [5:49] uses that dirty card thing, but keeping [5:51] that aside, it is quite fine. Now, the [5:54] next one is from Claude, and this one is [5:55] just too good. I mean, it looks really [5:57] professional, good-looking, and it is [5:59] just good. I mean, there's nothing more [6:01] that I'd want from it, and it is one of [6:02] the best for sure. Then there's a new [6:04] mathematics question, and none of them [6:06] pass it either. I also ask it to [6:08] fine-tune a Gemma 4 model for me with a [6:10] generated data set for Pandaax, and [6:12] well, none of them are able to do this [6:14] yet. So, this is it. I think Opus is [6:16] still just a better overall model for [6:18] most people, as it is just good overall, [6:20] while GPT 5.5 is good at some things, [6:23] but it isn't that great overall. [6:24] Deepseek is not good either. Deepseek is [6:27] just a model. It's not good. It's not [6:28] the best and it's not bad. But this is [6:31] not what I would have thought when [6:32] someone said Deepseek V4. I also want to [6:34] talk a bit about the pricing as well. [6:37] Deepseek V4 is a 1 million context [6:39] window model that comes at about 1.74 [6:42] and $378 for input and output per [6:45] million tokens respectively. It is [6:47] actually extremely cheap. Deepseek v4 [6:50] flash is also a million token context [6:53] window model that costs about4 cents and [6:55] your 28 cent per million tokens for [6:57] input and output per million tokens [6:59] respectively. It is actually extremely [7:01] cheap. Deepseek v4 flash is also a [7:04] million token context window model that [7:06] costs about4 cents and euro 28.4 million [7:09] tokens for input and output [7:10] respectively. I mean, yes, it consumes [7:13] fewer tokens, but those token costs are [7:15] now high. So, the end user will end up [7:17] paying more in the long term, especially [7:19] the API users. All that codeex rate [7:22] limiting will also probably go into the [7:23] trash, similar to cloud code, and there [7:25] is no denying that. GPT 5.5 needs to be [7:28] a 16 trillion model to justify this [7:30] price based on how much Deep Seek costs. [7:32] So, yeah, that is a bummer. I do get [7:34] that research and training costs more, [7:36] but I don't like it nonetheless. I think [7:38] Opus 4.7 is the best model, but it is [7:41] not the best experience anymore due to [7:42] all the limits and stuff in the clawed [7:44] code plan. Codeex might be better for [7:45] that, but it will also only last for a [7:47] bit. So, that is about it. [7:51] I think DeepS isn't a good model and I [7:53] can't recommend it either. Overall, it's [7:55] pretty cool. Anyway, let me know your [7:57] thoughts in the comments. If you like [7:58] this video, consider donating through [8:00] the super thanks option or becoming a [8:02] member by clicking the join button. [8:04] Also, give this video a thumbs up and [8:06] subscribe to my channel. I'll see you in [8:08] the next one. Until then, bye.