[0:05] Hi, welcome to another video. So,
[0:07] Deepseek has launched Deepseek V4 Lite
[0:10] and V4 Pro. But they are not the only
[0:13] ones who have released a new model.
[0:15] OpenAI has also launched GPT 5.5 which
[0:18] aims to be better in all aspects and
[0:20] also fix the gripes of the old model
[0:22] which are mostly just front-end issues
[0:24] but there might be some more which we'll
[0:26] talk about later as well. Apart from
[0:28] this, I have also been working on
[0:29] Kingbench 2.0, which is a refresh of my
[0:32] old benchmark. The aim was to test a
[0:34] model on all aspects of coding. You
[0:36] might see some general questions as well
[0:38] in the benchmark, but that is trivial.
[0:40] Some questions in here require the model
[0:42] to be used with an agentic contraption,
[0:44] while the others don't. Basically, as I
[0:46] said before, the benchmark is supposed
[0:48] to be a benchmark that doesn't just test
[0:50] agentic behavior or something. It also
[0:52] tests other aspects and makes it a
[0:54] better benchmark all around. Currently,
[0:56] I don't have a UI to show all of this.
[0:57] I'm still working on it. So, you guessed
[0:59] it. We'll use an Excel sheet. By the
[1:01] way, how many of you have been following
[1:03] this benchmark since the Excel sheet
[1:04] days? Comment below. Anyway, I'll be
[1:06] testing three models here, which are GPT
[1:08] 5.5, Deepseek V4, and Opus 4.7. These
[1:12] models give you a good look into the
[1:13] worldview of models in respect to the
[1:15] resolve of these companies. Before
[1:17] jumping into the bench, I do want to
[1:18] talk a bit about GPT 5.5 and Deepseek.
[1:21] Though Deepseek has launched two models
[1:23] which are Deepseek V4 Pro and Deepseek
[1:25] V4 Flash. The Pro version is a 1.6
[1:28] trillion parameter model that is a
[1:30] mixture of experts with only 49 billion
[1:33] parameters activated in a pass. The V4
[1:36] flash, however, is a relatively smaller
[1:38] model at 284 billion parameters with
[1:40] only 13B activated in a pass. I don't
[1:43] want to talk a lot about the model's
[1:44] architecture as it can be boring, but as
[1:47] an ML enthusiast, I can't help but
[1:49] notice that it uses a muon optimizer. I
[1:51] believe Kimmy also uses it. Moonshot
[1:53] even has a paper on this where they talk
[1:55] about how it is actually scalable.
[1:57] Contrary to some of the prior beliefs
[1:58] that it is just good for small LLMs or
[2:01] something, I believe it might be one of
[2:02] the biggest models using Muon if I am
[2:04] not wrong. GPT 5.5 on the other hand is
[2:07] one of the most anticipated models. It
[2:10] apparently beats Opus 4.7 across the
[2:12] board and even on front-end tasks now. I
[2:15] mean, if the card issue with GPT models
[2:17] is fixed, then I can easily recommend it
[2:19] to anyone. It seems that GPT 5.5 also
[2:22] now works better with lower token
[2:23] consumption, which looks very
[2:25] interesting. You'd know that I have
[2:27] indeed tested the alleged DeepSseek V4
[2:29] when it kind of became available on
[2:30] their chat platform, but that is not
[2:32] something I can fully trust. So, a new
[2:34] test is kind of mandatory. All of these
[2:36] models are indeed available everywhere
[2:37] now including open router, kilo gateway
[2:40] etc. I mostly use my models with kilo
[2:42] CLI and it is available there as well
[2:44] which is quite cool to see. So let's
[2:46] look at the results direct to start. I
[2:48] have a question where I ask the models
[2:49] to build me an elevator simulator. I
[2:52] want to make a simulator kind of thing
[2:53] where there are floors and on each floor
[2:55] we can spawn a person and the elevator
[2:57] should take the simulated person to
[2:59] their floor and then keep doing that
[3:01] until no one is left. Each elevator is
[3:03] only allowed to take one person at a
[3:05] time. So this one allows me to test how
[3:07] good a model is at front end and complex
[3:10] backend at the same time. To start, if I
[3:12] show you the generation from DeepSeek V4
[3:14] Pro, then it is not good. The elevator
[3:17] positioning is not correct at all and it
[3:19] all seems very random. So yeah, this is
[3:21] not great. Then we've got the generation
[3:23] from GPT 545 and this is also not good.
[3:27] I mean, it kind of works, but it just
[3:29] flickers too much. And if you see here,
[3:31] then it looks really bad. In terms of
[3:32] UI, it still does the same bad card
[3:35] designs and stuff. So, yeah, not great.
[3:37] Next up, we've got the generation from
[3:39] Opus 4.7. In here, you can see that this
[3:41] one actually looks awesome, works really
[3:43] well, and it is just really good. This
[3:45] is what I imagine when I ask a human to
[3:47] do it. So, yeah, Opus 4.7 kind of nails
[3:50] it. Next up, I ask it to make me a 3J
[3:52] contact lens case that also opens up
[3:54] when clicked. This is also quite good
[3:56] because it is something that is not very
[3:58] much in the training data of the models
[3:59] and 3D is also tricky for models. So if
[4:02] we look at the generation from DeepSeek,
[4:03] then it is not great either. I mean it
[4:05] is fine but it looks like a brick with
[4:07] two holes. So this ain't good at all.
[4:09] Then we have the one from GPT545 and
[4:12] this one is also not very good. I mean
[4:13] it is fine but it isn't the best.
[4:15] Clicking it, it does get opened but it
[4:17] opens the cap on the left for some odd
[4:19] reason. So that isn't the best for sure.
[4:22] The one from Opus 4.7 does indeed look
[4:24] like one of the best, but the L&R are
[4:26] kind of flipped and the cap opens on the
[4:28] bottom for some reason. So there's that.
[4:31] not the best from anyone. Next up, I
[4:33] have a question where I ask it to build
[4:34] me a 3J's folding table. It should have
[4:37] a slider that should allow the user to
[4:39] fold it or unfold it accordingly. If we
[4:41] look at the generation from Deepseek,
[4:43] then it is pretty good. I mean, it's not
[4:45] the best, but it is good nonetheless. I
[4:47] can't complain much. It does indeed
[4:49] work. Next, we've got the one from
[4:51] GPT545, and it is not good either. It
[4:54] does look good when unfolded, but it is
[4:56] not good when folded. It looks out of
[4:59] place. Both partitions overlap each
[5:00] other and it ain't good at all. Next up,
[5:04] we've got the one from Opus 4.7 and it
[5:06] is kind of fine, but not very good
[5:08] either. So, yeah, there's that. After
[5:11] that, we have a comeback question where
[5:12] I ask the models to make me an SVG of a
[5:14] panda eating a burger. Well, all the
[5:16] models are wonky in this.
[5:19] The panda with a burger from Deepseek is
[5:21] as good as a rock, so yeah, this is not
[5:24] good. Similarly, the one from GPT545 is
[5:27] also very weird. However, the one from
[5:29] Claude is actually kind of good. It's
[5:30] not the best, but good nonetheless. The
[5:32] next question is to make me a bow and
[5:34] arrow simulator game. And well, this one
[5:37] is really interesting. So, the one from
[5:39] Deep Seek basically doesn't even work.
[5:41] It is quite buggy. However, the one from
[5:44] GPT 5.5 is actually quite good. You can
[5:47] aim and shoot and everything. It still
[5:49] uses that dirty card thing, but keeping
[5:51] that aside, it is quite fine. Now, the
[5:54] next one is from Claude, and this one is
[5:55] just too good. I mean, it looks really
[5:57] professional, good-looking, and it is
[5:59] just good. I mean, there's nothing more
[6:01] that I'd want from it, and it is one of
[6:02] the best for sure. Then there's a new
[6:04] mathematics question, and none of them
[6:06] pass it either. I also ask it to
[6:08] fine-tune a Gemma 4 model for me with a
[6:10] generated data set for Pandaax, and
[6:12] well, none of them are able to do this
[6:14] yet. So, this is it. I think Opus is
[6:16] still just a better overall model for
[6:18] most people, as it is just good overall,
[6:20] while GPT 5.5 is good at some things,
[6:23] but it isn't that great overall.
[6:24] Deepseek is not good either. Deepseek is
[6:27] just a model. It's not good. It's not
[6:28] the best and it's not bad. But this is
[6:31] not what I would have thought when
[6:32] someone said Deepseek V4. I also want to
[6:34] talk a bit about the pricing as well.
[6:37] Deepseek V4 is a 1 million context
[6:39] window model that comes at about 1.74
[6:42] and $378 for input and output per
[6:45] million tokens respectively. It is
[6:47] actually extremely cheap. Deepseek v4
[6:50] flash is also a million token context
[6:53] window model that costs about4 cents and
[6:55] your 28 cent per million tokens for
[6:57] input and output per million tokens
[6:59] respectively. It is actually extremely
[7:01] cheap. Deepseek v4 flash is also a
[7:04] million token context window model that
[7:06] costs about4 cents and euro 28.4 million
[7:09] tokens for input and output
[7:10] respectively. I mean, yes, it consumes
[7:13] fewer tokens, but those token costs are
[7:15] now high. So, the end user will end up
[7:17] paying more in the long term, especially
[7:19] the API users. All that codeex rate
[7:22] limiting will also probably go into the
[7:23] trash, similar to cloud code, and there
[7:25] is no denying that. GPT 5.5 needs to be
[7:28] a 16 trillion model to justify this
[7:30] price based on how much Deep Seek costs.
[7:32] So, yeah, that is a bummer. I do get
[7:34] that research and training costs more,
[7:36] but I don't like it nonetheless. I think
[7:38] Opus 4.7 is the best model, but it is
[7:41] not the best experience anymore due to
[7:42] all the limits and stuff in the clawed
[7:44] code plan. Codeex might be better for
[7:45] that, but it will also only last for a
[7:47] bit. So, that is about it.
[7:51] I think DeepS isn't a good model and I
[7:53] can't recommend it either. Overall, it's
[7:55] pretty cool. Anyway, let me know your
[7:57] thoughts in the comments. If you like
[7:58] this video, consider donating through
[8:00] the super thanks option or becoming a
[8:02] member by clicking the join button.
[8:04] Also, give this video a thumbs up and
[8:06] subscribe to my channel. I'll see you in
[8:08] the next one. Until then, bye.