DeepSeek V4 vs GPT-5.5 vs Opus 4.7: First Look
45sDirect comparison of three highly anticipated AI models creates immediate curiosity and engagement.
▶ Play ClipThe video tests GPT-5.5, Deepseek V4 Pro, and Opus 4.7 on the creator's KingBench 2.0 benchmark, which evaluates coding, front-end, and 3D tasks. Opus 4.7 generally outperforms the others, while GPT-5.5 shows improvements but still has UI issues. Deepseek V4 Pro is cheap but underperforms.
Deepseek launched V4 Lite and V4 Pro; OpenAI launched GPT-5.5.
A benchmark testing coding, front-end, and agentic capabilities, refreshed from the original.
GPT-5.5, Deepseek V4 (Pro and Flash), and Opus 4.7.
V4 Pro: 1.6T parameters, MoE with 49B activated. V4 Flash: 284B parameters, 13B activated. Uses Muon optimizer.
Beats Opus 4.7 across the board, fixes front-end card issues, and uses fewer tokens.
Opus 4.7 produced a great UI and functionality; Deepseek and GPT-5.5 had issues.
None were perfect; Opus 4.7 was best but had flipped L&R and cap opening at bottom.
Deepseek was decent; GPT-5.5 had overlapping parts; Opus 4.7 was okay.
All models produced wonky results; Opus 4.7 was relatively better.
Deepseek buggy; GPT-5.5 good but with card UI; Opus 4.7 excellent.
None passed the math question or could fine-tune Gemma 4.
Deepseek V4: $1.74/$3.78 per million tokens (input/output). V4 Flash: $0.04/$0.28. GPT-5.5 is expensive.
Opus 4.7 remains the best overall model for most tasks, but its user experience is hampered by limits. Deepseek V4 is cheap but underperforms, and GPT-5.5 shows promise but still has UI flaws.
"Title accurately describes the comparison, though KingBench 2.0 is a personal benchmark."
What are the parameter counts for Deepseek V4 Pro and V4 Flash?
V4 Pro: 1.6 trillion total, 49 billion activated. V4 Flash: 284 billion total, 13 billion activated.
1:28
Which model performed best on the elevator simulator test?
Opus 4.7 produced a great UI and functionality.
3:39
What is the pricing for Deepseek V4 Pro per million tokens?
$1.74 for input, $3.78 for output.
6:37
What optimizer does Deepseek V4 use?
Muon optimizer.
1:49
Which model was best at the bow and arrow simulator?
Opus 4.7.
5:54
Deepseek V4 Architecture Details
Provides specific parameter counts and optimizer choice, important for ML enthusiasts.
1:28Elevator Simulator Test Results
Demonstrates clear performance differences in front-end and backend tasks.
2:48Pricing Comparison
Highlights cost differences that affect API users' decisions.
6:37Opus 4.7 as Best Model
Summarizes the overall winner despite limitations.
7:38[00:05] Hi, welcome to another video. So,
[00:07] Deepseek has launched Deepseek V4 Lite
[00:10] and V4 Pro. But they are not the only
[00:13] ones who have released a new model.
[00:15] OpenAI has also launched GPT 5.5 which
[00:18] aims to be better in all aspects and
[00:20] also fix the gripes of the old model
[00:22] which are mostly just front-end issues
[00:24] but there might be some more which we'll
[00:26] talk about later as well. Apart from
[00:28] this, I have also been working on
[00:29] Kingbench 2.0, which is a refresh of my
[00:32] old benchmark. The aim was to test a
[00:34] model on all aspects of coding. You
[00:36] might see some general questions as well
[00:38] in the benchmark, but that is trivial.
[00:40] Some questions in here require the model
[00:42] to be used with an agentic contraption,
[00:44] while the others don't. Basically, as I
[00:46] said before, the benchmark is supposed
[00:48] to be a benchmark that doesn't just test
[00:50] agentic behavior or something. It also
[00:52] tests other aspects and makes it a
[00:54] better benchmark all around. Currently,
[00:56] I don't have a UI to show all of this.
[00:57] I'm still working on it. So, you guessed
[00:59] it. We'll use an Excel sheet. By the
[01:01] way, how many of you have been following
[01:03] this benchmark since the Excel sheet
[01:04] days? Comment below. Anyway, I'll be
[01:06] testing three models here, which are GPT
[01:08] 5.5, Deepseek V4, and Opus 4.7. These
[01:12] models give you a good look into the
[01:13] worldview of models in respect to the
[01:15] resolve of these companies. Before
[01:17] jumping into the bench, I do want to
[01:18] talk a bit about GPT 5.5 and Deepseek.
[01:21] Though Deepseek has launched two models
[01:23] which are Deepseek V4 Pro and Deepseek
[01:25] V4 Flash. The Pro version is a 1.6
[01:28] trillion parameter model that is a
[01:30] mixture of experts with only 49 billion
[01:33] parameters activated in a pass. The V4
[01:36] flash, however, is a relatively smaller
[01:38] model at 284 billion parameters with
[01:40] only 13B activated in a pass. I don't
[01:43] want to talk a lot about the model's
[01:44] architecture as it can be boring, but as
[01:47] an ML enthusiast, I can't help but
[01:49] notice that it uses a muon optimizer. I
[01:51] believe Kimmy also uses it. Moonshot
[01:53] even has a paper on this where they talk
[01:55] about how it is actually scalable.
[01:57] Contrary to some of the prior beliefs
[01:58] that it is just good for small LLMs or
[02:01] something, I believe it might be one of
[02:02] the biggest models using Muon if I am
[02:04] not wrong. GPT 5.5 on the other hand is
[02:07] one of the most anticipated models. It
[02:10] apparently beats Opus 4.7 across the
[02:12] board and even on front-end tasks now. I
[02:15] mean, if the card issue with GPT models
[02:17] is fixed, then I can easily recommend it
[02:19] to anyone. It seems that GPT 5.5 also
[02:22] now works better with lower token
[02:23] consumption, which looks very
[02:25] interesting. You'd know that I have
[02:27] indeed tested the alleged DeepSseek V4
[02:29] when it kind of became available on
[02:30] their chat platform, but that is not
[02:32] something I can fully trust. So, a new
[02:34] test is kind of mandatory. All of these
[02:36] models are indeed available everywhere
[02:37] now including open router, kilo gateway
[02:40] etc. I mostly use my models with kilo
[02:42] CLI and it is available there as well
[02:44] which is quite cool to see. So let's
[02:46] look at the results direct to start. I
[02:48] have a question where I ask the models
[02:49] to build me an elevator simulator. I
[02:52] want to make a simulator kind of thing
[02:53] where there are floors and on each floor
[02:55] we can spawn a person and the elevator
[02:57] should take the simulated person to
[02:59] their floor and then keep doing that
[03:01] until no one is left. Each elevator is
[03:03] only allowed to take one person at a
[03:05] time. So this one allows me to test how
[03:07] good a model is at front end and complex
[03:10] backend at the same time. To start, if I
[03:12] show you the generation from DeepSeek V4
[03:14] Pro, then it is not good. The elevator
[03:17] positioning is not correct at all and it
[03:19] all seems very random. So yeah, this is
[03:21] not great. Then we've got the generation
[03:23] from GPT 545 and this is also not good.
[03:27] I mean, it kind of works, but it just
[03:29] flickers too much. And if you see here,
[03:31] then it looks really bad. In terms of
[03:32] UI, it still does the same bad card
[03:35] designs and stuff. So, yeah, not great.
[03:37] Next up, we've got the generation from
[03:39] Opus 4.7. In here, you can see that this
[03:41] one actually looks awesome, works really
[03:43] well, and it is just really good. This
[03:45] is what I imagine when I ask a human to
[03:47] do it. So, yeah, Opus 4.7 kind of nails
[03:50] it. Next up, I ask it to make me a 3J
[03:52] contact lens case that also opens up
[03:54] when clicked. This is also quite good
[03:56] because it is something that is not very
[03:58] much in the training data of the models
[03:59] and 3D is also tricky for models. So if
[04:02] we look at the generation from DeepSeek,
[04:03] then it is not great either. I mean it
[04:05] is fine but it looks like a brick with
[04:07] two holes. So this ain't good at all.
[04:09] Then we have the one from GPT545 and
[04:12] this one is also not very good. I mean
[04:13] it is fine but it isn't the best.
[04:15] Clicking it, it does get opened but it
[04:17] opens the cap on the left for some odd
[04:19] reason. So that isn't the best for sure.
[04:22] The one from Opus 4.7 does indeed look
[04:24] like one of the best, but the L&R are
[04:26] kind of flipped and the cap opens on the
[04:28] bottom for some reason. So there's that.
[04:31] not the best from anyone. Next up, I
[04:33] have a question where I ask it to build
[04:34] me a 3J's folding table. It should have
[04:37] a slider that should allow the user to
[04:39] fold it or unfold it accordingly. If we
[04:41] look at the generation from Deepseek,
[04:43] then it is pretty good. I mean, it's not
[04:45] the best, but it is good nonetheless. I
[04:47] can't complain much. It does indeed
[04:49] work. Next, we've got the one from
[04:51] GPT545, and it is not good either. It
[04:54] does look good when unfolded, but it is
[04:56] not good when folded. It looks out of
[04:59] place. Both partitions overlap each
[05:00] other and it ain't good at all. Next up,
[05:04] we've got the one from Opus 4.7 and it
[05:06] is kind of fine, but not very good
[05:08] either. So, yeah, there's that. After
[05:11] that, we have a comeback question where
[05:12] I ask the models to make me an SVG of a
[05:14] panda eating a burger. Well, all the
[05:16] models are wonky in this.
[05:19] The panda with a burger from Deepseek is
[05:21] as good as a rock, so yeah, this is not
[05:24] good. Similarly, the one from GPT545 is
[05:27] also very weird. However, the one from
[05:29] Claude is actually kind of good. It's
[05:30] not the best, but good nonetheless. The
[05:32] next question is to make me a bow and
[05:34] arrow simulator game. And well, this one
[05:37] is really interesting. So, the one from
[05:39] Deep Seek basically doesn't even work.
[05:41] It is quite buggy. However, the one from
[05:44] GPT 5.5 is actually quite good. You can
[05:47] aim and shoot and everything. It still
[05:49] uses that dirty card thing, but keeping
[05:51] that aside, it is quite fine. Now, the
[05:54] next one is from Claude, and this one is
[05:55] just too good. I mean, it looks really
[05:57] professional, good-looking, and it is
[05:59] just good. I mean, there's nothing more
[06:01] that I'd want from it, and it is one of
[06:02] the best for sure. Then there's a new
[06:04] mathematics question, and none of them
[06:06] pass it either. I also ask it to
[06:08] fine-tune a Gemma 4 model for me with a
[06:10] generated data set for Pandaax, and
[06:12] well, none of them are able to do this
[06:14] yet. So, this is it. I think Opus is
[06:16] still just a better overall model for
[06:18] most people, as it is just good overall,
[06:20] while GPT 5.5 is good at some things,
[06:23] but it isn't that great overall.
[06:24] Deepseek is not good either. Deepseek is
[06:27] just a model. It's not good. It's not
[06:28] the best and it's not bad. But this is
[06:31] not what I would have thought when
[06:32] someone said Deepseek V4. I also want to
[06:34] talk a bit about the pricing as well.
[06:37] Deepseek V4 is a 1 million context
[06:39] window model that comes at about 1.74
[06:42] and $378 for input and output per
[06:45] million tokens respectively. It is
[06:47] actually extremely cheap. Deepseek v4
[06:50] flash is also a million token context
[06:53] window model that costs about4 cents and
[06:55] your 28 cent per million tokens for
[06:57] input and output per million tokens
[06:59] respectively. It is actually extremely
[07:01] cheap. Deepseek v4 flash is also a
[07:04] million token context window model that
[07:06] costs about4 cents and euro 28.4 million
[07:09] tokens for input and output
[07:10] respectively. I mean, yes, it consumes
[07:13] fewer tokens, but those token costs are
[07:15] now high. So, the end user will end up
[07:17] paying more in the long term, especially
[07:19] the API users. All that codeex rate
[07:22] limiting will also probably go into the
[07:23] trash, similar to cloud code, and there
[07:25] is no denying that. GPT 5.5 needs to be
[07:28] a 16 trillion model to justify this
[07:30] price based on how much Deep Seek costs.
[07:32] So, yeah, that is a bummer. I do get
[07:34] that research and training costs more,
[07:36] but I don't like it nonetheless. I think
[07:38] Opus 4.7 is the best model, but it is
[07:41] not the best experience anymore due to
[07:42] all the limits and stuff in the clawed
[07:44] code plan. Codeex might be better for
[07:45] that, but it will also only last for a
[07:47] bit. So, that is about it.
[07:51] I think DeepS isn't a good model and I
[07:53] can't recommend it either. Overall, it's
[07:55] pretty cool. Anyway, let me know your
[07:57] thoughts in the comments. If you like
[07:58] this video, consider donating through
[08:00] the super thanks option or becoming a
[08:02] member by clicking the join button.
[08:04] Also, give this video a thumbs up and
[08:06] subscribe to my channel. I'll see you in
[08:08] the next one. Until then, bye.
⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.