TubeSum ← Transcribe a video

GPT-5.5 VS Deepseek V4 Pro VS Opus 4.7: I tested THEM on My KingBench 2.0 Questions!

Transcribed Jun 15, 2026 Watch on YouTube ↗
Intermediate 4 min read For: AI enthusiasts and developers interested in comparing latest LLMs on practical tasks.
21.5K
Views
365
Likes
54
Comments
54
Dislikes
2.0%
📊 Average

AI Summary

The video tests GPT-5.5, Deepseek V4 Pro, and Opus 4.7 on the creator's KingBench 2.0 benchmark, which evaluates coding, front-end, and 3D tasks. Opus 4.7 generally outperforms the others, while GPT-5.5 shows improvements but still has UI issues. Deepseek V4 Pro is cheap but underperforms.

[0:07]
New Models Released

Deepseek launched V4 Lite and V4 Pro; OpenAI launched GPT-5.5.

[0:29]
KingBench 2.0 Overview

A benchmark testing coding, front-end, and agentic capabilities, refreshed from the original.

[1:06]
Models Tested

GPT-5.5, Deepseek V4 (Pro and Flash), and Opus 4.7.

[1:21]
Deepseek V4 Architecture

V4 Pro: 1.6T parameters, MoE with 49B activated. V4 Flash: 284B parameters, 13B activated. Uses Muon optimizer.

[2:04]
GPT-5.5 Claims

Beats Opus 4.7 across the board, fixes front-end card issues, and uses fewer tokens.

[2:48]
Elevator Simulator Test

Opus 4.7 produced a great UI and functionality; Deepseek and GPT-5.5 had issues.

[3:52]
3D Contact Lens Case Test

None were perfect; Opus 4.7 was best but had flipped L&R and cap opening at bottom.

[4:33]
3D Folding Table Test

Deepseek was decent; GPT-5.5 had overlapping parts; Opus 4.7 was okay.

[5:11]
SVG Panda Eating Burger

All models produced wonky results; Opus 4.7 was relatively better.

[5:32]
Bow and Arrow Simulator

Deepseek buggy; GPT-5.5 good but with card UI; Opus 4.7 excellent.

[6:04]
Math and Fine-Tuning Tests

None passed the math question or could fine-tune Gemma 4.

[6:37]
Pricing Discussion

Deepseek V4: $1.74/$3.78 per million tokens (input/output). V4 Flash: $0.04/$0.28. GPT-5.5 is expensive.

Opus 4.7 remains the best overall model for most tasks, but its user experience is hampered by limits. Deepseek V4 is cheap but underperforms, and GPT-5.5 shows promise but still has UI flaws.

Clickbait Check

85% Legit

"Title accurately describes the comparison, though KingBench 2.0 is a personal benchmark."

Mentioned in this Video

Study Flashcards (5)

What are the parameter counts for Deepseek V4 Pro and V4 Flash?

medium Click to reveal answer

V4 Pro: 1.6 trillion total, 49 billion activated. V4 Flash: 284 billion total, 13 billion activated.

1:28

Which model performed best on the elevator simulator test?

easy Click to reveal answer

Opus 4.7 produced a great UI and functionality.

3:39

What is the pricing for Deepseek V4 Pro per million tokens?

medium Click to reveal answer

$1.74 for input, $3.78 for output.

6:37

What optimizer does Deepseek V4 use?

hard Click to reveal answer

Muon optimizer.

1:49

Which model was best at the bow and arrow simulator?

easy Click to reveal answer

Opus 4.7.

5:54

💡 Key Takeaways

📊

Deepseek V4 Architecture Details

Provides specific parameter counts and optimizer choice, important for ML enthusiasts.

1:28
💡

Elevator Simulator Test Results

Demonstrates clear performance differences in front-end and backend tasks.

2:48
📊

Pricing Comparison

Highlights cost differences that affect API users' decisions.

6:37
💡

Opus 4.7 as Best Model

Summarizes the overall winner despite limitations.

7:38

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

DeepSeek V4 vs GPT-5.5 vs Opus 4.7: First Look

45s

Direct comparison of three highly anticipated AI models creates immediate curiosity and engagement.

▶ Play Clip

Elevator Simulator Test: Opus 4.7 Crushes It

60s

Visual demo of a complex coding task where one model clearly outperforms others, sparking debate.

▶ Play Clip

3D Contact Lens Case: All Models Fail

60s

Unexpected failure of all models on a creative 3D task highlights current AI limitations.

▶ Play Clip

Bow & Arrow Game: Opus 4.7 Shines Again

60s

Impressive game simulation demo showcases model's coding and design capabilities.

▶ Play Clip

Pricing Shock: DeepSeek Cheap vs GPT Expensive

60s

Controversial pricing comparison reveals huge cost differences, fueling discussion on value.

▶ Play Clip

[00:05] Hi, welcome to another video. So,

[00:07] Deepseek has launched Deepseek V4 Lite

[00:10] and V4 Pro. But they are not the only

[00:13] ones who have released a new model.

[00:15] OpenAI has also launched GPT 5.5 which

[00:18] aims to be better in all aspects and

[00:20] also fix the gripes of the old model

[00:22] which are mostly just front-end issues

[00:24] but there might be some more which we'll

[00:26] talk about later as well. Apart from

[00:28] this, I have also been working on

[00:29] Kingbench 2.0, which is a refresh of my

[00:32] old benchmark. The aim was to test a

[00:34] model on all aspects of coding. You

[00:36] might see some general questions as well

[00:38] in the benchmark, but that is trivial.

[00:40] Some questions in here require the model

[00:42] to be used with an agentic contraption,

[00:44] while the others don't. Basically, as I

[00:46] said before, the benchmark is supposed

[00:48] to be a benchmark that doesn't just test

[00:50] agentic behavior or something. It also

[00:52] tests other aspects and makes it a

[00:54] better benchmark all around. Currently,

[00:56] I don't have a UI to show all of this.

[00:57] I'm still working on it. So, you guessed

[00:59] it. We'll use an Excel sheet. By the

[01:01] way, how many of you have been following

[01:03] this benchmark since the Excel sheet

[01:04] days? Comment below. Anyway, I'll be

[01:06] testing three models here, which are GPT

[01:08] 5.5, Deepseek V4, and Opus 4.7. These

[01:12] models give you a good look into the

[01:13] worldview of models in respect to the

[01:15] resolve of these companies. Before

[01:17] jumping into the bench, I do want to

[01:18] talk a bit about GPT 5.5 and Deepseek.

[01:21] Though Deepseek has launched two models

[01:23] which are Deepseek V4 Pro and Deepseek

[01:25] V4 Flash. The Pro version is a 1.6

[01:28] trillion parameter model that is a

[01:30] mixture of experts with only 49 billion

[01:33] parameters activated in a pass. The V4

[01:36] flash, however, is a relatively smaller

[01:38] model at 284 billion parameters with

[01:40] only 13B activated in a pass. I don't

[01:43] want to talk a lot about the model's

[01:44] architecture as it can be boring, but as

[01:47] an ML enthusiast, I can't help but

[01:49] notice that it uses a muon optimizer. I

[01:51] believe Kimmy also uses it. Moonshot

[01:53] even has a paper on this where they talk

[01:55] about how it is actually scalable.

[01:57] Contrary to some of the prior beliefs

[01:58] that it is just good for small LLMs or

[02:01] something, I believe it might be one of

[02:02] the biggest models using Muon if I am

[02:04] not wrong. GPT 5.5 on the other hand is

[02:07] one of the most anticipated models. It

[02:10] apparently beats Opus 4.7 across the

[02:12] board and even on front-end tasks now. I

[02:15] mean, if the card issue with GPT models

[02:17] is fixed, then I can easily recommend it

[02:19] to anyone. It seems that GPT 5.5 also

[02:22] now works better with lower token

[02:23] consumption, which looks very

[02:25] interesting. You'd know that I have

[02:27] indeed tested the alleged DeepSseek V4

[02:29] when it kind of became available on

[02:30] their chat platform, but that is not

[02:32] something I can fully trust. So, a new

[02:34] test is kind of mandatory. All of these

[02:36] models are indeed available everywhere

[02:37] now including open router, kilo gateway

[02:40] etc. I mostly use my models with kilo

[02:42] CLI and it is available there as well

[02:44] which is quite cool to see. So let's

[02:46] look at the results direct to start. I

[02:48] have a question where I ask the models

[02:49] to build me an elevator simulator. I

[02:52] want to make a simulator kind of thing

[02:53] where there are floors and on each floor

[02:55] we can spawn a person and the elevator

[02:57] should take the simulated person to

[02:59] their floor and then keep doing that

[03:01] until no one is left. Each elevator is

[03:03] only allowed to take one person at a

[03:05] time. So this one allows me to test how

[03:07] good a model is at front end and complex

[03:10] backend at the same time. To start, if I

[03:12] show you the generation from DeepSeek V4

[03:14] Pro, then it is not good. The elevator

[03:17] positioning is not correct at all and it

[03:19] all seems very random. So yeah, this is

[03:21] not great. Then we've got the generation

[03:23] from GPT 545 and this is also not good.

[03:27] I mean, it kind of works, but it just

[03:29] flickers too much. And if you see here,

[03:31] then it looks really bad. In terms of

[03:32] UI, it still does the same bad card

[03:35] designs and stuff. So, yeah, not great.

[03:37] Next up, we've got the generation from

[03:39] Opus 4.7. In here, you can see that this

[03:41] one actually looks awesome, works really

[03:43] well, and it is just really good. This

[03:45] is what I imagine when I ask a human to

[03:47] do it. So, yeah, Opus 4.7 kind of nails

[03:50] it. Next up, I ask it to make me a 3J

[03:52] contact lens case that also opens up

[03:54] when clicked. This is also quite good

[03:56] because it is something that is not very

[03:58] much in the training data of the models

[03:59] and 3D is also tricky for models. So if

[04:02] we look at the generation from DeepSeek,

[04:03] then it is not great either. I mean it

[04:05] is fine but it looks like a brick with

[04:07] two holes. So this ain't good at all.

[04:09] Then we have the one from GPT545 and

[04:12] this one is also not very good. I mean

[04:13] it is fine but it isn't the best.

[04:15] Clicking it, it does get opened but it

[04:17] opens the cap on the left for some odd

[04:19] reason. So that isn't the best for sure.

[04:22] The one from Opus 4.7 does indeed look

[04:24] like one of the best, but the L&R are

[04:26] kind of flipped and the cap opens on the

[04:28] bottom for some reason. So there's that.

[04:31] not the best from anyone. Next up, I

[04:33] have a question where I ask it to build

[04:34] me a 3J's folding table. It should have

[04:37] a slider that should allow the user to

[04:39] fold it or unfold it accordingly. If we

[04:41] look at the generation from Deepseek,

[04:43] then it is pretty good. I mean, it's not

[04:45] the best, but it is good nonetheless. I

[04:47] can't complain much. It does indeed

[04:49] work. Next, we've got the one from

[04:51] GPT545, and it is not good either. It

[04:54] does look good when unfolded, but it is

[04:56] not good when folded. It looks out of

[04:59] place. Both partitions overlap each

[05:00] other and it ain't good at all. Next up,

[05:04] we've got the one from Opus 4.7 and it

[05:06] is kind of fine, but not very good

[05:08] either. So, yeah, there's that. After

[05:11] that, we have a comeback question where

[05:12] I ask the models to make me an SVG of a

[05:14] panda eating a burger. Well, all the

[05:16] models are wonky in this.

[05:19] The panda with a burger from Deepseek is

[05:21] as good as a rock, so yeah, this is not

[05:24] good. Similarly, the one from GPT545 is

[05:27] also very weird. However, the one from

[05:29] Claude is actually kind of good. It's

[05:30] not the best, but good nonetheless. The

[05:32] next question is to make me a bow and

[05:34] arrow simulator game. And well, this one

[05:37] is really interesting. So, the one from

[05:39] Deep Seek basically doesn't even work.

[05:41] It is quite buggy. However, the one from

[05:44] GPT 5.5 is actually quite good. You can

[05:47] aim and shoot and everything. It still

[05:49] uses that dirty card thing, but keeping

[05:51] that aside, it is quite fine. Now, the

[05:54] next one is from Claude, and this one is

[05:55] just too good. I mean, it looks really

[05:57] professional, good-looking, and it is

[05:59] just good. I mean, there's nothing more

[06:01] that I'd want from it, and it is one of

[06:02] the best for sure. Then there's a new

[06:04] mathematics question, and none of them

[06:06] pass it either. I also ask it to

[06:08] fine-tune a Gemma 4 model for me with a

[06:10] generated data set for Pandaax, and

[06:12] well, none of them are able to do this

[06:14] yet. So, this is it. I think Opus is

[06:16] still just a better overall model for

[06:18] most people, as it is just good overall,

[06:20] while GPT 5.5 is good at some things,

[06:23] but it isn't that great overall.

[06:24] Deepseek is not good either. Deepseek is

[06:27] just a model. It's not good. It's not

[06:28] the best and it's not bad. But this is

[06:31] not what I would have thought when

[06:32] someone said Deepseek V4. I also want to

[06:34] talk a bit about the pricing as well.

[06:37] Deepseek V4 is a 1 million context

[06:39] window model that comes at about 1.74

[06:42] and $378 for input and output per

[06:45] million tokens respectively. It is

[06:47] actually extremely cheap. Deepseek v4

[06:50] flash is also a million token context

[06:53] window model that costs about4 cents and

[06:55] your 28 cent per million tokens for

[06:57] input and output per million tokens

[06:59] respectively. It is actually extremely

[07:01] cheap. Deepseek v4 flash is also a

[07:04] million token context window model that

[07:06] costs about4 cents and euro 28.4 million

[07:09] tokens for input and output

[07:10] respectively. I mean, yes, it consumes

[07:13] fewer tokens, but those token costs are

[07:15] now high. So, the end user will end up

[07:17] paying more in the long term, especially

[07:19] the API users. All that codeex rate

[07:22] limiting will also probably go into the

[07:23] trash, similar to cloud code, and there

[07:25] is no denying that. GPT 5.5 needs to be

[07:28] a 16 trillion model to justify this

[07:30] price based on how much Deep Seek costs.

[07:32] So, yeah, that is a bummer. I do get

[07:34] that research and training costs more,

[07:36] but I don't like it nonetheless. I think

[07:38] Opus 4.7 is the best model, but it is

[07:41] not the best experience anymore due to

[07:42] all the limits and stuff in the clawed

[07:44] code plan. Codeex might be better for

[07:45] that, but it will also only last for a

[07:47] bit. So, that is about it.

[07:51] I think DeepS isn't a good model and I

[07:53] can't recommend it either. Overall, it's

[07:55] pretty cool. Anyway, let me know your

[07:57] thoughts in the comments. If you like

[07:58] this video, consider donating through

[08:00] the super thanks option or becoming a

[08:02] member by clicking the join button.

[08:04] Also, give this video a thumbs up and

[08:06] subscribe to my channel. I'll see you in

[08:08] the next one. Until then, bye.

⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.