TubeSum ← Transcribe a video

GPT-5.5 VS Deepseek V4 Pro VS Opus 4.7: I tested THEM on My KingBench 2.0 Questions! thumbnail

GPT-5.5 VS Deepseek V4 Pro VS Opus 4.7: I tested THEM on My KingBench 2.0 Questions!

0h 08m video Published Apr 24, 2026 Transcribed Jun 15, 2026 AICodeKing

AICodeKing

Intermediate 4 min read For: AI enthusiasts and developers interested in comparing latest LLMs on practical tasks.

AI Trust Score 85/100

✅ Highly Legit

"Title accurately describes the comparison, though KingBench 2.0 is a personal benchmark."

AI Summary

The video tests GPT-5.5, Deepseek V4 Pro, and Opus 4.7 on the creator's KingBench 2.0 benchmark, which evaluates coding, front-end, and 3D tasks. Opus 4.7 generally outperforms the others, while GPT-5.5 shows improvements but still has UI issues. Deepseek V4 Pro is cheap but underperforms.

Chapters

1 Introduction and Model Announcements 0:05 2 Model Architecture and Pricing 1:21 3 KingBench 2.0 Tests: Elevator, 3D, SVG, Game 2:48 4 Additional Tests and Final Verdict 6:04

[0:07]

New Models Released

Deepseek launched V4 Lite and V4 Pro; OpenAI launched GPT-5.5.

[0:29]

KingBench 2.0 Overview

A benchmark testing coding, front-end, and agentic capabilities, refreshed from the original.

[1:06]

Models Tested

GPT-5.5, Deepseek V4 (Pro and Flash), and Opus 4.7.

[1:21]

Deepseek V4 Architecture

V4 Pro: 1.6T parameters, MoE with 49B activated. V4 Flash: 284B parameters, 13B activated. Uses Muon optimizer.

[2:04]

GPT-5.5 Claims

Beats Opus 4.7 across the board, fixes front-end card issues, and uses fewer tokens.

[2:48]

Elevator Simulator Test

Opus 4.7 produced a great UI and functionality; Deepseek and GPT-5.5 had issues.

[3:52]

3D Contact Lens Case Test

None were perfect; Opus 4.7 was best but had flipped L&R and cap opening at bottom.

[4:33]

3D Folding Table Test

Deepseek was decent; GPT-5.5 had overlapping parts; Opus 4.7 was okay.

[5:11]

SVG Panda Eating Burger

All models produced wonky results; Opus 4.7 was relatively better.

[5:32]

Bow and Arrow Simulator

Deepseek buggy; GPT-5.5 good but with card UI; Opus 4.7 excellent.

[6:04]

Math and Fine-Tuning Tests

None passed the math question or could fine-tune Gemma 4.

[6:37]

Pricing Discussion

Deepseek V4: $1.74/$3.78 per million tokens (input/output). V4 Flash: $0.04/$0.28. GPT-5.5 is expensive.

Opus 4.7 remains the best overall model for most tasks, but its user experience is hampered by limits. Deepseek V4 is cheap but underperforms, and GPT-5.5 shows promise but still has UI flaws.

Mentioned in this Video

KingBench 2.0

tool

OpenRouter

tool

Kilo Gateway

tool

Kilo CLI

tool

Deepseek V4 Pro

model

Deepseek V4 Flash

model

GPT-5.5

model

Opus 4.7

model

Gemma 4

model

Study Flashcards (5)

What are the parameter counts for Deepseek V4 Pro and V4 Flash?

medium Click to reveal answer

V4 Pro: 1.6 trillion total, 49 billion activated. V4 Flash: 284 billion total, 13 billion activated.

1:28

Which model performed best on the elevator simulator test?

easy Click to reveal answer

Opus 4.7 produced a great UI and functionality.

3:39

What is the pricing for Deepseek V4 Pro per million tokens?

medium Click to reveal answer

$1.74 for input, $3.78 for output.

6:37

What optimizer does Deepseek V4 use?

hard Click to reveal answer

Muon optimizer.

1:49

Which model was best at the bow and arrow simulator?

easy Click to reveal answer

Opus 4.7.

5:54

💡 Key Takeaways

📊

Deepseek V4 Architecture Details

Provides specific parameter counts and optimizer choice, important for ML enthusiasts.

1:28

💡

Elevator Simulator Test Results

Demonstrates clear performance differences in front-end and backend tasks.

2:48

📊

Pricing Comparison

Highlights cost differences that affect API users' decisions.

6:37

💡

Opus 4.7 as Best Model

Summarizes the overall winner despite limitations.

7:38

Full Transcript

Download .txt Download .md

[00:05] Hi, welcome to another video. So,

[00:07] Deepseek has launched Deepseek V4 Lite

[00:10] and V4 Pro. But they are not the only

[00:13] ones who have released a new model.

[00:15] OpenAI has also launched GPT 5.5 which

[00:18] aims to be better in all aspects and

[00:20] also fix the gripes of the old model

[00:22] which are mostly just front-end issues

[00:24] but there might be some more which we'll

[00:26] talk about later as well. Apart from

[00:28] this, I have also been working on

[00:29] Kingbench 2.0, which is a refresh of my

[00:32] old benchmark. The aim was to test a

[00:34] model on all aspects of coding. You

[00:36] might see some general questions as well

[00:38] in the benchmark, but that is trivial.

[00:40] Some questions in here require the model

[00:42] to be used with an agentic contraption,

[00:44] while the others don't. Basically, as I

[00:46] said before, the benchmark is supposed

[00:48] to be a benchmark that doesn't just test

[00:50] agentic behavior or something. It also

[00:52] tests other aspects and makes it a

[00:54] better benchmark all around. Currently,

[00:56] I don't have a UI to show all of this.

[00:57] I'm still working on it. So, you guessed

[00:59] it. We'll use an Excel sheet. By the

[01:01] way, how many of you have been following

[01:03] this benchmark since the Excel sheet

[01:04] days? Comment below. Anyway, I'll be

[01:06] testing three models here, which are GPT

[01:08] 5.5, Deepseek V4, and Opus 4.7. These

[01:12] models give you a good look into the

[01:13] worldview of models in respect to the

[01:15] resolve of these companies. Before

[01:17] jumping into the bench, I do want to

[01:18] talk a bit about GPT 5.5 and Deepseek.

[01:21] Though Deepseek has launched two models

[01:23] which are Deepseek V4 Pro and Deepseek

[01:25] V4 Flash. The Pro version is a 1.6

[01:28] trillion parameter model that is a

[01:30] mixture of experts with only 49 billion

[01:33] parameters activated in a pass. The V4

[01:36] flash, however, is a relatively smaller

[01:38] model at 284 billion parameters with

[01:40] only 13B activated in a pass. I don't

[01:43] want to talk a lot about the model's

[01:44] architecture as it can be boring, but as

[01:47] an ML enthusiast, I can't help but

[01:49] notice that it uses a muon optimizer. I

[01:51] believe Kimmy also uses it. Moonshot

[01:53] even has a paper on this where they talk

[01:55] about how it is actually scalable.

[01:57] Contrary to some of the prior beliefs

[01:58] that it is just good for small LLMs or

[02:01] something, I believe it might be one of

[02:02] the biggest models using Muon if I am

[02:04] not wrong. GPT 5.5 on the other hand is

[02:07] one of the most anticipated models. It

[02:10] apparently beats Opus 4.7 across the

[02:12] board and even on front-end tasks now. I

[02:15] mean, if the card issue with GPT models

[02:17] is fixed, then I can easily recommend it

[02:19] to anyone. It seems that GPT 5.5 also

[02:22] now works better with lower token

[02:23] consumption, which looks very

[02:25] interesting. You'd know that I have

[02:27] indeed tested the alleged DeepSseek V4

[02:29] when it kind of became available on

[02:30] their chat platform, but that is not

[02:32] something I can fully trust. So, a new

[02:34] test is kind of mandatory. All of these

[02:36] models are indeed available everywhere

[02:37] now including open router, kilo gateway

[02:40] etc. I mostly use my models with kilo

[02:42] CLI and it is available there as well

[02:44] which is quite cool to see. So let's

[02:46] look at the results direct to start. I

[02:48] have a question where I ask the models

[02:49] to build me an elevator simulator. I

[02:52] want to make a simulator kind of thing

[02:53] where there are floors and on each floor

[02:55] we can spawn a person and the elevator

[02:57] should take the simulated person to

[02:59] their floor and then keep doing that

[03:01] until no one is left. Each elevator is

[03:03] only allowed to take one person at a

[03:05] time. So this one allows me to test how

[03:07] good a model is at front end and complex

[03:10] backend at the same time. To start, if I

[03:12] show you the generation from DeepSeek V4

[03:14] Pro, then it is not good. The elevator

[03:17] positioning is not correct at all and it

[03:19] all seems very random. So yeah, this is

[03:21] not great. Then we've got the generation

[03:23] from GPT 545 and this is also not good.

[03:27] I mean, it kind of works, but it just

[03:29] flickers too much. And if you see here,

[03:31] then it looks really bad. In terms of

[03:32] UI, it still does the same bad card

[03:35] designs and stuff. So, yeah, not great.

[03:37] Next up, we've got the generation from

[03:39] Opus 4.7. In here, you can see that this

[03:41] one actually looks awesome, works really

[03:43] well, and it is just really good. This

[03:45] is what I imagine when I ask a human to

[03:47] do it. So, yeah, Opus 4.7 kind of nails

[03:50] it. Next up, I ask it to make me a 3J

[03:52] contact lens case that also opens up

[03:54] when clicked. This is also quite good

[03:56] because it is something that is not very

[03:58] much in the training data of the models

[03:59] and 3D is also tricky for models. So if

[04:02] we look at the generation from DeepSeek,

[04:03] then it is not great either. I mean it

[04:05] is fine but it looks like a brick with

[04:07] two holes. So this ain't good at all.

[04:09] Then we have the one from GPT545 and

[04:12] this one is also not very good. I mean

[04:13] it is fine but it isn't the best.

[04:15] Clicking it, it does get opened but it

[04:17] opens the cap on the left for some odd

[04:19] reason. So that isn't the best for sure.

[04:22] The one from Opus 4.7 does indeed look

[04:24] like one of the best, but the L&R are

[04:26] kind of flipped and the cap opens on the

[04:28] bottom for some reason. So there's that.

[04:31] not the best from anyone. Next up, I

[04:33] have a question where I ask it to build

[04:34] me a 3J's folding table. It should have

[04:37] a slider that should allow the user to

[04:39] fold it or unfold it accordingly. If we

[04:41] look at the generation from Deepseek,

[04:43] then it is pretty good. I mean, it's not

[04:45] the best, but it is good nonetheless. I

[04:47] can't complain much. It does indeed

[04:49] work. Next, we've got the one from

[04:51] GPT545, and it is not good either. It

[04:54] does look good when unfolded, but it is

[04:56] not good when folded. It looks out of

[04:59] place. Both partitions overlap each

[05:00] other and it ain't good at all. Next up,

[05:04] we've got the one from Opus 4.7 and it

[05:06] is kind of fine, but not very good

[05:08] either. So, yeah, there's that. After

[05:11] that, we have a comeback question where

[05:12] I ask the models to make me an SVG of a

[05:14] panda eating a burger. Well, all the

[05:16] models are wonky in this.

[05:19] The panda with a burger from Deepseek is

[05:21] as good as a rock, so yeah, this is not

[05:24] good. Similarly, the one from GPT545 is

[05:27] also very weird. However, the one from

[05:29] Claude is actually kind of good. It's

[05:30] not the best, but good nonetheless. The

[05:32] next question is to make me a bow and

[05:34] arrow simulator game. And well, this one

[05:37] is really interesting. So, the one from

[05:39] Deep Seek basically doesn't even work.

[05:41] It is quite buggy. However, the one from

[05:44] GPT 5.5 is actually quite good. You can

[05:47] aim and shoot and everything. It still

[05:49] uses that dirty card thing, but keeping

[05:51] that aside, it is quite fine. Now, the

[05:54] next one is from Claude, and this one is

[05:55] just too good. I mean, it looks really

[05:57] professional, good-looking, and it is

[05:59] just good. I mean, there's nothing more

[06:01] that I'd want from it, and it is one of

[06:02] the best for sure. Then there's a new

[06:04] mathematics question, and none of them

[06:06] pass it either. I also ask it to

[06:08] fine-tune a Gemma 4 model for me with a

[06:10] generated data set for Pandaax, and

[06:12] well, none of them are able to do this

[06:14] yet. So, this is it. I think Opus is

[06:16] still just a better overall model for

[06:18] most people, as it is just good overall,

[06:20] while GPT 5.5 is good at some things,

[06:23] but it isn't that great overall.

[06:24] Deepseek is not good either. Deepseek is

[06:27] just a model. It's not good. It's not

[06:28] the best and it's not bad. But this is

[06:31] not what I would have thought when

[06:32] someone said Deepseek V4. I also want to

[06:34] talk a bit about the pricing as well.

[06:37] Deepseek V4 is a 1 million context

[06:39] window model that comes at about 1.74

[06:42] and $378 for input and output per

[06:45] million tokens respectively. It is

[06:47] actually extremely cheap. Deepseek v4

[06:50] flash is also a million token context

[06:53] window model that costs about4 cents and

[06:55] your 28 cent per million tokens for

[06:57] input and output per million tokens

[06:59] respectively. It is actually extremely

[07:01] cheap. Deepseek v4 flash is also a

[07:04] million token context window model that

[07:06] costs about4 cents and euro 28.4 million

[07:09] tokens for input and output

[07:10] respectively. I mean, yes, it consumes

[07:13] fewer tokens, but those token costs are

[07:15] now high. So, the end user will end up

[07:17] paying more in the long term, especially

[07:19] the API users. All that codeex rate

[07:22] limiting will also probably go into the

[07:23] trash, similar to cloud code, and there

[07:25] is no denying that. GPT 5.5 needs to be

[07:28] a 16 trillion model to justify this

[07:30] price based on how much Deep Seek costs.

[07:32] So, yeah, that is a bummer. I do get

[07:34] that research and training costs more,

[07:36] but I don't like it nonetheless. I think

[07:38] Opus 4.7 is the best model, but it is

[07:41] not the best experience anymore due to

[07:42] all the limits and stuff in the clawed

[07:44] code plan. Codeex might be better for

[07:45] that, but it will also only last for a

[07:47] bit. So, that is about it.

[07:51] I think DeepS isn't a good model and I

[07:53] can't recommend it either. Overall, it's

[07:55] pretty cool. Anyway, let me know your

[07:57] thoughts in the comments. If you like

[07:58] this video, consider donating through

[08:00] the super thanks option or becoming a

[08:02] member by clicking the join button.

[08:04] Also, give this video a thumbs up and

[08:06] subscribe to my channel. I'll see you in

[08:08] the next one. Until then, bye.

AICodeKing

AICodeKing

View channel analytics →

Topics #ai models #llm comparison #coding benchmark #deepseek #gpt-5.5