---
title: 'GPT-5.5 VS Deepseek V4 Pro VS Opus 4.7: I tested THEM on My KingBench 2.0 Questions!'
source: 'https://youtube.com/watch?v=exS-Y6XGk6s'
video_id: 'exS-Y6XGk6s'
date: 2026-06-15
duration_sec: 0
---

# GPT-5.5 VS Deepseek V4 Pro VS Opus 4.7: I tested THEM on My KingBench 2.0 Questions!

> Source: [GPT-5.5 VS Deepseek V4 Pro VS Opus 4.7: I tested THEM on My KingBench 2.0 Questions!](https://youtube.com/watch?v=exS-Y6XGk6s)

## Summary

The video tests GPT-5.5, Deepseek V4 Pro, and Opus 4.7 on the creator's KingBench 2.0 benchmark, which evaluates coding, front-end, and 3D tasks. Opus 4.7 generally outperforms the others, while GPT-5.5 shows improvements but still has UI issues. Deepseek V4 Pro is cheap but underperforms.

### Key Points

- **New Models Released** [0:07] — Deepseek launched V4 Lite and V4 Pro; OpenAI launched GPT-5.5.
- **KingBench 2.0 Overview** [0:29] — A benchmark testing coding, front-end, and agentic capabilities, refreshed from the original.
- **Models Tested** [1:06] — GPT-5.5, Deepseek V4 (Pro and Flash), and Opus 4.7.
- **Deepseek V4 Architecture** [1:21] — V4 Pro: 1.6T parameters, MoE with 49B activated. V4 Flash: 284B parameters, 13B activated. Uses Muon optimizer.
- **GPT-5.5 Claims** [2:04] — Beats Opus 4.7 across the board, fixes front-end card issues, and uses fewer tokens.
- **Elevator Simulator Test** [2:48] — Opus 4.7 produced a great UI and functionality; Deepseek and GPT-5.5 had issues.
- **3D Contact Lens Case Test** [3:52] — None were perfect; Opus 4.7 was best but had flipped L&R and cap opening at bottom.
- **3D Folding Table Test** [4:33] — Deepseek was decent; GPT-5.5 had overlapping parts; Opus 4.7 was okay.
- **SVG Panda Eating Burger** [5:11] — All models produced wonky results; Opus 4.7 was relatively better.
- **Bow and Arrow Simulator** [5:32] — Deepseek buggy; GPT-5.5 good but with card UI; Opus 4.7 excellent.
- **Math and Fine-Tuning Tests** [6:04] — None passed the math question or could fine-tune Gemma 4.
- **Pricing Discussion** [6:37] — Deepseek V4: $1.74/$3.78 per million tokens (input/output). V4 Flash: $0.04/$0.28. GPT-5.5 is expensive.

### Conclusion

Opus 4.7 remains the best overall model for most tasks, but its user experience is hampered by limits. Deepseek V4 is cheap but underperforms, and GPT-5.5 shows promise but still has UI flaws.

## Transcript

Hi, welcome to another video. So,
Deepseek has launched Deepseek V4 Lite
and V4 Pro. But they are not the only
ones who have released a new model.
OpenAI has also launched GPT 5.5 which
aims to be better in all aspects and
also fix the gripes of the old model
which are mostly just front-end issues
but there might be some more which we'll
talk about later as well. Apart from
this, I have also been working on
Kingbench 2.0, which is a refresh of my
old benchmark. The aim was to test a
model on all aspects of coding. You
might see some general questions as well
in the benchmark, but that is trivial.
Some questions in here require the model
to be used with an agentic contraption,
while the others don't. Basically, as I
said before, the benchmark is supposed
to be a benchmark that doesn't just test
agentic behavior or something. It also
tests other aspects and makes it a
better benchmark all around. Currently,
I don't have a UI to show all of this.
I'm still working on it. So, you guessed
it. We'll use an Excel sheet. By the
way, how many of you have been following
this benchmark since the Excel sheet
days? Comment below. Anyway, I'll be
testing three models here, which are GPT
5.5, Deepseek V4, and Opus 4.7. These
models give you a good look into the
worldview of models in respect to the
resolve of these companies. Before
jumping into the bench, I do want to
talk a bit about GPT 5.5 and Deepseek.
Though Deepseek has launched two models
which are Deepseek V4 Pro and Deepseek
V4 Flash. The Pro version is a 1.6
trillion parameter model that is a
mixture of experts with only 49 billion
parameters activated in a pass. The V4
flash, however, is a relatively smaller
model at 284 billion parameters with
only 13B activated in a pass. I don't
want to talk a lot about the model's
architecture as it can be boring, but as
an ML enthusiast, I can't help but
notice that it uses a muon optimizer. I
believe Kimmy also uses it. Moonshot
even has a paper on this where they talk
about how it is actually scalable.
Contrary to some of the prior beliefs
that it is just good for small LLMs or
something, I believe it might be one of
the biggest models using Muon if I am
not wrong. GPT 5.5 on the other hand is
one of the most anticipated models. It
apparently beats Opus 4.7 across the
board and even on front-end tasks now. I
mean, if the card issue with GPT models
is fixed, then I can easily recommend it
to anyone. It seems that GPT 5.5 also
now works better with lower token
consumption, which looks very
interesting. You'd know that I have
indeed tested the alleged DeepSseek V4
when it kind of became available on
their chat platform, but that is not
something I can fully trust. So, a new
test is kind of mandatory. All of these
models are indeed available everywhere
now including open router, kilo gateway
etc. I mostly use my models with kilo
CLI and it is available there as well
which is quite cool to see. So let's
look at the results direct to start. I
have a question where I ask the models
to build me an elevator simulator. I
want to make a simulator kind of thing
where there are floors and on each floor
we can spawn a person and the elevator
should take the simulated person to
their floor and then keep doing that
until no one is left. Each elevator is
only allowed to take one person at a
time. So this one allows me to test how
good a model is at front end and complex
backend at the same time. To start, if I
show you the generation from DeepSeek V4
Pro, then it is not good. The elevator
positioning is not correct at all and it
all seems very random. So yeah, this is
not great. Then we've got the generation
from GPT 545 and this is also not good.
I mean, it kind of works, but it just
flickers too much. And if you see here,
then it looks really bad. In terms of
UI, it still does the same bad card
designs and stuff. So, yeah, not great.
Next up, we've got the generation from
Opus 4.7. In here, you can see that this
one actually looks awesome, works really
well, and it is just really good. This
is what I imagine when I ask a human to
do it. So, yeah, Opus 4.7 kind of nails
it. Next up, I ask it to make me a 3J
contact lens case that also opens up
when clicked. This is also quite good
because it is something that is not very
much in the training data of the models
and 3D is also tricky for models. So if
we look at the generation from DeepSeek,
then it is not great either. I mean it
is fine but it looks like a brick with
two holes. So this ain't good at all.
Then we have the one from GPT545 and
this one is also not very good. I mean
it is fine but it isn't the best.
Clicking it, it does get opened but it
opens the cap on the left for some odd
reason. So that isn't the best for sure.
The one from Opus 4.7 does indeed look
like one of the best, but the L&R are
kind of flipped and the cap opens on the
bottom for some reason. So there's that.
not the best from anyone. Next up, I
have a question where I ask it to build
me a 3J's folding table. It should have
a slider that should allow the user to
fold it or unfold it accordingly. If we
look at the generation from Deepseek,
then it is pretty good. I mean, it's not
the best, but it is good nonetheless. I
can't complain much. It does indeed
work. Next, we've got the one from
GPT545, and it is not good either. It
does look good when unfolded, but it is
not good when folded. It looks out of
place. Both partitions overlap each
other and it ain't good at all. Next up,
we've got the one from Opus 4.7 and it
is kind of fine, but not very good
either. So, yeah, there's that. After
that, we have a comeback question where
I ask the models to make me an SVG of a
panda eating a burger. Well, all the
models are wonky in this.
The panda with a burger from Deepseek is
as good as a rock, so yeah, this is not
good. Similarly, the one from GPT545 is
also very weird. However, the one from
Claude is actually kind of good. It's
not the best, but good nonetheless. The
next question is to make me a bow and
arrow simulator game. And well, this one
is really interesting. So, the one from
Deep Seek basically doesn't even work.
It is quite buggy. However, the one from
GPT 5.5 is actually quite good. You can
aim and shoot and everything. It still
uses that dirty card thing, but keeping
that aside, it is quite fine. Now, the
next one is from Claude, and this one is
just too good. I mean, it looks really
professional, good-looking, and it is
just good. I mean, there's nothing more
that I'd want from it, and it is one of
the best for sure. Then there's a new
mathematics question, and none of them
pass it either. I also ask it to
fine-tune a Gemma 4 model for me with a
generated data set for Pandaax, and
well, none of them are able to do this
yet. So, this is it. I think Opus is
still just a better overall model for
most people, as it is just good overall,
while GPT 5.5 is good at some things,
but it isn't that great overall.
Deepseek is not good either. Deepseek is
just a model. It's not good. It's not
the best and it's not bad. But this is
not what I would have thought when
someone said Deepseek V4. I also want to
talk a bit about the pricing as well.
Deepseek V4 is a 1 million context
window model that comes at about 1.74
and $378 for input and output per
million tokens respectively. It is
actually extremely cheap. Deepseek v4
flash is also a million token context
window model that costs about4 cents and
your 28 cent per million tokens for
input and output per million tokens
respectively. It is actually extremely
cheap. Deepseek v4 flash is also a
million token context window model that
costs about4 cents and euro 28.4 million
tokens for input and output
respectively. I mean, yes, it consumes
fewer tokens, but those token costs are
now high. So, the end user will end up
paying more in the long term, especially
the API users. All that codeex rate
limiting will also probably go into the
trash, similar to cloud code, and there
is no denying that. GPT 5.5 needs to be
a 16 trillion model to justify this
price based on how much Deep Seek costs.
So, yeah, that is a bummer. I do get
that research and training costs more,
but I don't like it nonetheless. I think
Opus 4.7 is the best model, but it is
not the best experience anymore due to
all the limits and stuff in the clawed
code plan. Codeex might be better for
that, but it will also only last for a
bit. So, that is about it.
I think DeepS isn't a good model and I
can't recommend it either. Overall, it's
pretty cool. Anyway, let me know your
thoughts in the comments. If you like
this video, consider donating through
the super thanks option or becoming a
member by clicking the join button.
Also, give this video a thumbs up and
subscribe to my channel. I'll see you in
the next one. Until then, bye.
