TubeSum ← Transcribe a video

I Tested DeepSeek V4 vs Opus 4.7 vs GPT 5.5

Transcribed Jun 15, 2026 Watch on YouTube ↗
Intermediate 10 min read For: Developers and tech enthusiasts interested in AI coding tools and model comparisons.
83.9K
Views
1.4K
Likes
182
Comments
210
Dislikes
1.9%
📊 Average

AI Summary

The video compares three major AI models—GPT 5.5, Opus 4.7, and DeepSeek V4—released within 24 hours, testing them on coding tasks to determine which is best for users. The host evaluates cost, benchmarks, and real-world performance in building a flight simulator and a WebGPU landing page.

[00:58]
Cost Comparison

DeepSeek V4 is 8x cheaper than competitors: $30 per million output tokens for GPT 5.5, $25 for Opus 4.7, and $3.48 for DeepSeek V4. Input tokens: $5 for GPT/Opus, $1.70 for DeepSeek.

[02:40]
Benchmark Results

On SWE-bench Verified and SWE-bench Pro, Opus 4.7 wins. On Terminal Bench 2.0, GPT 5.5 leads with 87.2%, beating Claude's Mythos. DeepSeek V4 is third but within 5 points of Opus while being 8x cheaper.

[03:49]
Long Context Performance

Opus 4.7 performs poorly on long context (500K-1M tokens), worse than DeepSeek and GPT 5.5. GPT 5.5 and DeepSeek handle long context better.

[05:10]
Flight Simulator Test

GPT 5.5 (Codeex) produced the best flight simulator after 3 iterations (66K tokens, ~15 min). Opus 4.7 (Claude Code) was second but needed more prompts (200K tokens, ~20 min). DeepSeek V4 (Open Code) failed completely.

[17:07]
WebGPU Landing Page Test

Opus 4.7 created a subtle, tasteful design (175K tokens). GPT 5.5 was flashier but overly bright (107K tokens). DeepSeek V4 produced a seizure-inducing, low-quality result (130K tokens).

[23:12]
Final Verdict

GPT 5.5 wins for complex tasks (flight sim). Opus 4.7 wins for design taste (WebGPU page). DeepSeek V4 is only suitable for simple, budget-conscious tasks.

GPT 5.5 and Opus 4.7 are both strong choices for agentic coding, with personal preference playing a big role. DeepSeek V4 is a budget option for simpler tasks but lags significantly in quality.

Clickbait Check

95% Legit

"Title accurately describes the head-to-head test of three AI models; video delivers exactly what it promises."

Mentioned in this Video

Study Flashcards (9)

What is the cost per million output tokens for GPT 5.5?

easy Click to reveal answer

$30

01:31

Which model won on SWE-bench Verified and SWE-bench Pro?

easy Click to reveal answer

Opus 4.7

02:49

What score did GPT 5.5 achieve on Terminal Bench 2.0?

medium Click to reveal answer

87.2

02:55

How many tokens did GPT 5.5 use for the flight simulator?

medium Click to reveal answer

66,000 tokens

16:47

Which model performed worst on long context (500K-1M tokens)?

hard Click to reveal answer

Opus 4.7

03:55

How much cheaper is DeepSeek V4 compared to GPT 5.5?

easy Click to reveal answer

8 times cheaper

01:15

Which model won the flight simulator test?

easy Click to reveal answer

GPT 5.5

16:04

Which model won the WebGPU landing page test according to the host?

medium Click to reveal answer

Opus 4.7

24:24

What was the total token usage for Opus 4.7 in the flight simulator?

hard Click to reveal answer

200,000 tokens

16:52

💡 Key Takeaways

📊

DeepSeek V4 is 8x cheaper

Highlights the massive cost advantage of open-source models, making AI accessible for budget-constrained users.

01:15
💡

GPT 5.5 beats Claude's Mythos on Terminal Bench

Demonstrates GPT 5.5's surprising strength in a key benchmark, outperforming a supposedly superior model.

02:55
📊

Opus 4.7 regresses on long context

Shows that even top models can have weaknesses, especially in handling very long documents.

03:55
🔧

GPT 5.5 wins flight simulator decisively

Proves GPT 5.5's superior ability to handle complex, multi-step coding tasks with minimal guidance.

16:04
⚖️

No vendor lock-in between Claude Code and Codeex

Emphasizes that learning AI fundamentals transfers across tools, reducing risk for developers.

25:11

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

GPT 5.5 vs DeepSeek V4 vs Opus 4.7: Cost Breakdown

37s

Viewers love comparing AI model costs, and the surprising price differences spark debate.

▶ Play Clip

DeepSeek V4: 8x Cheaper but Close to Opus?

60s

The claim that a cheap open-source model nearly matches premium ones challenges assumptions and drives engagement.

▶ Play Clip

Flight Sim Test: GPT 5.5 Crushes DeepSeek

60s

Seeing a live coding test with clear winners and losers is highly engaging and shareable.

▶ Play Clip

DeepSeek V4 Fails: Flight Sim Disaster

60s

Watching a hyped model fail spectacularly is entertaining and sparks discussion about open-source limitations.

▶ Play Clip

Final Verdict: Which AI Model Wins?

60s

Viewers want a clear winner and practical advice, making this a satisfying conclusion to the comparison.

▶ Play Clip

[00:00] In the last 24 hours, we have had huge

[00:02] updates to two of the biggest AI models

[00:04] on the planet. First, we got the release

[00:05] of GPT 5.5, which is boasting certain

[00:09] benchmark scores that beat out Claude's

[00:11] Mythos. Secondly, we got the release of

[00:14] DeepSeek V4, which is an open-source

[00:17] openweight model that has benchmarks

[00:19] that rival these frontier big players.

[00:22] So, with all these new models to choose

[00:24] from, what are you, the average user,

[00:26] supposed to do? Well, today I'm going to

[00:28] help you answer that question as I pit

[00:30] Opus 4.7, GPT 5.5, and DeepSeek V4

[00:34] against one another so you can see which

[00:37] one actually makes sense for you. Now,

[00:39] before we kick off this head-to-head

[00:41] test between GPT 5.5 inside of Codeex,

[00:45] Deepseek V4 inside of Open Code, and

[00:48] Opus 4.7 inside of Cloud Code. Let's

[00:51] first take a quick look at the

[00:52] benchmarks, especially these two latest

[00:54] models that dropped in the last 24

[00:56] hours. Now, let's first talk about cost.

[00:58] Now, DeepSeek V4, as you know, is an

[01:00] open- source openweight model, but that

[01:02] does not mean you can run this on your

[01:03] computer because this thing is huge. I'm

[01:05] talking 1.6 trillion parameters. You

[01:08] need some serious hardware to run this.

[01:10] So, we still got to pay for it. We're

[01:11] still going to have to use the API, but

[01:13] it is infinitely cheaper than the

[01:15] competition, about 8 times cheaper. And

[01:18] of the three models, the brand new GPT

[01:20] 5.5 is actually the most expensive,

[01:22] which is kind of surprising because by

[01:24] and large, OpenAI has been cheaper than

[01:26] its anthropic competition. In terms of

[01:29] what it will cost you per 1 million

[01:31] tokens of output for GPT 5.5, it's going

[01:34] to be $30. For Enthropic, it's going to

[01:37] be $25. And for DeepS, it's going to be

[01:39] $348.

[01:41] Now, if we're talking about input

[01:43] tokens, which is a smaller part of the

[01:46] hole, GPT 5.5 and Opus 5.7 are the same,

[01:49] it's going to be $5 per 1 million input.

[01:53] And for DeepSeek, it's about like $1.70.

[01:57] So, way cheaper on the input and way

[02:00] cheaper on the output. That being said,

[02:01] when it comes to 5.5, this is like twice

[02:04] as expensive as 5.4. However, OpenAI

[02:08] claims that it actually uses way less

[02:10] tokens due to its power. So, while it's

[02:12] double the price of 5.4, they say in

[02:14] terms of actual token spend and actual

[02:16] cost for the same task, it ends up only

[02:19] being like 20% more expensive when it's

[02:21] all said and done. So, just have that in

[02:23] the back of your mind. So, we've talked

[02:24] about the cost. Now, let's talk about

[02:25] the benchmarks. How good are these

[02:27] models on paper? I know we're all kind

[02:28] of numb to benchmarks in general. we

[02:31] need to take them with a grain of salt,

[02:32] but it's still worth taking a look,

[02:33] especially when we're looking at the

[02:35] numbers that are reported by each player

[02:37] on the same benchmark. So, there were

[02:40] three in the coding category that all

[02:42] three reported numbers that was Swebench

[02:44] Verified, SweetBench Pro, and Terminal

[02:46] Bench 2.0. Now, for SWEBench verified

[02:49] and SweetBench Pro, Opus was the winner

[02:51] there. On Terminal Bench 2.0, GPT was

[02:55] the winner by far at 87.2, 2, which by

[02:58] the way is a higher number than what

[03:00] Anthropic reported for Mythos, Mythos,

[03:03] sorry, which is kind of crazy. You know,

[03:05] the super secret model they can't

[03:06] release apparently does worse on

[03:08] Terminal Bench 2 than GPT 5.5. Now, the

[03:11] Terminal Bench 2.0 is the biggest

[03:12] outlier here. Opus 4.7 and V4 Pro are

[03:15] way behind, but take a look at Opus 4.7

[03:18] versus V4 Pro. It's less than two points

[03:21] while being eight times cheaper. And you

[03:23] see the same sort of story here with

[03:24] SweetBench Verified and SweetBench Pro.

[03:26] Yeah, Opus wins, but when we compare the

[03:30] second place with the third place, and

[03:31] V4 is always third place, there isn't

[03:33] the huge gap you would expect. I mean,

[03:36] five points isn't nothing, you know, on

[03:39] Sweetbench verified 85 to 86, but again,

[03:42] eight times cheaper, open- source, you

[03:45] know, there there's some actual

[03:46] trade-offs here that we can make if we

[03:48] don't need the most power. Another thing

[03:49] that's interesting to talk about is long

[03:51] context where oddly opus 4.7 is really

[03:55] bad by the numbers like significantly

[03:57] worse than 4.6 which kind of blows my

[03:59] mind. And when we're talking about long

[04:01] context where we're trying to retrieve

[04:02] things between 500,000 tokens and 1

[04:05] million tokens 4.7 is actually terrible

[04:08] and does way worse than deepseek and GPT

[04:11] 5.5. Now you can have a whole discussion

[04:13] about why are you even in the 500,000 to

[04:16] 1 million token range to begin with? how

[04:18] many people are actually operating there

[04:20] because we are hitting context rot no

[04:21] matter what at that place no matter what

[04:23] model you're using but it is interesting

[04:25] that for whatever reason we've seen some

[04:27] regression when it comes to the

[04:28] anthropic models but big picture I think

[04:30] the takeaway is 5.5 is really strong it

[04:33] beats opus 4.7 in certain metrics loses

[04:36] in certain metrics but it's an extremely

[04:38] robust model and on top of that while V4

[04:41] Pro is kind of you know lagging behind

[04:44] by and large it's within striking

[04:47] distance while being infinitely cheaper,

[04:48] which again is a great option for your

[04:51] average customer because right now it

[04:53] feels like you don't have a lot of

[04:54] options on the open source side that

[04:56] actually can compete. Now, let's jump

[04:57] into the actual head-to-head test with

[04:59] all three of these models. And we're

[05:00] using a harness for each of these

[05:02] models. With 5.5, it's going to be

[05:04] codeex. With Opus 4.7, it's going to be

[05:06] Claw Code. And with DeepSeek V4 Pro, I

[05:08] am using Open Code. And for the first

[05:10] test, what we're going to do is we're

[05:12] going to have them create a flight

[05:13] simulator for us in 3JS that runs in the

[05:16] browser. You can see the prompt right

[05:17] here. I'm saying I want it to feel good

[05:19] to fly. I want it to have some weight to

[05:21] it. I want some strong visuals, and I

[05:24] want it to use whatever structure and

[05:25] tooling it thinks is correct. So, it's

[05:29] straightforward enough that they know

[05:30] what to do, yet there's enough leeway so

[05:32] we can see some divergence between the

[05:33] models. And while we are going to look

[05:35] at what they're able to oneshot, we are

[05:37] going to go through multiple iterations

[05:38] of this and have follow-on prompts

[05:40] because as cool as it is to see how well

[05:42] it does on one shot, that isn't how we

[05:45] really work in real life, is it? Right.

[05:46] I want to see how it does when I give it

[05:48] follow-on prompts and how quickly it

[05:50] takes to get it to something I like. And

[05:52] when we compare these three models,

[05:54] there's really four things I'm going to

[05:55] look at. It's going to be time. How long

[05:57] does it take to build this? Cost, how

[05:59] many tokens are we using? Quality, how

[06:01] good is it? And then four is sort of

[06:04] vibes, and that sort of relates to

[06:05] quality. It's very subjective. Which one

[06:07] do I actually like more? And also of

[06:09] note, all three models, all three

[06:11] harnesses are also using the exact same

[06:13] skills. So, let's begin with Deepseek in

[06:15] the questions. It's asking us, it's

[06:16] asking what sort of flight model we

[06:18] want. Let's go with full sim. It's

[06:20] recommending oceans and islands for the

[06:22] terrain. We'll go with that. Let's see

[06:24] how. And then it's asking camera

[06:25] preference. Let's do both. Let's see if

[06:26] it's able to give us a toggle for both

[06:28] first person and third person. We'll go

[06:30] with its recommended tooling preference.

[06:32] And we'll just go with a low poly model

[06:33] for the aircraft and visuals itself.

[06:35] Now, moving over to Codeex. Same sort of

[06:37] questions, although it's only asking us

[06:39] three. Saying what kind of flight should

[06:41] this plane optimize for? Let's go with

[06:43] hard simulation. Which playable

[06:45] experience matters most for the browser?

[06:48] Let's do island takeoff loop. It is kind

[06:50] of interesting how they all have the

[06:51] same one. And what camera and aircraft

[06:53] presentation? I'm going to do toggle for

[06:55] this as well. And for cloud code, we'll

[06:57] do study sim learning. For the field,

[07:00] ocean, and islands input, we will do

[07:02] keyboard and mouse. It will let it go to

[07:04] work. So plan mode by the large very

[07:07] similar across all three. pretty much

[07:09] the same questions of like what do you

[07:11] want the physics to be? What do you want

[07:13] the terrain to be? What do you want the

[07:14] camera angle to be? So, no huge

[07:16] difference there. And let's see what

[07:17] they come back with in terms of a plan.

[07:19] All right, so all three plans are

[07:20] complete. So, let's go through each of

[07:21] them pretty quickly and see some of the

[07:23] differences. First one we're looking at

[07:25] here is Deepseek and it's pretty bare

[07:27] bones in terms of the plan it lays out.

[07:29] So, gives us the project structure and

[07:31] then talks very quickly about light

[07:33] physics, environment, camera, and HUD

[07:35] overlay and really just a few bullet

[07:36] points. On the other hand, when we're

[07:38] looking at 5.5 inside of codeex, it

[07:40] gives us a summary, key changes, goes

[07:43] into implementation details, the test

[07:45] plan, and as well as the assumptions. It

[07:47] spells all that out for us. And then we

[07:49] have Claude codes plan, which took the

[07:50] longest, took it about 5 minutes, but by

[07:52] far is the most thorough because it's

[07:53] the context, the stack layout, talks

[07:56] about the flight model. It's going into

[07:58] like the actual different moments,

[08:00] talking about stalls, like the stall

[08:01] buzzer, like it's it's going very very

[08:03] detailed. goes into the controls, the

[08:05] world, the mod, the actual aircraft

[08:06] we're going to be using, performance,

[08:08] and just keeps going on and on. So, very

[08:11] detailed. So, now we're going to have

[08:12] all three implement their plan, and

[08:14] we'll see what the final result looks

[08:15] like. So, GPT 5.5 inside of Codex was

[08:18] the first to finish. So, let's see what

[08:20] it looks like. So, here's the flight

[08:21] simulator it got us. We have, you know,

[08:24] some clouds in the sky. We have what

[08:28] looks like an AOA indicator up there. We

[08:31] have our speed down below. And let's see

[08:34] if we can actually get this thing off

[08:36] the ground. I will note there's no more

[08:37] like runway. It's just like straight

[08:39] grass and it said it was going to be

[08:40] like an island thing. Although when the

[08:44] camera kind of spazzes out, you can see

[08:45] the runway down below there for a

[08:47] second. All right, we're stalling out

[08:49] and we just we can't even get off the

[08:50] ground. All right, so this this one's

[08:53] actually just a little is actually kind

[08:54] of difficult. Um, so what I'm going to

[08:58] do is I'm going to give it a second

[09:01] prompt asking it to make it a little bit

[09:02] easier to fly because it has a lot going

[09:04] on here, but this is this is tough. So I

[09:07] wrote, "It is really hard to fly. Can we

[09:09] make this easier to use?" Aka a little

[09:11] bit more arcadey and also the graphics

[09:14] could use some work. So let's see how

[09:16] that does. Now, of note, um it took 5.5

[09:20] about 7 minutes to create that first

[09:22] pass for us, and it took 63,000 tokens.

[09:26] All right, it said it made it a little

[09:27] bit easier to fly and updated the

[09:29] graphics. So, let's see what the second

[09:31] pass looks like. So, here's what we got.

[09:33] Graphics definitely look better, but

[09:34] let's see if we can actually get off the

[09:36] runway this time. So, all right,

[09:38] throttle's at 100%.

[09:41] 50 60. What's the rotation speed on a

[09:44] Cessna? All right. 70 80 90. We got to

[09:50] be able to get off the ground now. Okay.

[09:52] Wrong way. Let's go. Get off the ground.

[09:55] Get off the ground. Nope. This is

[09:57] probably going to stall me out, isn't

[09:58] it? Yeah. Stall. Okay.

[10:01] This still needs this still needs some

[10:02] work. So, let's let's give Codeex one

[10:04] more shot. Let's give 5.5 one more

[10:06] chance to make this actually playable.

[10:08] So, I told it I can't even get the

[10:09] aircraft off the ground and enter

[10:11] flight. We definitely need to make it

[10:12] easier to take off and actually fly the

[10:13] thing. Okay, so it says it fixed the

[10:15] takeoff problem. Apparently the brakes

[10:17] started locked on before. I don't know

[10:19] if that's why we weren't able to do it.

[10:21] Oh, we it didn't automatically set it to

[10:24] takeoff flaps. Yeah, this was we had

[10:26] this on like super super simulator mode.

[10:29] Here is attempt number three at our

[10:31] flight simulator. Let's see how we do.

[10:34] So, can we get off the ground? Oh, we're

[10:36] bouncing on the runway with this time.

[10:38] That's something. All right, cool. We're

[10:40] off the ground. We're actually moving.

[10:43] Let's see if we can get on one of these

[10:45] rings. I mean, the graphics aren't that

[10:48] bad, you know, for something just

[10:50] generated in less than 10 minutes. Um,

[10:53] it seems to be pretty accurate in terms

[10:55] of, you know, it's giving me like my

[10:57] vertical, you know, feet per minute down

[10:59] at the bottom, my actual altitude, the

[11:02] knots, heading, AGL. So, like it's

[11:05] relatively sophisticated in terms of

[11:07] tracking everything. I mean, this little

[11:09] indicator in the front, I mean, looks to

[11:11] be like an angle of attack, you know,

[11:13] indicator, which is kind of cool. So, it

[11:15] has some good stuff going on. The the

[11:19] actual like controls are a little janky.

[11:21] As you can see, I can't control this for

[11:22] anything, but by and large, not bad. You

[11:25] know, we can kind of like kamicazi this

[11:27] and see what happens at, you know,

[11:28] 18,000 ft per minute.

[11:31] But yeah, you know, for 66,000 tokens,

[11:36] about 10 minutes, 15 minutes or so, give

[11:39] or take, you know, with the back and

[11:41] forth. I don't think that's bad at all.

[11:42] So, now let's take a look at DeepSk. It

[11:44] took about 10 minutes to do this. And in

[11:46] terms of tokens, 63,000

[11:50] 44. So, 44 cents, 10 minutes. And here

[11:53] is what Deep Seek came up with for us. I

[11:57] have no idea

[12:00] what I'm looking at.

[12:04] This is supposed to be third person.

[12:06] This is supposed to be the cockpit. And

[12:08] obviously,

[12:09] our first pass with Deepseek was an

[12:12] utter disaster. So, I'm telling Deepseek

[12:14] the simulator is a complete mess. The

[12:16] graphics are completely buggy and I

[12:18] cannot fly anything. Please fix. And

[12:21] here's what our second pass looks like.

[12:24] I still have no idea. Absolutely no clue

[12:29] what the heck Deep Seek is. Oh, hey,

[12:31] there's a plane. You know, there's

[12:33] something.

[12:35] I Yeah, this is This is brutal. And to

[12:38] be honest, I feel like even giving it

[12:40] another prompt to do this, I would need

[12:43] to start getting very, very specific

[12:44] about what we're trying to do, which

[12:47] again like falls pretty short of what we

[12:48] did with Codex. Like it was very, you

[12:50] know, kind of bland prompts. was able to

[12:51] get something at least close even on the

[12:53] first pass. Like this clearly it's

[12:56] completely struggling with the graphics.

[12:58] We are just I I don't even know how to

[13:00] describe this. But hey, it was super

[13:03] cheap. So now let's take a look at what

[13:06] Claude Code was able to give us. For

[13:08] reference, it took 13 minutes to

[13:11] actually execute the plan. The plan

[13:12] itself took 5 minutes. So let's call it

[13:15] 20 minutes to come up with the first

[13:17] pass. And then for total tokens, this

[13:19] run took about 15% plus the 5% before

[13:21] for plan. So we're looking at well sorry

[13:24] we are looking at 11% context plus 5%

[13:28] before. So call it 20 minutes 150,000

[13:31] tokens for Claude Code which is

[13:33] definitely the most expensive and

[13:34] slowest out of all of them. And here is

[13:37] Claude Code's attempt at this. Um for

[13:41] whatever reason we are instantly in the

[13:42] air. We are stalling. We are an IFR. I

[13:46] don't know what's happening. We are

[13:48] about to crash something. Let's Can we

[13:50] save this? Can we pull this out of a

[13:53] dive? No, we're stalling. No, we're

[13:54] dead. Okay, that's interesting. Um,

[13:57] again, it instantly slingshots us into

[13:59] the air. We are in the clouds. We are

[14:02] stalling. I don't know what is

[14:04] happening. We need We need a second

[14:08] pass. So, I wrote, "Upon loading, I'm

[14:10] instantly thrown into the air. It's hard

[14:12] to control. I want to start on the

[14:13] runway, and I want it easier to fly. Oh,

[14:15] and by the way, improve those graphics,

[14:16] too. So, it took about four minutes, but

[14:18] it made some changes. We're going to

[14:20] spawn on the runway. It changed the

[14:22] gear. So, now it's tricycle gear and a

[14:24] few other stuff. So, let's see what it

[14:25] looks like. All right, so here it is

[14:27] again. We are thrown immediately into a

[14:28] fog bank. I'm trying to control this

[14:30] thing. And I just Yeah, there's there's

[14:32] no controlling this at all. All right,

[14:34] we are going to give we're going to give

[14:35] Cloud Code one more chance here. So, I

[14:37] told it's still instantly slingshotting

[14:39] me into the sky. I said, let's go with a

[14:41] much more arcade type feel with the

[14:42] controls. I think we probably should

[14:43] have done that with the initial prompt

[14:45] for all three. I think going for a more

[14:48] realistic sim type thing, it it really

[14:51] struggles to

[14:53] I think do that in a way where it's

[14:55] still user friendly. I think it's

[14:57] probably doing a good job under the hood

[14:59] in terms like okay like angle of attack.

[15:01] All right, you're stalling at this you

[15:02] know angle versus the speed and all that

[15:04] but actually manipulating this from the

[15:07] computer is basically impossible.

[15:09] Although I I think the fog stuff is

[15:11] really strange. So, let's see if after

[15:14] the second round of prompts, it's able

[15:15] to do a little bit better because right

[15:17] now GPT 5.5 did much much better. So,

[15:20] Cloud Code made some more changes, made

[15:22] it more user friendly, and let's see if

[15:23] I'm still going for my instrument rating

[15:25] this time. So, yep, we're still going

[15:28] we're still going for instrument rating.

[15:30] RIP men's here, but you know, I can kind

[15:32] of see it. You know, I can I can check

[15:34] my instrument panel. All right, we're

[15:36] coming off the runway.

[15:38] Um,

[15:40] yeah. Okay. Can I Why is there a tree in

[15:43] the runway? Trying to trying to go up.

[15:46] Can I go up? Can I pitch? Click canvas

[15:50] to lock mouse. I What? Oh, we're in the

[15:54] We're in the air.

[15:56] Nope. Nope. We died. So, yeah, I think

[15:59] this one is pretty clear. Uh, GPT 5.5

[16:04] easily the winner. I think Cloud Code

[16:07] was second place. I would give it second

[16:09] place. You know, it definitely struggled

[16:13] even with the prompts we gave it. We

[16:14] didn't give it great prompts. Let's be

[16:15] totally honest. I think given more time,

[16:18] better prompts, a few more back and

[16:19] forth, we could have got it to where we

[16:21] wanted to go. Like, it was at least it

[16:22] had an aircraft, it had a runway, it had

[16:25] trees in the runway, but it had the the

[16:28] actual things we needed versus Deepseek

[16:31] with Open Go. I have no idea what was

[16:33] going on there. That was a complete

[16:34] mess. I feel like I would have had to

[16:36] start over from the beginning, like give

[16:37] it a very specific prompt. like it

[16:38] wasn't even close to being messed with,

[16:40] but GPT 5.5 right off the rip, you know,

[16:42] with pretty vague prompts. I thought it

[16:44] did really good. 5.5 also used a total

[16:47] of 66k tokens. We're looking at over

[16:50] here with Opus altogether about 200,000

[16:53] tokens. So, a quarter of the tokens

[16:55] essentially quarter the cost and it was

[16:57] a bit faster. I mean, at this point, I

[16:59] don't even care about how open code

[17:01] actually took longer than GPT 5.5 as

[17:03] well. And it just sucked. Let's just be

[17:06] honest, it just sucked. Now let's move

[17:07] on to test number two. This time we are

[17:11] going to be asking them to create a

[17:13] landing page that shows off web GPU

[17:15] shader work using 3JS. Now web GPU

[17:19] shader work is the kind of stuff you see

[17:21] on awards websites. I'm talking websites

[17:24] like Igloo. This kind of thing like very

[17:27] high-end graphics. It looks like a video

[17:29] game. It's essentially using your

[17:31] computer's graphics card to render all

[17:33] this stuff. Now, I don't expect any of

[17:35] these to get anything even close to what

[17:37] we see here, but I want to see what they

[17:39] can do using essentially the shaders

[17:42] technology. This is definitely a step

[17:43] above your basic SAS templated landing

[17:46] page. I want to see what they can do and

[17:48] push them to the limits in the world of

[17:50] web design. Now, I've given all of them

[17:51] a skill that actually breaks down how to

[17:54] do this sort of thing. So, it's not like

[17:56] they're completely in the dark and one

[17:58] also doesn't have an advantage over the

[18:00] other. The only thing I've told them is

[18:01] I want it to feel modern and visually

[18:03] striking. something you would see on

[18:05] awards and to make smart use of GPU

[18:07] compute. So they can pick whatever stack

[18:09] and project structure they like and use

[18:11] good judgment on hero concept UI and

[18:14] interactions. And just like the first

[18:15] test, they're all on plan mode. So let's

[18:18] get started. Okay, so they all finished

[18:19] their plan and funny enough, none of

[18:21] them asked me any questions even though

[18:23] we put them in plan mode. So let's take

[18:24] a look at GPT 5.5 first. So it's telling

[18:28] us it's going to do a fullbleleed

[18:30] interactive GPU driven hero. The concept

[18:32] will be a living signal field with some

[18:35] like dense particle thing it's going to

[18:36] do. We'll see what that ends up looking

[18:38] like. And overall, it's a minimal awards

[18:40] style landing copy. Fully interactive

[18:42] web GPU scene with pointer reactive

[18:44] compute simulation.

[18:46] All right, for Deep Seek, it's a pretty

[18:48] short and sweet plan just like we saw

[18:51] with the flight simulator. Hopefully, we

[18:53] get a better output this time. But a

[18:55] hero section with 75,000 GPU computer

[18:57] particles. I am kind of guessing that

[19:00] all of them are going to go for some

[19:01] sort of like particle theme on the on

[19:04] the hero.

[19:05] So, it's gonna have mouse interaction

[19:07] integration. It'll have a one-time

[19:09] initialization. And then we should see

[19:11] stuff like bloom, chromatic aberration,

[19:14] a custom vignette, and some film grain.

[19:16] So, we'll see what that actually ends up

[19:18] looking like. And then we have Opus

[19:20] 4.7's playing again going for this

[19:22] particle thing with bloom, and it's

[19:24] going to be interactive with the mouse.

[19:25] So, we'll see if any of these actually

[19:26] look different because on the surface

[19:28] all their plans sound very similar. So,

[19:30] the first one done was 5.5. It took

[19:33] about 6 minutes. And in terms of tokens,

[19:35] we've used 107K. So, let's see what it

[19:39] built us. And here's what it created for

[19:42] us. Now, this is very bright. Um, so

[19:45] it's hard to even see the actual

[19:47] particles, but you know, as we scroll up

[19:49] and down, it does have an animation

[19:52] going on in the background as well as,

[19:54] you know, some subtle color changes.

[19:56] Looks like right now our mouse is

[19:59] supposed to attract the particles and we

[20:01] have I'll move this over here. It gave

[20:04] some options for like repelling it

[20:06] versus drift, but again, it's kind of

[20:09] tough to to see it um due to how bright

[20:11] it is. So, I told that it's hard to

[20:13] actually see the particles due to the

[20:14] brightness. It also takes over a lot of

[20:16] the hero text. So, can we turn down the

[20:17] brightness a bit and also push it to the

[20:18] right a bit more because right now it is

[20:21] kind of overpowering. You can't even

[20:23] really read the text over here on the

[20:25] left due to just how freaking bright

[20:26] these particles are. And here's the

[20:28] update after the second run. It's a

[20:30] little bit better. It isn't as

[20:31] overpowering and leaves some room for

[20:34] the text. Um, although I will say it

[20:37] it's kind of blurry almost, but you

[20:40] know, it's not bad. Like it it set out

[20:43] to do what we told it to do given the

[20:46] somewhat vague prom. So, I'm not blown

[20:47] away by sort of the design it came up

[20:49] with, but I'm not like upset about it.

[20:51] Now, let's take a look at Clawude Code

[20:53] because as we've been doing all this,

[20:55] Deep Seek is still over here in the

[20:57] trenches trying to figure this out. And

[20:59] here's what Claude Code gave us.

[21:03] So,

[21:05] kind of nothing.

[21:07] I'm not sure if it's saying the

[21:10] background. I guess the entire

[21:11] background is supposed to be

[21:16] the WebGL. I'm I'm assuming it's very

[21:19] understated,

[21:21] which I guess is something you could

[21:23] totally do. I mean, like on screen it

[21:25] doesn't look like it looks kind of cool,

[21:26] but I I'll be honest, I was looking for

[21:28] something a little more flashy. So on

[21:31] the second pass when I told it to make

[21:32] it a bit more flashy, there wasn't a

[21:34] huge difference. Although like it's it's

[21:37] really subtle. There's kind of like this

[21:39] film grain almost like this blur that

[21:42] goes from bottom to top. So it's a

[21:44] pretty subtle thing. And you can see

[21:46] here on the bottom it tracks like the

[21:48] frames per second. It's using 250,000

[21:50] particles. So I mean honestly it looks

[21:54] cool. It's just not super flashy. So

[21:56] it's definitely like a taste thing. Now,

[21:58] total tokens on the cloud code side was

[21:59] about 175,000 and it took just slightly

[22:02] longer than 5.5 inside of Codeex. Now,

[22:06] let's take a look at DeepSk, which has

[22:08] taken 116,000 tokens at this point. It

[22:11] took the longest um as well, but total

[22:13] cost, we're talking again under a

[22:14] dollar. And here's what it gave us. So,

[22:18] it's

[22:20] kind of this particle field thing that

[22:22] uh somewhat follows my mouse.

[22:26] Interesting. I think it might give you

[22:28] like an epileptic seizure,

[22:31] honestly. Um, beyond that, it's pretty

[22:33] bland. Um, the flux, you know, text

[22:37] right here kind of changes colors, but

[22:40] yeah, pretty much just created this this

[22:42] thing. And after telling Deep Seek to do

[22:44] another pass, it then came back with

[22:46] this where now it kind of has like some

[22:48] weird parallax thing. It's got some like

[22:51] blue stuff going on in the background.

[22:53] And now this thing looks like a UFO,

[22:55] which kind of responds to your mouse,

[22:58] but

[22:59] yeah, it's it's it's something. And

[23:02] overall, the token count from Deep Seek

[23:04] was 130K tokens, coming in at a $143.

[23:08] So, after all those tests, where does

[23:12] that really leave us? So, now let's talk

[23:13] about the final results. When it comes

[23:15] to test number one, which was the flight

[23:17] simulator, clear winner. That was GPT

[23:19] 5.5 inside of Codeex. It was quicker

[23:22] than Opus 4.7 inside of Cloud Code. It

[23:25] was also faster and the end result was

[23:27] by far the best. Deepseek did terribly

[23:31] in the flight simulator. It wasn't even

[23:33] close to what we were trying to do. I

[23:34] would have had to continue to prompt it,

[23:35] prompt it, prompt it to even get it to

[23:37] like close to the first pass from 5.5.

[23:41] And Opus 4.7 and Cloud Code was like,

[23:43] eh,

[23:45] it wasn't awful. like it really didn't

[23:47] work at the beginning, but after a

[23:49] couple prompts, you could tell we could

[23:50] get it to a place where it was

[23:51] equivalent to what GPT 5.5 was doing,

[23:54] but that would have taken more prompts.

[23:55] It would have taken more time, and

[23:57] ultimately would be more expensive. So,

[23:59] clear winner for 5.5. In terms of the

[24:02] web GPU landing page, again, DeepSeek

[24:04] struggled here. I was not a fan of this.

[24:06] I don't really know what this is

[24:08] supposed to be. Sure, I didn't give it a

[24:09] super great prompt, but like is this

[24:12] what we're going to be getting as a

[24:14] baseline median outcome if I don't like

[24:17] grab Deep Seek by the reins and really

[24:20] force it to do something? I guess so.

[24:22] Now, when we compare Opus in 5.5, I

[24:24] would have gone with Opus 4.7 and Claude

[24:27] Code with how it handled the web GPU

[24:29] thing. I think that has to do with sort

[24:30] of a taste kind of deal. Yeah, you could

[24:32] argue the 5.5 was flashier, but I

[24:35] thought it was kind of ugly. Um, again,

[24:39] in all these tests, we kept the prompts

[24:40] rather vague to see what sort of path it

[24:42] would go down. So, I would definitely

[24:44] give Opus the lead here, although it was

[24:48] more expensive and it also took slightly

[24:50] longer. So, if they were given a more,

[24:53] you know, hands-on prompt that was very

[24:55] specific about what she wanted to do

[24:57] because 5.5 did what we wanted it to do,

[24:59] like it did create a web GPU landing

[25:01] page. I just thought it was ugly. So, it

[25:04] still completed the task. It just didn't

[25:06] complete it as well, I think, as Opus.

[25:08] Now, big picture, what does it mean if

[25:10] we take all that together? Well, I think

[25:11] it means great news for anybody who's

[25:15] using agent coders. We have options,

[25:18] right? You can use Opus and Claude Code

[25:20] or you can use GPT 5.5 and codecs.

[25:23] You're not wrong with either. I think

[25:26] it's totally a personal preference at

[25:27] this point. And the best part is if you

[25:30] go down the cloud code route, it pretty

[25:32] much all applies to Codex. If you go

[25:34] down the codeex route, it pretty much

[25:35] all applies to clawed code. So, I don't

[25:38] really think there's vendor lock in the

[25:39] sense like, oh, I've only learned about

[25:41] clawed code, like I can't go to codeex

[25:43] or vice versa. That's not the case at

[25:45] all. If you're doing this the right way,

[25:46] what you're really learning is AI

[25:47] fundamentals and how to build things.

[25:49] And that applies to both of these guys.

[25:51] And the more competition, the better it

[25:53] is for us, the consumer. Now, as for

[25:55] DeepSeek, n I don't know. I wasn't very

[25:59] impressed. This might be a situation

[26:01] we're like, okay, like Deep Seek makes

[26:02] sense if we're doing simpler tasks where

[26:04] we just don't need the power of

[26:05] something like Opus or we just don't

[26:07] need the power of something like GPT 5.5

[26:10] because remember we're talking about

[26:11] something that is eight times cheaper.

[26:13] Sure, I didn't like the web GPU landing

[26:16] page this thing came up with, but was it

[26:18] eight times worse?

[26:20] Maybe, maybe not. Kind of hard to

[26:22] actually, you know, articulate that and

[26:24] quantify that, but obviously that's

[26:26] something we need to take into account.

[26:27] So, you know, I don't think it's really

[26:29] competition to be frank with 4.7 or 5.5.

[26:33] I think though if you're doing simpler

[26:34] task and you're like very token

[26:36] conscious, very cash conscious, then

[26:38] hey, maybe Deep Seek makes sense for

[26:40] you. So, that's all I got for you guys

[26:42] today. I hope that sheds some light on

[26:44] these three models and how they kind of

[26:46] stack up to one another. I think it's a

[26:47] great time to be in the space. More

[26:49] competition is better for everyone. So,

[26:52] as always, if you want to get your hands

[26:53] on the Claude Code Masterass, make sure

[26:55] to check out Chase AF Plus. There's a

[26:57] link to that in the description and I'll

[26:59] see you

⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.