---
title: 'CodeLlama 70B'
source: 'https://youtube.com/watch?v=CgVvorNgGqA'
video_id: 'CgVvorNgGqA'
date: 2026-06-18
duration_sec: 0
---

# CodeLlama 70B

> Source: [CodeLlama 70B](https://youtube.com/watch?v=CgVvorNgGqA)

## Summary

Meta released CodeLama 70B, a 70 billion parameter code-focused language model. This video reviews its new prompt format, sensitivity to prompts, and compares its performance with DeepSeq Coder 33B across multiple tasks including string reversal, code generation, and passkey retrieval.

### Key Points

- **CodeLama 70B Release** [0:00] — Meta released CodeLama 70B, a 70B parameter model highly sensitive to prompts. Performance comparison with DeepSeq Coder 33B is presented.
- **New Chat Format** [0:22] — The prompt format changed radically from Llama 2: uses source system/user/assistant and a special <step> token between messages.
- **Use Tokenizer Apply Chat Template** [1:36] — Recommended to load tokenizer and use tokenizer.apply_chat_template to format prompts correctly instead of manual formatting.
- **Stop Token Step** [2:03] — The model generates a special <step> token at the end of each message. This token must be added as a stop token to generation.
- **VRAM and Storage Requirements** [3:04] — Full 16-bit model requires ~150 GB VRAM. 4-bit quantization reduces to ~35-40 GB allowing run on A6000. Storage needs 150 GB for weights.
- **Speed Up Downloads** [4:19] — Use hf_transfer library to achieve >500 MB/s download speeds for large weights.
- **Guardrails Observed** [6:04] — Instruct models often refuse non-coding questions. DeepSeq Coder refused planetary question; CodeLama included Pluto. Base models perform better.
- **String Reversal Test** [6:49] — Both models struggled with reversing letter sequences. CodeLama instruct refused; base model succeeded but limited to short strings. DeepSeq also failed for longer sequences.
- **Code Generation Test** [10:22] — CodeLama generated correct program for first 10 prime Fibonacci numbers. DeepSeq generated less efficient code that only returned 5 numbers.
- **Passkey Retrieval** [12:29] — Instruct models failed to retrieve injected string due to guardrails. Base models succeeded. DeepSeq base is a strong retriever.
- **Quantization and Hardware** [14:36] — Full precision requires 150 GB VRAM. With NF4 4-bit quantization, runs on single A6000 (48 GB). AWQ quantization expected soon for faster and smaller model.
- **Function Calling Performance** [16:55] — Fine-tuned function calling model on CodeLama performs well on validation tests, including tricky questions like greetings and random words.
- **Recommendation** [18:50] — Start with smaller models like OpenChat 3.5 for testing, then use CodeLama 70B for production only if high quality is needed.

### Conclusion

CodeLama 70B offers strong coding capabilities but is heavily guardrailed in instruct mode. For best results, use base models or expect refusals on non-coding tasks. Quantization makes it accessible on consumer-grade GPUs.

## Transcript

CodeLama 70B has just been released by Meta, and my biggest takeaway is that it's very sensitive to the prompt.
I'll be going through a performance comparison with DeepSeq Coder, the 33 billion parameter model,
and I'll talk you through a few of the changes, the tweaks, that have been made in the 70B CodeLama
versus the previous CodeLama releases for smaller models.
And for agenda, I want to talk through the chat format, which has changed significantly since CodeLama,
or compared to Lama 2.
Then I'll talk a few little notes here about the stop tokens.
Again, this is related to the prompt format.
Then I'll take you through performance comparison between DeepSeq Coder and CodeLama 70B.
And I'll show you some one-click templates you can use to get started with CodeLama 70B.
It does require quite a bit of VRAM. This is a big model.
And then I'll briefly show you some function calling performance.
So here we are on the CodeLama repo. It's just been launched, and it's a 70 billion parameter model.
It's been uploaded in safe tensors and PyTorch format, and it's very roughly 150, so 30 times 5.
That's roughly, it's actually 29 times 5, so 145 gigabytes in size.
Now, the key difference with this model, apart from its size and hopefully performance,
is that the prompt format has changed fairly radically.
Instead of using inst at the start and then a slash inst before the assistant,
there's a completely new format that is most easily applied by using tokenizer-apply-chat template.
So I'd highly recommend Rodin trying to put together a prompt manually,
load the tokenizer and use tokenize-apply-chat-tip template to an array of messages,
and that will format the token, that will format the prompt as you need.
Now just to give you a little insight, every message now has got this source,
and it's got either source system, source user, or source assistant for the very last message.
So this is quite a radical change, and also there's this special token step
that is put in between the system and the user, and then between the user and the assistant messages.
So actually when the model is generating, it will generate a sentence and end with that token step,
and you'll often need to add that as a stop token.
I'll show you that in the code that we go through, but you'll need to put that as a stop token
because it won't generate slash s as with the other Lama models.
So let's jump right into a comparison.
So here I've gone to run pod, pods, new pod, and I'm going to want to run here on...
You could run just the code Lama 70B on an A6000, but I want to compare it to DeepSeq,
so I selected an A1000, and the template I'm looking for here is fine-tuning, notebook by Trellis.
You can find the template on the one-click LLMs repo, that's a public repo I'll link below.
Now, before you deploy, make sure to put a lot of space.
You need at least 150 gigabytes to download one copy of the code Lama 70B model.
So I have that model up and running, and then I've opened a Jupyter notebook,
and within the notebook I've uploaded this file here, which is LLM comparison,
and I'll provide a link to that in the install guides repo, a public repo on Trellis Research GitHub.
So we don't need to log in to Huggingface because these are all public models.
I've set the cache directory, and I've selected two models here.
Let me just increase my screen for size.
I have the code Lama 70B model and the DeepSeq coder model.
Note that they're both in struct models, so these have been guardrailed pretty heavily in both cases,
and we're going to see that during performance.
This supervised fine-tuning that has been done by Meta and DeepSeq,
it does improve the performance, but it also means that the models are quite limited in what they can respond to.
I've installed a series of packages here, including flash attention to improve speed,
imported the packages I need.
Now, I want to show you this one little trick that I've discovered recently,
which is to pip install hf underscore transfer and then set an environment variable of hf hub enable hf transfer,
and this really speeds up downloading your weights and also uploading them again to hub.
Now, you'll see that down below a little, but first I'm going to set up each of my models and load the models.
I'm going to load them in bits and bytes NF4 format, that's a 4-bit format.
It's a quite good quality 4-bit format, but keep that in mind because that will degrade quality a little bit.
But allow me to fit both of these models on an A100.
And next, the shards are all downloaded, but look at how fast the download speed is.
That's more than half a gigabyte per second of speed, so you can achieve really high download speeds if you're using that hf transfer.
And this is a massive time saver because these weights are really, really big.
I've then set up the tokenizers and remember that we're going to use the chat template within the tokenizer in order to format the prompt.
So I run a series of checks that there is a chat template for each of these models,
and indeed they both do have chat templates within the tokenizer.
And here you can see the tokenizer, this is Lama tokenizer fast.
And just notice that it has this added token of step, that's 32015.
And I'm actually going to set that as the end of sequence token because that during generation will get the program to stop generating
once it hits this end of sequence token step.
So next I'm going to set a generate function, which will take in a prompt and it will tokenize,
apply the template, then tokenize, then submit to the model for generation.
And I'm going to run the very first question here, which is list the planets in our solar system on both of these models.
And you can see that Lama marks Pluto as debatable, but gets it right.
And deep seek tells us that as an AI programming assistant, not equipped.
In fact, if you run the deep seek base model, you are able to get an answer to that question.
So this is an example of the model being guard railed.
Now we'll move on to the three evaluations, which are actually in slightly different order,
returning a sequence of letters in reverse.
This is a difficult task for LMS, perhaps particularly because tokens are combinations of letters,
and that makes it hard to just reverse letters within tokens.
PASCII retrieval, where we inject PASCII within text and then code generation, which I'll actually do second.
So the first test here on both models is asking the model to reverse the sequence, like A2B,
and then add on another letter to make it harder and harder and harder.
So here with code LAMA, I've given an example, so I've even done one prompt, one shot prompting,
and I give an example reversing that and then ask LAMA to reverse the string.
And instead it's telling me that 2A is not a valid string, it's not a valid number, it's not a valid.
So the answer here doesn't make sense, and this again unfortunately is due to guard railing,
even though I've left out the system prompt, which can help quite a bit in reducing the effects of guard railing.
Now I've run that example as well using the base model a little bit earlier,
so I'll show you that answer here.
And you can see that actually the base model is able to reverse the string.
Now after reversing it, it then follows with a long gap here.
So this is the problem when you don't have an instruction fine tune model.
And next I get it to try and reverse this string, which is A2BD,
and in this case it does not correctly reverse the string.
So again this illustrates that the chat fine tuned model, it's very heavily guard railed,
which stops answering quite a lot of questions, and probably more representative of performance here of the base model.
Although unfortunately the base model never has the benefit then of the additional fine tuning.
The base model here seems to be cutting off at just a few, maybe around four,
maybe it's between three and five characters in a row, so you can see it's failed here.
Now by contrast deep seek, I've got it to do the same test,
and deep seek in this case has failed for a length of four, well that's actually a length of five characters,
but either way it doesn't get that exactly in reverse, it mixes the A and the two.
But let me just run that test again, because there's variability,
depending on exactly what that sequence is to reverse, you won't get success or failure necessarily.
So sometimes the model might get a very long sequence, sometimes it might get a shorter one,
and you have to run it a few times to get a sense for the performance.
So here again you can see the model is rejecting our request in the case of code lambda 70B,
and in the case of deep seek, you can see it's progressing on fairly nicely,
and it's regressing, it's giving a Python example here,
giving Python that example here again, because it's aligned towards returning code.
So broadly speaking I've run this quite a few number of times, and deep seek coder is up able to get up higher,
so my sense is that deep seek is a bit stronger on this task.
I've rerun the task as well with GPT 3.5 and 4, and here's an example of where GPT 3.5 gets this sequence,
which has got 9 characters within it, and it gets it wrong, it doesn't capitalise the T,
and GPT 4 is able to go quite a bit further, it can probably go out towards 15 plus tokens.
You can see it's successfully reversing this same sequence right here.
So overall in terms of reversal of tokens, I would probably give the edge to deep seek coder,
but certainly it's a hard test to run in CodeLama because it just refuses to respond unless you use the base model.
The next test here is code generation, so the question I ask is to give a Python code snippet that prints the first N,
where I set N to 10, so the first 10 prime numbers into Fibonacci series.
So to do that, the model needs to first figure out how to calculate the Fibonacci series,
and then filter out the prime ones, of which like there aren't that many, well there's probably a very large amount,
but what I'm saying is the frequency is not that great, and then print those out.
So here we have a program, which is from CodeLama, so at least on the surface looks good, and deep seek looks good as well.
So we can take those programs, and here I'm running the CodeLama program,
and indeed it does give the first 10 prime numbers, deep seek, I run the program, it only gives up to 89,
but this is because in fact I've only run it for 5.
Now you can run it for 10 as well, but the program is less efficient than the CodeLama program,
which means that it takes really long to run for the first 10.
So in this case that would give the slide edge, it's just one example though,
but it would give the slide edge here to CodeLama in terms of performance.
Chat GPT 3.5, it is able to get the first 10, it actually sets a limit on which Fibonacci numbers to check,
so it sets a limit of 30, so that limit is not quite high enough to grab this top number here,
but it's got the calculation method as efficient for calculating these prime Fibonacci numbers.
And then last of all you've got GPT 4, which is able to get all of the requested numbers.
Okay, so on this one here, I think the coding capability is certainly very strong,
it's not obvious, it's possibly better than, well, it's not obviously worse than GPT 3.5,
and it's doing well compared to GPT 4 as well, you need a harder challenge in order to distinguish them.
Now the last question here is pass key retrieval.
I've actually renamed it from pass key retrieval to random string retrieval
because calling it a pass key makes Lama tell you that this is sensitive information that it's not permitted to return.
So I've instead called it a random string, and I inject the random string halfway through a piece of text,
which is as usual the Berkshire Hathaway transcript from 23.
And I'm just testing about 16,000 characters, actually exactly 16,000 characters, which is about 4,000 tokens.
These models are very good in long context, so I expect this would work up to 16,000.
And if you extend the context, you can look at the long context fine tuning video from previously,
but it's very realistic to extend the context on these models and get to 100k, or at least, yeah,
probably 100k worth of tokens and succeed on pass key retrieval.
So here I've run the two models, and I'm asking for the pass key and collab, not collab.
Code Lama says, I see you're a fan of the Berkshire Hathaway meaning this is true, but it's not giving me the pass key.
So I'm not sure why it can't give this response, it's not saying explicitly, it's a safety thing.
When I ask it for a pass key, it does refuse on the grounds of user privacy.
So I'm not quite sure, but it's clearly giving issues again with guard railing.
And for deep seek, at the end here, let's see what happens.
It's not able to find the random string.
Now I know from running the base models, and I'll show you again with the base model just for the proof.
Let me show you the model I loaded.
So where are we here?
So this is the base model, you can see there's no instruct mentioned in the model title.
And when I go down to the bottom, Code Lama does actually get the pass key.
So the base model is capable of getting the pass key, and this is true of deep seek as well.
Deep seek is a very good pass key retriever.
So it's the effective, the instruction fine tuning, the guard railing, that's affecting the performance of the model right here.
The Code Lama model is quite large.
In 16 bit precision, you would need to have a VRAM availability of probably about 150 gigabytes.
So you would need either three A6000s or two A100s to have plenty of room.
So this makes it difficult as well to run on consumer hardware.
Now, in the meantime, if you do want to run it on a rented server, you can use something like RunPod.
There's a one click template that I've set up right here.
It's the Code Lama 70B instruct.
Let's just take a quick look at the template when it's opened up.
So I'm going to choose to run it on an A6000.
And what's going to allow me to do that is using bits and bytes quantization.
So here with four bit quantization, which is a good quality quantization with the NF4 data type,
it will take down the required VRAM to about a quarter of 150.
So somewhere around 35 to 40 gigabytes of VRAM at 48 gigabytes with an A6000.
That does the trick and will run quite well.
Now, just an extra trick.
If you want to improve the speed a little bit, you can type in Speculate 3.
This will use little sequences from the prompt to guess tokens ahead.
And this can speed up generation.
Now, keep in mind, even though we're going to be quantizing,
the quantization is done after the weights are downloaded,
so you still need to have about 150 or 60 gigabytes of VRAM,
gigabytes of hard disks, sorry, not VRAM, on your pod.
Now, the bloke, I'm sure, will come out with an AWQ model, a quantized form.
That will allow you to download it in a smaller format, probably about 35 to 40 gigabytes in size.
It will also allow you to run this with AWQ quantization.
So that's a template that I'll put up.
You can keep an eye out on the one-click LLMs repo,
or you can take a look at the newsletter, that's trellis.substack.com.
I'll send out an email once it's ready.
AWQ is nice because it cuts the size of the storage required,
so you get a faster download, and it's a faster inference as well,
probably even two to four times the speed of doing bits and bytes in F4.
The last thing I want to mention is around function calling.
I have fine-tuned a function calling model that is available for purchase.
I'll put a link below.
Let me just very quickly show the performance of this model.
As with many code-ing models, they tend to be quite strong at function calling
because function calling requires a structured response.
What I'll do is I'll just very quickly show you performance on a validation set.
This is a set of data that was not used for training.
You can see the format here, source user, then there's a list of the functions
that are available, and then there's a question,
get the names of the five largest stocks by market cap,
and then I compare the generated response in this validation test
with the correct response, and you can see they're the same.
There are more questions like get the names of the five largest stocks,
and again, these are the same.
The structured responses that are supposed to be returned
are all consistent across the validation,
and even then when you test it with some more tricky questions,
like just a short question like greetings,
which doesn't require any function calls,
the model responds with greetings to you too,
which is perfectly reasonable,
and then when you give a random word,
does it respond in a way that makes sense to that word,
and the generator response here is just describing a shop,
which is a pretty reasonable response.
Then just testing again on a normal question,
what are the planets in our solar system?
It lists out the planets here, minus Pluto, which is correct as of now.
So as with many code models, this function calling model is quite strong.
It does well.
It's probably amongst the strongest of the function calling models
among DeepSeq 33B and also OpenChat, the 7B model,
somewhat surprisingly.
And that's it for the review of CodeLama70B.
Typically, I'd recommend any testing or building you want to do,
start off with a smaller model first
and make sure you've got everything working,
maybe with OpenChat 3.5, which is a 7B model,
and one of my favourites, a very high-performing model.
Only then in the later stages of deploying to production,
if you really care about getting the highest possible quality,
do you want to consider something like CodeLama70B?
I've left a script, a link to a script below
if you want to try out the LLM comparison,
or you can check out the one-click templates.
As I mentioned, there'll be a new one coming out once the AWQ is up,
so keep an eye out for that.
And if you like, you can get notified with the newsletter,
which is trellis.substack.com.
Cheers.
