[0:00] CodeLama 70B has just been released by Meta, and my biggest takeaway is that it's very sensitive to the prompt.
[0:06] I'll be going through a performance comparison with DeepSeq Coder, the 33 billion parameter model,
[0:12] and I'll talk you through a few of the changes, the tweaks, that have been made in the 70B CodeLama
[0:17] versus the previous CodeLama releases for smaller models.
[0:22] And for agenda, I want to talk through the chat format, which has changed significantly since CodeLama,
[0:28] or compared to Lama 2.
[0:30] Then I'll talk a few little notes here about the stop tokens.
[0:34] Again, this is related to the prompt format.
[0:37] Then I'll take you through performance comparison between DeepSeq Coder and CodeLama 70B.
[0:43] And I'll show you some one-click templates you can use to get started with CodeLama 70B.
[0:48] It does require quite a bit of VRAM. This is a big model.
[0:51] And then I'll briefly show you some function calling performance.
[0:55] So here we are on the CodeLama repo. It's just been launched, and it's a 70 billion parameter model.
[1:01] It's been uploaded in safe tensors and PyTorch format, and it's very roughly 150, so 30 times 5.
[1:10] That's roughly, it's actually 29 times 5, so 145 gigabytes in size.
[1:16] Now, the key difference with this model, apart from its size and hopefully performance,
[1:21] is that the prompt format has changed fairly radically.
[1:25] Instead of using inst at the start and then a slash inst before the assistant,
[1:30] there's a completely new format that is most easily applied by using tokenizer-apply-chat template.
[1:36] So I'd highly recommend Rodin trying to put together a prompt manually,
[1:40] load the tokenizer and use tokenize-apply-chat-tip template to an array of messages,
[1:46] and that will format the token, that will format the prompt as you need.
[1:51] Now just to give you a little insight, every message now has got this source,
[1:57] and it's got either source system, source user, or source assistant for the very last message.
[2:03] So this is quite a radical change, and also there's this special token step
[2:08] that is put in between the system and the user, and then between the user and the assistant messages.
[2:16] So actually when the model is generating, it will generate a sentence and end with that token step,
[2:22] and you'll often need to add that as a stop token.
[2:26] I'll show you that in the code that we go through, but you'll need to put that as a stop token
[2:30] because it won't generate slash s as with the other Lama models.
[2:35] So let's jump right into a comparison.
[2:38] So here I've gone to run pod, pods, new pod, and I'm going to want to run here on...
[2:46] You could run just the code Lama 70B on an A6000, but I want to compare it to DeepSeq,
[2:52] so I selected an A1000, and the template I'm looking for here is fine-tuning, notebook by Trellis.
[2:59] You can find the template on the one-click LLMs repo, that's a public repo I'll link below.
[3:04] Now, before you deploy, make sure to put a lot of space.
[3:07] You need at least 150 gigabytes to download one copy of the code Lama 70B model.
[3:14] So I have that model up and running, and then I've opened a Jupyter notebook,
[3:18] and within the notebook I've uploaded this file here, which is LLM comparison,
[3:23] and I'll provide a link to that in the install guides repo, a public repo on Trellis Research GitHub.
[3:31] So we don't need to log in to Huggingface because these are all public models.
[3:35] I've set the cache directory, and I've selected two models here.
[3:38] Let me just increase my screen for size.
[3:41] I have the code Lama 70B model and the DeepSeq coder model.
[3:46] Note that they're both in struct models, so these have been guardrailed pretty heavily in both cases,
[3:52] and we're going to see that during performance.
[3:55] This supervised fine-tuning that has been done by Meta and DeepSeq,
[4:00] it does improve the performance, but it also means that the models are quite limited in what they can respond to.
[4:07] I've installed a series of packages here, including flash attention to improve speed,
[4:13] imported the packages I need.
[4:15] Now, I want to show you this one little trick that I've discovered recently,
[4:19] which is to pip install hf underscore transfer and then set an environment variable of hf hub enable hf transfer,
[4:28] and this really speeds up downloading your weights and also uploading them again to hub.
[4:33] Now, you'll see that down below a little, but first I'm going to set up each of my models and load the models.
[4:39] I'm going to load them in bits and bytes NF4 format, that's a 4-bit format.
[4:44] It's a quite good quality 4-bit format, but keep that in mind because that will degrade quality a little bit.
[4:49] But allow me to fit both of these models on an A100.
[4:53] And next, the shards are all downloaded, but look at how fast the download speed is.
[4:59] That's more than half a gigabyte per second of speed, so you can achieve really high download speeds if you're using that hf transfer.
[5:07] And this is a massive time saver because these weights are really, really big.
[5:11] I've then set up the tokenizers and remember that we're going to use the chat template within the tokenizer in order to format the prompt.
[5:18] So I run a series of checks that there is a chat template for each of these models,
[5:22] and indeed they both do have chat templates within the tokenizer.
[5:26] And here you can see the tokenizer, this is Lama tokenizer fast.
[5:31] And just notice that it has this added token of step, that's 32015.
[5:37] And I'm actually going to set that as the end of sequence token because that during generation will get the program to stop generating
[5:45] once it hits this end of sequence token step.
[5:49] So next I'm going to set a generate function, which will take in a prompt and it will tokenize,
[5:54] apply the template, then tokenize, then submit to the model for generation.
[5:58] And I'm going to run the very first question here, which is list the planets in our solar system on both of these models.
[6:04] And you can see that Lama marks Pluto as debatable, but gets it right.
[6:10] And deep seek tells us that as an AI programming assistant, not equipped.
[6:16] In fact, if you run the deep seek base model, you are able to get an answer to that question.
[6:21] So this is an example of the model being guard railed.
[6:25] Now we'll move on to the three evaluations, which are actually in slightly different order,
[6:29] returning a sequence of letters in reverse.
[6:31] This is a difficult task for LMS, perhaps particularly because tokens are combinations of letters,
[6:38] and that makes it hard to just reverse letters within tokens.
[6:42] PASCII retrieval, where we inject PASCII within text and then code generation, which I'll actually do second.
[6:49] So the first test here on both models is asking the model to reverse the sequence, like A2B,
[6:55] and then add on another letter to make it harder and harder and harder.
[7:00] So here with code LAMA, I've given an example, so I've even done one prompt, one shot prompting,
[7:07] and I give an example reversing that and then ask LAMA to reverse the string.
[7:12] And instead it's telling me that 2A is not a valid string, it's not a valid number, it's not a valid.
[7:18] So the answer here doesn't make sense, and this again unfortunately is due to guard railing,
[7:24] even though I've left out the system prompt, which can help quite a bit in reducing the effects of guard railing.
[7:30] Now I've run that example as well using the base model a little bit earlier,
[7:35] so I'll show you that answer here.
[7:39] And you can see that actually the base model is able to reverse the string.
[7:44] Now after reversing it, it then follows with a long gap here.
[7:50] So this is the problem when you don't have an instruction fine tune model.
[7:56] And next I get it to try and reverse this string, which is A2BD,
[8:01] and in this case it does not correctly reverse the string.
[8:05] So again this illustrates that the chat fine tuned model, it's very heavily guard railed,
[8:11] which stops answering quite a lot of questions, and probably more representative of performance here of the base model.
[8:17] Although unfortunately the base model never has the benefit then of the additional fine tuning.
[8:22] The base model here seems to be cutting off at just a few, maybe around four,
[8:28] maybe it's between three and five characters in a row, so you can see it's failed here.
[8:35] Now by contrast deep seek, I've got it to do the same test,
[8:40] and deep seek in this case has failed for a length of four, well that's actually a length of five characters,
[8:51] but either way it doesn't get that exactly in reverse, it mixes the A and the two.
[8:56] But let me just run that test again, because there's variability,
[9:01] depending on exactly what that sequence is to reverse, you won't get success or failure necessarily.
[9:07] So sometimes the model might get a very long sequence, sometimes it might get a shorter one,
[9:11] and you have to run it a few times to get a sense for the performance.
[9:15] So here again you can see the model is rejecting our request in the case of code lambda 70B,
[9:22] and in the case of deep seek, you can see it's progressing on fairly nicely,
[9:28] and it's regressing, it's giving a Python example here,
[9:33] giving Python that example here again, because it's aligned towards returning code.
[9:38] So broadly speaking I've run this quite a few number of times, and deep seek coder is up able to get up higher,
[9:44] so my sense is that deep seek is a bit stronger on this task.
[9:47] I've rerun the task as well with GPT 3.5 and 4, and here's an example of where GPT 3.5 gets this sequence,
[9:55] which has got 9 characters within it, and it gets it wrong, it doesn't capitalise the T,
[10:00] and GPT 4 is able to go quite a bit further, it can probably go out towards 15 plus tokens.
[10:05] You can see it's successfully reversing this same sequence right here.
[10:10] So overall in terms of reversal of tokens, I would probably give the edge to deep seek coder,
[10:15] but certainly it's a hard test to run in CodeLama because it just refuses to respond unless you use the base model.
[10:22] The next test here is code generation, so the question I ask is to give a Python code snippet that prints the first N,
[10:29] where I set N to 10, so the first 10 prime numbers into Fibonacci series.
[10:33] So to do that, the model needs to first figure out how to calculate the Fibonacci series,
[10:38] and then filter out the prime ones, of which like there aren't that many, well there's probably a very large amount,
[10:44] but what I'm saying is the frequency is not that great, and then print those out.
[10:51] So here we have a program, which is from CodeLama, so at least on the surface looks good, and deep seek looks good as well.
[10:59] So we can take those programs, and here I'm running the CodeLama program,
[11:06] and indeed it does give the first 10 prime numbers, deep seek, I run the program, it only gives up to 89,
[11:13] but this is because in fact I've only run it for 5.
[11:18] Now you can run it for 10 as well, but the program is less efficient than the CodeLama program,
[11:23] which means that it takes really long to run for the first 10.
[11:28] So in this case that would give the slide edge, it's just one example though,
[11:32] but it would give the slide edge here to CodeLama in terms of performance.
[11:36] Chat GPT 3.5, it is able to get the first 10, it actually sets a limit on which Fibonacci numbers to check,
[11:48] so it sets a limit of 30, so that limit is not quite high enough to grab this top number here,
[11:54] but it's got the calculation method as efficient for calculating these prime Fibonacci numbers.
[12:01] And then last of all you've got GPT 4, which is able to get all of the requested numbers.
[12:07] Okay, so on this one here, I think the coding capability is certainly very strong,
[12:12] it's not obvious, it's possibly better than, well, it's not obviously worse than GPT 3.5,
[12:19] and it's doing well compared to GPT 4 as well, you need a harder challenge in order to distinguish them.
[12:25] Now the last question here is pass key retrieval.
[12:29] I've actually renamed it from pass key retrieval to random string retrieval
[12:33] because calling it a pass key makes Lama tell you that this is sensitive information that it's not permitted to return.
[12:40] So I've instead called it a random string, and I inject the random string halfway through a piece of text,
[12:46] which is as usual the Berkshire Hathaway transcript from 23.
[12:52] And I'm just testing about 16,000 characters, actually exactly 16,000 characters, which is about 4,000 tokens.
[12:59] These models are very good in long context, so I expect this would work up to 16,000.
[13:04] And if you extend the context, you can look at the long context fine tuning video from previously,
[13:10] but it's very realistic to extend the context on these models and get to 100k, or at least, yeah,
[13:17] probably 100k worth of tokens and succeed on pass key retrieval.
[13:21] So here I've run the two models, and I'm asking for the pass key and collab, not collab.
[13:29] Code Lama says, I see you're a fan of the Berkshire Hathaway meaning this is true, but it's not giving me the pass key.
[13:36] So I'm not sure why it can't give this response, it's not saying explicitly, it's a safety thing.
[13:42] When I ask it for a pass key, it does refuse on the grounds of user privacy.
[13:47] So I'm not quite sure, but it's clearly giving issues again with guard railing.
[13:52] And for deep seek, at the end here, let's see what happens.
[13:56] It's not able to find the random string.
[13:59] Now I know from running the base models, and I'll show you again with the base model just for the proof.
[14:05] Let me show you the model I loaded.
[14:07] So where are we here?
[14:09] So this is the base model, you can see there's no instruct mentioned in the model title.
[14:13] And when I go down to the bottom, Code Lama does actually get the pass key.
[14:18] So the base model is capable of getting the pass key, and this is true of deep seek as well.
[14:24] Deep seek is a very good pass key retriever.
[14:28] So it's the effective, the instruction fine tuning, the guard railing, that's affecting the performance of the model right here.
[14:36] The Code Lama model is quite large.
[14:38] In 16 bit precision, you would need to have a VRAM availability of probably about 150 gigabytes.
[14:47] So you would need either three A6000s or two A100s to have plenty of room.
[14:54] So this makes it difficult as well to run on consumer hardware.
[14:58] Now, in the meantime, if you do want to run it on a rented server, you can use something like RunPod.
[15:04] There's a one click template that I've set up right here.
[15:07] It's the Code Lama 70B instruct.
[15:10] Let's just take a quick look at the template when it's opened up.
[15:14] So I'm going to choose to run it on an A6000.
[15:18] And what's going to allow me to do that is using bits and bytes quantization.
[15:22] So here with four bit quantization, which is a good quality quantization with the NF4 data type,
[15:28] it will take down the required VRAM to about a quarter of 150.
[15:32] So somewhere around 35 to 40 gigabytes of VRAM at 48 gigabytes with an A6000.
[15:40] That does the trick and will run quite well.
[15:44] Now, just an extra trick.
[15:46] If you want to improve the speed a little bit, you can type in Speculate 3.
[15:50] This will use little sequences from the prompt to guess tokens ahead.
[15:56] And this can speed up generation.
[15:58] Now, keep in mind, even though we're going to be quantizing,
[16:01] the quantization is done after the weights are downloaded,
[16:04] so you still need to have about 150 or 60 gigabytes of VRAM,
[16:08] gigabytes of hard disks, sorry, not VRAM, on your pod.
[16:13] Now, the bloke, I'm sure, will come out with an AWQ model, a quantized form.
[16:19] That will allow you to download it in a smaller format, probably about 35 to 40 gigabytes in size.
[16:26] It will also allow you to run this with AWQ quantization.
[16:30] So that's a template that I'll put up.
[16:32] You can keep an eye out on the one-click LLMs repo,
[16:35] or you can take a look at the newsletter, that's trellis.substack.com.
[16:39] I'll send out an email once it's ready.
[16:42] AWQ is nice because it cuts the size of the storage required,
[16:46] so you get a faster download, and it's a faster inference as well,
[16:50] probably even two to four times the speed of doing bits and bytes in F4.
[16:55] The last thing I want to mention is around function calling.
[16:58] I have fine-tuned a function calling model that is available for purchase.
[17:02] I'll put a link below.
[17:03] Let me just very quickly show the performance of this model.
[17:07] As with many code-ing models, they tend to be quite strong at function calling
[17:14] because function calling requires a structured response.
[17:17] What I'll do is I'll just very quickly show you performance on a validation set.
[17:21] This is a set of data that was not used for training.
[17:24] You can see the format here, source user, then there's a list of the functions
[17:29] that are available, and then there's a question,
[17:32] get the names of the five largest stocks by market cap,
[17:35] and then I compare the generated response in this validation test
[17:39] with the correct response, and you can see they're the same.
[17:42] There are more questions like get the names of the five largest stocks,
[17:46] and again, these are the same.
[17:49] The structured responses that are supposed to be returned
[17:53] are all consistent across the validation,
[17:56] and even then when you test it with some more tricky questions,
[18:00] like just a short question like greetings,
[18:03] which doesn't require any function calls,
[18:05] the model responds with greetings to you too,
[18:07] which is perfectly reasonable,
[18:09] and then when you give a random word,
[18:11] does it respond in a way that makes sense to that word,
[18:14] and the generator response here is just describing a shop,
[18:18] which is a pretty reasonable response.
[18:22] Then just testing again on a normal question,
[18:25] what are the planets in our solar system?
[18:27] It lists out the planets here, minus Pluto, which is correct as of now.
[18:33] So as with many code models, this function calling model is quite strong.
[18:37] It does well.
[18:38] It's probably amongst the strongest of the function calling models
[18:41] among DeepSeq 33B and also OpenChat, the 7B model,
[18:46] somewhat surprisingly.
[18:48] And that's it for the review of CodeLama70B.
[18:51] Typically, I'd recommend any testing or building you want to do,
[18:54] start off with a smaller model first
[18:56] and make sure you've got everything working,
[18:58] maybe with OpenChat 3.5, which is a 7B model,
[19:01] and one of my favourites, a very high-performing model.
[19:04] Only then in the later stages of deploying to production,
[19:07] if you really care about getting the highest possible quality,
[19:10] do you want to consider something like CodeLama70B?
[19:14] I've left a script, a link to a script below
[19:16] if you want to try out the LLM comparison,
[19:18] or you can check out the one-click templates.
[19:20] As I mentioned, there'll be a new one coming out once the AWQ is up,
[19:24] so keep an eye out for that.
[19:26] And if you like, you can get notified with the newsletter,
[19:29] which is trellis.substack.com.
[19:31] Cheers.