[0:00] CodeLama 70B has just been released by Meta, and my biggest takeaway is that it's very sensitive to the prompt. [0:06] I'll be going through a performance comparison with DeepSeq Coder, the 33 billion parameter model, [0:12] and I'll talk you through a few of the changes, the tweaks, that have been made in the 70B CodeLama [0:17] versus the previous CodeLama releases for smaller models. [0:22] And for agenda, I want to talk through the chat format, which has changed significantly since CodeLama, [0:28] or compared to Lama 2. [0:30] Then I'll talk a few little notes here about the stop tokens. [0:34] Again, this is related to the prompt format. [0:37] Then I'll take you through performance comparison between DeepSeq Coder and CodeLama 70B. [0:43] And I'll show you some one-click templates you can use to get started with CodeLama 70B. [0:48] It does require quite a bit of VRAM. This is a big model. [0:51] And then I'll briefly show you some function calling performance. [0:55] So here we are on the CodeLama repo. It's just been launched, and it's a 70 billion parameter model. [1:01] It's been uploaded in safe tensors and PyTorch format, and it's very roughly 150, so 30 times 5. [1:10] That's roughly, it's actually 29 times 5, so 145 gigabytes in size. [1:16] Now, the key difference with this model, apart from its size and hopefully performance, [1:21] is that the prompt format has changed fairly radically. [1:25] Instead of using inst at the start and then a slash inst before the assistant, [1:30] there's a completely new format that is most easily applied by using tokenizer-apply-chat template. [1:36] So I'd highly recommend Rodin trying to put together a prompt manually, [1:40] load the tokenizer and use tokenize-apply-chat-tip template to an array of messages, [1:46] and that will format the token, that will format the prompt as you need. [1:51] Now just to give you a little insight, every message now has got this source, [1:57] and it's got either source system, source user, or source assistant for the very last message. [2:03] So this is quite a radical change, and also there's this special token step [2:08] that is put in between the system and the user, and then between the user and the assistant messages. [2:16] So actually when the model is generating, it will generate a sentence and end with that token step, [2:22] and you'll often need to add that as a stop token. [2:26] I'll show you that in the code that we go through, but you'll need to put that as a stop token [2:30] because it won't generate slash s as with the other Lama models. [2:35] So let's jump right into a comparison. [2:38] So here I've gone to run pod, pods, new pod, and I'm going to want to run here on... [2:46] You could run just the code Lama 70B on an A6000, but I want to compare it to DeepSeq, [2:52] so I selected an A1000, and the template I'm looking for here is fine-tuning, notebook by Trellis. [2:59] You can find the template on the one-click LLMs repo, that's a public repo I'll link below. [3:04] Now, before you deploy, make sure to put a lot of space. [3:07] You need at least 150 gigabytes to download one copy of the code Lama 70B model. [3:14] So I have that model up and running, and then I've opened a Jupyter notebook, [3:18] and within the notebook I've uploaded this file here, which is LLM comparison, [3:23] and I'll provide a link to that in the install guides repo, a public repo on Trellis Research GitHub. [3:31] So we don't need to log in to Huggingface because these are all public models. [3:35] I've set the cache directory, and I've selected two models here. [3:38] Let me just increase my screen for size. [3:41] I have the code Lama 70B model and the DeepSeq coder model. [3:46] Note that they're both in struct models, so these have been guardrailed pretty heavily in both cases, [3:52] and we're going to see that during performance. [3:55] This supervised fine-tuning that has been done by Meta and DeepSeq, [4:00] it does improve the performance, but it also means that the models are quite limited in what they can respond to. [4:07] I've installed a series of packages here, including flash attention to improve speed, [4:13] imported the packages I need. [4:15] Now, I want to show you this one little trick that I've discovered recently, [4:19] which is to pip install hf underscore transfer and then set an environment variable of hf hub enable hf transfer, [4:28] and this really speeds up downloading your weights and also uploading them again to hub. [4:33] Now, you'll see that down below a little, but first I'm going to set up each of my models and load the models. [4:39] I'm going to load them in bits and bytes NF4 format, that's a 4-bit format. [4:44] It's a quite good quality 4-bit format, but keep that in mind because that will degrade quality a little bit. [4:49] But allow me to fit both of these models on an A100. [4:53] And next, the shards are all downloaded, but look at how fast the download speed is. [4:59] That's more than half a gigabyte per second of speed, so you can achieve really high download speeds if you're using that hf transfer. [5:07] And this is a massive time saver because these weights are really, really big. [5:11] I've then set up the tokenizers and remember that we're going to use the chat template within the tokenizer in order to format the prompt. [5:18] So I run a series of checks that there is a chat template for each of these models, [5:22] and indeed they both do have chat templates within the tokenizer. [5:26] And here you can see the tokenizer, this is Lama tokenizer fast. [5:31] And just notice that it has this added token of step, that's 32015. [5:37] And I'm actually going to set that as the end of sequence token because that during generation will get the program to stop generating [5:45] once it hits this end of sequence token step. [5:49] So next I'm going to set a generate function, which will take in a prompt and it will tokenize, [5:54] apply the template, then tokenize, then submit to the model for generation. [5:58] And I'm going to run the very first question here, which is list the planets in our solar system on both of these models. [6:04] And you can see that Lama marks Pluto as debatable, but gets it right. [6:10] And deep seek tells us that as an AI programming assistant, not equipped. [6:16] In fact, if you run the deep seek base model, you are able to get an answer to that question. [6:21] So this is an example of the model being guard railed. [6:25] Now we'll move on to the three evaluations, which are actually in slightly different order, [6:29] returning a sequence of letters in reverse. [6:31] This is a difficult task for LMS, perhaps particularly because tokens are combinations of letters, [6:38] and that makes it hard to just reverse letters within tokens. [6:42] PASCII retrieval, where we inject PASCII within text and then code generation, which I'll actually do second. [6:49] So the first test here on both models is asking the model to reverse the sequence, like A2B, [6:55] and then add on another letter to make it harder and harder and harder. [7:00] So here with code LAMA, I've given an example, so I've even done one prompt, one shot prompting, [7:07] and I give an example reversing that and then ask LAMA to reverse the string. [7:12] And instead it's telling me that 2A is not a valid string, it's not a valid number, it's not a valid. [7:18] So the answer here doesn't make sense, and this again unfortunately is due to guard railing, [7:24] even though I've left out the system prompt, which can help quite a bit in reducing the effects of guard railing. [7:30] Now I've run that example as well using the base model a little bit earlier, [7:35] so I'll show you that answer here. [7:39] And you can see that actually the base model is able to reverse the string. [7:44] Now after reversing it, it then follows with a long gap here. [7:50] So this is the problem when you don't have an instruction fine tune model. [7:56] And next I get it to try and reverse this string, which is A2BD, [8:01] and in this case it does not correctly reverse the string. [8:05] So again this illustrates that the chat fine tuned model, it's very heavily guard railed, [8:11] which stops answering quite a lot of questions, and probably more representative of performance here of the base model. [8:17] Although unfortunately the base model never has the benefit then of the additional fine tuning. [8:22] The base model here seems to be cutting off at just a few, maybe around four, [8:28] maybe it's between three and five characters in a row, so you can see it's failed here. [8:35] Now by contrast deep seek, I've got it to do the same test, [8:40] and deep seek in this case has failed for a length of four, well that's actually a length of five characters, [8:51] but either way it doesn't get that exactly in reverse, it mixes the A and the two. [8:56] But let me just run that test again, because there's variability, [9:01] depending on exactly what that sequence is to reverse, you won't get success or failure necessarily. [9:07] So sometimes the model might get a very long sequence, sometimes it might get a shorter one, [9:11] and you have to run it a few times to get a sense for the performance. [9:15] So here again you can see the model is rejecting our request in the case of code lambda 70B, [9:22] and in the case of deep seek, you can see it's progressing on fairly nicely, [9:28] and it's regressing, it's giving a Python example here, [9:33] giving Python that example here again, because it's aligned towards returning code. [9:38] So broadly speaking I've run this quite a few number of times, and deep seek coder is up able to get up higher, [9:44] so my sense is that deep seek is a bit stronger on this task. [9:47] I've rerun the task as well with GPT 3.5 and 4, and here's an example of where GPT 3.5 gets this sequence, [9:55] which has got 9 characters within it, and it gets it wrong, it doesn't capitalise the T, [10:00] and GPT 4 is able to go quite a bit further, it can probably go out towards 15 plus tokens. [10:05] You can see it's successfully reversing this same sequence right here. [10:10] So overall in terms of reversal of tokens, I would probably give the edge to deep seek coder, [10:15] but certainly it's a hard test to run in CodeLama because it just refuses to respond unless you use the base model. [10:22] The next test here is code generation, so the question I ask is to give a Python code snippet that prints the first N, [10:29] where I set N to 10, so the first 10 prime numbers into Fibonacci series. [10:33] So to do that, the model needs to first figure out how to calculate the Fibonacci series, [10:38] and then filter out the prime ones, of which like there aren't that many, well there's probably a very large amount, [10:44] but what I'm saying is the frequency is not that great, and then print those out. [10:51] So here we have a program, which is from CodeLama, so at least on the surface looks good, and deep seek looks good as well. [10:59] So we can take those programs, and here I'm running the CodeLama program, [11:06] and indeed it does give the first 10 prime numbers, deep seek, I run the program, it only gives up to 89, [11:13] but this is because in fact I've only run it for 5. [11:18] Now you can run it for 10 as well, but the program is less efficient than the CodeLama program, [11:23] which means that it takes really long to run for the first 10. [11:28] So in this case that would give the slide edge, it's just one example though, [11:32] but it would give the slide edge here to CodeLama in terms of performance. [11:36] Chat GPT 3.5, it is able to get the first 10, it actually sets a limit on which Fibonacci numbers to check, [11:48] so it sets a limit of 30, so that limit is not quite high enough to grab this top number here, [11:54] but it's got the calculation method as efficient for calculating these prime Fibonacci numbers. [12:01] And then last of all you've got GPT 4, which is able to get all of the requested numbers. [12:07] Okay, so on this one here, I think the coding capability is certainly very strong, [12:12] it's not obvious, it's possibly better than, well, it's not obviously worse than GPT 3.5, [12:19] and it's doing well compared to GPT 4 as well, you need a harder challenge in order to distinguish them. [12:25] Now the last question here is pass key retrieval. [12:29] I've actually renamed it from pass key retrieval to random string retrieval [12:33] because calling it a pass key makes Lama tell you that this is sensitive information that it's not permitted to return. [12:40] So I've instead called it a random string, and I inject the random string halfway through a piece of text, [12:46] which is as usual the Berkshire Hathaway transcript from 23. [12:52] And I'm just testing about 16,000 characters, actually exactly 16,000 characters, which is about 4,000 tokens. [12:59] These models are very good in long context, so I expect this would work up to 16,000. [13:04] And if you extend the context, you can look at the long context fine tuning video from previously, [13:10] but it's very realistic to extend the context on these models and get to 100k, or at least, yeah, [13:17] probably 100k worth of tokens and succeed on pass key retrieval. [13:21] So here I've run the two models, and I'm asking for the pass key and collab, not collab. [13:29] Code Lama says, I see you're a fan of the Berkshire Hathaway meaning this is true, but it's not giving me the pass key. [13:36] So I'm not sure why it can't give this response, it's not saying explicitly, it's a safety thing. [13:42] When I ask it for a pass key, it does refuse on the grounds of user privacy. [13:47] So I'm not quite sure, but it's clearly giving issues again with guard railing. [13:52] And for deep seek, at the end here, let's see what happens. [13:56] It's not able to find the random string. [13:59] Now I know from running the base models, and I'll show you again with the base model just for the proof. [14:05] Let me show you the model I loaded. [14:07] So where are we here? [14:09] So this is the base model, you can see there's no instruct mentioned in the model title. [14:13] And when I go down to the bottom, Code Lama does actually get the pass key. [14:18] So the base model is capable of getting the pass key, and this is true of deep seek as well. [14:24] Deep seek is a very good pass key retriever. [14:28] So it's the effective, the instruction fine tuning, the guard railing, that's affecting the performance of the model right here. [14:36] The Code Lama model is quite large. [14:38] In 16 bit precision, you would need to have a VRAM availability of probably about 150 gigabytes. [14:47] So you would need either three A6000s or two A100s to have plenty of room. [14:54] So this makes it difficult as well to run on consumer hardware. [14:58] Now, in the meantime, if you do want to run it on a rented server, you can use something like RunPod. [15:04] There's a one click template that I've set up right here. [15:07] It's the Code Lama 70B instruct. [15:10] Let's just take a quick look at the template when it's opened up. [15:14] So I'm going to choose to run it on an A6000. [15:18] And what's going to allow me to do that is using bits and bytes quantization. [15:22] So here with four bit quantization, which is a good quality quantization with the NF4 data type, [15:28] it will take down the required VRAM to about a quarter of 150. [15:32] So somewhere around 35 to 40 gigabytes of VRAM at 48 gigabytes with an A6000. [15:40] That does the trick and will run quite well. [15:44] Now, just an extra trick. [15:46] If you want to improve the speed a little bit, you can type in Speculate 3. [15:50] This will use little sequences from the prompt to guess tokens ahead. [15:56] And this can speed up generation. [15:58] Now, keep in mind, even though we're going to be quantizing, [16:01] the quantization is done after the weights are downloaded, [16:04] so you still need to have about 150 or 60 gigabytes of VRAM, [16:08] gigabytes of hard disks, sorry, not VRAM, on your pod. [16:13] Now, the bloke, I'm sure, will come out with an AWQ model, a quantized form. [16:19] That will allow you to download it in a smaller format, probably about 35 to 40 gigabytes in size. [16:26] It will also allow you to run this with AWQ quantization. [16:30] So that's a template that I'll put up. [16:32] You can keep an eye out on the one-click LLMs repo, [16:35] or you can take a look at the newsletter, that's trellis.substack.com. [16:39] I'll send out an email once it's ready. [16:42] AWQ is nice because it cuts the size of the storage required, [16:46] so you get a faster download, and it's a faster inference as well, [16:50] probably even two to four times the speed of doing bits and bytes in F4. [16:55] The last thing I want to mention is around function calling. [16:58] I have fine-tuned a function calling model that is available for purchase. [17:02] I'll put a link below. [17:03] Let me just very quickly show the performance of this model. [17:07] As with many code-ing models, they tend to be quite strong at function calling [17:14] because function calling requires a structured response. [17:17] What I'll do is I'll just very quickly show you performance on a validation set. [17:21] This is a set of data that was not used for training. [17:24] You can see the format here, source user, then there's a list of the functions [17:29] that are available, and then there's a question, [17:32] get the names of the five largest stocks by market cap, [17:35] and then I compare the generated response in this validation test [17:39] with the correct response, and you can see they're the same. [17:42] There are more questions like get the names of the five largest stocks, [17:46] and again, these are the same. [17:49] The structured responses that are supposed to be returned [17:53] are all consistent across the validation, [17:56] and even then when you test it with some more tricky questions, [18:00] like just a short question like greetings, [18:03] which doesn't require any function calls, [18:05] the model responds with greetings to you too, [18:07] which is perfectly reasonable, [18:09] and then when you give a random word, [18:11] does it respond in a way that makes sense to that word, [18:14] and the generator response here is just describing a shop, [18:18] which is a pretty reasonable response. [18:22] Then just testing again on a normal question, [18:25] what are the planets in our solar system? [18:27] It lists out the planets here, minus Pluto, which is correct as of now. [18:33] So as with many code models, this function calling model is quite strong. [18:37] It does well. [18:38] It's probably amongst the strongest of the function calling models [18:41] among DeepSeq 33B and also OpenChat, the 7B model, [18:46] somewhat surprisingly. [18:48] And that's it for the review of CodeLama70B. [18:51] Typically, I'd recommend any testing or building you want to do, [18:54] start off with a smaller model first [18:56] and make sure you've got everything working, [18:58] maybe with OpenChat 3.5, which is a 7B model, [19:01] and one of my favourites, a very high-performing model. [19:04] Only then in the later stages of deploying to production, [19:07] if you really care about getting the highest possible quality, [19:10] do you want to consider something like CodeLama70B? [19:14] I've left a script, a link to a script below [19:16] if you want to try out the LLM comparison, [19:18] or you can check out the one-click templates. [19:20] As I mentioned, there'll be a new one coming out once the AWQ is up, [19:24] so keep an eye out for that. [19:26] And if you like, you can get notified with the newsletter, [19:29] which is trellis.substack.com. [19:31] Cheers.