[0:00] hi everyone i'm patrick and in today's [0:02] video we are going to learn how to get [0:03] started with hugging face and the [0:05] transformers library [0:07] the hugging face transformers library is [0:09] probably the most popular nlp library in [0:12] python right now [0:13] and it can be combined directly with [0:14] pytorch or tensorflow [0:16] it provides state-of-the-art natural [0:19] language processing models and has a [0:21] very clean api that makes it extremely [0:23] simple to build powerful [0:25] nlp pipelines so today we have a first [0:27] look at the library and build a [0:29] sentiment [0:30] classification algorithm i show you some [0:32] basic functions [0:33] and then we have a look at the model hub [0:35] and then i also show you how you can [0:37] fine-tune your own model [0:38] so let's get started all right so to get [0:41] started you should [0:42] either install pytorch or tensorflow [0:45] first [0:46] and then in order to install the [0:48] transformers library you just have to [0:50] say [0:51] pip install transformers [0:54] or there's also a conda installation [0:56] command that you can find on the [0:58] installation page so let's [1:02] install it like this so i already did [1:04] this and then we can start using this so [1:06] we can save [1:07] from transformers and then we import [1:10] a pipeline as first thing and have a [1:13] look at this [1:14] and then we also import some utilities [1:18] that we need from the [1:19] pytorch library so we import torch [1:22] and we import torch dot nn [1:25] dot functional sf so we're going to use [1:29] this [1:29] later and now we can start using this [1:33] pipeline so let's say classifier [1:36] equals and then we create a [1:39] pipeline and we need to specify the [1:42] task that we want so in this case we [1:45] want to do [1:46] sentiment analysis so we have to call it [1:50] like [1:50] this and you will find the different [1:54] available tasks on the website [1:58] so here we can see for example we have [2:01] this [2:01] sentiment analysis which is just an [2:05] alias of text classification but for [2:08] example we also have a [2:09] question answering pipeline or a text [2:12] generation or a conversational pipeline [2:16] so yeah this is how we can define a [2:18] pipeline [2:19] and what a pipeline does is that it [2:22] gives you a great and easy way to use [2:25] model for inference and it abstracts a [2:28] lot of the things for you [2:30] so you will see what i mean in a moment [2:33] so now we can just use this classifier [2:36] and classify some text by saying [2:39] res for results equals [2:42] and then we call this classifier and we [2:46] want to classify a example text [2:49] so let me copy and paste some example [2:52] text for you [2:54] so we want to classify we are very happy [2:56] to show you [2:57] the smiley face transformers library and [3:00] then let's print [3:02] the result and see how this looks like [3:05] so let's run the code all right and as [3:08] you can see we get the label [3:09] is positive and the score is 0.99 so [3:13] it's very confident that this is [3:15] a positive sentence and as you can see [3:17] it only takes [3:18] two lines of code with this pipeline to [3:21] create a [3:22] sentiment analysis code so [3:26] yeah this is exactly what we need so we [3:28] need to see the [3:29] label of the text if it's negative or [3:31] positive [3:32] and we also get the score so yeah this [3:35] is really nice [3:36] and now let's have a look at some more [3:38] things that we can do with this pipeline [3:41] so first of all we can put in [3:44] more texts at once so we can not just [3:47] use [3:48] one so we can give it a list so let's [3:50] for example use a list [3:52] and then let's use another example text [3:55] so let me [3:56] copy and paste this one in here as well [4:00] so we also want to classify this we hope [4:03] you don't [4:04] hate it and then we get multiple [4:07] results back so let's call this results [4:10] and then we can iterate over this so we [4:12] can say for [4:13] results in results [4:16] and then we want to print the result [4:19] and now let's run this code and have a [4:22] look at how this looks like [4:24] all right and as you can see for the [4:26] second text we get [4:28] another result back so here the label is [4:31] negative and the score is maybe not that [4:34] confident in this case [4:35] so this text might be a little bit [4:37] confusing we hope [4:39] you don't hate it but basically this is [4:41] how you can pass in multiple texts at [4:44] once [4:44] and now so right now we only use [4:48] the default pipeline with the default [4:51] model but now let's have a look at how [4:53] we can use a [4:54] concrete model and then also how you can [4:57] use a concrete [4:58] tokenizer so what we can do is [5:02] we can specify the model name [5:05] and say model name equals and in this [5:09] case i use [5:10] this pillbird base uncased and then [5:13] fine tuned sst to english so i will show [5:17] you where i got this [5:19] string or this name in a moment [5:22] but for now yeah this is basically just [5:24] a distilled bird model [5:26] which is a smaller and faster version of [5:30] bird but it was pre-trained on the same [5:33] corpus [5:34] and then you see that it also was [5:36] fine-tuned and this is just the name of [5:38] the data set so in this case [5:40] it's an english data set from the [5:43] stanford sentiment tree bank version two [5:46] and yeah so now if we have the model [5:48] name we can [5:49] give this to our pipeline with the model [5:53] argument so we can say model equals and [5:56] then we use this model name [5:58] so now in this case i can tell you that [6:01] the [6:01] default model for this sentiment [6:04] analysis task [6:06] is already this model name so this [6:08] should do [6:09] exactly the same but later we will [6:12] switch this and then have a look at how [6:14] we can use different models [6:16] so first of all let's run this again and [6:19] see that this is still the same [6:21] all right so we see this is still the [6:23] same result so this worked [6:25] so now we um just use [6:28] this string to define our model but now [6:31] let's have a different [6:33] approach to define a model and then also [6:36] a [6:36] tokenizer so this will give us a little [6:39] bit more flexibility [6:40] later so in order to do this we want to [6:44] say [6:45] from transformers and then here i [6:48] import a auto tokenizer class [6:51] and auto model for [6:54] sequence classification and this is [6:58] just a generic class for a tokenizer [7:02] and this is also a generic class but a [7:05] little bit more specific so in this case [7:08] i want to have it for sequence [7:10] classification [7:11] and then it will give me a little bit [7:13] more functionality [7:14] specifically for this task so don't [7:18] worry about this right now you can [7:20] also find all the model classes [7:22] available [7:23] in the documentation so if you're [7:25] interested then have a look at this [7:27] and also if you use tensorflow then [7:30] here you have to say tf and then [7:33] the name of this class but the rest is [7:36] actually [7:36] the same so yeah this is how you use [7:39] tensorflow [7:40] and now after importing this [7:43] we can create um two instances of this [7:47] so we can do we can say model [7:50] equals and then we use this class [7:54] so auto model for sequence [7:56] classification [7:58] and then we use a function that is [8:00] called so let's say [8:02] dot from pre-trained [8:05] and then it also needs the model name [8:07] and we do the same with the tokenizer so [8:10] we say [8:11] tokenizer equals the auto tokenizer [8:15] dot from pre-trained and then it needs [8:18] the [8:19] model name so this dot from [8:23] pre-trained function is a very important [8:26] function in hacking phase that you will [8:28] see a lot [8:29] so you will see this later a few more [8:31] times so [8:33] now that we created this we can also [8:37] just give the actual model and not just [8:40] the string [8:41] to the classifier or to the pipeline [8:44] so we can say our model equals [8:47] our model and our tokenizer [8:50] equals our tokenizer so [8:54] now if we run this we should still get [8:56] the same results because these are the [8:59] default versions and yeah as we see we [9:02] still get the same result [9:03] but then later um if you want to use a [9:06] different [9:07] model or tokenizer then you know how you [9:09] can switch this [9:11] so just by using a different model and [9:13] tokenizer here for the pipeline so now [9:16] instead of using this [9:17] pipeline let's see how we can use this [9:21] model and tokenizer directly and do some [9:24] of the steps manually [9:26] and this will give you a little bit more [9:28] flexibility [9:29] so down here um let's first [9:32] use the tokenizer and see what this [9:36] does so first let's [9:39] um call the tokenizer.tokenize function [9:44] so we say let's call this tokens and [9:47] then [9:48] equals tokenizer dot tokenize [9:52] and then the string or the sentence we [9:54] want to tokenize [9:56] so let's copy and paste this in here [9:59] and then once we get the tokens we can [10:02] use them and get the [10:04] token ids out of it so we can say [10:07] token ids equals and then we [10:11] again use the tokenizer and the function [10:15] convert tokenizer to [10:18] it's called ids and then it needs [10:21] the tokens so this is one way how to do [10:26] this [10:26] or we can um do this directly by saying [10:30] token ids equals and then we [10:34] call this tokenizer like a function [10:38] and then again we give it the same [10:41] string here so now let's [10:45] print all these three variables to see [10:48] where is the difference [10:50] so first we print the tokens then we [10:53] print the token ids [10:55] and then here let's actually [10:58] give this a different name so let's call [11:01] this [11:02] input ids so [11:05] now let's run this and see how this [11:07] looks like all right so here is the [11:09] result so as you can see when we call [11:12] tokenizer tokenizer.tokenize then we get [11:15] a [11:16] list of strings or the list of the words [11:20] back so now [11:21] each word is a oh sorry [11:24] each word is a separate token [11:28] and for example this one is our smiley [11:32] face or our emoji [11:34] so yeah this is what the tokenize [11:37] function [11:37] does and then once we call this [11:41] convert tokens to ids we get [11:44] this one back so now it converted [11:47] each token to an id so [11:50] each word has a very unique [11:53] id and this is basically the [11:56] mathematical [11:57] representation or the numerical [11:59] representation that our model then can [12:02] understand [12:03] so this is what we get after this [12:05] function and if we call this tokenizer [12:08] directly then we get a dictionary back [12:12] and here we have the key input ids [12:15] and we also have the attention mask so [12:18] for now you don't really have to worry [12:20] about this [12:21] but let's have a look at the input ids [12:25] so if we compare the token ids with the [12:29] input ids then we see we have the exact [12:32] same [12:33] sequence of token ids but we also have [12:37] this 101 [12:38] and 102 token and this is [12:41] just the beginning of string and the end [12:44] of string [12:45] token but basically it's the same [12:48] so yeah this is the difference between [12:50] these three [12:51] functions and then these input ids [12:54] this is what we can pass to our model [12:58] later to do the predictions manually [13:01] so now like before we can also use [13:04] multiple [13:04] um sentences of course to for our [13:07] tokenizers so [13:09] um for example usually in your code you [13:12] have your [13:13] training data so let's say x train [13:16] and in this example let's just use these [13:19] two [13:20] sentences so this is our x train [13:23] and then we can um and then we can pass [13:27] this to our [13:28] tokenizer and let's call this batch so [13:31] this is [13:32] our batch that we put into our model [13:35] later [13:35] so we say batch equals tokenizer and [13:39] then we call this [13:40] tokenizer directly with our training [13:43] data [13:44] and then i also want to show you some [13:46] useful arguments so we say [13:48] padding equals true and we also say [13:52] truncation [13:53] equals true and then we say [13:56] max length equals 412 [14:01] and we say return tensors [14:04] equals and then as a string pt [14:08] for pi torch so this will ensure that [14:11] all of our samples in our batch have the [14:14] same [14:15] length so it will apply padding and [14:18] truncation if necessary [14:20] and this is also important so in this [14:23] case we want to have a [14:25] pie torch tensor returned directly [14:28] so i will show you later what's the [14:30] difference if you don't use this [14:33] but for now let's just use this and then [14:36] um first of all let's print this [14:39] batch and see how this looks like and [14:42] then [14:42] we see we get a dictionary [14:45] and again it has the key input ids [14:49] and the key attention mask and then here [14:52] it has [14:53] two tensors so the first one [14:56] for the first sentence and the second [15:00] one for the [15:01] second sentence and the same for the [15:03] attention mask so two tensors [15:05] so yeah as i said these input ids are [15:08] these unique ids that our [15:10] model can understand so yeah now we have [15:13] this batch [15:14] and now we can pass this to our [15:17] model so and let's do this manually and [15:21] see how we can call our model [15:23] so in pytorch when we do inference we [15:26] also want to say [15:28] with torch dot no grab [15:31] so this will disable the gradient [15:33] tracking i explained this in [15:36] a lot of my tutorials so you can just [15:37] have a look at them if you want to learn [15:39] more about this [15:41] and then we can call our model by saying [15:44] outputs equals and then we call [15:47] the model and then here we use [15:51] two asterisks and then we [15:55] unpack this batch so if you remember [15:58] here this is [15:59] a dictionary and here basically [16:02] with this we just unpack these [16:06] values in our dictionary so for [16:08] tensorflow you don't do this so [16:10] you just pass in the batch like this but [16:13] for pytorch you [16:14] have to unpack this and now we get the [16:17] outputs of our model [16:19] so let's print the outputs and as you [16:22] might know this [16:23] these are just the raw values so [16:26] to get the actual probabilities and the [16:29] predictions [16:30] we can apply a the softmax so let's say [16:34] predictions equals torch or [16:37] we also have this in f dot soft [16:40] max and then here we say [16:44] outputs dot logits and we want to do [16:48] this along dimension [16:49] equals one and let's also [16:52] print the um predictions [16:56] and then let's do one more thing so [16:58] let's also get the [17:00] labels labels equals and we just get [17:03] this by [17:04] taking the prediction with the or the [17:09] index with the highest probability so we [17:11] get this by saying [17:12] torch dot arc max [17:15] and we can either put in the predictions [17:19] or we can put in the outputs and [17:22] actually [17:23] don't need this but just for [17:25] demonstration [17:26] uh let's use the predictions and then [17:29] again [17:29] dimension equals one and then let's [17:33] print the labels as well [17:36] and now let's actually do one more thing [17:40] so let's convert the labels [17:42] by saying labels equals and then we use [17:45] list comprehension [17:47] and call model dot config [17:50] dot id to [17:53] label and then it needs the [17:56] actual label id [18:00] and then we iterate so we say for [18:04] label id in labels [18:08] to list and now what this does you will [18:12] see this when we print this so we print [18:15] the labels and now [18:19] let's actually run this and see if this [18:22] works [18:22] all right so this worked so as you can [18:25] see [18:26] um here we print the output [18:30] so these are our output this is a [18:33] sequence classifier output and as you [18:37] see [18:37] it has the logits argument so that's why [18:40] we used [18:42] outputs.logith and then we get the [18:45] actual probabilities and [18:49] then to get the labels we used arcmux so [18:52] this is a tensor with the label [18:55] one and the label zero and then we [18:58] converted each [19:00] label to the actual class name and then [19:03] we get [19:03] positive and negative so by the way this [19:07] function i think is only dedicated [19:11] to a auto model for sequence [19:13] classification [19:15] for example if we just used a autumn [19:18] model then i [19:18] think it won't be available so that's [19:21] what [19:22] these more um concrete classes will do [19:25] for you it gives you [19:27] a little bit more functionality for the [19:29] dedicated task [19:31] so we see that the loss is [19:34] none in this case so if you also want to [19:36] have [19:37] a loss that we want to inspect then we [19:40] can [19:40] give the loss or the [19:43] not the loss but the labels arguments [19:47] to our model that um it knows how to [19:49] compute the loss [19:51] so we say labels and then we [19:54] create a torch dot tensor by saying [19:57] torch dot tensor and then as a list we [20:01] give it the labels [20:02] one and zero and now let's run this [20:06] again [20:06] and then you should see that we should [20:08] see a loss here [20:10] and yeah now here we see the loss and [20:13] again [20:13] this labels argument is i think [20:17] special to this auto model for sequence [20:20] classification [20:22] so yeah this worked and now if we have a [20:26] careful look at the probabilities [20:30] so first of all we see we get label [20:33] positive [20:34] and negative and here for the first one [20:37] this is the highest probability so 9.997 [20:42] and here for the second one this is [20:45] the largest number so it took this one [20:49] and this [20:49] is 5.30 so if we compare them [20:53] with the results that we got from our [20:56] pipeline [20:57] then we see these are exactly the same [21:01] numbers so now you might see [21:04] what's the difference between a pipeline [21:07] and [21:07] using tokenizer and model directly [21:10] so with the pipeline we only need two [21:12] lines of code and then we actually [21:15] get what we want so we get the label and [21:17] we get the score we are interested in [21:19] so this might be just fine but then yeah [21:22] if you want to do it manually [21:23] you can do it like i showed you and you [21:25] will get the same results that you can [21:27] then [21:28] use so yeah that's how you can use a [21:30] model and a [21:32] tokenizer and yeah so using the model [21:35] and the tokenizer will be important when [21:38] you for example want to [21:39] fine-tune your model so i will show you [21:43] roughly how to do this later but [21:46] yeah so this is how you use model and [21:49] tokenizer [21:50] and let's just assume we did [21:53] fine tune our model then what we can do [21:56] and we can say save directory and [22:00] specify [22:01] a directory so let's call the folder [22:04] saved and then we can call tokenizer [22:08] and then we can call dot save [22:11] pre-trained [22:12] and then the location just the safe [22:15] directory [22:16] and the same with our model so we can [22:18] say model [22:19] dot save pre-trained save [22:23] underscore pre-trained and then again [22:27] the [22:27] safe directory and then we can load them [22:30] in another application for example [22:33] tokenizer [22:34] equals and then again here we use this [22:37] auto tokenizer class [22:39] and then the from pre-trained and then [22:42] here [22:43] we can give it a directory so [22:46] this from pre-trained we can either give [22:49] it a [22:50] model name or we can give it this [22:52] directory [22:54] and again the same for the model so [22:56] model [22:57] equals and then we use this auto model [23:00] for [23:00] sequence classification dot from [23:03] pre-trained and then the safe directory [23:07] so this should work and then you should [23:09] get the exact same [23:11] model and tokenize it back and yeah as [23:14] you might see [23:14] these um model these dot [23:18] from pre-trained functions are very [23:21] important [23:22] and you will use them a lot of time all [23:24] right so i think these are the basic [23:26] functions you need to build a pipeline [23:29] or to apply the model and tokenizer [23:31] manually [23:33] and now let's have a look at how we can [23:35] use a different [23:36] model so like here you can either [23:40] load this from your disk if you already [23:42] have a pre-trained model somewhere on [23:45] your computer [23:46] but what you can also do is you can go [23:49] to [23:50] the hugging face model hub so you can [23:52] find this at hugging face dot [23:54] co slash models and here we have the [23:58] model hub and you can search [24:00] through different models so for example [24:03] you [24:04] could filter for the tasks so [24:07] in this case we want to do text [24:09] classification [24:10] which is the same as sentiment analysis [24:14] and then it filter is applies this [24:16] filter so [24:17] you can see the most popular model [24:20] is already this one and then we can [24:23] click on this and get some more [24:25] information [24:26] and as you could see so this is the [24:28] exact same [24:30] model name that we used in our code [24:34] so once you've decided for a model you [24:36] can click here and copy this [24:38] name and then paste into your code [24:41] so let's say in this case we want to use [24:44] a different model so in this case [24:46] i want to do sentiment classification [24:49] with [24:49] german sentences so then of course i [24:53] need one that is trained on [24:55] german so you can filter here so you can [24:59] search so i can either again [25:01] search for distilbert and see what [25:03] different versions there are available [25:06] or let me search for german [25:09] and then here let's take the most [25:12] popular one so [25:14] by oliver gore and then we see this is a [25:18] german sentiment bird and then we get [25:21] more information and sometimes we also [25:24] see [25:24] some example code which is helpful so [25:27] yeah this is nice and now what we have [25:29] to do is we want to click here and [25:31] copy this will just copy the name and [25:35] then in our application let me [25:38] comment this out and then let's again [25:41] say [25:42] model name equals and now i hit [25:45] paste so now it pasted this [25:48] string here so now we have this [25:52] and now here we can give our model and [25:55] tokenizer [25:57] the model name so model name [26:00] and model name and now let's do this for [26:03] some [26:04] example texts in german so let me copy [26:07] and paste this in here so basically let [26:10] me [26:10] quickly translate this for you so this [26:12] says not a good result [26:15] this was unfair this was not good [26:19] um not as bad as expected this [26:22] was good and she drives a green car [26:25] so basically these three texts are [26:29] negative this one is rather positive and [26:32] this [26:33] is neutral so let's see if our model can [26:36] detect this correctly [26:38] so now again like above we do the same [26:42] steps so [26:43] we could copy and paste this so let's [26:46] copy [26:47] and paste this and then the same as [26:50] above we say width torch [26:53] torch dots no graphs and then we call [26:57] the model so we say [26:59] outputs equals model and then here we [27:04] unpack our batch then we have the model [27:08] then we want to have the label id so [27:11] let's say [27:11] label ids equals and then we [27:15] use the torch.arc max function [27:19] with the outputs and along dimension [27:23] equals [27:24] one and let me remove this one [27:27] and then we print the label id so print [27:30] the label ids [27:32] and then we do the same as we do here so [27:36] we want to [27:36] convert them to the actual label names [27:39] by calling model.config [27:42] id to label label id for [27:45] label in here we call this label [27:49] ids to list and then print the labels [27:53] and now let's run this and actually [27:56] let's [27:57] also print the batch in this [28:00] case and uh let's have a look at how [28:04] this looks like [28:05] so let's run this and i get an error so [28:08] here i forgot to say [28:10] outputs dot logits like we did before [28:14] so let's try it again and this is only [28:16] two results so [28:18] of course here in our tokenizer we want [28:21] to use [28:21] these texts so let's call this [28:25] x train underscore [28:28] sherman and then let's use x train [28:31] underscore german here and let's [28:34] run it again all right and as we can see [28:37] we get the [28:38] labels one one one zero zero and [28:42] two and this is equal to negative [28:45] negative negative then two times [28:47] positive and then neutral [28:49] so yeah this is exactly what i told you [28:52] the first three sentences are rather [28:54] negative [28:55] than two positive ones and this one is [28:57] neutral [28:58] so yeah now our german model works as [29:01] well and this [29:02] is how we can use different models [29:05] so we simply search the model hub and [29:09] hopefully there is an already [29:11] pre-trained version for the task we want [29:14] and then we can just use this here as [29:16] our model name and then we are good to [29:18] go [29:19] or if there is not a already pre-trained [29:22] version then we have to do this [29:24] ourselves and fine-tune our own model so [29:27] i will show you how you do this in a [29:29] moment [29:30] but now one more thing i want to mention [29:32] so [29:33] um i want to talk about this return [29:36] tensors equals pt so [29:40] um if we here we print the batch and [29:44] here the input ids and then we see [29:47] this is a tensor so right now it's [29:50] already [29:50] in the pi touch format so we could [29:54] use tensorflow here or we just um [29:57] omit this and if we omit this [30:00] then we don't have this in the tensor [30:04] format [30:04] so now it is just a python list i think [30:08] but then what you could do is you could [30:11] convert this so we can say [30:13] batch equals and then we convert this to [30:17] a tensor by saying [30:18] torch dot tensor and then we [30:21] give it the we call this batch [30:24] and this is a dictionary so we can say [30:28] batch and then access the key input [30:33] ids like we see here and now [30:36] we created a actual tensor out of this [30:40] and then we don't have to [30:43] unpack it like this here so now we [30:45] remove this [30:46] and then if we run it again then this [30:49] should work as well [30:51] and yeah this worked too so we get the [30:53] same result [30:54] and here we printed our batch and now we [30:56] see this is a [30:57] tensor directly so yeah be careful here [31:00] to specify [31:02] what you want so it's actually if you [31:05] use pytorch then it's just simpler to [31:08] use this as a return argument so return [31:12] tensors equals pt but if you don't [31:16] use this then you know what you can do [31:18] otherwise all right so now we know how [31:20] we can use different [31:21] models so yeah try this out for [31:24] other models in your language and see if [31:27] this works [31:28] and now let's have another look at how [31:30] we can fine [31:31] tune our own models so this is very [31:35] important [31:36] and i already prepared some code here [31:39] and i will [31:40] go over this very roughly [31:43] but there's also a very great [31:45] documentation [31:46] about this so we can go to this [31:49] documentation page here and you can also [31:52] open this in collab so either with [31:55] pytorch or tensorflow code so this is [31:57] really helpful [31:58] so i encourage you to check this out [32:01] um but now let's go over this briefly [32:04] so basically there are five steps you [32:07] have to do [32:08] um so in this example it's for pytorch [32:12] so we have to prepare our data set for [32:15] example [32:16] loaded from a csv file or whatever [32:19] then we have to load a pre-trained [32:22] tokenizer [32:23] and then call it with our data set so [32:26] then we get the [32:27] encodings or the token ids then [32:30] we have to build a pie torch data set [32:33] out of this with these encodings so if [32:36] you don't know [32:37] what the pi torch data set is then i [32:39] will have a link for you here where i [32:41] explain this then we also load a [32:44] pre-trained [32:44] model and then we can either load [32:48] a hugging face trainer and train it so [32:51] this abstracts away a lot of things or [32:54] we can just use [32:56] a native or normal python training [32:58] pipeline like in our other pytorch code [33:02] so yeah this is what we have to do so [33:04] let's go [33:05] over this very quickly so in this case [33:08] we define our base model name so we want [33:12] to start with [33:13] a distilbert base uncased version [33:17] but in this case for example not the [33:19] fine-tuned one so [33:20] just this one then step one we prepare [33:23] the data set so we write a helpful [33:25] function [33:26] to create texts and [33:29] labels out of the actual text [33:33] and here we downloaded some [33:37] data set and put it in our folder so i [33:39] already did this here and [33:41] yeah this is available at this website [33:44] and this contains [33:45] movie reviews so we want to fine-tune [33:48] our models on movie reviews for [33:50] sentiment classification [33:52] so here we create training texts and the [33:55] training [33:55] labels with our helper function and we [33:58] also do [33:59] a trained test split to get validation [34:02] texts and labels [34:04] and yeah then as a next step [34:07] we create or we define a [34:10] pi torch data set so this inherits from [34:13] pi torch data set so torch utils data we [34:18] import data set and then we define this [34:21] here so again i have a tutorial where i [34:24] explain how this works [34:26] but basically it needs the encodings [34:29] and the labels and it stores them in [34:32] here [34:33] so yeah this needs the encoding so for [34:36] the [34:36] encodings we need a tokenizer [34:39] so again we use this from pre-trained [34:42] function [34:43] with the model name and in this case [34:46] since we know [34:47] we use the distilled bird one we can [34:50] use this class so remember before we [34:53] used a generic [34:55] tokenizer this auto tokenizer class [34:58] and here we use a more concrete one so [35:01] we use the [35:02] distal bird tokenizer fast then we apply [35:05] it [35:06] to a training validation and test set [35:08] and get the [35:09] encodings then we put them in our data [35:13] set [35:14] and create the pi torch data set [35:17] and then we import a trainer [35:21] and the training argument so this is in [35:24] available in transformers library and [35:27] then we can [35:28] set this up so we can create the [35:31] arguments so here for example we specify [35:35] the number of training epochs the output [35:38] directory [35:38] the learning rate and other parameters [35:41] we want and then we [35:42] create our model again from a [35:46] concrete model class and then with this [35:49] dot [35:49] from pre-trained function and then we [35:52] set up this [35:54] trainer and give it the model and the [35:56] training [35:57] arguments and then the training set and [36:00] the validation set [36:02] and then we simply have to call [36:04] trainer.train [36:05] and this will do all the training for us [36:07] and afterwards you can test it on your [36:09] test data set [36:10] and then you have a fine-tuned model so [36:13] yeah this is basically [36:14] all you need and then i also want to [36:17] show you that instead of using this [36:20] trainer if you want to do it manually [36:22] and have [36:23] even more flexibility you can just use a [36:27] normal pie touch training loop so [36:30] for this we use a data loader [36:33] and we need an optimization so in this [36:36] case [36:36] we use a optimizer from the transformers [36:39] library [36:40] and then here we specify our device then [36:43] again we create this [36:44] model we push it to the device and set [36:47] it to training mode [36:48] then we create a data loader and the [36:51] optimizer [36:52] and then we do the typical training loop [36:55] so we say [36:55] for epoch in num epochs and for batch in [36:59] our training loader [37:01] and then we do the stuff we always do we [37:03] say optimize the zero grad [37:05] we also push it to the device if [37:08] necessary [37:09] then we call the model and we calculate [37:11] the [37:12] loss with this and in this case um [37:15] this is already contained in the output [37:18] so we can just [37:19] access the loss like this then we call [37:22] lost.backward [37:23] and optimizer step and iterate and [37:27] afterwards we can set our model to [37:30] evaluation mode again and yeah this is [37:32] how we do it in native pi touch code [37:34] and yeah so this is basically how we do [37:37] a [37:38] fine tuning and then can fine-tune our [37:41] own models and then afterwards you can [37:42] also [37:43] upload them to the hugging face model [37:45] hub if you want so [37:47] yeah i think that's pretty cool and yeah [37:50] that's all that i wanted to [37:52] show you for now i think that's enough [37:54] to get started with hugging face [37:56] and i hope you enjoyed this tutorial and [37:58] then i hope to see you in the next video [38:02] bye [38:11] you