---
title: 'HuggingFace Crash Course - Sentiment Analysis, Model Hub, Fine Tuning'
source: 'https://youtube.com/watch?v=GSt00_-0ncQ'
video_id: 'GSt00_-0ncQ'
date: 2026-06-30
duration_sec: 2292
---

# HuggingFace Crash Course - Sentiment Analysis, Model Hub, Fine Tuning

> Source: [HuggingFace Crash Course - Sentiment Analysis, Model Hub, Fine Tuning](https://youtube.com/watch?v=GSt00_-0ncQ)

## Summary

Patrick introduces the Hugging Face Transformers library, a popular NLP library in Python that integrates with PyTorch or TensorFlow. The tutorial covers building a sentiment analysis pipeline, exploring the model hub, and fine-tuning a custom model.

### Key Points

- **Introduction to Hugging Face Transformers** [0:00] — The library provides state-of-the-art NLP models with a clean API, making it easy to build powerful pipelines.
- **Installation and Setup** [0:41] — Install PyTorch or TensorFlow first, then run 'pip install transformers' or use conda.
- **Using the Pipeline for Sentiment Analysis** [1:06] — Import 'pipeline' from transformers, create a classifier with 'pipeline('sentiment-analysis')', and classify text with two lines of code.
- **Processing Multiple Texts** [3:41] — Pass a list of texts to the pipeline to get multiple results at once.
- **Specifying a Model and Tokenizer** [4:54] — Use 'model_name' to load a specific pre-trained model (e.g., 'distilbert-base-uncased-finetuned-sst-2-english') and pass it to the pipeline.
- **Manual Tokenization and Model Inference** [6:45] — Import 'AutoTokenizer' and 'AutoModelForSequenceClassification', use 'from_pretrained' to load them, then tokenize text and get predictions manually.
- **Batch Processing with Padding and Truncation** [13:01] — Use tokenizer with arguments like 'padding=True', 'truncation=True', 'max_length=512', and 'return_tensors='pt'' to prepare batches for PyTorch.
- **Manual Inference with PyTorch** [15:14] — Disable gradient tracking with 'torch.no_grad()', pass the batch to the model, apply softmax to logits, and get predictions using 'torch.argmax'.
- **Saving and Loading Models** [21:50] — Save tokenizer and model with 'save_pretrained(directory)' and load them back with 'from_pretrained(directory)'.
- **Exploring the Model Hub** [23:36] — Visit huggingface.co/models to search for pre-trained models by task (e.g., text classification) and language (e.g., German).
- **Using a German Sentiment Model** [25:18] — Load a German sentiment model (e.g., 'oliverguhr/german-sentiment-bert') and test it on German sentences.
- **Fine-Tuning a Model Overview** [29:30] — Five steps: prepare dataset, load tokenizer, encode data, build PyTorch dataset, load pre-trained model, and train using Trainer or custom loop.
- **Fine-Tuning with Trainer** [31:28] — Use 'Trainer' and 'TrainingArguments' from transformers to simplify training; specify epochs, output directory, learning rate, etc.
- **Custom PyTorch Training Loop** [36:22] — For more flexibility, use a native PyTorch loop with DataLoader, optimizer (e.g., AdamW), and manual forward/backward passes.

### Conclusion

The Hugging Face Transformers library simplifies NLP tasks like sentiment analysis through high-level pipelines and also allows manual control for fine-tuning. With the model hub, you can leverage pre-trained models for multiple languages and tasks.

## Transcript

hi everyone i'm patrick and in today's
video we are going to learn how to get
started with hugging face and the
transformers library
the hugging face transformers library is
probably the most popular nlp library in
python right now
and it can be combined directly with
pytorch or tensorflow
it provides state-of-the-art natural
language processing models and has a
very clean api that makes it extremely
simple to build powerful
nlp pipelines so today we have a first
look at the library and build a
sentiment
classification algorithm i show you some
basic functions
and then we have a look at the model hub
and then i also show you how you can
fine-tune your own model
so let's get started all right so to get
started you should
either install pytorch or tensorflow
first
and then in order to install the
transformers library you just have to
say
pip install transformers
or there's also a conda installation
command that you can find on the
installation page so let's
install it like this so i already did
this and then we can start using this so
we can save
from transformers and then we import
a pipeline as first thing and have a
look at this
and then we also import some utilities
that we need from the
pytorch library so we import torch
and we import torch dot nn
dot functional sf so we're going to use
this
later and now we can start using this
pipeline so let's say classifier
equals and then we create a
pipeline and we need to specify the
task that we want so in this case we
want to do
sentiment analysis so we have to call it
like
this and you will find the different
available tasks on the website
so here we can see for example we have
this
sentiment analysis which is just an
alias of text classification but for
example we also have a
question answering pipeline or a text
generation or a conversational pipeline
so yeah this is how we can define a
pipeline
and what a pipeline does is that it
gives you a great and easy way to use
model for inference and it abstracts a
lot of the things for you
so you will see what i mean in a moment
so now we can just use this classifier
and classify some text by saying
res for results equals
and then we call this classifier and we
want to classify a example text
so let me copy and paste some example
text for you
so we want to classify we are very happy
to show you
the smiley face transformers library and
then let's print
the result and see how this looks like
so let's run the code all right and as
you can see we get the label
is positive and the score is 0.99 so
it's very confident that this is
a positive sentence and as you can see
it only takes
two lines of code with this pipeline to
create a
sentiment analysis code so
yeah this is exactly what we need so we
need to see the
label of the text if it's negative or
positive
and we also get the score so yeah this
is really nice
and now let's have a look at some more
things that we can do with this pipeline
so first of all we can put in
more texts at once so we can not just
use
one so we can give it a list so let's
for example use a list
and then let's use another example text
so let me
copy and paste this one in here as well
so we also want to classify this we hope
you don't
hate it and then we get multiple
results back so let's call this results
and then we can iterate over this so we
can say for
results in results
and then we want to print the result
and now let's run this code and have a
look at how this looks like
all right and as you can see for the
second text we get
another result back so here the label is
negative and the score is maybe not that
confident in this case
so this text might be a little bit
confusing we hope
you don't hate it but basically this is
how you can pass in multiple texts at
once
and now so right now we only use
the default pipeline with the default
model but now let's have a look at how
we can use a
concrete model and then also how you can
use a concrete
tokenizer so what we can do is
we can specify the model name
and say model name equals and in this
case i use
this pillbird base uncased and then
fine tuned sst to english so i will show
you where i got this
string or this name in a moment
but for now yeah this is basically just
a distilled bird model
which is a smaller and faster version of
bird but it was pre-trained on the same
corpus
and then you see that it also was
fine-tuned and this is just the name of
the data set so in this case
it's an english data set from the
stanford sentiment tree bank version two
and yeah so now if we have the model
name we can
give this to our pipeline with the model
argument so we can say model equals and
then we use this model name
so now in this case i can tell you that
the
default model for this sentiment
analysis task
is already this model name so this
should do
exactly the same but later we will
switch this and then have a look at how
we can use different models
so first of all let's run this again and
see that this is still the same
all right so we see this is still the
same result so this worked
so now we um just use
this string to define our model but now
let's have a different
approach to define a model and then also
a
tokenizer so this will give us a little
bit more flexibility
later so in order to do this we want to
say
from transformers and then here i
import a auto tokenizer class
and auto model for
sequence classification and this is
just a generic class for a tokenizer
and this is also a generic class but a
little bit more specific so in this case
i want to have it for sequence
classification
and then it will give me a little bit
more functionality
specifically for this task so don't
worry about this right now you can
also find all the model classes
available
in the documentation so if you're
interested then have a look at this
and also if you use tensorflow then
here you have to say tf and then
the name of this class but the rest is
actually
the same so yeah this is how you use
tensorflow
and now after importing this
we can create um two instances of this
so we can do we can say model
equals and then we use this class
so auto model for sequence
classification
and then we use a function that is
called so let's say
dot from pre-trained
and then it also needs the model name
and we do the same with the tokenizer so
we say
tokenizer equals the auto tokenizer
dot from pre-trained and then it needs
the
model name so this dot from
pre-trained function is a very important
function in hacking phase that you will
see a lot
so you will see this later a few more
times so
now that we created this we can also
just give the actual model and not just
the string
to the classifier or to the pipeline
so we can say our model equals
our model and our tokenizer
equals our tokenizer so
now if we run this we should still get
the same results because these are the
default versions and yeah as we see we
still get the same result
but then later um if you want to use a
different
model or tokenizer then you know how you
can switch this
so just by using a different model and
tokenizer here for the pipeline so now
instead of using this
pipeline let's see how we can use this
model and tokenizer directly and do some
of the steps manually
and this will give you a little bit more
flexibility
so down here um let's first
use the tokenizer and see what this
does so first let's
um call the tokenizer.tokenize function
so we say let's call this tokens and
then
equals tokenizer dot tokenize
and then the string or the sentence we
want to tokenize
so let's copy and paste this in here
and then once we get the tokens we can
use them and get the
token ids out of it so we can say
token ids equals and then we
again use the tokenizer and the function
convert tokenizer to
it's called ids and then it needs
the tokens so this is one way how to do
this
or we can um do this directly by saying
token ids equals and then we
call this tokenizer like a function
and then again we give it the same
string here so now let's
print all these three variables to see
where is the difference
so first we print the tokens then we
print the token ids
and then here let's actually
give this a different name so let's call
this
input ids so
now let's run this and see how this
looks like all right so here is the
result so as you can see when we call
tokenizer tokenizer.tokenize then we get
a
list of strings or the list of the words
back so now
each word is a oh sorry
each word is a separate token
and for example this one is our smiley
face or our emoji
so yeah this is what the tokenize
function
does and then once we call this
convert tokens to ids we get
this one back so now it converted
each token to an id so
each word has a very unique
id and this is basically the
mathematical
representation or the numerical
representation that our model then can
understand
so this is what we get after this
function and if we call this tokenizer
directly then we get a dictionary back
and here we have the key input ids
and we also have the attention mask so
for now you don't really have to worry
about this
but let's have a look at the input ids
so if we compare the token ids with the
input ids then we see we have the exact
same
sequence of token ids but we also have
this 101
and 102 token and this is
just the beginning of string and the end
of string
token but basically it's the same
so yeah this is the difference between
these three
functions and then these input ids
this is what we can pass to our model
later to do the predictions manually
so now like before we can also use
multiple
um sentences of course to for our
tokenizers so
um for example usually in your code you
have your
training data so let's say x train
and in this example let's just use these
two
sentences so this is our x train
and then we can um and then we can pass
this to our
tokenizer and let's call this batch so
this is
our batch that we put into our model
later
so we say batch equals tokenizer and
then we call this
tokenizer directly with our training
data
and then i also want to show you some
useful arguments so we say
padding equals true and we also say
truncation
equals true and then we say
max length equals 412
and we say return tensors
equals and then as a string pt
for pi torch so this will ensure that
all of our samples in our batch have the
same
length so it will apply padding and
truncation if necessary
and this is also important so in this
case we want to have a
pie torch tensor returned directly
so i will show you later what's the
difference if you don't use this
but for now let's just use this and then
um first of all let's print this
batch and see how this looks like and
then
we see we get a dictionary
and again it has the key input ids
and the key attention mask and then here
it has
two tensors so the first one
for the first sentence and the second
one for the
second sentence and the same for the
attention mask so two tensors
so yeah as i said these input ids are
these unique ids that our
model can understand so yeah now we have
this batch
and now we can pass this to our
model so and let's do this manually and
see how we can call our model
so in pytorch when we do inference we
also want to say
with torch dot no grab
so this will disable the gradient
tracking i explained this in
a lot of my tutorials so you can just
have a look at them if you want to learn
more about this
and then we can call our model by saying
outputs equals and then we call
the model and then here we use
two asterisks and then we
unpack this batch so if you remember
here this is
a dictionary and here basically
with this we just unpack these
values in our dictionary so for
tensorflow you don't do this so
you just pass in the batch like this but
for pytorch you
have to unpack this and now we get the
outputs of our model
so let's print the outputs and as you
might know this
these are just the raw values so
to get the actual probabilities and the
predictions
we can apply a the softmax so let's say
predictions equals torch or
we also have this in f dot soft
max and then here we say
outputs dot logits and we want to do
this along dimension
equals one and let's also
print the um predictions
and then let's do one more thing so
let's also get the
labels labels equals and we just get
this by
taking the prediction with the or the
index with the highest probability so we
get this by saying
torch dot arc max
and we can either put in the predictions
or we can put in the outputs and
actually
don't need this but just for
demonstration
uh let's use the predictions and then
again
dimension equals one and then let's
print the labels as well
and now let's actually do one more thing
so let's convert the labels
by saying labels equals and then we use
list comprehension
and call model dot config
dot id to
label and then it needs the
actual label id
and then we iterate so we say for
label id in labels
to list and now what this does you will
see this when we print this so we print
the labels and now
let's actually run this and see if this
works
all right so this worked so as you can
see
um here we print the output
so these are our output this is a
sequence classifier output and as you
see
it has the logits argument so that's why
we used
outputs.logith and then we get the
actual probabilities and
then to get the labels we used arcmux so
this is a tensor with the label
one and the label zero and then we
converted each
label to the actual class name and then
we get
positive and negative so by the way this
function i think is only dedicated
to a auto model for sequence
classification
for example if we just used a autumn
model then i
think it won't be available so that's
what
these more um concrete classes will do
for you it gives you
a little bit more functionality for the
dedicated task
so we see that the loss is
none in this case so if you also want to
have
a loss that we want to inspect then we
can
give the loss or the
not the loss but the labels arguments
to our model that um it knows how to
compute the loss
so we say labels and then we
create a torch dot tensor by saying
torch dot tensor and then as a list we
give it the labels
one and zero and now let's run this
again
and then you should see that we should
see a loss here
and yeah now here we see the loss and
again
this labels argument is i think
special to this auto model for sequence
classification
so yeah this worked and now if we have a
careful look at the probabilities
so first of all we see we get label
positive
and negative and here for the first one
this is the highest probability so 9.997
and here for the second one this is
the largest number so it took this one
and this
is 5.30 so if we compare them
with the results that we got from our
pipeline
then we see these are exactly the same
numbers so now you might see
what's the difference between a pipeline
and
using tokenizer and model directly
so with the pipeline we only need two
lines of code and then we actually
get what we want so we get the label and
we get the score we are interested in
so this might be just fine but then yeah
if you want to do it manually
you can do it like i showed you and you
will get the same results that you can
then
use so yeah that's how you can use a
model and a
tokenizer and yeah so using the model
and the tokenizer will be important when
you for example want to
fine-tune your model so i will show you
roughly how to do this later but
yeah so this is how you use model and
tokenizer
and let's just assume we did
fine tune our model then what we can do
and we can say save directory and
specify
a directory so let's call the folder
saved and then we can call tokenizer
and then we can call dot save
pre-trained
and then the location just the safe
directory
and the same with our model so we can
say model
dot save pre-trained save
underscore pre-trained and then again
the
safe directory and then we can load them
in another application for example
tokenizer
equals and then again here we use this
auto tokenizer class
and then the from pre-trained and then
here
we can give it a directory so
this from pre-trained we can either give
it a
model name or we can give it this
directory
and again the same for the model so
model
equals and then we use this auto model
for
sequence classification dot from
pre-trained and then the safe directory
so this should work and then you should
get the exact same
model and tokenize it back and yeah as
you might see
these um model these dot
from pre-trained functions are very
important
and you will use them a lot of time all
right so i think these are the basic
functions you need to build a pipeline
or to apply the model and tokenizer
manually
and now let's have a look at how we can
use a different
model so like here you can either
load this from your disk if you already
have a pre-trained model somewhere on
your computer
but what you can also do is you can go
to
the hugging face model hub so you can
find this at hugging face dot
co slash models and here we have the
model hub and you can search
through different models so for example
you
could filter for the tasks so
in this case we want to do text
classification
which is the same as sentiment analysis
and then it filter is applies this
filter so
you can see the most popular model
is already this one and then we can
click on this and get some more
information
and as you could see so this is the
exact same
model name that we used in our code
so once you've decided for a model you
can click here and copy this
name and then paste into your code
so let's say in this case we want to use
a different model so in this case
i want to do sentiment classification
with
german sentences so then of course i
need one that is trained on
german so you can filter here so you can
search so i can either again
search for distilbert and see what
different versions there are available
or let me search for german
and then here let's take the most
popular one so
by oliver gore and then we see this is a
german sentiment bird and then we get
more information and sometimes we also
see
some example code which is helpful so
yeah this is nice and now what we have
to do is we want to click here and
copy this will just copy the name and
then in our application let me
comment this out and then let's again
say
model name equals and now i hit
paste so now it pasted this
string here so now we have this
and now here we can give our model and
tokenizer
the model name so model name
and model name and now let's do this for
some
example texts in german so let me copy
and paste this in here so basically let
me
quickly translate this for you so this
says not a good result
this was unfair this was not good
um not as bad as expected this
was good and she drives a green car
so basically these three texts are
negative this one is rather positive and
this
is neutral so let's see if our model can
detect this correctly
so now again like above we do the same
steps so
we could copy and paste this so let's
copy
and paste this and then the same as
above we say width torch
torch dots no graphs and then we call
the model so we say
outputs equals model and then here we
unpack our batch then we have the model
then we want to have the label id so
let's say
label ids equals and then we
use the torch.arc max function
with the outputs and along dimension
equals
one and let me remove this one
and then we print the label id so print
the label ids
and then we do the same as we do here so
we want to
convert them to the actual label names
by calling model.config
id to label label id for
label in here we call this label
ids to list and then print the labels
and now let's run this and actually
let's
also print the batch in this
case and uh let's have a look at how
this looks like
so let's run this and i get an error so
here i forgot to say
outputs dot logits like we did before
so let's try it again and this is only
two results so
of course here in our tokenizer we want
to use
these texts so let's call this
x train underscore
sherman and then let's use x train
underscore german here and let's
run it again all right and as we can see
we get the
labels one one one zero zero and
two and this is equal to negative
negative negative then two times
positive and then neutral
so yeah this is exactly what i told you
the first three sentences are rather
negative
than two positive ones and this one is
neutral
so yeah now our german model works as
well and this
is how we can use different models
so we simply search the model hub and
hopefully there is an already
pre-trained version for the task we want
and then we can just use this here as
our model name and then we are good to
go
or if there is not a already pre-trained
version then we have to do this
ourselves and fine-tune our own model so
i will show you how you do this in a
moment
but now one more thing i want to mention
so
um i want to talk about this return
tensors equals pt so
um if we here we print the batch and
here the input ids and then we see
this is a tensor so right now it's
already
in the pi touch format so we could
use tensorflow here or we just um
omit this and if we omit this
then we don't have this in the tensor
format
so now it is just a python list i think
but then what you could do is you could
convert this so we can say
batch equals and then we convert this to
a tensor by saying
torch dot tensor and then we
give it the we call this batch
and this is a dictionary so we can say
batch and then access the key input
ids like we see here and now
we created a actual tensor out of this
and then we don't have to
unpack it like this here so now we
remove this
and then if we run it again then this
should work as well
and yeah this worked too so we get the
same result
and here we printed our batch and now we
see this is a
tensor directly so yeah be careful here
to specify
what you want so it's actually if you
use pytorch then it's just simpler to
use this as a return argument so return
tensors equals pt but if you don't
use this then you know what you can do
otherwise all right so now we know how
we can use different
models so yeah try this out for
other models in your language and see if
this works
and now let's have another look at how
we can fine
tune our own models so this is very
important
and i already prepared some code here
and i will
go over this very roughly
but there's also a very great
documentation
about this so we can go to this
documentation page here and you can also
open this in collab so either with
pytorch or tensorflow code so this is
really helpful
so i encourage you to check this out
um but now let's go over this briefly
so basically there are five steps you
have to do
um so in this example it's for pytorch
so we have to prepare our data set for
example
loaded from a csv file or whatever
then we have to load a pre-trained
tokenizer
and then call it with our data set so
then we get the
encodings or the token ids then
we have to build a pie torch data set
out of this with these encodings so if
you don't know
what the pi torch data set is then i
will have a link for you here where i
explain this then we also load a
pre-trained
model and then we can either load
a hugging face trainer and train it so
this abstracts away a lot of things or
we can just use
a native or normal python training
pipeline like in our other pytorch code
so yeah this is what we have to do so
let's go
over this very quickly so in this case
we define our base model name so we want
to start with
a distilbert base uncased version
but in this case for example not the
fine-tuned one so
just this one then step one we prepare
the data set so we write a helpful
function
to create texts and
labels out of the actual text
and here we downloaded some
data set and put it in our folder so i
already did this here and
yeah this is available at this website
and this contains
movie reviews so we want to fine-tune
our models on movie reviews for
sentiment classification
so here we create training texts and the
training
labels with our helper function and we
also do
a trained test split to get validation
texts and labels
and yeah then as a next step
we create or we define a
pi torch data set so this inherits from
pi torch data set so torch utils data we
import data set and then we define this
here so again i have a tutorial where i
explain how this works
but basically it needs the encodings
and the labels and it stores them in
here
so yeah this needs the encoding so for
the
encodings we need a tokenizer
so again we use this from pre-trained
function
with the model name and in this case
since we know
we use the distilled bird one we can
use this class so remember before we
used a generic
tokenizer this auto tokenizer class
and here we use a more concrete one so
we use the
distal bird tokenizer fast then we apply
it
to a training validation and test set
and get the
encodings then we put them in our data
set
and create the pi torch data set
and then we import a trainer
and the training argument so this is in
available in transformers library and
then we can
set this up so we can create the
arguments so here for example we specify
the number of training epochs the output
directory
the learning rate and other parameters
we want and then we
create our model again from a
concrete model class and then with this
dot
from pre-trained function and then we
set up this
trainer and give it the model and the
training
arguments and then the training set and
the validation set
and then we simply have to call
trainer.train
and this will do all the training for us
and afterwards you can test it on your
test data set
and then you have a fine-tuned model so
yeah this is basically
all you need and then i also want to
show you that instead of using this
trainer if you want to do it manually
and have
even more flexibility you can just use a
normal pie touch training loop so
for this we use a data loader
and we need an optimization so in this
case
we use a optimizer from the transformers
library
and then here we specify our device then
again we create this
model we push it to the device and set
it to training mode
then we create a data loader and the
optimizer
and then we do the typical training loop
so we say
for epoch in num epochs and for batch in
our training loader
and then we do the stuff we always do we
say optimize the zero grad
we also push it to the device if
necessary
then we call the model and we calculate
the
loss with this and in this case um
this is already contained in the output
so we can just
access the loss like this then we call
lost.backward
and optimizer step and iterate and
afterwards we can set our model to
evaluation mode again and yeah this is
how we do it in native pi touch code
and yeah so this is basically how we do
a
fine tuning and then can fine-tune our
own models and then afterwards you can
also
upload them to the hugging face model
hub if you want so
yeah i think that's pretty cool and yeah
that's all that i wanted to
show you for now i think that's enough
to get started with hugging face
and i hope you enjoyed this tutorial and
then i hope to see you in the next video
bye
you