---
title: 'Web Scraping with Python - Beautiful Soup Crash Course'
source: 'https://youtube.com/watch?v=XVv6mJpFOb0'
video_id: 'XVv6mJpFOb0'
date: 2026-06-30
duration_sec: 4103
---

# Web Scraping with Python - Beautiful Soup Crash Course

> Source: [Web Scraping with Python - Beautiful Soup Crash Course](https://youtube.com/watch?v=XVv6mJpFOb0)

## Summary

This tutorial teaches web scraping with Python using the Beautiful Soup library. It covers basic HTML parsing, then progresses to scraping a real job listing website (TimesJobs) with requests, filtering results, and saving data to files. The course is structured for beginners and includes hands-on coding examples.

### Key Points

- **Goal of Tutorial** [00:33] — Will teach web scraping using Beautiful Soup, starting with basic HTML page, then real website, and finally storing data in files.
- **Installing Libraries** [05:36] — Install beautifulsoup4 and lxml parser using pip install.
- **Reading HTML File** [07:20] — Open home.html with with open and read its content into a variable.
- **Finding Tags with Beautiful Soup** [12:50] — Use find() to get first match and find_all() to get all matches. Access tag text via .text.
- **Inspecting Real Websites** [16:35] — Use browser inspect (right-click) to view HTML structure and identify elements to scrape.
- **Scraping TimesJobs with Requests** [24:47] — Use requests.get(url).text to fetch page HTML, then parse with Beautiful Soup.
- **Extracting Job Details** [33:10] — Find company name (h3.job-list-comp-name), skills (span.srp-skills), and posted date (span.sim-posted) using find with class_ parameter.
- **Filtering Jobs by Date** [42:00] — Only include jobs with 'few' in the posted date text to get recent postings.
- **Saving Results to Files** [56:00] — Write each job's info to a separate text file in the 'posts' directory using with open and f.write.
- **Conclusion** [68:00] — Final wrap-up: encouraged to subscribe and check the channel for more content.

### Conclusion

This tutorial provides a comprehensive introduction to web scraping with Python, covering local HTML parsing, real website scraping with requests, filtering, and saving data to files.

## Transcript

hi everyone and welcome to a special
python tutorial where we are going to
learn how to perform web scripting so
first of all thanks to free code cam to
giving me this opportunity of being a
guest on their channel and i have a
youtube channel as well that is named
gymshape coding and you can find there
any tech related topic such as
programming language web development and
more content that i am uploading once or
twice a week so you can just go ahead
and find the link from the description
okay so in this video i'm going to do my
best to teach you anything that is
related to web scripting and i'm going
to do that with the beautiful soup
library and that is a special library
that will allow you to gather any
information you want from any website
you want okay so this website could be
your bank account could be a job post
website like linkedin this could be
wikipedia or a sports website and really
anything that you can think about so we
will start by scraping a basic html page
first just to understand the concepts
and then we will move on to scraping a
real website and by the last 15 to 20
minutes of this tutorial i'm going to
show you how you can store the
information that we have just pulled
from this website so let's begin
great so this is the webpage that we are
going to
start web scraping and i'm going to
explain what is going on here so you can
see that we are having a basic title and
then we are having a kind of three
paragraphs so you can see that we have a
title of python and then we have a kind
of secondary title and then there is a
basic explanation about the course
itself and then we are having a button
that says start that will probably lead
us to a different page if we click on it
and then you can see that it has the
price here as well now we are kind of
repeating ourselves three times here and
this is what is responsible to that web
development paragraph and then also for
that machine learning paragraph now what
we are currently looking at it is
basically the behind the scenes of that
page so this is the html code that is
defined in order to show you that hello
start learning page and you can see that
inside our html documents all of the
code is being created with tags now
those tags are what are responsible to
display different information for you
and you can see that we have a big tag
that is called html and then inside of
that html tag we are having a head tag
and then a body tag now you can see that
we are defining a closure for each of
our tags with the forward slash here and
then you are probably going to see that
for the different tags as well now let's
expand the head tag here and then inside
of it we are seeing some meta
information that is not quite relevant
for us but we see that link tag which is
responsible to import some styling for
our page and then we can see that title
tag which is responsible to customize
our tab name and that is why you'll see
my courses over here now i will close
back the head and then i will expand the
body so the body is responsible to
display what is going to be on the page
itself it is the page's body and you can
see that we already have the h1 tag that
is created here and then between the
closure which is the area that you can
write the text for that tag we see the
hello comma start learning and then we
are having some div tags here and when
you see the tag of div this is the very
basic tag that will create some
tags in different styling so you'll see
here the class equals card what this
attribute assigning does here it is
importing the card styling and that is
why you see the kind of carding style
for each of our paragraphs over that
page and you can see that we are having
one more div inside that card class
which is called card header so this is
the styling for card header this is why
it is called that way and then the text
is python and then we have the card body
and we have the h5 tag which is a kind
of smaller header that you can display
and if i scroll right here you can see
that python for beginners text and then
the closure for hyh5 tag and we are
having a paragraph and then the a tag
which is allowing us to lead to another
page so when you see the a tag it is
basically a reference to another page
that you can visit
now this entire code that i'm currently
marking let's actually make our page a
bigger here
this entire code that i just marked is
kind of repeated three times and that is
why we see the page that we saw
previously okay so it is quite important
to understand and we are going to scrape
that page and pull some information with
the beautiful soup library now if you
are confused with the script tags here
don't because those tags are responsible
to import some javascript libraries and
that is something not relevant for us
right now okay so we are going to switch
to python now in order to apply some
basic scraping for that page so i will
go and start working on my main.pi file
and you can see that nothing is here now
before we actually start we have to
install same libraries and one of them
will be the beautiful soup so i will
open my terminal and since i'm working
with my system global interpreter i will
allow myself to install it over here and
i will go here and write pip install and
then we will write here beautiful soup
4 so make sure that everything is
not spaced or not split it with dashes
and then i'm going to hit enter and then
you can see that it is installed
successfully and then the next thing
that i want to install will be something
that is going to be used from the
beautiful soup library and that is the
parcel method so when you work with
beautiful soup you have to specify the
method that you are going to parse html
files into python objects okay so there
are going to be different methods to
parse your html code and i heard that
the best of them could be the lxml
parser since if you work with the
default html parser it is not going to
deal well with broken html code so just
go ahead and install the lxml parcel
library and you can also do that with
pip install and then we are going to use
that when we work with the beautiful
soup so i will go here and then write
pip install lxml and then once i do that
let's wait until it's finished great so
we are ready now to go back to python
and start working with the beautiful
soup library now we have to go here and
import that beautiful soup library so it
could be a little bit confusing because
the libraries folder is created as bs4
so that is why we are going to write
here from bs 4 import beautiful soup
like this and once i have done that i
have to figure out how i'm going to
access the content inside the home.html
file that is right there inside my web
scraping directory so in order to do
that we have to work with file objects
now if you don't know how to work with
files in python that is totally fine
because we are going to go over it and
it also might be worth to check my
channel out if i have already uploaded
how to work with files in python so i'm
going to write here
with open so this is basically a
statement that will allow me to open a
file and then read the content of that
specific file so as you can see from the
auto completion i have to specify as my
first argument the file's name so i'm
going to close the parenthesis here and
then inside here i'm going to write my
html files name now since the python
file and then the home.html file are in
the same exact directory it will be okay
just to write its name so it will be
home.html
and the second argument will be the
method that you want to apply when you
open that file in that python's memory
so you have couple of options when you
work with python files you can read them
you can write them or you can do both
and if we only want to read the content
then we somehow want to specify that we
only want to read this file so we will
open here a new string
and we will write here r so what this
tells to python is basically that i'm
going to read that file only and once i
have done that i have to write here a
variable that is going to be used inside
that code block that i just created
which is the with open so i'm going to
use the as keyword and then i'm going to
create here a variable name that is
going to be used throughout the block of
the open so it will be html underscore
file and that will be basically my
variable name and then once i do that i
will go inside the
open block and then i will write here
content
equals to html file dot read and once i
apply the read method i'm basically
reading the html file content and in
order to show you how this works let's
first print the content itself so i will
go here and print the content and then i
will run out the main dot pi and then
you can see that the information that is
printed is exactly what we saw in the
home dot html okay so
we kind of did a great job reading this
file now in my future episodes we are
going to read html files from real
websites but i just want to give you an
idea of how web scraping works in a very
basic way because when you work with
actual websites the scraping and the
information pulling is going to be quite
harder than the html file that i just
have written in order to explain the
idea of web scraping okay so i'm going
to continue on here and i'm going to use
the beautiful soup library in order to
prettify my html and work with its tags
like python objects so the way you can
accomplish that will be by creating an
instance of beautiful soup and i will go
here and create a new variable let's
call it soup and that is going to be
equal to a new instance of the beautiful
soup library now the arguments that i'm
going to specify here will be the html
file that i want to scrape so the
content of that will be the content
variable that is created up above and
then the second argument will be the
parser method that we want to use so we
will pass the password method as string
and that will be the lxml that we have
just installed previously now once i go
ahead and try to print what is inside
that soup instance it will be something
like the following so we will create
here a print statement and then we will
go with soup dot pretify so that will
allow you to see the html code in a more
pretty way and if i go ahead and run
this you can see that we see the html
content that is exactly the same like
what we saw in the home.html so we have
done a great job until now so let's
minimize back our terminal and now we
are going to get more familiar with the
special methods that are created inside
the beautiful soup library so we are
going to delete the print from here and
we are going to start working how we can
grab some specific information that we
want to grab so let's assume that we
want to grab all the html tags that are
created as h5 tags which is a kind of
header tag so we will go here and create
a new variable let's call it tags for
example and then we will go with soup
dot find and then once i go with find it
is going to search for the specific html
tag that i'm going to specify here as a
string so if i go here and write h5 and
then down below i go ahead and print the
tags the results of that will be
something like the following now you can
see that we have the entire html tag for
the h5 tag as you can see that its text
is python for beginners but if you
remember we have more than one h5 tags
that are created inside our home html
tag so if you remember from the home
file there is one here there is the
second one over there and there is the
third one over there and what that means
it means that the find method searches
for the first element and then it stops
the execution of searching for the html
tag that you are looking for now if you
want to change this behavior and not
only grab the first element then
basically you have to change your method
into find underscore all okay so that
will search for all the h5 tags inside
the content and now if i go ahead and
run that out then you can see that the
result here is quite different as we
have here a list and then you can see
that it has
python for beginners and then also
python web development and then also the
python machine learning now that could
be a great logic to bring you back all
the courses names from that webpage so
you can go here and change this into
courses
html tags okay so this is what the h5
tags are actually responsible for and
now i can write here some different code
that will allow me to see all the
courses that are defined on our page so
we have python for beginners and then we
have python web development and then we
also have
python machine learning so we can work
with these courses html tags that stores
all the h5 html tags and write a next
program that is going to display all the
courses so we can actually create here
an iteration over the course of html
tags because it has a list so we will go
here with four course in courses html
tags and then inside of that course tag
that we are iterating we can bring only
the text attribute which is going to
display the course text itself so it
will be here course
dot text and now if i go ahead and run
our program then you can see that we
have a nice output regarding all of the
courses that are available from that
page so this could be a nice starter to
understand how you can scrape a web page
to grab some specific information you
want all right so we were able to
understand how we can apply some basic
scraping to a web page but when you are
going to deal with real websites the
html code is not going to be quite
friendly and simple like we had here so
in order to be able to access the html
code behind the scenes of some page we
have to use the inspect of any browser
so let's say that you want to grab the
price for each of the courses so it
makes sense to go with your mouse and
hover to that button and then right
click on it and then you want to look
for that inspect option and once you
open that out you will have a new pane
that is going to be opened and then here
we can see all the html code that is
responsible to display what is going on
on the left pane so you can see that we
have here let's make it a little bit
more bigger
so that will be enough and then you can
see that we have here div class card
three times which is displaying all the
different courses now when you go over
different html tags with your mouse you
can see that it is going to mark for you
the html tag that is related to it so it
is a quite important behavior that we
should understand
now let's say that we want to grab the
price for that python for beginners so
it makes sense to expand this tag and
see what is inside so i will go here and
search for that button and you can see
that this a tag is actually responsible
for that
button itself and then you can see that
its text is start for twenty dollars so
the price information is right there and
let's actually write a program that is
going to search for that python for
beginners and then we will grab the
price for that course and then we will
be able to write a nice program that is
going to include a list of all of the
courses and then the prices for each one
of them so let's go back to pycharm and
write this program so we will go here
and delete everything from here and the
first step that we probably want to do
is to be able to grab all the course
cards so it will be course
underscore cards equals to soup
that find underscore all because we
probably are looking to bring us back
all the cards so this is why you have to
use find all and not define and i'm
going to search for the div tags now it
could be much nicer if we could filter
the div tags that we actually want to
grab and store it inside our course
cards so if you noticed let's go back to
our courses page and here if i just
expand back there all the div tags you
can see that there is something that is
common for all the div tags their class
is equal to card so i can filter my div
tags by this expression right there so i
go back to pycharm and i will write here
class equals to card but now you can see
that there is an error and it is quite
important behavior to understand you
have to apply here the underscore
because the class is a built-in keyword
in python where you create python
classes so that is why you have to add
the underscore over here and then the
beautiful soup will understand that you
are relating to the class of the html
attribute okay so it is important now
since we have all the course cards
stored right in this variable then we
probably want to iterate over this list
and then search for the course name and
then the course price so let's see how
we can do that for each of our course
cards so we will start with
for loop here and that will be four
course in course cards and before we go
ahead and write some more code inside
our for loop let's actually remind you
what is inside each of our courses and
then you can see that we have h5 tags on
each of our course cards and it makes
sense to access this specific h5 tags so
we can accomplish that by going here and
then use the h5 tag as an attribute so
if i go ahead and press here dot h5 and
re run my program then you can see that
we were able to grab each of our h5 tags
that are inside the course card so it is
a quite great thing and now
if i revert this back to course again
and run that out you can also see that
inside our a tags we have the text for
start for 20 dollars and that is
repeated for all of our cards as well so
first of all it makes sense to delete
this again and right here something like
course name
equals to course
dot h5 and then here we probably look
for the text attribute of that h5 tag so
i will write here dot text and then this
course name will be responsible to store
the text
on each iteration so it is great and now
i can go here and
write course price and then this time i
will search for course dot a because the
a tag stores the information about the
course price so until now if i go ahead
and print the course name and then i
also go ahead and print the course price
then we will see the results like the
following so you can see that we have
python for beginners and then we have
the a tag itself but in this case we
look for the text of that a tag as well
so i will
minimize my terminal out and excuse me
for that i will delete that from here
and then search for the text attribute
over here as well and now i will run my
program and then you can see that we
have python for beginners and then we
have the text for each of our a tags and
now since we reached this stage it might
be a greater idea to print a sentence
like python for beginners costs 20 okay
so the way we can do that
is basically using the split method to
access that last element of that text
because the price is located as the last
word so it makes sense to go here with
split and then we will split it by the
blank so we don't have to specify
anything here and we want to grab that
last element so we are looking for -1
index over here and now if i run it you
can see that we have the price
for each of our courses and now it might
be much nicer if we go ahead and use an
f-string to print a dynamic sentence for
each of our cursors so we will go here
with print and then we will open an f
string and then we will access the
course name so it will be course
underscore name and then we will write
costs and then we want to display the
course price so it will be cool
underscore price now if i run our
program then you can see that it
displays a nice information about each
one of the courses
now if you think about it that is a
quite nice behavior that we have applied
here because if you scrape a real
website like udemy that keeps updating
courses then it might be a great idea to
launch this program every certain amount
of time for example each week and then
you have the ability to be aware about
each of the courses that udemy has
updated on the webpage that you scrape
on so this is a quite nice behavior that
we were able to reach here
on this one we are going to scrape real
websites with the request library so i'm
going to simulate this against a website
that is going to search for job
advertisements and i'm going to bring
all the jobs from a specific website
that their main skill requirement is
python programming language and i'm
going to write a program that is going
to pull the latest published job
advertisements from a specific website
so it is going to be very interesting so
let's get started all right so one of
the first things that we must do is to
ensure that we have the request library
installed so i'm going to go down to my
terminal right in pycharm and i'm going
to write here pip install request just
to make sure that i have the request
library installed now the output for
myself could be different than yours
because you may not have the request
library but since i already have that
you can see outputs like requirement
already satisfied okay so it is quite
important now i'm going to minimize the
terminal and right here import requests
so you want to make sure that you do
that after the installation of this
library and the first thing that i'm
going to do here is to use the get
method of the request library now what
request library is doing behind the
scenes it is just requesting information
from a specific website so it is like a
real person
going to a website and requesting some
information okay so you can go with
something like the following when it
comes to request library so it will be
request dot get so you want to get
specific information from a website and
here we are going to provide an empty
string for now but later on we are going
to complete this string with the url
that we are going to web scrape against
it and i'm going to assign this to a new
variable and i will call it html text so
i'm going to make that to be equal to
this entire statement now let's go to a
web browser and look up for the website
that is going to include some job ads
okay so this is timejobs.com and this
website includes job posts about almost
everything so you can simply go down
here and search for some skill that you
own and then this will search for you
jobs that are requiring this specific
skill in that position now this video is
recorded a couple days before when i
uploaded it so if you watch this video
after a couple of months or even a year
or two since the publish date then there
is a great chance that the html elements
are going to be quite different but the
main point of this video is to teach you
all the tools to pull information from a
website just as you want and then you
can apply your own customizations and
kind of doing a reverse engineering to
the code that i'm going to write
throughout this tutorial great so let's
go here and write python so i will
receive only job posts about this
programming language and you can see
that we have this job found over there
and we have a lot of jobs that are
published so my goal here in this
tutorial would be to
let's get this closed so my goal in this
tutorial will be to bring all the jobs
that are posted a few days ago so if i
am zooming here in then you can see that
we have posted a few days ago for a
couple of posts but after i reach down
here we have posted four days ago so
this might mean that this job post is
not the most updated so i'm going to
bring all the jobs and i'm going to
condition my program to bring those
elements with the posted few days ago
text only so let's go back to here now
i'm going to bring this url from here
and i'm going to paste that in in the
empty string that we created inside the
request.get and once i have done that
what is going on inside this variable
right now is simply the request code
status okay so if i'm going to
print the i mean if i'm going to run
this program then we are going to see
the results like the following so 200 is
the convention number in web that the
request is done successfully but in
order to avoid the status code we are
going to go to here and i'm going to
accept the text only so i'm going to go
here and then write dot text okay so
this is what we have to apply here in
order to bring the html text of that
specific page and now it makes sense to
leave this variable name as it is
because it is storing the html text and
i'm going to re run this program and we
will probably receive a large
information of html so right now it is
not quite relevant but i'm just i just
wanted to show you the results so let's
continue from here okay so as you know
we are going to
create a beautiful soup instance like we
did in the previous episode and i'm
going to provide the html as the html
text variable so it will be soup equals
to
an instance of a beautiful soup and then
i'm going to write here html text as my
information that i want to scrape and we
are going to use the same parser again
like the previous episode so it will be
lxml now once i have done that it makes
sense to go back to our page and see how
we can grab only this each paragraph
from this website so the white boxes are
kind of a list of elements that this
page has provided here and i want to
look for a method that is going to bring
me all the job posts so it makes sense
to catch a certain element inside that
post and right click on it and then
click on inspect and once i have done
that you can see here so i'm going to
zoom in things a little bit
so we can see that the h3 class is
pointing to that
text over here i know that the text is a
little bit small here but just you can
see that it has a gray mark and i'm
going to go up here and then you can see
that those elements are opened up as
well so if i hover my mouse here then
you can see a green background wrapped
in the article over here i mean the
paragraph and then if i close that up
you can see that we have a lot of clear
fix job dash px and something like that
that its name is the class and our html
element here is called li so li stands
for list and then you can see that it is
inside a ul tag so this is standing for
unordered list and it is containing a
lot of
list tags inside that ul so you can see
once i close that then the entire
list of all the posts are marked with a
blue
background so i'm going to search the
element of li with that name of class so
i'm going to copy the name of the class
here and i'm going to go back to my
pycharm and i'm going to write here jobs
equals to soup dot find
underscore all and i'm going to search
for all the li's and as the second
argument it makes sense to pass here
class underscore equals to and then
inside that string i'm going to paste
that in the class name that we have
copied from the page itself so once i
have done that then we will probably see
the results of all the jobs in that page
now this doesn't mean that it is going
to bring back all the
16 000 jobs because you can see that
this page is being paginated so that
means that it is going to bring the
results only for the first page so this
is not going to take extremely long now
if i go back to here and paste the jobs
then let's see the results before we
continue on just to make sure that
everything is okay so we can see that we
receive the results and then we see that
we have some company names and i think
that everything is quite great here now
in order to work with this
scraping project it makes sense to only
work with only one job element so i'm
going to
delete the underscore all from here and
what this means it means that it is
going to bring the first match that sees
the li tag and then the class name as
this string over here so let's change
this variable name just to job for now
okay just in order to develop our
program slower
relying on only one job post okay so
once we've done that we probably want to
search for the company name of that
specific job post so i'm going to go
back to here and i'm going to make
things bigger over here and now let's
actually go here and try to inspect what
is going on
here again so let me
zoom that out great now i'm going to
try to inspect this text over here again
and then we can see that it is inside
the li tag for sure but we can also see
that it is inside an h3 tag and it has
the class name of job list comp name so
i'm going to search for that class in
the entire page as well but speaking
about the entire page so let's go to our
pycharm you want to search for that
specific element only inside the job
itself so you see it doesn't make sense
to search for an h3 tag in the entire
page again so you can basically go with
job dot find besides soup dot find
because we want to search for that h3
tag only inside our job so if i go ahead
and print the job here then we can see
that it only includes an html code about
only one job and i'm going to search for
this h3 tag
so let's create here a new variable and
i'm going to call that company
underscore name and we are going to use
job.find
and we are going to accept here as an
argument the h3 and then this time the
class underscore is going to be equal to
whatever this h3 tag includes as the
class name which is the job list comp
name now to debug this out and to ensure
that the results are great we are going
to print the company name
and then you can see that we receive
this
this element back and i'm going to use
here the dot text method just to bring
back the text itself now once i do that
we are going to see a weird result here
now you can see that we have some white
spaces so we kind of want to replace our
white spaces with nothing so in order to
do this one i'm going to go here and i'm
going to use the replace method and this
trick is going to avoid having this not
necessary white spaces so i'm going to
replace the spaces with nothing so i'm
going to just write here double quotes
twice i mean single quotes twice and
once i have done that and rerun our
program then you can see that the result
is going to be quite different as you
can see this text is fully aligned to
left now let's minimize back and
continue from here now we're going to
zoom out a little bit the code here just
we can see the important points like the
replacement
and let's continue from here now it also
makes sense to bring the skill
requirements other than the python
programming language because we know
that this job is only for people who are
good with the python programming
language so i'm going to go here and i'm
going to repeat myself in the same
process again and i'm going to write
here job.find and we are probably
looking for an element that is including
a text about the skill requirements so
let's search for that okay so let's go
back to our website again and i'm going
to
go here and check out what html element
is including the skills so we are
talking about this one so i'm going to
inspect inside here and we can see here
that this text is inside a span class
with the class name of srp skills so i'm
going to copy again this class name and
that time i'm going to search for the
spin elements inside my job post so i'm
going to go back to pycharm again and
i'm going to write here span so this is
the html tag that we are searching for
and again i'm going to write class
underscore equals to that srp skills now
i want to ensure the results over here
once again so you always want to
quickly print the results of whatever
html element that you want to pull to
see what other methods you have to apply
to prettify your result okay so let's
run our program again and it makes sense
to delete the print company name so
let's re-execute our program
and then you can see here that we have
some spin tag and then here we have a
strong tag which is basically created to
make our text bold when we want to type
in something so i'm just going to
guess here that i'm going to
only write here dot text and then i
expect for the results to be fine so
let's check out for that and then you
can see here that the results are quite
great so we have the python scripting
and then we have some more requirements
that are divided with commons and a lot
of white spaces again so i'm going to
apply the same method of that replace
once again like we did with the company
name so let's write here dot replace and
i'm going to replace white spaces
with nothing so let's re-execute that
out and then we can see that the result
is quite like we want and now we were
also able to grab the skills as well so
this is quite nice now if we want to
display a nice information about the job
until now then we want to go with a nice
print message here so let's try to
create a nice message so we will use an
f method here and we will also use the
triple quote method just to allow us to
write some text in separated lines as
well and i'm going to write here company
name
like this and then i'm going to write
here company name so i'm calling the
company name value by writing it inside
a curly brackets and i'm going to repeat
the same process for required skills so
it will be required skills and then i'm
going to make that to be equal to skills
variable and now if i go and execute our
program let's see if the results are
quite nice
yes so we kind of receiving a nice
information about the job info okay so
this is quite great
now if we go back to here then we want
to search for one more element so
you remember that i told you that we
only want to grab the
job post with the text of posted few
days ago so we for sure want to write
some extra code to apply this
functionality so i'm going to go here
and i'm going to inspect for that
element again
and then we can see that it is inside a
span once again but i can also see that
this
job post including some more span
tags so i have to filter out the results
again with the class name itself so i'm
going to search for that sim posted
class name and i'm going to go back to
here so we will write this time
job
published date
so it makes sense to delete the job
excuse me so it is just going to be
published date and i'm going to go here
again with job.find
and we will search for the span and then
this time the class underscore is going
to be equal to the text that i just
copied and i'm going to repeat myself
with printing the published date but
that time let's just avoid printing this
print line so i'm just going to comment
out those lines and let's see what the
published a
date text is looking like and you can
see that we have here something
a little bit weird so we have the span
here and we have also one more span
inside of the text of it so what that
means it means that we have to take some
different action than what we did
previously so this time i want to search
for the attribute of span just to get
inside that tag over here and then right
after it i want to look for the text of
that span tag so this will give me the
published date of this specific job but
i'm not going to include the publish
date inside my print message because we
only want the publish date for the
functionality to stop our execution if
the published date text is not including
the word of fuel and i'm going to code
this functionality just in a second so
you will see what i mean by what i said
all right so what i'm going to do here
is take a tricky action that is going to
bring me all the jobs from the first
page so if we paid attention then all
the job posts including this class name
so what i can do besides the find is
change that back to underscore all and
change this variable name to jobs and i
know that just now it just raised an
error here and i'm going to use here a
for loop that is going to iterate over
each element and i'm going to write here
for
job in jobs and then i'm going to create
an indentation of the entire code that
is right there so the results will be
applied for all the jobs that are posted
in the first page of the
web page that we scrape so once i hit
here the colon sign then i'm going to
create an indentation for each of our
lines like this and then the results are
going to be quite the same so let's test
that out okay i'm going to
uncomment our print line over here and
just for comfort reasons i'm also going
to print here and empty lines so we can
kind of see a division between the
different jobs and then i'm going to
delete the published date for now so if
we execute our program
that time then we are going to see a
nicer results and this is going to
contain all the job posts
from the page that we scrape against so
you can see that we have a nice
paragraph for that job post and then we
have also another one here and if i keep
scrolling up we can see a lot of them in
that output so this is quite great so if
you remember we wanted to filter out the
job posts that are not including the
word of few inside the published date
because what that means it means that
this job could be outdated so if i go to
our page again then we can see that as i
keep scrolling down we have some
text like posted six days ago and i
wanted to filter out only the jobs that
are containing the text of posted few
days ago so in order to apply this i'm
going to change the orders here a little
bit okay so i'm going to
cut this searching here and i'm going to
paste that in as the first line inside
my for loop now the reason i'm doing
this it is basically because i don't
want to continue on scraping for that
post if the publish date is not matching
my condition so it makes a lot of sense
to place this code as the first line
inside my for loop and then right here
i'm going to
write a condition that is going to check
if the word of fuel is inside that text
so it will be if
fill in
published date and again i'm going to
create an indentation for the entire
code here so you can do that with the
shift alt combined and then you can just
press tab and all the lines here are
being indented so right now if i go
ahead and execute our program then we
should see the results again like almost
the same but we also see here that the f
string is not quite nice
but i can live with that okay so it is
great that we were able to receive the
posts only that have been published few
days ago now there is no limit for what
you can do when it comes to web scraping
and what you can filter in or filter out
but basically this program deals with
how to grab some job posts with the
filters that you want to apply that
maybe sometimes may not be available
from the website itself so you can write
your own filtrations on your python code
while you scrape some information from a
specific website
so i'm going to do whatever it takes to
turn this program into a very useful one
and i'm going to do that by applying
some special functionalities such as
wrapping this entire program in a while
loop and executing this project every
certain amount of time and also apply
some filtrations to filter out the job
post that are not meeting the skills
that i own and also i'm going to throw
the results of the different job posts
into a new blank file so i can be aware
of the posts that are being posted every
certain amount of time so let's get
started all right then so let's start
with a kind reminder of the results that
we got until that point so
we run our program now and if we show it
right here you can see that those lines
are not aligned well so i'm going to
change that and i'm also going to
provide some extra information that will
show us the exact
link of the specific job that we are
iterating on so that way i will have the
ability to just click on the link and
then see more information about that job
so as a beginner i will get rid of the
formatted string in that case because
doing a formatted string with a triple
quote might not be a great idea when you
execute it with a for loop because as
you can see that it also includes the
indentations right here so i'm going to
delete this entire code here and i'm
going to write two more new formatted
strings and we will start with company
name
make that to be equal to company name so
make sure to add a column here so it
will be more friendly and then i will
write here required skills as well and
then we will write here the skills
variable now there was one more issue
with the result that we showed a minute
ago and that was the blank spaces that
are being shown as well so we can get
rid of the spaces by a special method
that is called strip and it is a special
method that you are allowed to use
inside strings and since the company
name and the skills are strings by
default i don't have to convert them to
a string so i can just call that method
like this okay and now i will show the
results of
something like the following and in a
few seconds we will see that
this is aligned way better than what it
was and i'm also going to add here more
information line that will show the link
of the job post so let's do that okay
let's go here and write this
functionality okay so we had an
unordered list that inside of that we
had some different html tags that are
called li and that stands for list and
they are actually different job posts
that are divided into different elements
inside an unordered list and then if we
hover our mouse you can see that there
are different jobs now if i go inside
one of them and i go inside a header tag
that is actually the first editor of the
li tag and then i will
go inside the h2 here and then you can
see that we have a link that could lead
us to a link that provides some extra
information about that specific job so
if i actually
go here and click on here you can see
that we receive the job description
right here so what we have to do in
order to access this link in each job
post that we are iterating on the python
code is actually going inside and header
and then going inside one more tag with
a kind of h2 as you saw me doing that
and then access that a
tag so let's do that okay i'm going to
go back to pycharm and apply this
functionality so we will go under the
skills and then we will write here more
info and that will be equal to job
dot header because this was the first
tag that we want to go inside of it and
then we want to go inside the h2 and
then inside that h2 we want to go inside
the a tag now before we go further let's
test ourselves that we have done great
job so let's print the more info in the
following way so it will be more info
and then we will call the variable in a
formatted string now let's execute our
program
and then you can see that inside the
more info we have the a href which gives
us the link about the
specific job that we are iterating on so
all i have to do here is going back to
my more info and then call that href
attribute so this time i'm going to do
that with a square bracket like in
dictionaries and then i'm going to write
here href so i will receive the value of
that attribute so if i run that one more
time
then i should see the link only and that
is what exactly happening so the result
is quite great and then you can see that
this is already better than what we did
in the last episode and we will continue
from here okay so what i want to do now
is giving the opportunity for the user
that executes this program to filter out
some skill requirement that he does not
own so we will use the input function
for that and then whatever the input is
equal to we will filter out the results
from the jobs that we are finding
right here okay so let's write this
functionality so to apply this i'm going
to create a new variable over here and
i'm going to call it unfamiliar skill
and i'm going to make that to be equal
to an input and then i'm going to
write here something like this okay so
the user could understand that he has to
provide some information in order to
execute this program and actually it
might be a great idea to print some
extra information before that input
function so it will be print
put some skill that you are not
familiar with and then right after the
unfamiliar skill input i will write here
filtering out
and we will actually make that a
formatted string and then we will write
here filtering out and then whatever the
unfamiliar skill is equal to
now what are we going to do with this
unfamiliar skill variable so that is
quite easy right we have to search for a
condition that will filter out the job
post that is including that word that we
are going to provide here as an
unfamiliar skill and what we can
actually do is search for the unfamiliar
skill world inside the skills string so
if you remember the skills is a long
string that is divided with commas so we
can go with a condition like the
following so it will be if unfamiliar
skill
not inside the skills that we are
grabbing in the each job post that we
are iterating and now all what we have
to do here is creating the indentation
for the different print lines okay so
now i should see the job posts that are
not including the unfamiliar skill that
i'm going to provide so just to test
that out let's
run our program twice okay so in the
first we are going to write here linux
as a skill that i'm not familiar with
and you can see that we don't see
anything that is including the keyword
of linux over here but let's actually
take that to a next level and test that
out so we see here a specific java post
that is including django so let's say
that i am not familiar with django and
see next time if i see that job post
with this company so let's re-execute
our program and that time i will write
django
and let's see the results so we can see
that we don't have any job with django
but we do have linux that time so
this condition works well and we will
continue on to next step from here now
what could be an exciting challenge for
you guys is to write an algorithm that
will accept more than one unfamiliar
skills so you want to accept multiple
inputs from a user and it might be more
challenging but i think you should try
to spend some time on something like
this because i think this could be an
amazing challenge for everyone who is
watching this video all right so now we
are going to save each job post in a
different file so besides printing this
in the terminal then we are going to
write this entire information in a
separated file and then i will also
allow this program to run every 15
minutes or every 10 minutes up to you
and i will show this logic as well so
first of first it makes sense to wrap
our entire program in a function and i'm
going to do that by collecting
everything that is kind of pulling the
information from the website and i'm
going to indent everything one step
aside and then i'm going to write here
def find
jobs okay so that way we have one
function that executes our main program
and then what i'm going to do here is
using the logic of if double underscore
name is double underscore main so that
way if you want to extend this program
only if this file is ran directly then
this function will be executed now if
you don't know what i said about if
double underscore name equals double
underscore main then i have a video that
explains this condition so you can check
that out by the suggested link above so
let's write here if double underscore
name
equals to double underscore main inside
a string and then right here while true
so i want to run this program forever
and then i will call the find jobs and
right after it since i don't want this
program to be executed like every
millisecond then i'm going to write here
time dot sleep so time dot sleep allows
your program to wait certain amount of
time that you decide and you can provide
its argument by seconds so i'm going to
write here 600 just to make that program
to run every 10 minutes but you can
notice how we did not import the time
library so let's do that by
import time okay and then this program
should be okay now to make this more
dynamic i actually prefer to
make some variable here that will be
equal to 10 and then i will just make
that to be equal to time weight
multiplied by 60 and right after it we
can provide some extra information
excuse me this should be over here and
we can write here waiting
let's make it formatted
then we can write here waiting
time weight seconds
and let's write three dots here great so
this is great so if i'm executing this
program i expect to see this program
running every 10 10 minutes so i'm
inside my command line interface and you
can see that my directory has been
already set to the directory where we
worked so i can go with python and then
execute the name of the file by calling
it so it will be main dot pi and then
once i run that you can see that we
receive
this
output and then i have to provide some
information that is going to be filtered
out and then let's write here django
again and
you can see that we receive the results
successfully but more important we see
that waiting 10 seconds which is not
great we have to change that to waiting
10 minutes because
we are waiting 10 minutes right but the
program works great it was just my
mistake by writing here seconds so it
should be minutes for sure but i'm not
going to wait 10 minutes until this
program is running one more time and so
i will allow myself to move on to
writing this information inside file so
it makes sense to write this kind of
information in a separated directory so
i will go inside my web scraping tree
file i mean folder and then i'm going to
create here new directory which is going
to be named as posts and then i'm going
to
write here some extra functionality that
will create
files i mean text files then
and then inside each text file i'm going
to write this exact information so you
can do that by with open i already show
you how you can do that in the first
episode now i know that i don't have any
separated tutorial about working with
files in python but you want to consider
check out my channel maybe i will upload
very soon so
you can go here and
that time i want to put here information
and i will call my post directory and
then inside here i have to provide my
file name that i'm going to create now
before i move on here i talked about
changing my for loop here and use the
enumerate function now enumerate
function is going to allow us to iterate
over the index of the jobs list and also
the job content itself and so i have to
provide here one more variable like
index so the index is going to be a kind
of counter for the job that i'm
iterating on and then the job variable
will relate to the job
beautiful sub object itself and so it
makes sense to name our files with the
index of the job that i'm iterating on
so i will change this into a formatted
string and then i will write here index
dot txt so it will be something like the
following and i expect each my text file
to be named like
0.txt or 1.txt and so on now the second
argument will be the permission level
that you want to give when you create or
open a new file and this time i'm going
to write here w and that stands for
writing inside the file and then i have
to use the as statement and i'm going to
use the f variable so inside that block
i can write to a file with the f
variable and i'm going to go inside my
with open and i'm going to create
indentation of the prints and i'm going
to delete this print line here and it
makes sense to remove this blank space
as well and all i have to do here is
changing this print statement to f dot
write and then that time i'm not going
to print the results in the command line
interface besides i'm going to write the
information in a new file so i'm going
to use the combination of alt shift here
and i'm going to change those entire
three prints to f dot write okay and
then i'm going to open the parentheses
so it will be closed by those and then i
expect for each job to being
written inside a file and once i do that
it might be a great idea to print a
sentence like file
saved and then you can provide the name
of the file as an extra information so i
will create one more time formatted
string and then i will relate to that
index
variable and now our program is complete
so let's check it okay let's go back to
our command line interface and let's
actually control break this program and
let's write cls to clear our terminal
and then i'm going to re-execute my
program so it will be python
main dot pi
and then i'm going to
execute it so let's see this time i'm
going to write django as well
and that time i don't expect to see
output for the information besides i
expect to see
this okay so let's see
what is inside each of our files so
let's see what is inside that post
directory okay so i'm going to go inside
my c python put
web scripting tree and then the post
directory that we created a few minutes
ago and you can see that inside of that
we have our text files but if i go here
inside let's see if the results are okay
okay so
i'm not quite satisfied with with that
because it might be a greater idea to
see that like
i mean like this okay so
you might want to divide those
information in separated lines but that
is not going to be complex so
we just have to go inside our python
again and then whenever we write to the
file we have to use that convention
where you can just jump a line and that
will be backslash in so when you provide
backslash n inside a string it is just a
convention that is going to jump to the
next line right after it so it will be
backslash n for the first line and then
also here and let's run this program one
more time so i'm just going to break the
program and re-execute it so that time i
will write linux and then let's test our
results one more time so let's go inside
our
19.txt and then you can see that the
information is right there just like we
expected okay so this is quite great
alright guys so i hope you enjoyed this
entire series and you can find
everything that we have done here by the
links in the description of course i
will provide extra information in my
website about this series so if you like
this video consider subscribing and also
hit the like button i will see you in my
future uploads
