TubeSum ← Transcribe a video

Web Scraping with Python - Beautiful Soup Crash Course

1h 08m video Transcribed Jun 30, 2026 F freeCodeCamp.org
Beginner 25 min read For: Beginner Python developers interested in learning web scraping with Beautiful Soup.
1.8M
Views
39.2K
Likes
1.1K
Comments
526
Dislikes
2.2%
📈 Moderate

AI Summary

This tutorial teaches web scraping with Python using the Beautiful Soup library. It covers basic HTML parsing, then progresses to scraping a real job listing website (TimesJobs) with requests, filtering results, and saving data to files. The course is structured for beginners and includes hands-on coding examples.

[00:33]
Goal of Tutorial

Will teach web scraping using Beautiful Soup, starting with basic HTML page, then real website, and finally storing data in files.

[05:36]
Installing Libraries

Install beautifulsoup4 and lxml parser using pip install.

[07:20]
Reading HTML File

Open home.html with with open and read its content into a variable.

[12:50]
Finding Tags with Beautiful Soup

Use find() to get first match and find_all() to get all matches. Access tag text via .text.

[16:35]
Inspecting Real Websites

Use browser inspect (right-click) to view HTML structure and identify elements to scrape.

[24:47]
Scraping TimesJobs with Requests

Use requests.get(url).text to fetch page HTML, then parse with Beautiful Soup.

[33:10]
Extracting Job Details

Find company name (h3.job-list-comp-name), skills (span.srp-skills), and posted date (span.sim-posted) using find with class_ parameter.

[42:00]
Filtering Jobs by Date

Only include jobs with 'few' in the posted date text to get recent postings.

[56:00]
Saving Results to Files

Write each job's info to a separate text file in the 'posts' directory using with open and f.write.

[68:00]
Conclusion

Final wrap-up: encouraged to subscribe and check the channel for more content.

This tutorial provides a comprehensive introduction to web scraping with Python, covering local HTML parsing, real website scraping with requests, filtering, and saving data to files.

Clickbait Check

90% Legit

"Delivers exactly what the title promises: a full crash course on web scraping with Beautiful Soup, from basics to advanced filtering and saving."

Mentioned in this Video

Tutorial Checklist

1 00:33 Understand the goal: learn web scraping with Beautiful Soup.
2 05:36 Install beautifulsoup4 and lxml using 'pip install beautifulsoup4 lxml'.
3 07:20 Open a local HTML file using 'with open('home.html', 'r') as html_file:' and read its content.
4 11:10 Create a Beautiful Soup object: soup = BeautifulSoup(content, 'lxml').
5 12:50 Use soup.find('tag') to get the first element or soup.find_all('tag') to get all.
6 15:57 Access text inside a tag using .text attribute.
7 20:02 Filter by class: use class_='class-name' in find/find_all.
8 24:47 For a real website, use requests.get(url).text to get HTML.
9 30:10 Parse the fetched HTML with Beautiful Soup and find list items (li) with a specific class.
10 33:10 Within each job element, find company name (h3), skills (span), and posted date (span) using find with class_.
11 42:00 Filter jobs by checking if 'few' is in the posted date text.
12 56:00 Save each job's info to a separate text file using 'with open(f'posts/{index}.txt', 'w') as f: f.write(...)'.

Study Flashcards (11)

What library is used for web scraping in this tutorial?

easy Click to reveal answer

Beautiful Soup (bs4).

00:33

What parser is recommended for Beautiful Soup?

easy Click to reveal answer

lxml.

06:46

How do you open and read an HTML file in Python?

medium Click to reveal answer

Use 'with open('file.html', 'r') as file: content = file.read()'.

08:17

What method returns the first matching HTML tag?

easy Click to reveal answer

find().

13:14

What method returns all matching HTML tags?

easy Click to reveal answer

find_all().

14:18

How do you access the text content of a tag?

easy Click to reveal answer

Use the .text attribute.

15:57

What parameter in find() filters by HTML class?

medium Click to reveal answer

class_ (with underscore).

20:02

How do you get the href attribute of an anchor tag?

medium Click to reveal answer

Use tag['href'] with square brackets.

53:38

What is the purpose of time.sleep() in this script?

easy Click to reveal answer

To pause execution for a specified number of seconds.

59:37

How do you write text to a file in Python?

medium Click to reveal answer

Use f.write() inside a 'with open' block.

64:48

What does the strip() method do?

easy Click to reveal answer

Removes leading and trailing whitespace from a string.

50:37

💡 Key Takeaways

💡

Learning web scraping with Beautiful Soup

Sets the foundation for the entire tutorial, clearly stating the goal.

00:33
🔧

Using lxml parser for broken HTML

Practical tip that helps avoid parsing errors with real-world HTML.

06:46
🔧

Inspecting real website HTML

Essential skill for any web scraper – shows how to locate elements to extract.

16:35
⚖️

Filtering jobs with condition

Demonstrates applying logic to scraped data to refine results.

42:00
🔧

Saving scraped data to files

Shows persistence of data, a crucial step in any scraping project.

56:00

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

Scrape ANY Website? (Legit?)

46s

Promises ability to scrape sensitive sites like bank accounts, sparking curiosity and ethical debate.

▶ Play Clip

Find All HTML Tags in Python

45s

Shows the fundamental find_all method, essential for beginners to grasp web scraping.

▶ Play Clip

How to Scrape Job Listings with Python

50s

Demonstrates real-world scraping using requests and BeautifulSoup, highly engaging for job seekers.

▶ Play Clip

Filter Web Scraping Results Like a Pro

60s

Teaches conditional filtering by date and skill, a key skill for effective automation.

▶ Play Clip

[00:00] hi everyone and welcome to a special

[00:02] python tutorial where we are going to

[00:04] learn how to perform web scripting so

[00:07] first of all thanks to free code cam to

[00:09] giving me this opportunity of being a

[00:12] guest on their channel and i have a

[00:14] youtube channel as well that is named

[00:16] gymshape coding and you can find there

[00:18] any tech related topic such as

[00:20] programming language web development and

[00:23] more content that i am uploading once or

[00:26] twice a week so you can just go ahead

[00:28] and find the link from the description

[00:30] okay so in this video i'm going to do my

[00:33] best to teach you anything that is

[00:35] related to web scripting and i'm going

[00:37] to do that with the beautiful soup

[00:39] library and that is a special library

[00:42] that will allow you to gather any

[00:44] information you want from any website

[00:46] you want okay so this website could be

[00:49] your bank account could be a job post

[00:52] website like linkedin this could be

[00:54] wikipedia or a sports website and really

[00:57] anything that you can think about so we

[01:00] will start by scraping a basic html page

[01:03] first just to understand the concepts

[01:06] and then we will move on to scraping a

[01:08] real website and by the last 15 to 20

[01:11] minutes of this tutorial i'm going to

[01:13] show you how you can store the

[01:15] information that we have just pulled

[01:17] from this website so let's begin

[01:23] great so this is the webpage that we are

[01:25] going to

[01:26] start web scraping and i'm going to

[01:29] explain what is going on here so you can

[01:31] see that we are having a basic title and

[01:33] then we are having a kind of three

[01:36] paragraphs so you can see that we have a

[01:38] title of python and then we have a kind

[01:41] of secondary title and then there is a

[01:43] basic explanation about the course

[01:45] itself and then we are having a button

[01:48] that says start that will probably lead

[01:51] us to a different page if we click on it

[01:53] and then you can see that it has the

[01:55] price here as well now we are kind of

[01:58] repeating ourselves three times here and

[02:01] this is what is responsible to that web

[02:04] development paragraph and then also for

[02:07] that machine learning paragraph now what

[02:10] we are currently looking at it is

[02:11] basically the behind the scenes of that

[02:14] page so this is the html code that is

[02:17] defined in order to show you that hello

[02:20] start learning page and you can see that

[02:23] inside our html documents all of the

[02:25] code is being created with tags now

[02:29] those tags are what are responsible to

[02:31] display different information for you

[02:33] and you can see that we have a big tag

[02:35] that is called html and then inside of

[02:38] that html tag we are having a head tag

[02:41] and then a body tag now you can see that

[02:44] we are defining a closure for each of

[02:46] our tags with the forward slash here and

[02:50] then you are probably going to see that

[02:52] for the different tags as well now let's

[02:55] expand the head tag here and then inside

[02:58] of it we are seeing some meta

[02:59] information that is not quite relevant

[03:02] for us but we see that link tag which is

[03:05] responsible to import some styling for

[03:07] our page and then we can see that title

[03:10] tag which is responsible to customize

[03:13] our tab name and that is why you'll see

[03:16] my courses over here now i will close

[03:20] back the head and then i will expand the

[03:23] body so the body is responsible to

[03:26] display what is going to be on the page

[03:28] itself it is the page's body and you can

[03:31] see that we already have the h1 tag that

[03:35] is created here and then between the

[03:37] closure which is the area that you can

[03:40] write the text for that tag we see the

[03:42] hello comma start learning and then we

[03:45] are having some div tags here and when

[03:48] you see the tag of div this is the very

[03:50] basic tag that will create some

[03:54] tags in different styling so you'll see

[03:57] here the class equals card what this

[04:00] attribute assigning does here it is

[04:03] importing the card styling and that is

[04:06] why you see the kind of carding style

[04:10] for each of our paragraphs over that

[04:13] page and you can see that we are having

[04:16] one more div inside that card class

[04:18] which is called card header so this is

[04:20] the styling for card header this is why

[04:23] it is called that way and then the text

[04:26] is python and then we have the card body

[04:29] and we have the h5 tag which is a kind

[04:32] of smaller header that you can display

[04:35] and if i scroll right here you can see

[04:38] that python for beginners text and then

[04:40] the closure for hyh5 tag and we are

[04:44] having a paragraph and then the a tag

[04:47] which is allowing us to lead to another

[04:50] page so when you see the a tag it is

[04:53] basically a reference to another page

[04:55] that you can visit

[04:56] now this entire code that i'm currently

[05:00] marking let's actually make our page a

[05:02] bigger here

[05:03] this entire code that i just marked is

[05:06] kind of repeated three times and that is

[05:09] why we see the page that we saw

[05:12] previously okay so it is quite important

[05:14] to understand and we are going to scrape

[05:18] that page and pull some information with

[05:21] the beautiful soup library now if you

[05:24] are confused with the script tags here

[05:26] don't because those tags are responsible

[05:29] to import some javascript libraries and

[05:32] that is something not relevant for us

[05:34] right now okay so we are going to switch

[05:36] to python now in order to apply some

[05:38] basic scraping for that page so i will

[05:41] go and start working on my main.pi file

[05:45] and you can see that nothing is here now

[05:47] before we actually start we have to

[05:49] install same libraries and one of them

[05:51] will be the beautiful soup so i will

[05:53] open my terminal and since i'm working

[05:55] with my system global interpreter i will

[05:58] allow myself to install it over here and

[06:02] i will go here and write pip install and

[06:05] then we will write here beautiful soup

[06:08] 4 so make sure that everything is

[06:12] not spaced or not split it with dashes

[06:14] and then i'm going to hit enter and then

[06:16] you can see that it is installed

[06:18] successfully and then the next thing

[06:20] that i want to install will be something

[06:23] that is going to be used from the

[06:25] beautiful soup library and that is the

[06:27] parcel method so when you work with

[06:29] beautiful soup you have to specify the

[06:31] method that you are going to parse html

[06:34] files into python objects okay so there

[06:38] are going to be different methods to

[06:40] parse your html code and i heard that

[06:44] the best of them could be the lxml

[06:46] parser since if you work with the

[06:49] default html parser it is not going to

[06:51] deal well with broken html code so just

[06:55] go ahead and install the lxml parcel

[06:58] library and you can also do that with

[06:59] pip install and then we are going to use

[07:02] that when we work with the beautiful

[07:05] soup so i will go here and then write

[07:08] pip install lxml and then once i do that

[07:12] let's wait until it's finished great so

[07:15] we are ready now to go back to python

[07:18] and start working with the beautiful

[07:20] soup library now we have to go here and

[07:25] import that beautiful soup library so it

[07:27] could be a little bit confusing because

[07:29] the libraries folder is created as bs4

[07:34] so that is why we are going to write

[07:36] here from bs 4 import beautiful soup

[07:41] like this and once i have done that i

[07:44] have to figure out how i'm going to

[07:47] access the content inside the home.html

[07:50] file that is right there inside my web

[07:54] scraping directory so in order to do

[07:57] that we have to work with file objects

[08:00] now if you don't know how to work with

[08:02] files in python that is totally fine

[08:04] because we are going to go over it and

[08:07] it also might be worth to check my

[08:09] channel out if i have already uploaded

[08:12] how to work with files in python so i'm

[08:15] going to write here

[08:17] with open so this is basically a

[08:20] statement that will allow me to open a

[08:23] file and then read the content of that

[08:26] specific file so as you can see from the

[08:29] auto completion i have to specify as my

[08:32] first argument the file's name so i'm

[08:35] going to close the parenthesis here and

[08:38] then inside here i'm going to write my

[08:41] html files name now since the python

[08:45] file and then the home.html file are in

[08:48] the same exact directory it will be okay

[08:51] just to write its name so it will be

[08:53] home.html

[08:55] and the second argument will be the

[08:57] method that you want to apply when you

[09:01] open that file in that python's memory

[09:03] so you have couple of options when you

[09:05] work with python files you can read them

[09:08] you can write them or you can do both

[09:11] and if we only want to read the content

[09:14] then we somehow want to specify that we

[09:17] only want to read this file so we will

[09:20] open here a new string

[09:22] and we will write here r so what this

[09:25] tells to python is basically that i'm

[09:28] going to read that file only and once i

[09:31] have done that i have to write here a

[09:34] variable that is going to be used inside

[09:37] that code block that i just created

[09:40] which is the with open so i'm going to

[09:42] use the as keyword and then i'm going to

[09:45] create here a variable name that is

[09:47] going to be used throughout the block of

[09:49] the open so it will be html underscore

[09:53] file and that will be basically my

[09:56] variable name and then once i do that i

[09:59] will go inside the

[10:01] open block and then i will write here

[10:04] content

[10:05] equals to html file dot read and once i

[10:10] apply the read method i'm basically

[10:13] reading the html file content and in

[10:17] order to show you how this works let's

[10:20] first print the content itself so i will

[10:22] go here and print the content and then i

[10:25] will run out the main dot pi and then

[10:28] you can see that the information that is

[10:30] printed is exactly what we saw in the

[10:34] home dot html okay so

[10:36] we kind of did a great job reading this

[10:39] file now in my future episodes we are

[10:43] going to read html files from real

[10:46] websites but i just want to give you an

[10:48] idea of how web scraping works in a very

[10:52] basic way because when you work with

[10:54] actual websites the scraping and the

[10:57] information pulling is going to be quite

[10:59] harder than the html file that i just

[11:02] have written in order to explain the

[11:05] idea of web scraping okay so i'm going

[11:07] to continue on here and i'm going to use

[11:11] the beautiful soup library in order to

[11:14] prettify my html and work with its tags

[11:18] like python objects so the way you can

[11:22] accomplish that will be by creating an

[11:25] instance of beautiful soup and i will go

[11:28] here and create a new variable let's

[11:30] call it soup and that is going to be

[11:33] equal to a new instance of the beautiful

[11:36] soup library now the arguments that i'm

[11:39] going to specify here will be the html

[11:42] file that i want to scrape so the

[11:45] content of that will be the content

[11:47] variable that is created up above and

[11:49] then the second argument will be the

[11:52] parser method that we want to use so we

[11:55] will pass the password method as string

[11:58] and that will be the lxml that we have

[12:01] just installed previously now once i go

[12:05] ahead and try to print what is inside

[12:08] that soup instance it will be something

[12:11] like the following so we will create

[12:14] here a print statement and then we will

[12:16] go with soup dot pretify so that will

[12:20] allow you to see the html code in a more

[12:23] pretty way and if i go ahead and run

[12:27] this you can see that we see the html

[12:29] content that is exactly the same like

[12:33] what we saw in the home.html so we have

[12:37] done a great job until now so let's

[12:40] minimize back our terminal and now we

[12:43] are going to get more familiar with the

[12:45] special methods that are created inside

[12:47] the beautiful soup library so we are

[12:50] going to delete the print from here and

[12:52] we are going to start working how we can

[12:54] grab some specific information that we

[12:57] want to grab so let's assume that we

[13:00] want to grab all the html tags that are

[13:03] created as h5 tags which is a kind of

[13:06] header tag so we will go here and create

[13:08] a new variable let's call it tags for

[13:11] example and then we will go with soup

[13:14] dot find and then once i go with find it

[13:17] is going to search for the specific html

[13:20] tag that i'm going to specify here as a

[13:23] string so if i go here and write h5 and

[13:27] then down below i go ahead and print the

[13:31] tags the results of that will be

[13:34] something like the following now you can

[13:36] see that we have the entire html tag for

[13:39] the h5 tag as you can see that its text

[13:43] is python for beginners but if you

[13:46] remember we have more than one h5 tags

[13:50] that are created inside our home html

[13:53] tag so if you remember from the home

[13:55] file there is one here there is the

[13:58] second one over there and there is the

[14:00] third one over there and what that means

[14:04] it means that the find method searches

[14:07] for the first element and then it stops

[14:10] the execution of searching for the html

[14:13] tag that you are looking for now if you

[14:16] want to change this behavior and not

[14:18] only grab the first element then

[14:21] basically you have to change your method

[14:23] into find underscore all okay so that

[14:27] will search for all the h5 tags inside

[14:30] the content and now if i go ahead and

[14:33] run that out then you can see that the

[14:35] result here is quite different as we

[14:38] have here a list and then you can see

[14:40] that it has

[14:42] python for beginners and then also

[14:44] python web development and then also the

[14:46] python machine learning now that could

[14:49] be a great logic to bring you back all

[14:53] the courses names from that webpage so

[14:57] you can go here and change this into

[15:00] courses

[15:02] html tags okay so this is what the h5

[15:05] tags are actually responsible for and

[15:08] now i can write here some different code

[15:10] that will allow me to see all the

[15:13] courses that are defined on our page so

[15:17] we have python for beginners and then we

[15:20] have python web development and then we

[15:22] also have

[15:23] python machine learning so we can work

[15:26] with these courses html tags that stores

[15:29] all the h5 html tags and write a next

[15:33] program that is going to display all the

[15:35] courses so we can actually create here

[15:38] an iteration over the course of html

[15:40] tags because it has a list so we will go

[15:44] here with four course in courses html

[15:48] tags and then inside of that course tag

[15:51] that we are iterating we can bring only

[15:54] the text attribute which is going to

[15:57] display the course text itself so it

[15:59] will be here course

[16:01] dot text and now if i go ahead and run

[16:05] our program then you can see that we

[16:07] have a nice output regarding all of the

[16:10] courses that are available from that

[16:12] page so this could be a nice starter to

[16:15] understand how you can scrape a web page

[16:18] to grab some specific information you

[16:20] want all right so we were able to

[16:22] understand how we can apply some basic

[16:24] scraping to a web page but when you are

[16:27] going to deal with real websites the

[16:29] html code is not going to be quite

[16:31] friendly and simple like we had here so

[16:35] in order to be able to access the html

[16:39] code behind the scenes of some page we

[16:42] have to use the inspect of any browser

[16:45] so let's say that you want to grab the

[16:48] price for each of the courses so it

[16:51] makes sense to go with your mouse and

[16:53] hover to that button and then right

[16:56] click on it and then you want to look

[16:58] for that inspect option and once you

[17:01] open that out you will have a new pane

[17:05] that is going to be opened and then here

[17:07] we can see all the html code that is

[17:10] responsible to display what is going on

[17:13] on the left pane so you can see that we

[17:16] have here let's make it a little bit

[17:18] more bigger

[17:19] so that will be enough and then you can

[17:22] see that we have here div class card

[17:25] three times which is displaying all the

[17:27] different courses now when you go over

[17:30] different html tags with your mouse you

[17:32] can see that it is going to mark for you

[17:35] the html tag that is related to it so it

[17:38] is a quite important behavior that we

[17:40] should understand

[17:42] now let's say that we want to grab the

[17:44] price for that python for beginners so

[17:46] it makes sense to expand this tag and

[17:49] see what is inside so i will go here and

[17:53] search for that button and you can see

[17:56] that this a tag is actually responsible

[17:59] for that

[18:00] button itself and then you can see that

[18:02] its text is start for twenty dollars so

[18:06] the price information is right there and

[18:10] let's actually write a program that is

[18:12] going to search for that python for

[18:14] beginners and then we will grab the

[18:17] price for that course and then we will

[18:20] be able to write a nice program that is

[18:22] going to include a list of all of the

[18:25] courses and then the prices for each one

[18:28] of them so let's go back to pycharm and

[18:31] write this program so we will go here

[18:33] and delete everything from here and the

[18:37] first step that we probably want to do

[18:40] is to be able to grab all the course

[18:43] cards so it will be course

[18:46] underscore cards equals to soup

[18:50] that find underscore all because we

[18:53] probably are looking to bring us back

[18:56] all the cards so this is why you have to

[18:58] use find all and not define and i'm

[19:01] going to search for the div tags now it

[19:04] could be much nicer if we could filter

[19:08] the div tags that we actually want to

[19:11] grab and store it inside our course

[19:13] cards so if you noticed let's go back to

[19:17] our courses page and here if i just

[19:21] expand back there all the div tags you

[19:23] can see that there is something that is

[19:26] common for all the div tags their class

[19:30] is equal to card so i can filter my div

[19:33] tags by this expression right there so i

[19:37] go back to pycharm and i will write here

[19:41] class equals to card but now you can see

[19:44] that there is an error and it is quite

[19:47] important behavior to understand you

[19:49] have to apply here the underscore

[19:51] because the class is a built-in keyword

[19:54] in python where you create python

[19:57] classes so that is why you have to add

[19:59] the underscore over here and then the

[20:02] beautiful soup will understand that you

[20:04] are relating to the class of the html

[20:08] attribute okay so it is important now

[20:10] since we have all the course cards

[20:12] stored right in this variable then we

[20:15] probably want to iterate over this list

[20:18] and then search for the course name and

[20:20] then the course price so let's see how

[20:23] we can do that for each of our course

[20:26] cards so we will start with

[20:29] for loop here and that will be four

[20:31] course in course cards and before we go

[20:34] ahead and write some more code inside

[20:36] our for loop let's actually remind you

[20:39] what is inside each of our courses and

[20:42] then you can see that we have h5 tags on

[20:46] each of our course cards and it makes

[20:48] sense to access this specific h5 tags so

[20:52] we can accomplish that by going here and

[20:55] then use the h5 tag as an attribute so

[21:00] if i go ahead and press here dot h5 and

[21:04] re run my program then you can see that

[21:06] we were able to grab each of our h5 tags

[21:10] that are inside the course card so it is

[21:13] a quite great thing and now

[21:16] if i revert this back to course again

[21:19] and run that out you can also see that

[21:22] inside our a tags we have the text for

[21:26] start for 20 dollars and that is

[21:29] repeated for all of our cards as well so

[21:32] first of all it makes sense to delete

[21:35] this again and right here something like

[21:38] course name

[21:40] equals to course

[21:42] dot h5 and then here we probably look

[21:46] for the text attribute of that h5 tag so

[21:50] i will write here dot text and then this

[21:53] course name will be responsible to store

[21:56] the text

[21:57] on each iteration so it is great and now

[22:01] i can go here and

[22:03] write course price and then this time i

[22:06] will search for course dot a because the

[22:09] a tag stores the information about the

[22:12] course price so until now if i go ahead

[22:16] and print the course name and then i

[22:19] also go ahead and print the course price

[22:23] then we will see the results like the

[22:26] following so you can see that we have

[22:28] python for beginners and then we have

[22:31] the a tag itself but in this case we

[22:34] look for the text of that a tag as well

[22:37] so i will

[22:39] minimize my terminal out and excuse me

[22:41] for that i will delete that from here

[22:44] and then search for the text attribute

[22:46] over here as well and now i will run my

[22:49] program and then you can see that we

[22:51] have python for beginners and then we

[22:53] have the text for each of our a tags and

[22:56] now since we reached this stage it might

[22:59] be a greater idea to print a sentence

[23:02] like python for beginners costs 20 okay

[23:05] so the way we can do that

[23:08] is basically using the split method to

[23:11] access that last element of that text

[23:14] because the price is located as the last

[23:18] word so it makes sense to go here with

[23:21] split and then we will split it by the

[23:24] blank so we don't have to specify

[23:26] anything here and we want to grab that

[23:28] last element so we are looking for -1

[23:31] index over here and now if i run it you

[23:35] can see that we have the price

[23:37] for each of our courses and now it might

[23:40] be much nicer if we go ahead and use an

[23:43] f-string to print a dynamic sentence for

[23:47] each of our cursors so we will go here

[23:50] with print and then we will open an f

[23:52] string and then we will access the

[23:54] course name so it will be course

[23:57] underscore name and then we will write

[24:00] costs and then we want to display the

[24:03] course price so it will be cool

[24:06] underscore price now if i run our

[24:09] program then you can see that it

[24:11] displays a nice information about each

[24:13] one of the courses

[24:15] now if you think about it that is a

[24:17] quite nice behavior that we have applied

[24:19] here because if you scrape a real

[24:21] website like udemy that keeps updating

[24:24] courses then it might be a great idea to

[24:27] launch this program every certain amount

[24:30] of time for example each week and then

[24:32] you have the ability to be aware about

[24:35] each of the courses that udemy has

[24:37] updated on the webpage that you scrape

[24:40] on so this is a quite nice behavior that

[24:43] we were able to reach here

[24:47] on this one we are going to scrape real

[24:49] websites with the request library so i'm

[24:52] going to simulate this against a website

[24:55] that is going to search for job

[24:57] advertisements and i'm going to bring

[24:59] all the jobs from a specific website

[25:02] that their main skill requirement is

[25:06] python programming language and i'm

[25:07] going to write a program that is going

[25:10] to pull the latest published job

[25:12] advertisements from a specific website

[25:15] so it is going to be very interesting so

[25:17] let's get started all right so one of

[25:19] the first things that we must do is to

[25:22] ensure that we have the request library

[25:25] installed so i'm going to go down to my

[25:29] terminal right in pycharm and i'm going

[25:32] to write here pip install request just

[25:35] to make sure that i have the request

[25:36] library installed now the output for

[25:40] myself could be different than yours

[25:41] because you may not have the request

[25:43] library but since i already have that

[25:46] you can see outputs like requirement

[25:49] already satisfied okay so it is quite

[25:51] important now i'm going to minimize the

[25:54] terminal and right here import requests

[25:58] so you want to make sure that you do

[25:59] that after the installation of this

[26:01] library and the first thing that i'm

[26:03] going to do here is to use the get

[26:07] method of the request library now what

[26:11] request library is doing behind the

[26:13] scenes it is just requesting information

[26:15] from a specific website so it is like a

[26:18] real person

[26:19] going to a website and requesting some

[26:22] information okay so you can go with

[26:24] something like the following when it

[26:26] comes to request library so it will be

[26:28] request dot get so you want to get

[26:32] specific information from a website and

[26:34] here we are going to provide an empty

[26:36] string for now but later on we are going

[26:39] to complete this string with the url

[26:41] that we are going to web scrape against

[26:44] it and i'm going to assign this to a new

[26:46] variable and i will call it html text so

[26:50] i'm going to make that to be equal to

[26:52] this entire statement now let's go to a

[26:55] web browser and look up for the website

[26:57] that is going to include some job ads

[27:00] okay so this is timejobs.com and this

[27:04] website includes job posts about almost

[27:07] everything so you can simply go down

[27:09] here and search for some skill that you

[27:12] own and then this will search for you

[27:15] jobs that are requiring this specific

[27:17] skill in that position now this video is

[27:20] recorded a couple days before when i

[27:23] uploaded it so if you watch this video

[27:25] after a couple of months or even a year

[27:27] or two since the publish date then there

[27:30] is a great chance that the html elements

[27:32] are going to be quite different but the

[27:34] main point of this video is to teach you

[27:37] all the tools to pull information from a

[27:40] website just as you want and then you

[27:43] can apply your own customizations and

[27:45] kind of doing a reverse engineering to

[27:48] the code that i'm going to write

[27:50] throughout this tutorial great so let's

[27:52] go here and write python so i will

[27:55] receive only job posts about this

[27:58] programming language and you can see

[28:00] that we have this job found over there

[28:04] and we have a lot of jobs that are

[28:06] published so my goal here in this

[28:09] tutorial would be to

[28:12] let's get this closed so my goal in this

[28:15] tutorial will be to bring all the jobs

[28:19] that are posted a few days ago so if i

[28:22] am zooming here in then you can see that

[28:26] we have posted a few days ago for a

[28:29] couple of posts but after i reach down

[28:32] here we have posted four days ago so

[28:36] this might mean that this job post is

[28:39] not the most updated so i'm going to

[28:42] bring all the jobs and i'm going to

[28:45] condition my program to bring those

[28:48] elements with the posted few days ago

[28:51] text only so let's go back to here now

[28:55] i'm going to bring this url from here

[28:58] and i'm going to paste that in in the

[29:01] empty string that we created inside the

[29:04] request.get and once i have done that

[29:07] what is going on inside this variable

[29:09] right now is simply the request code

[29:13] status okay so if i'm going to

[29:16] print the i mean if i'm going to run

[29:18] this program then we are going to see

[29:21] the results like the following so 200 is

[29:24] the convention number in web that the

[29:28] request is done successfully but in

[29:31] order to avoid the status code we are

[29:34] going to go to here and i'm going to

[29:37] accept the text only so i'm going to go

[29:40] here and then write dot text okay so

[29:43] this is what we have to apply here in

[29:46] order to bring the html text of that

[29:49] specific page and now it makes sense to

[29:52] leave this variable name as it is

[29:54] because it is storing the html text and

[29:56] i'm going to re run this program and we

[29:59] will probably receive a large

[30:02] information of html so right now it is

[30:05] not quite relevant but i'm just i just

[30:07] wanted to show you the results so let's

[30:10] continue from here okay so as you know

[30:13] we are going to

[30:14] create a beautiful soup instance like we

[30:17] did in the previous episode and i'm

[30:19] going to provide the html as the html

[30:22] text variable so it will be soup equals

[30:26] to

[30:26] an instance of a beautiful soup and then

[30:29] i'm going to write here html text as my

[30:32] information that i want to scrape and we

[30:35] are going to use the same parser again

[30:37] like the previous episode so it will be

[30:40] lxml now once i have done that it makes

[30:43] sense to go back to our page and see how

[30:47] we can grab only this each paragraph

[30:50] from this website so the white boxes are

[30:54] kind of a list of elements that this

[30:57] page has provided here and i want to

[31:00] look for a method that is going to bring

[31:03] me all the job posts so it makes sense

[31:06] to catch a certain element inside that

[31:10] post and right click on it and then

[31:13] click on inspect and once i have done

[31:16] that you can see here so i'm going to

[31:19] zoom in things a little bit

[31:21] so we can see that the h3 class is

[31:26] pointing to that

[31:28] text over here i know that the text is a

[31:31] little bit small here but just you can

[31:33] see that it has a gray mark and i'm

[31:36] going to go up here and then you can see

[31:40] that those elements are opened up as

[31:42] well so if i hover my mouse here then

[31:45] you can see a green background wrapped

[31:47] in the article over here i mean the

[31:49] paragraph and then if i close that up

[31:52] you can see that we have a lot of clear

[31:55] fix job dash px and something like that

[31:59] that its name is the class and our html

[32:04] element here is called li so li stands

[32:07] for list and then you can see that it is

[32:09] inside a ul tag so this is standing for

[32:14] unordered list and it is containing a

[32:16] lot of

[32:17] list tags inside that ul so you can see

[32:20] once i close that then the entire

[32:24] list of all the posts are marked with a

[32:27] blue

[32:28] background so i'm going to search the

[32:31] element of li with that name of class so

[32:34] i'm going to copy the name of the class

[32:37] here and i'm going to go back to my

[32:39] pycharm and i'm going to write here jobs

[32:43] equals to soup dot find

[32:46] underscore all and i'm going to search

[32:49] for all the li's and as the second

[32:53] argument it makes sense to pass here

[32:55] class underscore equals to and then

[32:58] inside that string i'm going to paste

[33:01] that in the class name that we have

[33:04] copied from the page itself so once i

[33:07] have done that then we will probably see

[33:10] the results of all the jobs in that page

[33:14] now this doesn't mean that it is going

[33:16] to bring back all the

[33:19] 16 000 jobs because you can see that

[33:23] this page is being paginated so that

[33:27] means that it is going to bring the

[33:29] results only for the first page so this

[33:32] is not going to take extremely long now

[33:35] if i go back to here and paste the jobs

[33:39] then let's see the results before we

[33:40] continue on just to make sure that

[33:42] everything is okay so we can see that we

[33:45] receive the results and then we see that

[33:47] we have some company names and i think

[33:50] that everything is quite great here now

[33:52] in order to work with this

[33:55] scraping project it makes sense to only

[33:57] work with only one job element so i'm

[34:01] going to

[34:02] delete the underscore all from here and

[34:05] what this means it means that it is

[34:07] going to bring the first match that sees

[34:11] the li tag and then the class name as

[34:14] this string over here so let's change

[34:17] this variable name just to job for now

[34:20] okay just in order to develop our

[34:22] program slower

[34:23] relying on only one job post okay so

[34:28] once we've done that we probably want to

[34:30] search for the company name of that

[34:32] specific job post so i'm going to go

[34:35] back to here and i'm going to make

[34:38] things bigger over here and now let's

[34:41] actually go here and try to inspect what

[34:44] is going on

[34:45] here again so let me

[34:48] zoom that out great now i'm going to

[34:52] try to inspect this text over here again

[34:55] and then we can see that it is inside

[34:58] the li tag for sure but we can also see

[35:01] that it is inside an h3 tag and it has

[35:04] the class name of job list comp name so

[35:08] i'm going to search for that class in

[35:12] the entire page as well but speaking

[35:15] about the entire page so let's go to our

[35:18] pycharm you want to search for that

[35:22] specific element only inside the job

[35:25] itself so you see it doesn't make sense

[35:28] to search for an h3 tag in the entire

[35:31] page again so you can basically go with

[35:35] job dot find besides soup dot find

[35:38] because we want to search for that h3

[35:41] tag only inside our job so if i go ahead

[35:44] and print the job here then we can see

[35:47] that it only includes an html code about

[35:50] only one job and i'm going to search for

[35:53] this h3 tag

[35:55] so let's create here a new variable and

[35:58] i'm going to call that company

[36:00] underscore name and we are going to use

[36:02] job.find

[36:04] and we are going to accept here as an

[36:07] argument the h3 and then this time the

[36:10] class underscore is going to be equal to

[36:13] whatever this h3 tag includes as the

[36:16] class name which is the job list comp

[36:19] name now to debug this out and to ensure

[36:24] that the results are great we are going

[36:26] to print the company name

[36:28] and then you can see that we receive

[36:30] this

[36:31] this element back and i'm going to use

[36:33] here the dot text method just to bring

[36:36] back the text itself now once i do that

[36:40] we are going to see a weird result here

[36:43] now you can see that we have some white

[36:46] spaces so we kind of want to replace our

[36:50] white spaces with nothing so in order to

[36:54] do this one i'm going to go here and i'm

[36:57] going to use the replace method and this

[37:01] trick is going to avoid having this not

[37:04] necessary white spaces so i'm going to

[37:06] replace the spaces with nothing so i'm

[37:10] going to just write here double quotes

[37:12] twice i mean single quotes twice and

[37:15] once i have done that and rerun our

[37:18] program then you can see that the result

[37:20] is going to be quite different as you

[37:22] can see this text is fully aligned to

[37:25] left now let's minimize back and

[37:28] continue from here now we're going to

[37:30] zoom out a little bit the code here just

[37:33] we can see the important points like the

[37:35] replacement

[37:36] and let's continue from here now it also

[37:39] makes sense to bring the skill

[37:42] requirements other than the python

[37:44] programming language because we know

[37:46] that this job is only for people who are

[37:50] good with the python programming

[37:52] language so i'm going to go here and i'm

[37:55] going to repeat myself in the same

[37:57] process again and i'm going to write

[37:59] here job.find and we are probably

[38:02] looking for an element that is including

[38:05] a text about the skill requirements so

[38:09] let's search for that okay so let's go

[38:11] back to our website again and i'm going

[38:13] to

[38:14] go here and check out what html element

[38:17] is including the skills so we are

[38:20] talking about this one so i'm going to

[38:22] inspect inside here and we can see here

[38:26] that this text is inside a span class

[38:30] with the class name of srp skills so i'm

[38:34] going to copy again this class name and

[38:38] that time i'm going to search for the

[38:40] spin elements inside my job post so i'm

[38:44] going to go back to pycharm again and

[38:47] i'm going to write here span so this is

[38:49] the html tag that we are searching for

[38:52] and again i'm going to write class

[38:54] underscore equals to that srp skills now

[38:58] i want to ensure the results over here

[39:02] once again so you always want to

[39:06] quickly print the results of whatever

[39:08] html element that you want to pull to

[39:10] see what other methods you have to apply

[39:14] to prettify your result okay so let's

[39:18] run our program again and it makes sense

[39:20] to delete the print company name so

[39:23] let's re-execute our program

[39:27] and then you can see here that we have

[39:30] some spin tag and then here we have a

[39:33] strong tag which is basically created to

[39:36] make our text bold when we want to type

[39:39] in something so i'm just going to

[39:41] guess here that i'm going to

[39:44] only write here dot text and then i

[39:46] expect for the results to be fine so

[39:49] let's check out for that and then you

[39:50] can see here that the results are quite

[39:52] great so we have the python scripting

[39:55] and then we have some more requirements

[39:58] that are divided with commons and a lot

[40:01] of white spaces again so i'm going to

[40:04] apply the same method of that replace

[40:06] once again like we did with the company

[40:09] name so let's write here dot replace and

[40:11] i'm going to replace white spaces

[40:14] with nothing so let's re-execute that

[40:18] out and then we can see that the result

[40:20] is quite like we want and now we were

[40:23] also able to grab the skills as well so

[40:26] this is quite nice now if we want to

[40:30] display a nice information about the job

[40:33] until now then we want to go with a nice

[40:36] print message here so let's try to

[40:38] create a nice message so we will use an

[40:40] f method here and we will also use the

[40:43] triple quote method just to allow us to

[40:46] write some text in separated lines as

[40:49] well and i'm going to write here company

[40:52] name

[40:52] like this and then i'm going to write

[40:56] here company name so i'm calling the

[40:59] company name value by writing it inside

[41:02] a curly brackets and i'm going to repeat

[41:05] the same process for required skills so

[41:09] it will be required skills and then i'm

[41:12] going to make that to be equal to skills

[41:16] variable and now if i go and execute our

[41:19] program let's see if the results are

[41:22] quite nice

[41:23] yes so we kind of receiving a nice

[41:25] information about the job info okay so

[41:29] this is quite great

[41:31] now if we go back to here then we want

[41:35] to search for one more element so

[41:38] you remember that i told you that we

[41:41] only want to grab the

[41:44] job post with the text of posted few

[41:47] days ago so we for sure want to write

[41:51] some extra code to apply this

[41:53] functionality so i'm going to go here

[41:56] and i'm going to inspect for that

[41:58] element again

[42:00] and then we can see that it is inside a

[42:02] span once again but i can also see that

[42:06] this

[42:07] job post including some more span

[42:10] tags so i have to filter out the results

[42:14] again with the class name itself so i'm

[42:17] going to search for that sim posted

[42:21] class name and i'm going to go back to

[42:24] here so we will write this time

[42:27] job

[42:29] published date

[42:31] so it makes sense to delete the job

[42:32] excuse me so it is just going to be

[42:34] published date and i'm going to go here

[42:37] again with job.find

[42:39] and we will search for the span and then

[42:42] this time the class underscore is going

[42:45] to be equal to the text that i just

[42:47] copied and i'm going to repeat myself

[42:51] with printing the published date but

[42:55] that time let's just avoid printing this

[42:58] print line so i'm just going to comment

[43:00] out those lines and let's see what the

[43:03] published a

[43:04] date text is looking like and you can

[43:06] see that we have here something

[43:09] a little bit weird so we have the span

[43:13] here and we have also one more span

[43:16] inside of the text of it so what that

[43:19] means it means that we have to take some

[43:21] different action than what we did

[43:23] previously so this time i want to search

[43:26] for the attribute of span just to get

[43:29] inside that tag over here and then right

[43:32] after it i want to look for the text of

[43:34] that span tag so this will give me the

[43:38] published date of this specific job but

[43:43] i'm not going to include the publish

[43:44] date inside my print message because we

[43:47] only want the publish date for the

[43:49] functionality to stop our execution if

[43:52] the published date text is not including

[43:56] the word of fuel and i'm going to code

[43:59] this functionality just in a second so

[44:01] you will see what i mean by what i said

[44:03] all right so what i'm going to do here

[44:06] is take a tricky action that is going to

[44:08] bring me all the jobs from the first

[44:11] page so if we paid attention then all

[44:14] the job posts including this class name

[44:18] so what i can do besides the find is

[44:21] change that back to underscore all and

[44:25] change this variable name to jobs and i

[44:28] know that just now it just raised an

[44:31] error here and i'm going to use here a

[44:34] for loop that is going to iterate over

[44:37] each element and i'm going to write here

[44:40] for

[44:40] job in jobs and then i'm going to create

[44:44] an indentation of the entire code that

[44:48] is right there so the results will be

[44:51] applied for all the jobs that are posted

[44:55] in the first page of the

[44:57] web page that we scrape so once i hit

[45:00] here the colon sign then i'm going to

[45:03] create an indentation for each of our

[45:06] lines like this and then the results are

[45:09] going to be quite the same so let's test

[45:12] that out okay i'm going to

[45:14] uncomment our print line over here and

[45:17] just for comfort reasons i'm also going

[45:19] to print here and empty lines so we can

[45:22] kind of see a division between the

[45:24] different jobs and then i'm going to

[45:26] delete the published date for now so if

[45:30] we execute our program

[45:32] that time then we are going to see a

[45:35] nicer results and this is going to

[45:37] contain all the job posts

[45:39] from the page that we scrape against so

[45:42] you can see that we have a nice

[45:43] paragraph for that job post and then we

[45:46] have also another one here and if i keep

[45:48] scrolling up we can see a lot of them in

[45:51] that output so this is quite great so if

[45:54] you remember we wanted to filter out the

[45:58] job posts that are not including the

[46:02] word of few inside the published date

[46:06] because what that means it means that

[46:08] this job could be outdated so if i go to

[46:12] our page again then we can see that as i

[46:15] keep scrolling down we have some

[46:18] text like posted six days ago and i

[46:21] wanted to filter out only the jobs that

[46:23] are containing the text of posted few

[46:26] days ago so in order to apply this i'm

[46:29] going to change the orders here a little

[46:32] bit okay so i'm going to

[46:34] cut this searching here and i'm going to

[46:37] paste that in as the first line inside

[46:41] my for loop now the reason i'm doing

[46:43] this it is basically because i don't

[46:45] want to continue on scraping for that

[46:48] post if the publish date is not matching

[46:51] my condition so it makes a lot of sense

[46:54] to place this code as the first line

[46:58] inside my for loop and then right here

[47:01] i'm going to

[47:03] write a condition that is going to check

[47:05] if the word of fuel is inside that text

[47:10] so it will be if

[47:12] fill in

[47:14] published date and again i'm going to

[47:17] create an indentation for the entire

[47:20] code here so you can do that with the

[47:22] shift alt combined and then you can just

[47:25] press tab and all the lines here are

[47:29] being indented so right now if i go

[47:32] ahead and execute our program then we

[47:36] should see the results again like almost

[47:39] the same but we also see here that the f

[47:43] string is not quite nice

[47:46] but i can live with that okay so it is

[47:49] great that we were able to receive the

[47:53] posts only that have been published few

[47:56] days ago now there is no limit for what

[47:59] you can do when it comes to web scraping

[48:02] and what you can filter in or filter out

[48:05] but basically this program deals with

[48:08] how to grab some job posts with the

[48:11] filters that you want to apply that

[48:14] maybe sometimes may not be available

[48:16] from the website itself so you can write

[48:19] your own filtrations on your python code

[48:22] while you scrape some information from a

[48:25] specific website

[48:29] so i'm going to do whatever it takes to

[48:31] turn this program into a very useful one

[48:34] and i'm going to do that by applying

[48:36] some special functionalities such as

[48:39] wrapping this entire program in a while

[48:41] loop and executing this project every

[48:45] certain amount of time and also apply

[48:47] some filtrations to filter out the job

[48:51] post that are not meeting the skills

[48:54] that i own and also i'm going to throw

[48:56] the results of the different job posts

[48:59] into a new blank file so i can be aware

[49:02] of the posts that are being posted every

[49:05] certain amount of time so let's get

[49:07] started all right then so let's start

[49:09] with a kind reminder of the results that

[49:11] we got until that point so

[49:14] we run our program now and if we show it

[49:18] right here you can see that those lines

[49:21] are not aligned well so i'm going to

[49:24] change that and i'm also going to

[49:26] provide some extra information that will

[49:28] show us the exact

[49:30] link of the specific job that we are

[49:33] iterating on so that way i will have the

[49:35] ability to just click on the link and

[49:38] then see more information about that job

[49:41] so as a beginner i will get rid of the

[49:43] formatted string in that case because

[49:46] doing a formatted string with a triple

[49:48] quote might not be a great idea when you

[49:51] execute it with a for loop because as

[49:53] you can see that it also includes the

[49:55] indentations right here so i'm going to

[49:58] delete this entire code here and i'm

[50:01] going to write two more new formatted

[50:04] strings and we will start with company

[50:07] name

[50:09] make that to be equal to company name so

[50:12] make sure to add a column here so it

[50:14] will be more friendly and then i will

[50:16] write here required skills as well and

[50:19] then we will write here the skills

[50:22] variable now there was one more issue

[50:25] with the result that we showed a minute

[50:27] ago and that was the blank spaces that

[50:31] are being shown as well so we can get

[50:35] rid of the spaces by a special method

[50:37] that is called strip and it is a special

[50:40] method that you are allowed to use

[50:43] inside strings and since the company

[50:45] name and the skills are strings by

[50:48] default i don't have to convert them to

[50:50] a string so i can just call that method

[50:54] like this okay and now i will show the

[50:57] results of

[50:58] something like the following and in a

[51:01] few seconds we will see that

[51:03] this is aligned way better than what it

[51:06] was and i'm also going to add here more

[51:09] information line that will show the link

[51:13] of the job post so let's do that okay

[51:16] let's go here and write this

[51:19] functionality okay so we had an

[51:21] unordered list that inside of that we

[51:24] had some different html tags that are

[51:27] called li and that stands for list and

[51:30] they are actually different job posts

[51:33] that are divided into different elements

[51:36] inside an unordered list and then if we

[51:39] hover our mouse you can see that there

[51:41] are different jobs now if i go inside

[51:44] one of them and i go inside a header tag

[51:47] that is actually the first editor of the

[51:50] li tag and then i will

[51:53] go inside the h2 here and then you can

[51:56] see that we have a link that could lead

[52:00] us to a link that provides some extra

[52:02] information about that specific job so

[52:06] if i actually

[52:07] go here and click on here you can see

[52:10] that we receive the job description

[52:13] right here so what we have to do in

[52:15] order to access this link in each job

[52:18] post that we are iterating on the python

[52:21] code is actually going inside and header

[52:24] and then going inside one more tag with

[52:26] a kind of h2 as you saw me doing that

[52:29] and then access that a

[52:31] tag so let's do that okay i'm going to

[52:34] go back to pycharm and apply this

[52:36] functionality so we will go under the

[52:40] skills and then we will write here more

[52:43] info and that will be equal to job

[52:47] dot header because this was the first

[52:50] tag that we want to go inside of it and

[52:53] then we want to go inside the h2 and

[52:56] then inside that h2 we want to go inside

[52:59] the a tag now before we go further let's

[53:03] test ourselves that we have done great

[53:05] job so let's print the more info in the

[53:09] following way so it will be more info

[53:12] and then we will call the variable in a

[53:15] formatted string now let's execute our

[53:18] program

[53:20] and then you can see that inside the

[53:22] more info we have the a href which gives

[53:26] us the link about the

[53:28] specific job that we are iterating on so

[53:31] all i have to do here is going back to

[53:34] my more info and then call that href

[53:38] attribute so this time i'm going to do

[53:40] that with a square bracket like in

[53:42] dictionaries and then i'm going to write

[53:44] here href so i will receive the value of

[53:47] that attribute so if i run that one more

[53:50] time

[53:51] then i should see the link only and that

[53:54] is what exactly happening so the result

[53:57] is quite great and then you can see that

[53:59] this is already better than what we did

[54:01] in the last episode and we will continue

[54:04] from here okay so what i want to do now

[54:06] is giving the opportunity for the user

[54:09] that executes this program to filter out

[54:12] some skill requirement that he does not

[54:15] own so we will use the input function

[54:17] for that and then whatever the input is

[54:19] equal to we will filter out the results

[54:22] from the jobs that we are finding

[54:25] right here okay so let's write this

[54:28] functionality so to apply this i'm going

[54:30] to create a new variable over here and

[54:33] i'm going to call it unfamiliar skill

[54:36] and i'm going to make that to be equal

[54:38] to an input and then i'm going to

[54:41] write here something like this okay so

[54:43] the user could understand that he has to

[54:47] provide some information in order to

[54:49] execute this program and actually it

[54:51] might be a great idea to print some

[54:53] extra information before that input

[54:56] function so it will be print

[54:59] put some skill that you are not

[55:03] familiar with and then right after the

[55:06] unfamiliar skill input i will write here

[55:10] filtering out

[55:12] and we will actually make that a

[55:14] formatted string and then we will write

[55:16] here filtering out and then whatever the

[55:20] unfamiliar skill is equal to

[55:23] now what are we going to do with this

[55:25] unfamiliar skill variable so that is

[55:29] quite easy right we have to search for a

[55:32] condition that will filter out the job

[55:35] post that is including that word that we

[55:39] are going to provide here as an

[55:41] unfamiliar skill and what we can

[55:43] actually do is search for the unfamiliar

[55:46] skill world inside the skills string so

[55:50] if you remember the skills is a long

[55:53] string that is divided with commas so we

[55:56] can go with a condition like the

[55:59] following so it will be if unfamiliar

[56:02] skill

[56:03] not inside the skills that we are

[56:07] grabbing in the each job post that we

[56:10] are iterating and now all what we have

[56:12] to do here is creating the indentation

[56:15] for the different print lines okay so

[56:17] now i should see the job posts that are

[56:21] not including the unfamiliar skill that

[56:24] i'm going to provide so just to test

[56:27] that out let's

[56:29] run our program twice okay so in the

[56:32] first we are going to write here linux

[56:35] as a skill that i'm not familiar with

[56:38] and you can see that we don't see

[56:40] anything that is including the keyword

[56:43] of linux over here but let's actually

[56:46] take that to a next level and test that

[56:48] out so we see here a specific java post

[56:52] that is including django so let's say

[56:55] that i am not familiar with django and

[56:57] see next time if i see that job post

[57:00] with this company so let's re-execute

[57:04] our program and that time i will write

[57:07] django

[57:08] and let's see the results so we can see

[57:11] that we don't have any job with django

[57:15] but we do have linux that time so

[57:18] this condition works well and we will

[57:21] continue on to next step from here now

[57:24] what could be an exciting challenge for

[57:25] you guys is to write an algorithm that

[57:28] will accept more than one unfamiliar

[57:31] skills so you want to accept multiple

[57:33] inputs from a user and it might be more

[57:36] challenging but i think you should try

[57:39] to spend some time on something like

[57:40] this because i think this could be an

[57:42] amazing challenge for everyone who is

[57:44] watching this video all right so now we

[57:46] are going to save each job post in a

[57:49] different file so besides printing this

[57:52] in the terminal then we are going to

[57:55] write this entire information in a

[57:58] separated file and then i will also

[58:00] allow this program to run every 15

[58:04] minutes or every 10 minutes up to you

[58:06] and i will show this logic as well so

[58:09] first of first it makes sense to wrap

[58:13] our entire program in a function and i'm

[58:16] going to do that by collecting

[58:19] everything that is kind of pulling the

[58:22] information from the website and i'm

[58:25] going to indent everything one step

[58:27] aside and then i'm going to write here

[58:30] def find

[58:32] jobs okay so that way we have one

[58:36] function that executes our main program

[58:39] and then what i'm going to do here is

[58:42] using the logic of if double underscore

[58:45] name is double underscore main so that

[58:48] way if you want to extend this program

[58:51] only if this file is ran directly then

[58:54] this function will be executed now if

[58:57] you don't know what i said about if

[58:59] double underscore name equals double

[59:01] underscore main then i have a video that

[59:04] explains this condition so you can check

[59:06] that out by the suggested link above so

[59:10] let's write here if double underscore

[59:13] name

[59:14] equals to double underscore main inside

[59:17] a string and then right here while true

[59:21] so i want to run this program forever

[59:24] and then i will call the find jobs and

[59:28] right after it since i don't want this

[59:31] program to be executed like every

[59:33] millisecond then i'm going to write here

[59:37] time dot sleep so time dot sleep allows

[59:41] your program to wait certain amount of

[59:44] time that you decide and you can provide

[59:47] its argument by seconds so i'm going to

[59:50] write here 600 just to make that program

[59:53] to run every 10 minutes but you can

[59:56] notice how we did not import the time

[59:59] library so let's do that by

[1:00:02] import time okay and then this program

[1:00:05] should be okay now to make this more

[1:00:07] dynamic i actually prefer to

[1:00:10] make some variable here that will be

[1:00:12] equal to 10 and then i will just make

[1:00:15] that to be equal to time weight

[1:00:17] multiplied by 60 and right after it we

[1:00:20] can provide some extra information

[1:00:23] excuse me this should be over here and

[1:00:25] we can write here waiting

[1:00:30] let's make it formatted

[1:00:31] then we can write here waiting

[1:00:34] time weight seconds

[1:00:37] and let's write three dots here great so

[1:00:40] this is great so if i'm executing this

[1:00:42] program i expect to see this program

[1:00:44] running every 10 10 minutes so i'm

[1:00:47] inside my command line interface and you

[1:00:50] can see that my directory has been

[1:00:52] already set to the directory where we

[1:00:56] worked so i can go with python and then

[1:00:59] execute the name of the file by calling

[1:01:02] it so it will be main dot pi and then

[1:01:05] once i run that you can see that we

[1:01:08] receive

[1:01:09] this

[1:01:09] output and then i have to provide some

[1:01:12] information that is going to be filtered

[1:01:14] out and then let's write here django

[1:01:17] again and

[1:01:18] you can see that we receive the results

[1:01:22] successfully but more important we see

[1:01:24] that waiting 10 seconds which is not

[1:01:27] great we have to change that to waiting

[1:01:29] 10 minutes because

[1:01:31] we are waiting 10 minutes right but the

[1:01:34] program works great it was just my

[1:01:36] mistake by writing here seconds so it

[1:01:38] should be minutes for sure but i'm not

[1:01:41] going to wait 10 minutes until this

[1:01:43] program is running one more time and so

[1:01:45] i will allow myself to move on to

[1:01:48] writing this information inside file so

[1:01:52] it makes sense to write this kind of

[1:01:55] information in a separated directory so

[1:01:58] i will go inside my web scraping tree

[1:02:01] file i mean folder and then i'm going to

[1:02:04] create here new directory which is going

[1:02:07] to be named as posts and then i'm going

[1:02:11] to

[1:02:11] write here some extra functionality that

[1:02:15] will create

[1:02:16] files i mean text files then

[1:02:19] and then inside each text file i'm going

[1:02:22] to write this exact information so you

[1:02:25] can do that by with open i already show

[1:02:29] you how you can do that in the first

[1:02:31] episode now i know that i don't have any

[1:02:33] separated tutorial about working with

[1:02:35] files in python but you want to consider

[1:02:38] check out my channel maybe i will upload

[1:02:40] very soon so

[1:02:41] you can go here and

[1:02:43] that time i want to put here information

[1:02:46] and i will call my post directory and

[1:02:49] then inside here i have to provide my

[1:02:53] file name that i'm going to create now

[1:02:57] before i move on here i talked about

[1:03:00] changing my for loop here and use the

[1:03:03] enumerate function now enumerate

[1:03:06] function is going to allow us to iterate

[1:03:10] over the index of the jobs list and also

[1:03:15] the job content itself and so i have to

[1:03:19] provide here one more variable like

[1:03:22] index so the index is going to be a kind

[1:03:25] of counter for the job that i'm

[1:03:28] iterating on and then the job variable

[1:03:31] will relate to the job

[1:03:33] beautiful sub object itself and so it

[1:03:36] makes sense to name our files with the

[1:03:39] index of the job that i'm iterating on

[1:03:42] so i will change this into a formatted

[1:03:44] string and then i will write here index

[1:03:47] dot txt so it will be something like the

[1:03:50] following and i expect each my text file

[1:03:52] to be named like

[1:03:54] 0.txt or 1.txt and so on now the second

[1:03:59] argument will be the permission level

[1:04:02] that you want to give when you create or

[1:04:05] open a new file and this time i'm going

[1:04:08] to write here w and that stands for

[1:04:12] writing inside the file and then i have

[1:04:15] to use the as statement and i'm going to

[1:04:18] use the f variable so inside that block

[1:04:22] i can write to a file with the f

[1:04:26] variable and i'm going to go inside my

[1:04:29] with open and i'm going to create

[1:04:31] indentation of the prints and i'm going

[1:04:35] to delete this print line here and it

[1:04:37] makes sense to remove this blank space

[1:04:40] as well and all i have to do here is

[1:04:43] changing this print statement to f dot

[1:04:48] write and then that time i'm not going

[1:04:51] to print the results in the command line

[1:04:54] interface besides i'm going to write the

[1:04:57] information in a new file so i'm going

[1:05:00] to use the combination of alt shift here

[1:05:03] and i'm going to change those entire

[1:05:06] three prints to f dot write okay and

[1:05:10] then i'm going to open the parentheses

[1:05:13] so it will be closed by those and then i

[1:05:17] expect for each job to being

[1:05:21] written inside a file and once i do that

[1:05:24] it might be a great idea to print a

[1:05:27] sentence like file

[1:05:30] saved and then you can provide the name

[1:05:33] of the file as an extra information so i

[1:05:36] will create one more time formatted

[1:05:37] string and then i will relate to that

[1:05:41] index

[1:05:42] variable and now our program is complete

[1:05:46] so let's check it okay let's go back to

[1:05:49] our command line interface and let's

[1:05:52] actually control break this program and

[1:05:55] let's write cls to clear our terminal

[1:05:58] and then i'm going to re-execute my

[1:06:01] program so it will be python

[1:06:04] main dot pi

[1:06:06] and then i'm going to

[1:06:09] execute it so let's see this time i'm

[1:06:12] going to write django as well

[1:06:15] and that time i don't expect to see

[1:06:18] output for the information besides i

[1:06:20] expect to see

[1:06:21] this okay so let's see

[1:06:24] what is inside each of our files so

[1:06:27] let's see what is inside that post

[1:06:29] directory okay so i'm going to go inside

[1:06:32] my c python put

[1:06:34] web scripting tree and then the post

[1:06:36] directory that we created a few minutes

[1:06:38] ago and you can see that inside of that

[1:06:41] we have our text files but if i go here

[1:06:44] inside let's see if the results are okay

[1:06:47] okay so

[1:06:49] i'm not quite satisfied with with that

[1:06:51] because it might be a greater idea to

[1:06:54] see that like

[1:06:56] i mean like this okay so

[1:06:59] you might want to divide those

[1:07:01] information in separated lines but that

[1:07:04] is not going to be complex so

[1:07:06] we just have to go inside our python

[1:07:09] again and then whenever we write to the

[1:07:14] file we have to use that convention

[1:07:17] where you can just jump a line and that

[1:07:20] will be backslash in so when you provide

[1:07:24] backslash n inside a string it is just a

[1:07:28] convention that is going to jump to the

[1:07:31] next line right after it so it will be

[1:07:34] backslash n for the first line and then

[1:07:37] also here and let's run this program one

[1:07:40] more time so i'm just going to break the

[1:07:42] program and re-execute it so that time i

[1:07:45] will write linux and then let's test our

[1:07:48] results one more time so let's go inside

[1:07:51] our

[1:07:52] 19.txt and then you can see that the

[1:07:54] information is right there just like we

[1:07:57] expected okay so this is quite great

[1:08:00] alright guys so i hope you enjoyed this

[1:08:02] entire series and you can find

[1:08:05] everything that we have done here by the

[1:08:08] links in the description of course i

[1:08:10] will provide extra information in my

[1:08:12] website about this series so if you like

[1:08:16] this video consider subscribing and also

[1:08:18] hit the like button i will see you in my

[1:08:21] future uploads

⚡ Saved you 1h 08m reading this? Transcribe any YouTube video for free — no signup needed.