[0:00] hi everyone and welcome to a special
[0:02] python tutorial where we are going to
[0:04] learn how to perform web scripting so
[0:07] first of all thanks to free code cam to
[0:09] giving me this opportunity of being a
[0:12] guest on their channel and i have a
[0:14] youtube channel as well that is named
[0:16] gymshape coding and you can find there
[0:18] any tech related topic such as
[0:20] programming language web development and
[0:23] more content that i am uploading once or
[0:26] twice a week so you can just go ahead
[0:28] and find the link from the description
[0:30] okay so in this video i'm going to do my
[0:33] best to teach you anything that is
[0:35] related to web scripting and i'm going
[0:37] to do that with the beautiful soup
[0:39] library and that is a special library
[0:42] that will allow you to gather any
[0:44] information you want from any website
[0:46] you want okay so this website could be
[0:49] your bank account could be a job post
[0:52] website like linkedin this could be
[0:54] wikipedia or a sports website and really
[0:57] anything that you can think about so we
[1:00] will start by scraping a basic html page
[1:03] first just to understand the concepts
[1:06] and then we will move on to scraping a
[1:08] real website and by the last 15 to 20
[1:11] minutes of this tutorial i'm going to
[1:13] show you how you can store the
[1:15] information that we have just pulled
[1:17] from this website so let's begin
[1:23] great so this is the webpage that we are
[1:25] going to
[1:26] start web scraping and i'm going to
[1:29] explain what is going on here so you can
[1:31] see that we are having a basic title and
[1:33] then we are having a kind of three
[1:36] paragraphs so you can see that we have a
[1:38] title of python and then we have a kind
[1:41] of secondary title and then there is a
[1:43] basic explanation about the course
[1:45] itself and then we are having a button
[1:48] that says start that will probably lead
[1:51] us to a different page if we click on it
[1:53] and then you can see that it has the
[1:55] price here as well now we are kind of
[1:58] repeating ourselves three times here and
[2:01] this is what is responsible to that web
[2:04] development paragraph and then also for
[2:07] that machine learning paragraph now what
[2:10] we are currently looking at it is
[2:11] basically the behind the scenes of that
[2:14] page so this is the html code that is
[2:17] defined in order to show you that hello
[2:20] start learning page and you can see that
[2:23] inside our html documents all of the
[2:25] code is being created with tags now
[2:29] those tags are what are responsible to
[2:31] display different information for you
[2:33] and you can see that we have a big tag
[2:35] that is called html and then inside of
[2:38] that html tag we are having a head tag
[2:41] and then a body tag now you can see that
[2:44] we are defining a closure for each of
[2:46] our tags with the forward slash here and
[2:50] then you are probably going to see that
[2:52] for the different tags as well now let's
[2:55] expand the head tag here and then inside
[2:58] of it we are seeing some meta
[2:59] information that is not quite relevant
[3:02] for us but we see that link tag which is
[3:05] responsible to import some styling for
[3:07] our page and then we can see that title
[3:10] tag which is responsible to customize
[3:13] our tab name and that is why you'll see
[3:16] my courses over here now i will close
[3:20] back the head and then i will expand the
[3:23] body so the body is responsible to
[3:26] display what is going to be on the page
[3:28] itself it is the page's body and you can
[3:31] see that we already have the h1 tag that
[3:35] is created here and then between the
[3:37] closure which is the area that you can
[3:40] write the text for that tag we see the
[3:42] hello comma start learning and then we
[3:45] are having some div tags here and when
[3:48] you see the tag of div this is the very
[3:50] basic tag that will create some
[3:54] tags in different styling so you'll see
[3:57] here the class equals card what this
[4:00] attribute assigning does here it is
[4:03] importing the card styling and that is
[4:06] why you see the kind of carding style
[4:10] for each of our paragraphs over that
[4:13] page and you can see that we are having
[4:16] one more div inside that card class
[4:18] which is called card header so this is
[4:20] the styling for card header this is why
[4:23] it is called that way and then the text
[4:26] is python and then we have the card body
[4:29] and we have the h5 tag which is a kind
[4:32] of smaller header that you can display
[4:35] and if i scroll right here you can see
[4:38] that python for beginners text and then
[4:40] the closure for hyh5 tag and we are
[4:44] having a paragraph and then the a tag
[4:47] which is allowing us to lead to another
[4:50] page so when you see the a tag it is
[4:53] basically a reference to another page
[4:55] that you can visit
[4:56] now this entire code that i'm currently
[5:00] marking let's actually make our page a
[5:02] bigger here
[5:03] this entire code that i just marked is
[5:06] kind of repeated three times and that is
[5:09] why we see the page that we saw
[5:12] previously okay so it is quite important
[5:14] to understand and we are going to scrape
[5:18] that page and pull some information with
[5:21] the beautiful soup library now if you
[5:24] are confused with the script tags here
[5:26] don't because those tags are responsible
[5:29] to import some javascript libraries and
[5:32] that is something not relevant for us
[5:34] right now okay so we are going to switch
[5:36] to python now in order to apply some
[5:38] basic scraping for that page so i will
[5:41] go and start working on my main.pi file
[5:45] and you can see that nothing is here now
[5:47] before we actually start we have to
[5:49] install same libraries and one of them
[5:51] will be the beautiful soup so i will
[5:53] open my terminal and since i'm working
[5:55] with my system global interpreter i will
[5:58] allow myself to install it over here and
[6:02] i will go here and write pip install and
[6:05] then we will write here beautiful soup
[6:08] 4 so make sure that everything is
[6:12] not spaced or not split it with dashes
[6:14] and then i'm going to hit enter and then
[6:16] you can see that it is installed
[6:18] successfully and then the next thing
[6:20] that i want to install will be something
[6:23] that is going to be used from the
[6:25] beautiful soup library and that is the
[6:27] parcel method so when you work with
[6:29] beautiful soup you have to specify the
[6:31] method that you are going to parse html
[6:34] files into python objects okay so there
[6:38] are going to be different methods to
[6:40] parse your html code and i heard that
[6:44] the best of them could be the lxml
[6:46] parser since if you work with the
[6:49] default html parser it is not going to
[6:51] deal well with broken html code so just
[6:55] go ahead and install the lxml parcel
[6:58] library and you can also do that with
[6:59] pip install and then we are going to use
[7:02] that when we work with the beautiful
[7:05] soup so i will go here and then write
[7:08] pip install lxml and then once i do that
[7:12] let's wait until it's finished great so
[7:15] we are ready now to go back to python
[7:18] and start working with the beautiful
[7:20] soup library now we have to go here and
[7:25] import that beautiful soup library so it
[7:27] could be a little bit confusing because
[7:29] the libraries folder is created as bs4
[7:34] so that is why we are going to write
[7:36] here from bs 4 import beautiful soup
[7:41] like this and once i have done that i
[7:44] have to figure out how i'm going to
[7:47] access the content inside the home.html
[7:50] file that is right there inside my web
[7:54] scraping directory so in order to do
[7:57] that we have to work with file objects
[8:00] now if you don't know how to work with
[8:02] files in python that is totally fine
[8:04] because we are going to go over it and
[8:07] it also might be worth to check my
[8:09] channel out if i have already uploaded
[8:12] how to work with files in python so i'm
[8:15] going to write here
[8:17] with open so this is basically a
[8:20] statement that will allow me to open a
[8:23] file and then read the content of that
[8:26] specific file so as you can see from the
[8:29] auto completion i have to specify as my
[8:32] first argument the file's name so i'm
[8:35] going to close the parenthesis here and
[8:38] then inside here i'm going to write my
[8:41] html files name now since the python
[8:45] file and then the home.html file are in
[8:48] the same exact directory it will be okay
[8:51] just to write its name so it will be
[8:53] home.html
[8:55] and the second argument will be the
[8:57] method that you want to apply when you
[9:01] open that file in that python's memory
[9:03] so you have couple of options when you
[9:05] work with python files you can read them
[9:08] you can write them or you can do both
[9:11] and if we only want to read the content
[9:14] then we somehow want to specify that we
[9:17] only want to read this file so we will
[9:20] open here a new string
[9:22] and we will write here r so what this
[9:25] tells to python is basically that i'm
[9:28] going to read that file only and once i
[9:31] have done that i have to write here a
[9:34] variable that is going to be used inside
[9:37] that code block that i just created
[9:40] which is the with open so i'm going to
[9:42] use the as keyword and then i'm going to
[9:45] create here a variable name that is
[9:47] going to be used throughout the block of
[9:49] the open so it will be html underscore
[9:53] file and that will be basically my
[9:56] variable name and then once i do that i
[9:59] will go inside the
[10:01] open block and then i will write here
[10:04] content
[10:05] equals to html file dot read and once i
[10:10] apply the read method i'm basically
[10:13] reading the html file content and in
[10:17] order to show you how this works let's
[10:20] first print the content itself so i will
[10:22] go here and print the content and then i
[10:25] will run out the main dot pi and then
[10:28] you can see that the information that is
[10:30] printed is exactly what we saw in the
[10:34] home dot html okay so
[10:36] we kind of did a great job reading this
[10:39] file now in my future episodes we are
[10:43] going to read html files from real
[10:46] websites but i just want to give you an
[10:48] idea of how web scraping works in a very
[10:52] basic way because when you work with
[10:54] actual websites the scraping and the
[10:57] information pulling is going to be quite
[10:59] harder than the html file that i just
[11:02] have written in order to explain the
[11:05] idea of web scraping okay so i'm going
[11:07] to continue on here and i'm going to use
[11:11] the beautiful soup library in order to
[11:14] prettify my html and work with its tags
[11:18] like python objects so the way you can
[11:22] accomplish that will be by creating an
[11:25] instance of beautiful soup and i will go
[11:28] here and create a new variable let's
[11:30] call it soup and that is going to be
[11:33] equal to a new instance of the beautiful
[11:36] soup library now the arguments that i'm
[11:39] going to specify here will be the html
[11:42] file that i want to scrape so the
[11:45] content of that will be the content
[11:47] variable that is created up above and
[11:49] then the second argument will be the
[11:52] parser method that we want to use so we
[11:55] will pass the password method as string
[11:58] and that will be the lxml that we have
[12:01] just installed previously now once i go
[12:05] ahead and try to print what is inside
[12:08] that soup instance it will be something
[12:11] like the following so we will create
[12:14] here a print statement and then we will
[12:16] go with soup dot pretify so that will
[12:20] allow you to see the html code in a more
[12:23] pretty way and if i go ahead and run
[12:27] this you can see that we see the html
[12:29] content that is exactly the same like
[12:33] what we saw in the home.html so we have
[12:37] done a great job until now so let's
[12:40] minimize back our terminal and now we
[12:43] are going to get more familiar with the
[12:45] special methods that are created inside
[12:47] the beautiful soup library so we are
[12:50] going to delete the print from here and
[12:52] we are going to start working how we can
[12:54] grab some specific information that we
[12:57] want to grab so let's assume that we
[13:00] want to grab all the html tags that are
[13:03] created as h5 tags which is a kind of
[13:06] header tag so we will go here and create
[13:08] a new variable let's call it tags for
[13:11] example and then we will go with soup
[13:14] dot find and then once i go with find it
[13:17] is going to search for the specific html
[13:20] tag that i'm going to specify here as a
[13:23] string so if i go here and write h5 and
[13:27] then down below i go ahead and print the
[13:31] tags the results of that will be
[13:34] something like the following now you can
[13:36] see that we have the entire html tag for
[13:39] the h5 tag as you can see that its text
[13:43] is python for beginners but if you
[13:46] remember we have more than one h5 tags
[13:50] that are created inside our home html
[13:53] tag so if you remember from the home
[13:55] file there is one here there is the
[13:58] second one over there and there is the
[14:00] third one over there and what that means
[14:04] it means that the find method searches
[14:07] for the first element and then it stops
[14:10] the execution of searching for the html
[14:13] tag that you are looking for now if you
[14:16] want to change this behavior and not
[14:18] only grab the first element then
[14:21] basically you have to change your method
[14:23] into find underscore all okay so that
[14:27] will search for all the h5 tags inside
[14:30] the content and now if i go ahead and
[14:33] run that out then you can see that the
[14:35] result here is quite different as we
[14:38] have here a list and then you can see
[14:40] that it has
[14:42] python for beginners and then also
[14:44] python web development and then also the
[14:46] python machine learning now that could
[14:49] be a great logic to bring you back all
[14:53] the courses names from that webpage so
[14:57] you can go here and change this into
[15:00] courses
[15:02] html tags okay so this is what the h5
[15:05] tags are actually responsible for and
[15:08] now i can write here some different code
[15:10] that will allow me to see all the
[15:13] courses that are defined on our page so
[15:17] we have python for beginners and then we
[15:20] have python web development and then we
[15:22] also have
[15:23] python machine learning so we can work
[15:26] with these courses html tags that stores
[15:29] all the h5 html tags and write a next
[15:33] program that is going to display all the
[15:35] courses so we can actually create here
[15:38] an iteration over the course of html
[15:40] tags because it has a list so we will go
[15:44] here with four course in courses html
[15:48] tags and then inside of that course tag
[15:51] that we are iterating we can bring only
[15:54] the text attribute which is going to
[15:57] display the course text itself so it
[15:59] will be here course
[16:01] dot text and now if i go ahead and run
[16:05] our program then you can see that we
[16:07] have a nice output regarding all of the
[16:10] courses that are available from that
[16:12] page so this could be a nice starter to
[16:15] understand how you can scrape a web page
[16:18] to grab some specific information you
[16:20] want all right so we were able to
[16:22] understand how we can apply some basic
[16:24] scraping to a web page but when you are
[16:27] going to deal with real websites the
[16:29] html code is not going to be quite
[16:31] friendly and simple like we had here so
[16:35] in order to be able to access the html
[16:39] code behind the scenes of some page we
[16:42] have to use the inspect of any browser
[16:45] so let's say that you want to grab the
[16:48] price for each of the courses so it
[16:51] makes sense to go with your mouse and
[16:53] hover to that button and then right
[16:56] click on it and then you want to look
[16:58] for that inspect option and once you
[17:01] open that out you will have a new pane
[17:05] that is going to be opened and then here
[17:07] we can see all the html code that is
[17:10] responsible to display what is going on
[17:13] on the left pane so you can see that we
[17:16] have here let's make it a little bit
[17:18] more bigger
[17:19] so that will be enough and then you can
[17:22] see that we have here div class card
[17:25] three times which is displaying all the
[17:27] different courses now when you go over
[17:30] different html tags with your mouse you
[17:32] can see that it is going to mark for you
[17:35] the html tag that is related to it so it
[17:38] is a quite important behavior that we
[17:40] should understand
[17:42] now let's say that we want to grab the
[17:44] price for that python for beginners so
[17:46] it makes sense to expand this tag and
[17:49] see what is inside so i will go here and
[17:53] search for that button and you can see
[17:56] that this a tag is actually responsible
[17:59] for that
[18:00] button itself and then you can see that
[18:02] its text is start for twenty dollars so
[18:06] the price information is right there and
[18:10] let's actually write a program that is
[18:12] going to search for that python for
[18:14] beginners and then we will grab the
[18:17] price for that course and then we will
[18:20] be able to write a nice program that is
[18:22] going to include a list of all of the
[18:25] courses and then the prices for each one
[18:28] of them so let's go back to pycharm and
[18:31] write this program so we will go here
[18:33] and delete everything from here and the
[18:37] first step that we probably want to do
[18:40] is to be able to grab all the course
[18:43] cards so it will be course
[18:46] underscore cards equals to soup
[18:50] that find underscore all because we
[18:53] probably are looking to bring us back
[18:56] all the cards so this is why you have to
[18:58] use find all and not define and i'm
[19:01] going to search for the div tags now it
[19:04] could be much nicer if we could filter
[19:08] the div tags that we actually want to
[19:11] grab and store it inside our course
[19:13] cards so if you noticed let's go back to
[19:17] our courses page and here if i just
[19:21] expand back there all the div tags you
[19:23] can see that there is something that is
[19:26] common for all the div tags their class
[19:30] is equal to card so i can filter my div
[19:33] tags by this expression right there so i
[19:37] go back to pycharm and i will write here
[19:41] class equals to card but now you can see
[19:44] that there is an error and it is quite
[19:47] important behavior to understand you
[19:49] have to apply here the underscore
[19:51] because the class is a built-in keyword
[19:54] in python where you create python
[19:57] classes so that is why you have to add
[19:59] the underscore over here and then the
[20:02] beautiful soup will understand that you
[20:04] are relating to the class of the html
[20:08] attribute okay so it is important now
[20:10] since we have all the course cards
[20:12] stored right in this variable then we
[20:15] probably want to iterate over this list
[20:18] and then search for the course name and
[20:20] then the course price so let's see how
[20:23] we can do that for each of our course
[20:26] cards so we will start with
[20:29] for loop here and that will be four
[20:31] course in course cards and before we go
[20:34] ahead and write some more code inside
[20:36] our for loop let's actually remind you
[20:39] what is inside each of our courses and
[20:42] then you can see that we have h5 tags on
[20:46] each of our course cards and it makes
[20:48] sense to access this specific h5 tags so
[20:52] we can accomplish that by going here and
[20:55] then use the h5 tag as an attribute so
[21:00] if i go ahead and press here dot h5 and
[21:04] re run my program then you can see that
[21:06] we were able to grab each of our h5 tags
[21:10] that are inside the course card so it is
[21:13] a quite great thing and now
[21:16] if i revert this back to course again
[21:19] and run that out you can also see that
[21:22] inside our a tags we have the text for
[21:26] start for 20 dollars and that is
[21:29] repeated for all of our cards as well so
[21:32] first of all it makes sense to delete
[21:35] this again and right here something like
[21:38] course name
[21:40] equals to course
[21:42] dot h5 and then here we probably look
[21:46] for the text attribute of that h5 tag so
[21:50] i will write here dot text and then this
[21:53] course name will be responsible to store
[21:56] the text
[21:57] on each iteration so it is great and now
[22:01] i can go here and
[22:03] write course price and then this time i
[22:06] will search for course dot a because the
[22:09] a tag stores the information about the
[22:12] course price so until now if i go ahead
[22:16] and print the course name and then i
[22:19] also go ahead and print the course price
[22:23] then we will see the results like the
[22:26] following so you can see that we have
[22:28] python for beginners and then we have
[22:31] the a tag itself but in this case we
[22:34] look for the text of that a tag as well
[22:37] so i will
[22:39] minimize my terminal out and excuse me
[22:41] for that i will delete that from here
[22:44] and then search for the text attribute
[22:46] over here as well and now i will run my
[22:49] program and then you can see that we
[22:51] have python for beginners and then we
[22:53] have the text for each of our a tags and
[22:56] now since we reached this stage it might
[22:59] be a greater idea to print a sentence
[23:02] like python for beginners costs 20 okay
[23:05] so the way we can do that
[23:08] is basically using the split method to
[23:11] access that last element of that text
[23:14] because the price is located as the last
[23:18] word so it makes sense to go here with
[23:21] split and then we will split it by the
[23:24] blank so we don't have to specify
[23:26] anything here and we want to grab that
[23:28] last element so we are looking for -1
[23:31] index over here and now if i run it you
[23:35] can see that we have the price
[23:37] for each of our courses and now it might
[23:40] be much nicer if we go ahead and use an
[23:43] f-string to print a dynamic sentence for
[23:47] each of our cursors so we will go here
[23:50] with print and then we will open an f
[23:52] string and then we will access the
[23:54] course name so it will be course
[23:57] underscore name and then we will write
[24:00] costs and then we want to display the
[24:03] course price so it will be cool
[24:06] underscore price now if i run our
[24:09] program then you can see that it
[24:11] displays a nice information about each
[24:13] one of the courses
[24:15] now if you think about it that is a
[24:17] quite nice behavior that we have applied
[24:19] here because if you scrape a real
[24:21] website like udemy that keeps updating
[24:24] courses then it might be a great idea to
[24:27] launch this program every certain amount
[24:30] of time for example each week and then
[24:32] you have the ability to be aware about
[24:35] each of the courses that udemy has
[24:37] updated on the webpage that you scrape
[24:40] on so this is a quite nice behavior that
[24:43] we were able to reach here
[24:47] on this one we are going to scrape real
[24:49] websites with the request library so i'm
[24:52] going to simulate this against a website
[24:55] that is going to search for job
[24:57] advertisements and i'm going to bring
[24:59] all the jobs from a specific website
[25:02] that their main skill requirement is
[25:06] python programming language and i'm
[25:07] going to write a program that is going
[25:10] to pull the latest published job
[25:12] advertisements from a specific website
[25:15] so it is going to be very interesting so
[25:17] let's get started all right so one of
[25:19] the first things that we must do is to
[25:22] ensure that we have the request library
[25:25] installed so i'm going to go down to my
[25:29] terminal right in pycharm and i'm going
[25:32] to write here pip install request just
[25:35] to make sure that i have the request
[25:36] library installed now the output for
[25:40] myself could be different than yours
[25:41] because you may not have the request
[25:43] library but since i already have that
[25:46] you can see outputs like requirement
[25:49] already satisfied okay so it is quite
[25:51] important now i'm going to minimize the
[25:54] terminal and right here import requests
[25:58] so you want to make sure that you do
[25:59] that after the installation of this
[26:01] library and the first thing that i'm
[26:03] going to do here is to use the get
[26:07] method of the request library now what
[26:11] request library is doing behind the
[26:13] scenes it is just requesting information
[26:15] from a specific website so it is like a
[26:18] real person
[26:19] going to a website and requesting some
[26:22] information okay so you can go with
[26:24] something like the following when it
[26:26] comes to request library so it will be
[26:28] request dot get so you want to get
[26:32] specific information from a website and
[26:34] here we are going to provide an empty
[26:36] string for now but later on we are going
[26:39] to complete this string with the url
[26:41] that we are going to web scrape against
[26:44] it and i'm going to assign this to a new
[26:46] variable and i will call it html text so
[26:50] i'm going to make that to be equal to
[26:52] this entire statement now let's go to a
[26:55] web browser and look up for the website
[26:57] that is going to include some job ads
[27:00] okay so this is timejobs.com and this
[27:04] website includes job posts about almost
[27:07] everything so you can simply go down
[27:09] here and search for some skill that you
[27:12] own and then this will search for you
[27:15] jobs that are requiring this specific
[27:17] skill in that position now this video is
[27:20] recorded a couple days before when i
[27:23] uploaded it so if you watch this video
[27:25] after a couple of months or even a year
[27:27] or two since the publish date then there
[27:30] is a great chance that the html elements
[27:32] are going to be quite different but the
[27:34] main point of this video is to teach you
[27:37] all the tools to pull information from a
[27:40] website just as you want and then you
[27:43] can apply your own customizations and
[27:45] kind of doing a reverse engineering to
[27:48] the code that i'm going to write
[27:50] throughout this tutorial great so let's
[27:52] go here and write python so i will
[27:55] receive only job posts about this
[27:58] programming language and you can see
[28:00] that we have this job found over there
[28:04] and we have a lot of jobs that are
[28:06] published so my goal here in this
[28:09] tutorial would be to
[28:12] let's get this closed so my goal in this
[28:15] tutorial will be to bring all the jobs
[28:19] that are posted a few days ago so if i
[28:22] am zooming here in then you can see that
[28:26] we have posted a few days ago for a
[28:29] couple of posts but after i reach down
[28:32] here we have posted four days ago so
[28:36] this might mean that this job post is
[28:39] not the most updated so i'm going to
[28:42] bring all the jobs and i'm going to
[28:45] condition my program to bring those
[28:48] elements with the posted few days ago
[28:51] text only so let's go back to here now
[28:55] i'm going to bring this url from here
[28:58] and i'm going to paste that in in the
[29:01] empty string that we created inside the
[29:04] request.get and once i have done that
[29:07] what is going on inside this variable
[29:09] right now is simply the request code
[29:13] status okay so if i'm going to
[29:16] print the i mean if i'm going to run
[29:18] this program then we are going to see
[29:21] the results like the following so 200 is
[29:24] the convention number in web that the
[29:28] request is done successfully but in
[29:31] order to avoid the status code we are
[29:34] going to go to here and i'm going to
[29:37] accept the text only so i'm going to go
[29:40] here and then write dot text okay so
[29:43] this is what we have to apply here in
[29:46] order to bring the html text of that
[29:49] specific page and now it makes sense to
[29:52] leave this variable name as it is
[29:54] because it is storing the html text and
[29:56] i'm going to re run this program and we
[29:59] will probably receive a large
[30:02] information of html so right now it is
[30:05] not quite relevant but i'm just i just
[30:07] wanted to show you the results so let's
[30:10] continue from here okay so as you know
[30:13] we are going to
[30:14] create a beautiful soup instance like we
[30:17] did in the previous episode and i'm
[30:19] going to provide the html as the html
[30:22] text variable so it will be soup equals
[30:26] to
[30:26] an instance of a beautiful soup and then
[30:29] i'm going to write here html text as my
[30:32] information that i want to scrape and we
[30:35] are going to use the same parser again
[30:37] like the previous episode so it will be
[30:40] lxml now once i have done that it makes
[30:43] sense to go back to our page and see how
[30:47] we can grab only this each paragraph
[30:50] from this website so the white boxes are
[30:54] kind of a list of elements that this
[30:57] page has provided here and i want to
[31:00] look for a method that is going to bring
[31:03] me all the job posts so it makes sense
[31:06] to catch a certain element inside that
[31:10] post and right click on it and then
[31:13] click on inspect and once i have done
[31:16] that you can see here so i'm going to
[31:19] zoom in things a little bit
[31:21] so we can see that the h3 class is
[31:26] pointing to that
[31:28] text over here i know that the text is a
[31:31] little bit small here but just you can
[31:33] see that it has a gray mark and i'm
[31:36] going to go up here and then you can see
[31:40] that those elements are opened up as
[31:42] well so if i hover my mouse here then
[31:45] you can see a green background wrapped
[31:47] in the article over here i mean the
[31:49] paragraph and then if i close that up
[31:52] you can see that we have a lot of clear
[31:55] fix job dash px and something like that
[31:59] that its name is the class and our html
[32:04] element here is called li so li stands
[32:07] for list and then you can see that it is
[32:09] inside a ul tag so this is standing for
[32:14] unordered list and it is containing a
[32:16] lot of
[32:17] list tags inside that ul so you can see
[32:20] once i close that then the entire
[32:24] list of all the posts are marked with a
[32:27] blue
[32:28] background so i'm going to search the
[32:31] element of li with that name of class so
[32:34] i'm going to copy the name of the class
[32:37] here and i'm going to go back to my
[32:39] pycharm and i'm going to write here jobs
[32:43] equals to soup dot find
[32:46] underscore all and i'm going to search
[32:49] for all the li's and as the second
[32:53] argument it makes sense to pass here
[32:55] class underscore equals to and then
[32:58] inside that string i'm going to paste
[33:01] that in the class name that we have
[33:04] copied from the page itself so once i
[33:07] have done that then we will probably see
[33:10] the results of all the jobs in that page
[33:14] now this doesn't mean that it is going
[33:16] to bring back all the
[33:19] 16 000 jobs because you can see that
[33:23] this page is being paginated so that
[33:27] means that it is going to bring the
[33:29] results only for the first page so this
[33:32] is not going to take extremely long now
[33:35] if i go back to here and paste the jobs
[33:39] then let's see the results before we
[33:40] continue on just to make sure that
[33:42] everything is okay so we can see that we
[33:45] receive the results and then we see that
[33:47] we have some company names and i think
[33:50] that everything is quite great here now
[33:52] in order to work with this
[33:55] scraping project it makes sense to only
[33:57] work with only one job element so i'm
[34:01] going to
[34:02] delete the underscore all from here and
[34:05] what this means it means that it is
[34:07] going to bring the first match that sees
[34:11] the li tag and then the class name as
[34:14] this string over here so let's change
[34:17] this variable name just to job for now
[34:20] okay just in order to develop our
[34:22] program slower
[34:23] relying on only one job post okay so
[34:28] once we've done that we probably want to
[34:30] search for the company name of that
[34:32] specific job post so i'm going to go
[34:35] back to here and i'm going to make
[34:38] things bigger over here and now let's
[34:41] actually go here and try to inspect what
[34:44] is going on
[34:45] here again so let me
[34:48] zoom that out great now i'm going to
[34:52] try to inspect this text over here again
[34:55] and then we can see that it is inside
[34:58] the li tag for sure but we can also see
[35:01] that it is inside an h3 tag and it has
[35:04] the class name of job list comp name so
[35:08] i'm going to search for that class in
[35:12] the entire page as well but speaking
[35:15] about the entire page so let's go to our
[35:18] pycharm you want to search for that
[35:22] specific element only inside the job
[35:25] itself so you see it doesn't make sense
[35:28] to search for an h3 tag in the entire
[35:31] page again so you can basically go with
[35:35] job dot find besides soup dot find
[35:38] because we want to search for that h3
[35:41] tag only inside our job so if i go ahead
[35:44] and print the job here then we can see
[35:47] that it only includes an html code about
[35:50] only one job and i'm going to search for
[35:53] this h3 tag
[35:55] so let's create here a new variable and
[35:58] i'm going to call that company
[36:00] underscore name and we are going to use
[36:02] job.find
[36:04] and we are going to accept here as an
[36:07] argument the h3 and then this time the
[36:10] class underscore is going to be equal to
[36:13] whatever this h3 tag includes as the
[36:16] class name which is the job list comp
[36:19] name now to debug this out and to ensure
[36:24] that the results are great we are going
[36:26] to print the company name
[36:28] and then you can see that we receive
[36:30] this
[36:31] this element back and i'm going to use
[36:33] here the dot text method just to bring
[36:36] back the text itself now once i do that
[36:40] we are going to see a weird result here
[36:43] now you can see that we have some white
[36:46] spaces so we kind of want to replace our
[36:50] white spaces with nothing so in order to
[36:54] do this one i'm going to go here and i'm
[36:57] going to use the replace method and this
[37:01] trick is going to avoid having this not
[37:04] necessary white spaces so i'm going to
[37:06] replace the spaces with nothing so i'm
[37:10] going to just write here double quotes
[37:12] twice i mean single quotes twice and
[37:15] once i have done that and rerun our
[37:18] program then you can see that the result
[37:20] is going to be quite different as you
[37:22] can see this text is fully aligned to
[37:25] left now let's minimize back and
[37:28] continue from here now we're going to
[37:30] zoom out a little bit the code here just
[37:33] we can see the important points like the
[37:35] replacement
[37:36] and let's continue from here now it also
[37:39] makes sense to bring the skill
[37:42] requirements other than the python
[37:44] programming language because we know
[37:46] that this job is only for people who are
[37:50] good with the python programming
[37:52] language so i'm going to go here and i'm
[37:55] going to repeat myself in the same
[37:57] process again and i'm going to write
[37:59] here job.find and we are probably
[38:02] looking for an element that is including
[38:05] a text about the skill requirements so
[38:09] let's search for that okay so let's go
[38:11] back to our website again and i'm going
[38:13] to
[38:14] go here and check out what html element
[38:17] is including the skills so we are
[38:20] talking about this one so i'm going to
[38:22] inspect inside here and we can see here
[38:26] that this text is inside a span class
[38:30] with the class name of srp skills so i'm
[38:34] going to copy again this class name and
[38:38] that time i'm going to search for the
[38:40] spin elements inside my job post so i'm
[38:44] going to go back to pycharm again and
[38:47] i'm going to write here span so this is
[38:49] the html tag that we are searching for
[38:52] and again i'm going to write class
[38:54] underscore equals to that srp skills now
[38:58] i want to ensure the results over here
[39:02] once again so you always want to
[39:06] quickly print the results of whatever
[39:08] html element that you want to pull to
[39:10] see what other methods you have to apply
[39:14] to prettify your result okay so let's
[39:18] run our program again and it makes sense
[39:20] to delete the print company name so
[39:23] let's re-execute our program
[39:27] and then you can see here that we have
[39:30] some spin tag and then here we have a
[39:33] strong tag which is basically created to
[39:36] make our text bold when we want to type
[39:39] in something so i'm just going to
[39:41] guess here that i'm going to
[39:44] only write here dot text and then i
[39:46] expect for the results to be fine so
[39:49] let's check out for that and then you
[39:50] can see here that the results are quite
[39:52] great so we have the python scripting
[39:55] and then we have some more requirements
[39:58] that are divided with commons and a lot
[40:01] of white spaces again so i'm going to
[40:04] apply the same method of that replace
[40:06] once again like we did with the company
[40:09] name so let's write here dot replace and
[40:11] i'm going to replace white spaces
[40:14] with nothing so let's re-execute that
[40:18] out and then we can see that the result
[40:20] is quite like we want and now we were
[40:23] also able to grab the skills as well so
[40:26] this is quite nice now if we want to
[40:30] display a nice information about the job
[40:33] until now then we want to go with a nice
[40:36] print message here so let's try to
[40:38] create a nice message so we will use an
[40:40] f method here and we will also use the
[40:43] triple quote method just to allow us to
[40:46] write some text in separated lines as
[40:49] well and i'm going to write here company
[40:52] name
[40:52] like this and then i'm going to write
[40:56] here company name so i'm calling the
[40:59] company name value by writing it inside
[41:02] a curly brackets and i'm going to repeat
[41:05] the same process for required skills so
[41:09] it will be required skills and then i'm
[41:12] going to make that to be equal to skills
[41:16] variable and now if i go and execute our
[41:19] program let's see if the results are
[41:22] quite nice
[41:23] yes so we kind of receiving a nice
[41:25] information about the job info okay so
[41:29] this is quite great
[41:31] now if we go back to here then we want
[41:35] to search for one more element so
[41:38] you remember that i told you that we
[41:41] only want to grab the
[41:44] job post with the text of posted few
[41:47] days ago so we for sure want to write
[41:51] some extra code to apply this
[41:53] functionality so i'm going to go here
[41:56] and i'm going to inspect for that
[41:58] element again
[42:00] and then we can see that it is inside a
[42:02] span once again but i can also see that
[42:06] this
[42:07] job post including some more span
[42:10] tags so i have to filter out the results
[42:14] again with the class name itself so i'm
[42:17] going to search for that sim posted
[42:21] class name and i'm going to go back to
[42:24] here so we will write this time
[42:27] job
[42:29] published date
[42:31] so it makes sense to delete the job
[42:32] excuse me so it is just going to be
[42:34] published date and i'm going to go here
[42:37] again with job.find
[42:39] and we will search for the span and then
[42:42] this time the class underscore is going
[42:45] to be equal to the text that i just
[42:47] copied and i'm going to repeat myself
[42:51] with printing the published date but
[42:55] that time let's just avoid printing this
[42:58] print line so i'm just going to comment
[43:00] out those lines and let's see what the
[43:03] published a
[43:04] date text is looking like and you can
[43:06] see that we have here something
[43:09] a little bit weird so we have the span
[43:13] here and we have also one more span
[43:16] inside of the text of it so what that
[43:19] means it means that we have to take some
[43:21] different action than what we did
[43:23] previously so this time i want to search
[43:26] for the attribute of span just to get
[43:29] inside that tag over here and then right
[43:32] after it i want to look for the text of
[43:34] that span tag so this will give me the
[43:38] published date of this specific job but
[43:43] i'm not going to include the publish
[43:44] date inside my print message because we
[43:47] only want the publish date for the
[43:49] functionality to stop our execution if
[43:52] the published date text is not including
[43:56] the word of fuel and i'm going to code
[43:59] this functionality just in a second so
[44:01] you will see what i mean by what i said
[44:03] all right so what i'm going to do here
[44:06] is take a tricky action that is going to
[44:08] bring me all the jobs from the first
[44:11] page so if we paid attention then all
[44:14] the job posts including this class name
[44:18] so what i can do besides the find is
[44:21] change that back to underscore all and
[44:25] change this variable name to jobs and i
[44:28] know that just now it just raised an
[44:31] error here and i'm going to use here a
[44:34] for loop that is going to iterate over
[44:37] each element and i'm going to write here
[44:40] for
[44:40] job in jobs and then i'm going to create
[44:44] an indentation of the entire code that
[44:48] is right there so the results will be
[44:51] applied for all the jobs that are posted
[44:55] in the first page of the
[44:57] web page that we scrape so once i hit
[45:00] here the colon sign then i'm going to
[45:03] create an indentation for each of our
[45:06] lines like this and then the results are
[45:09] going to be quite the same so let's test
[45:12] that out okay i'm going to
[45:14] uncomment our print line over here and
[45:17] just for comfort reasons i'm also going
[45:19] to print here and empty lines so we can
[45:22] kind of see a division between the
[45:24] different jobs and then i'm going to
[45:26] delete the published date for now so if
[45:30] we execute our program
[45:32] that time then we are going to see a
[45:35] nicer results and this is going to
[45:37] contain all the job posts
[45:39] from the page that we scrape against so
[45:42] you can see that we have a nice
[45:43] paragraph for that job post and then we
[45:46] have also another one here and if i keep
[45:48] scrolling up we can see a lot of them in
[45:51] that output so this is quite great so if
[45:54] you remember we wanted to filter out the
[45:58] job posts that are not including the
[46:02] word of few inside the published date
[46:06] because what that means it means that
[46:08] this job could be outdated so if i go to
[46:12] our page again then we can see that as i
[46:15] keep scrolling down we have some
[46:18] text like posted six days ago and i
[46:21] wanted to filter out only the jobs that
[46:23] are containing the text of posted few
[46:26] days ago so in order to apply this i'm
[46:29] going to change the orders here a little
[46:32] bit okay so i'm going to
[46:34] cut this searching here and i'm going to
[46:37] paste that in as the first line inside
[46:41] my for loop now the reason i'm doing
[46:43] this it is basically because i don't
[46:45] want to continue on scraping for that
[46:48] post if the publish date is not matching
[46:51] my condition so it makes a lot of sense
[46:54] to place this code as the first line
[46:58] inside my for loop and then right here
[47:01] i'm going to
[47:03] write a condition that is going to check
[47:05] if the word of fuel is inside that text
[47:10] so it will be if
[47:12] fill in
[47:14] published date and again i'm going to
[47:17] create an indentation for the entire
[47:20] code here so you can do that with the
[47:22] shift alt combined and then you can just
[47:25] press tab and all the lines here are
[47:29] being indented so right now if i go
[47:32] ahead and execute our program then we
[47:36] should see the results again like almost
[47:39] the same but we also see here that the f
[47:43] string is not quite nice
[47:46] but i can live with that okay so it is
[47:49] great that we were able to receive the
[47:53] posts only that have been published few
[47:56] days ago now there is no limit for what
[47:59] you can do when it comes to web scraping
[48:02] and what you can filter in or filter out
[48:05] but basically this program deals with
[48:08] how to grab some job posts with the
[48:11] filters that you want to apply that
[48:14] maybe sometimes may not be available
[48:16] from the website itself so you can write
[48:19] your own filtrations on your python code
[48:22] while you scrape some information from a
[48:25] specific website
[48:29] so i'm going to do whatever it takes to
[48:31] turn this program into a very useful one
[48:34] and i'm going to do that by applying
[48:36] some special functionalities such as
[48:39] wrapping this entire program in a while
[48:41] loop and executing this project every
[48:45] certain amount of time and also apply
[48:47] some filtrations to filter out the job
[48:51] post that are not meeting the skills
[48:54] that i own and also i'm going to throw
[48:56] the results of the different job posts
[48:59] into a new blank file so i can be aware
[49:02] of the posts that are being posted every
[49:05] certain amount of time so let's get
[49:07] started all right then so let's start
[49:09] with a kind reminder of the results that
[49:11] we got until that point so
[49:14] we run our program now and if we show it
[49:18] right here you can see that those lines
[49:21] are not aligned well so i'm going to
[49:24] change that and i'm also going to
[49:26] provide some extra information that will
[49:28] show us the exact
[49:30] link of the specific job that we are
[49:33] iterating on so that way i will have the
[49:35] ability to just click on the link and
[49:38] then see more information about that job
[49:41] so as a beginner i will get rid of the
[49:43] formatted string in that case because
[49:46] doing a formatted string with a triple
[49:48] quote might not be a great idea when you
[49:51] execute it with a for loop because as
[49:53] you can see that it also includes the
[49:55] indentations right here so i'm going to
[49:58] delete this entire code here and i'm
[50:01] going to write two more new formatted
[50:04] strings and we will start with company
[50:07] name
[50:09] make that to be equal to company name so
[50:12] make sure to add a column here so it
[50:14] will be more friendly and then i will
[50:16] write here required skills as well and
[50:19] then we will write here the skills
[50:22] variable now there was one more issue
[50:25] with the result that we showed a minute
[50:27] ago and that was the blank spaces that
[50:31] are being shown as well so we can get
[50:35] rid of the spaces by a special method
[50:37] that is called strip and it is a special
[50:40] method that you are allowed to use
[50:43] inside strings and since the company
[50:45] name and the skills are strings by
[50:48] default i don't have to convert them to
[50:50] a string so i can just call that method
[50:54] like this okay and now i will show the
[50:57] results of
[50:58] something like the following and in a
[51:01] few seconds we will see that
[51:03] this is aligned way better than what it
[51:06] was and i'm also going to add here more
[51:09] information line that will show the link
[51:13] of the job post so let's do that okay
[51:16] let's go here and write this
[51:19] functionality okay so we had an
[51:21] unordered list that inside of that we
[51:24] had some different html tags that are
[51:27] called li and that stands for list and
[51:30] they are actually different job posts
[51:33] that are divided into different elements
[51:36] inside an unordered list and then if we
[51:39] hover our mouse you can see that there
[51:41] are different jobs now if i go inside
[51:44] one of them and i go inside a header tag
[51:47] that is actually the first editor of the
[51:50] li tag and then i will
[51:53] go inside the h2 here and then you can
[51:56] see that we have a link that could lead
[52:00] us to a link that provides some extra
[52:02] information about that specific job so
[52:06] if i actually
[52:07] go here and click on here you can see
[52:10] that we receive the job description
[52:13] right here so what we have to do in
[52:15] order to access this link in each job
[52:18] post that we are iterating on the python
[52:21] code is actually going inside and header
[52:24] and then going inside one more tag with
[52:26] a kind of h2 as you saw me doing that
[52:29] and then access that a
[52:31] tag so let's do that okay i'm going to
[52:34] go back to pycharm and apply this
[52:36] functionality so we will go under the
[52:40] skills and then we will write here more
[52:43] info and that will be equal to job
[52:47] dot header because this was the first
[52:50] tag that we want to go inside of it and
[52:53] then we want to go inside the h2 and
[52:56] then inside that h2 we want to go inside
[52:59] the a tag now before we go further let's
[53:03] test ourselves that we have done great
[53:05] job so let's print the more info in the
[53:09] following way so it will be more info
[53:12] and then we will call the variable in a
[53:15] formatted string now let's execute our
[53:18] program
[53:20] and then you can see that inside the
[53:22] more info we have the a href which gives
[53:26] us the link about the
[53:28] specific job that we are iterating on so
[53:31] all i have to do here is going back to
[53:34] my more info and then call that href
[53:38] attribute so this time i'm going to do
[53:40] that with a square bracket like in
[53:42] dictionaries and then i'm going to write
[53:44] here href so i will receive the value of
[53:47] that attribute so if i run that one more
[53:50] time
[53:51] then i should see the link only and that
[53:54] is what exactly happening so the result
[53:57] is quite great and then you can see that
[53:59] this is already better than what we did
[54:01] in the last episode and we will continue
[54:04] from here okay so what i want to do now
[54:06] is giving the opportunity for the user
[54:09] that executes this program to filter out
[54:12] some skill requirement that he does not
[54:15] own so we will use the input function
[54:17] for that and then whatever the input is
[54:19] equal to we will filter out the results
[54:22] from the jobs that we are finding
[54:25] right here okay so let's write this
[54:28] functionality so to apply this i'm going
[54:30] to create a new variable over here and
[54:33] i'm going to call it unfamiliar skill
[54:36] and i'm going to make that to be equal
[54:38] to an input and then i'm going to
[54:41] write here something like this okay so
[54:43] the user could understand that he has to
[54:47] provide some information in order to
[54:49] execute this program and actually it
[54:51] might be a great idea to print some
[54:53] extra information before that input
[54:56] function so it will be print
[54:59] put some skill that you are not
[55:03] familiar with and then right after the
[55:06] unfamiliar skill input i will write here
[55:10] filtering out
[55:12] and we will actually make that a
[55:14] formatted string and then we will write
[55:16] here filtering out and then whatever the
[55:20] unfamiliar skill is equal to
[55:23] now what are we going to do with this
[55:25] unfamiliar skill variable so that is
[55:29] quite easy right we have to search for a
[55:32] condition that will filter out the job
[55:35] post that is including that word that we
[55:39] are going to provide here as an
[55:41] unfamiliar skill and what we can
[55:43] actually do is search for the unfamiliar
[55:46] skill world inside the skills string so
[55:50] if you remember the skills is a long
[55:53] string that is divided with commas so we
[55:56] can go with a condition like the
[55:59] following so it will be if unfamiliar
[56:02] skill
[56:03] not inside the skills that we are
[56:07] grabbing in the each job post that we
[56:10] are iterating and now all what we have
[56:12] to do here is creating the indentation
[56:15] for the different print lines okay so
[56:17] now i should see the job posts that are
[56:21] not including the unfamiliar skill that
[56:24] i'm going to provide so just to test
[56:27] that out let's
[56:29] run our program twice okay so in the
[56:32] first we are going to write here linux
[56:35] as a skill that i'm not familiar with
[56:38] and you can see that we don't see
[56:40] anything that is including the keyword
[56:43] of linux over here but let's actually
[56:46] take that to a next level and test that
[56:48] out so we see here a specific java post
[56:52] that is including django so let's say
[56:55] that i am not familiar with django and
[56:57] see next time if i see that job post
[57:00] with this company so let's re-execute
[57:04] our program and that time i will write
[57:07] django
[57:08] and let's see the results so we can see
[57:11] that we don't have any job with django
[57:15] but we do have linux that time so
[57:18] this condition works well and we will
[57:21] continue on to next step from here now
[57:24] what could be an exciting challenge for
[57:25] you guys is to write an algorithm that
[57:28] will accept more than one unfamiliar
[57:31] skills so you want to accept multiple
[57:33] inputs from a user and it might be more
[57:36] challenging but i think you should try
[57:39] to spend some time on something like
[57:40] this because i think this could be an
[57:42] amazing challenge for everyone who is
[57:44] watching this video all right so now we
[57:46] are going to save each job post in a
[57:49] different file so besides printing this
[57:52] in the terminal then we are going to
[57:55] write this entire information in a
[57:58] separated file and then i will also
[58:00] allow this program to run every 15
[58:04] minutes or every 10 minutes up to you
[58:06] and i will show this logic as well so
[58:09] first of first it makes sense to wrap
[58:13] our entire program in a function and i'm
[58:16] going to do that by collecting
[58:19] everything that is kind of pulling the
[58:22] information from the website and i'm
[58:25] going to indent everything one step
[58:27] aside and then i'm going to write here
[58:30] def find
[58:32] jobs okay so that way we have one
[58:36] function that executes our main program
[58:39] and then what i'm going to do here is
[58:42] using the logic of if double underscore
[58:45] name is double underscore main so that
[58:48] way if you want to extend this program
[58:51] only if this file is ran directly then
[58:54] this function will be executed now if
[58:57] you don't know what i said about if
[58:59] double underscore name equals double
[59:01] underscore main then i have a video that
[59:04] explains this condition so you can check
[59:06] that out by the suggested link above so
[59:10] let's write here if double underscore
[59:13] name
[59:14] equals to double underscore main inside
[59:17] a string and then right here while true
[59:21] so i want to run this program forever
[59:24] and then i will call the find jobs and
[59:28] right after it since i don't want this
[59:31] program to be executed like every
[59:33] millisecond then i'm going to write here
[59:37] time dot sleep so time dot sleep allows
[59:41] your program to wait certain amount of
[59:44] time that you decide and you can provide
[59:47] its argument by seconds so i'm going to
[59:50] write here 600 just to make that program
[59:53] to run every 10 minutes but you can
[59:56] notice how we did not import the time
[59:59] library so let's do that by
[60:02] import time okay and then this program
[60:05] should be okay now to make this more
[60:07] dynamic i actually prefer to
[60:10] make some variable here that will be
[60:12] equal to 10 and then i will just make
[60:15] that to be equal to time weight
[60:17] multiplied by 60 and right after it we
[60:20] can provide some extra information
[60:23] excuse me this should be over here and
[60:25] we can write here waiting
[60:30] let's make it formatted
[60:31] then we can write here waiting
[60:34] time weight seconds
[60:37] and let's write three dots here great so
[60:40] this is great so if i'm executing this
[60:42] program i expect to see this program
[60:44] running every 10 10 minutes so i'm
[60:47] inside my command line interface and you
[60:50] can see that my directory has been
[60:52] already set to the directory where we
[60:56] worked so i can go with python and then
[60:59] execute the name of the file by calling
[61:02] it so it will be main dot pi and then
[61:05] once i run that you can see that we
[61:08] receive
[61:09] this
[61:09] output and then i have to provide some
[61:12] information that is going to be filtered
[61:14] out and then let's write here django
[61:17] again and
[61:18] you can see that we receive the results
[61:22] successfully but more important we see
[61:24] that waiting 10 seconds which is not
[61:27] great we have to change that to waiting
[61:29] 10 minutes because
[61:31] we are waiting 10 minutes right but the
[61:34] program works great it was just my
[61:36] mistake by writing here seconds so it
[61:38] should be minutes for sure but i'm not
[61:41] going to wait 10 minutes until this
[61:43] program is running one more time and so
[61:45] i will allow myself to move on to
[61:48] writing this information inside file so
[61:52] it makes sense to write this kind of
[61:55] information in a separated directory so
[61:58] i will go inside my web scraping tree
[62:01] file i mean folder and then i'm going to
[62:04] create here new directory which is going
[62:07] to be named as posts and then i'm going
[62:11] to
[62:11] write here some extra functionality that
[62:15] will create
[62:16] files i mean text files then
[62:19] and then inside each text file i'm going
[62:22] to write this exact information so you
[62:25] can do that by with open i already show
[62:29] you how you can do that in the first
[62:31] episode now i know that i don't have any
[62:33] separated tutorial about working with
[62:35] files in python but you want to consider
[62:38] check out my channel maybe i will upload
[62:40] very soon so
[62:41] you can go here and
[62:43] that time i want to put here information
[62:46] and i will call my post directory and
[62:49] then inside here i have to provide my
[62:53] file name that i'm going to create now
[62:57] before i move on here i talked about
[63:00] changing my for loop here and use the
[63:03] enumerate function now enumerate
[63:06] function is going to allow us to iterate
[63:10] over the index of the jobs list and also
[63:15] the job content itself and so i have to
[63:19] provide here one more variable like
[63:22] index so the index is going to be a kind
[63:25] of counter for the job that i'm
[63:28] iterating on and then the job variable
[63:31] will relate to the job
[63:33] beautiful sub object itself and so it
[63:36] makes sense to name our files with the
[63:39] index of the job that i'm iterating on
[63:42] so i will change this into a formatted
[63:44] string and then i will write here index
[63:47] dot txt so it will be something like the
[63:50] following and i expect each my text file
[63:52] to be named like
[63:54] 0.txt or 1.txt and so on now the second
[63:59] argument will be the permission level
[64:02] that you want to give when you create or
[64:05] open a new file and this time i'm going
[64:08] to write here w and that stands for
[64:12] writing inside the file and then i have
[64:15] to use the as statement and i'm going to
[64:18] use the f variable so inside that block
[64:22] i can write to a file with the f
[64:26] variable and i'm going to go inside my
[64:29] with open and i'm going to create
[64:31] indentation of the prints and i'm going
[64:35] to delete this print line here and it
[64:37] makes sense to remove this blank space
[64:40] as well and all i have to do here is
[64:43] changing this print statement to f dot
[64:48] write and then that time i'm not going
[64:51] to print the results in the command line
[64:54] interface besides i'm going to write the
[64:57] information in a new file so i'm going
[65:00] to use the combination of alt shift here
[65:03] and i'm going to change those entire
[65:06] three prints to f dot write okay and
[65:10] then i'm going to open the parentheses
[65:13] so it will be closed by those and then i
[65:17] expect for each job to being
[65:21] written inside a file and once i do that
[65:24] it might be a great idea to print a
[65:27] sentence like file
[65:30] saved and then you can provide the name
[65:33] of the file as an extra information so i
[65:36] will create one more time formatted
[65:37] string and then i will relate to that
[65:41] index
[65:42] variable and now our program is complete
[65:46] so let's check it okay let's go back to
[65:49] our command line interface and let's
[65:52] actually control break this program and
[65:55] let's write cls to clear our terminal
[65:58] and then i'm going to re-execute my
[66:01] program so it will be python
[66:04] main dot pi
[66:06] and then i'm going to
[66:09] execute it so let's see this time i'm
[66:12] going to write django as well
[66:15] and that time i don't expect to see
[66:18] output for the information besides i
[66:20] expect to see
[66:21] this okay so let's see
[66:24] what is inside each of our files so
[66:27] let's see what is inside that post
[66:29] directory okay so i'm going to go inside
[66:32] my c python put
[66:34] web scripting tree and then the post
[66:36] directory that we created a few minutes
[66:38] ago and you can see that inside of that
[66:41] we have our text files but if i go here
[66:44] inside let's see if the results are okay
[66:47] okay so
[66:49] i'm not quite satisfied with with that
[66:51] because it might be a greater idea to
[66:54] see that like
[66:56] i mean like this okay so
[66:59] you might want to divide those
[67:01] information in separated lines but that
[67:04] is not going to be complex so
[67:06] we just have to go inside our python
[67:09] again and then whenever we write to the
[67:14] file we have to use that convention
[67:17] where you can just jump a line and that
[67:20] will be backslash in so when you provide
[67:24] backslash n inside a string it is just a
[67:28] convention that is going to jump to the
[67:31] next line right after it so it will be
[67:34] backslash n for the first line and then
[67:37] also here and let's run this program one
[67:40] more time so i'm just going to break the
[67:42] program and re-execute it so that time i
[67:45] will write linux and then let's test our
[67:48] results one more time so let's go inside
[67:51] our
[67:52] 19.txt and then you can see that the
[67:54] information is right there just like we
[67:57] expected okay so this is quite great
[68:00] alright guys so i hope you enjoyed this
[68:02] entire series and you can find
[68:05] everything that we have done here by the
[68:08] links in the description of course i
[68:10] will provide extra information in my
[68:12] website about this series so if you like
[68:16] this video consider subscribing and also
[68:18] hit the like button i will see you in my
[68:21] future uploads