[0:00] hey what's up everyone welcome back to
[0:01] another video in this video we are going
[0:03] to do some Advanced web scraping with
[0:06] python if you haven't already seen it I
[0:08] recommend checking out my overview on
[0:10] the python beautiful soup Library I'll
[0:12] pop it up right here but this is going
[0:14] to kind of take that video and really
[0:16] build onto more advanced things that you
[0:18] might see in the wild web scraping is a
[0:20] incredibly useful skill I do a lot of
[0:23] work on upwork and I feel like the most
[0:25] common thing that I ever see is web
[0:27] scraping jobs so if you can Master web
[0:29] scraping you have tons of opportunities
[0:32] when it comes to freelance when it comes
[0:33] to work etc before we get into the video
[0:36] though I want to give a shout out to
[0:38] this video sponsor and that is bright
[0:40] data when it comes to really Advanced
[0:43] web scraping stuff it's hard to do it on
[0:45] your own one of the things that I like
[0:46] most about bright data our sponsor is
[0:48] they offer all sorts of proxy tools
[0:51] which basically allows you to send
[0:53] requests from many different locations
[0:55] and it bypasses it allows you to
[0:57] ultimately bypass a lot of the
[0:58] restrictions that sites try to implement
[1:01] to prevent you from scraping them uh if
[1:03] you don't want to deal with learning how
[1:05] to web scrape and watching this whole
[1:07] tutorial bright data also offers a bunch
[1:10] of data sents that are available for
[1:12] purchase such as Walmart which we'll be
[1:14] scraping today but also Amazon
[1:17] Airbnb Instagram LinkedIn Etc so tons of
[1:23] data sets in the data set Marketplace
[1:24] that you could also just get started
[1:26] super quickly but without further Ado
[1:27] let's get into this tutorial all right
[1:30] to demonstrate these Advanced scraping
[1:32] techniques we're going to be doing some
[1:33] scraping on walmart.com and if this ever
[1:36] feels too quick for you I recommend
[1:38] checking out the original video I did on
[1:40] uh beautiful soup which will be linked
[1:42] in the description and also I'll pop it
[1:44] up right here the first thing we'll want
[1:45] to do with any sort of web scraping
[1:48] project is kind of identify the HTML of
[1:51] what we're trying to scrape so let's say
[1:53] I was looking for a new computer monitor
[1:56] because of course I'm a programmer I
[1:57] need 10,000 monitors at all times time
[2:00] so I look up monitor here and let's see
[2:03] what looks good let's click on this
[2:06] first Samsung one how about this Odyssey
[2:08] G6 we'll check that out so if I was
[2:11] trying to web scrape and like collect
[2:13] information because I want to pay the
[2:14] best price for these monitors uh I'm
[2:17] going to be looking at the HTML on the
[2:19] page so I might be trying to get this
[2:21] information I might be able to you know
[2:23] trying to get the you save information
[2:26] probably be trying to get this
[2:27] information maybe the reviews
[2:29] information bunch of things that I want
[2:31] to get but the way that we do this in
[2:34] web scraping land is you for most
[2:36] internet browsers you can write click
[2:39] and click on inspect which opens up the
[2:42] kind of HTML source code so if we look
[2:46] in here we can specifically click
[2:48] inspect on this tag and kind of find
[2:51] where the price is one thing though that
[2:53] I also notice with this specific Walmart
[2:55] page is there's this concept of item
[2:58] props so that kind of leads me to to
[3:00] think that maybe there's another kind of
[3:02] technique going on here that's getting
[3:04] some of this information if we look at
[3:07] all of the information on this page
[3:09] trying to see if there's anything that
[3:11] sticks out here yep there we go um
[3:14] there's this next data field let's see
[3:16] what's in that uh and we see in here
[3:19] there's this massive massive Json object
[3:23] that has props so this next data tag is
[3:26] telling me this is a specific type of
[3:28] JavaScript project
[3:30] that leverages these props and passes
[3:33] them in on the web page so instead of
[3:35] actually scraping from the HTML tags
[3:37] themselves we can actually usually find
[3:39] the information we're looking for within
[3:41] this props
[3:43] field um so I think kind of our
[3:46] technique for scraping information from
[3:48] Walmart will be do a search like we just
[3:51] did so we had you know the search for
[3:53] monitor this is a pretty easy format I
[3:55] could very easily replace monitor here
[3:58] with other words and search for them
[4:00] and then kind of programmatically go
[4:02] through this page maybe click on items
[4:05] and then once we're clicking on an
[4:08] item we'll see for this monitor too that
[4:11] it also has these page props and then in
[4:15] this page props
[4:17] information basically we'll want to look
[4:19] for and find where the actual price is
[4:24] within this nested Json so that's the
[4:27] general technique Let's uh open up a
[4:29] code code editor and kind of just write
[4:31] some template code out to help us do
[4:34] this so I'm in this Advanced scraping
[4:36] repo and I'll link this in the
[4:39] description uh I'll link my GitHub repo
[4:42] with all the code that we'll see in this
[4:43] video but I'll just call this like
[4:46] Walmart
[4:47] scraper
[4:49] dopy and probably the first thing we'll
[4:51] want to do is install beautiful soup and
[4:54] I recommend if um you know how to to set
[4:57] up a virtual environment to do this I'll
[5:00] link a video on how to do that right
[5:01] here but we can go pip 3 install
[5:03] beautiful soup 4 we see we already have
[5:06] it and we're also going to want to use
[5:09] the requests Library so I'll just kind
[5:11] of template out some code and then kind
[5:12] of explain it more in depth in a
[5:17] sec often like when I forget how to
[5:20] import things feel like beautiful soup
[5:22] is always a weird thing I like using
[5:23] Copilot
[5:30] and then I also want to import the
[5:31] requests library and Walmart URL equals
[5:36] how about
[5:38] https
[5:39] walmart.com SL maybe we grab a specific
[5:43] one of those pages that we had
[5:45] open so I'm just going to steal this
[5:48] page real quick that URL is this so just
[5:51] take any product URL from Walmart we got
[5:54] the curved Ultra wide Monitor and then
[5:58] we'll want to get that so
[6:00] response equals requests.get pass in a
[6:04] URL and we'll print out that
[6:09] HTML okay so let's go ahead and do that
[6:12] cool we get the
[6:15] HTML and the next thing we'll want to do
[6:19] is get specifically that next data tag
[6:23] so we can start looking at what's in
[6:24] that so if we look at our web page again
[6:27] I inspect this
[6:29] this next data basically we'll want to
[6:32] find the script with the ID next data so
[6:34] how do we do that with beautiful soup so
[6:36] we'll say soup
[6:39] equals beautiful soup response. text
[6:43] we'll use the HTML
[6:46] parser and we'll specifically want to
[6:48] find that sup tag with the ID
[6:53] so sup find script tag with idid
[7:01] next data I think it was like this we
[7:05] can change the ID if we have to that
[7:09] looks good basically we want to just
[7:11] print out the HTML again and it says
[7:13] None that is strange why does it say
[7:15] none so let's make sure that the next
[7:18] data is correct next 22
[7:21] underscores so that looks correct and so
[7:24] this is the first I think like Advanced
[7:27] feature that you'll have to know about
[7:29] if you ever have a situation where you
[7:31] retrieve a page it's usually one or two
[7:33] and like the tag you're looking for is
[7:35] not there when you retrieve it with
[7:36] python it's e either one of two things
[7:39] one is that this is a dynamic page that
[7:42] like some of the stuff loads a little
[7:43] bit later or two they're purposely
[7:46] hiding stuff from from requests that
[7:49] don't look Human by default there's
[7:51] certain headers when you use the
[7:53] requests library in Python and so to get
[7:56] around and make ourselves look more
[7:57] human we want to make sure that that we
[8:00] mock whatever headers we use when we're
[8:02] actually making this request in the
[8:05] python land what I would probably do
[8:07] here to figure out what are good headers
[8:09] to use look at our Network and any
[8:12] really one of these probably will be
[8:13] fine and basically we want to just copy
[8:17] the headers that we use in our actual
[8:19] web browser so we have the request
[8:21] headers down here we copy some of these
[8:24] parameters we'll probably find we have
[8:27] more success so we probably want to copy
[8:29] the accept languages and the user agent
[8:33] tags good and sometimes you might even
[8:36] want to like have this cycle between
[8:39] several different options so you always
[8:41] are changing up what you look like We'll
[8:43] add
[8:47] commas that looks good cool and so now
[8:52] what we can do with our requests library
[8:54] and I guess I have to change the order
[8:56] of this slightly
[9:01] we can pass in the
[9:04] headers equals the new headers that
[9:06] we've added and now the question is can
[9:08] we find this script tag so we're going
[9:11] to go ahead and run all of this
[9:16] code and look at that we do get um this
[9:20] next data field that we are looking for
[9:24] awesome all right now that we have this
[9:27] data here we want to be able to grab
[9:30] specifically the price from this so what
[9:32] I would recommend is this is
[9:34] crazy uh complicated to look at I feel
[9:38] like probably it's purposely like just
[9:39] super OB obis skated to make it hard to
[9:43] grab everything what I might do is go
[9:46] back to the web page inspect grab the
[9:50] next data I might copy the element and
[9:54] then what I might recommend do is doing
[9:56] is copy this Json go to like a online
[10:00] token counter probably a bunch of these
[10:02] will work I'm just going to use this
[10:04] streamlet app I see and paste in your
[10:08] Json and then see how many tokens so we
[10:10] see we have like 59,000 tokens not super
[10:13] friendly to honestly the like GPT 3.5 I
[10:16] think that's above any limits it can
[10:19] process you could use GPT 4 if you have
[10:21] access to full
[10:23] 128k context basically the goal is can
[10:26] we look at what's in this without having
[10:29] to to dig through all of this Json
[10:31] couple different ways to do it um what
[10:33] we might do real quick and we'll want to
[10:35] make this Json so we can do that by
[10:39] importing the Json Library as
[10:41] well and then we can go ahead and we
[10:43] should be able to do something like um
[10:46] data
[10:47] equals HTML or script
[10:51] tag. string that'll get all the text
[10:56] within it we can do json. loads
[11:00] and now we should be able to print out
[11:02] data. Keys cool look at that
[11:06] so basically we can kind of keep
[11:08] repeating this process we could get the
[11:11] props and then see what the keys of that
[11:14] are it's going to be kind of a slow
[11:17] process but I'm just kind of showing how
[11:19] you could do it
[11:21] manually so what are the props of the
[11:23] keys or the keys of the props page props
[11:27] so we know we have to go into page props
[11:30] and I'm thinking they're in props just
[11:31] because that's what you know the
[11:33] JavaScript framework will
[11:36] leverage uh initial data sounds pretty
[11:47] good I would say probably
[11:49] data not
[11:54] headers um probably product
[11:59] but maybe also review information would
[12:01] be helpful
[12:03] too oh wow now there's a lot of things
[12:06] now I'd look at if there's a price in
[12:08] here somewhere
[12:09] price do we see any
[12:13] price price info I like
[12:20] that now what's in the price info
[12:23] one current price I like that this might
[12:27] be the end of our
[12:31] dictionary we'll
[12:32] see and then I think probably if we get
[12:35] price from
[12:39] this see what we get look at that that
[12:42] gives us $199 and if we look at the
[12:45] product that we were looking at it is
[12:49] 100% $199 so that was getting
[12:52] information from this page not super
[12:54] easy but uh the reason I was asking
[12:57] saying go to this token counter is
[12:59] if you have a large language model that
[13:02] has the ability to parse this number of
[13:04] tokens what I might recommend is copying
[13:08] your code like for example Claude if you
[13:11] have the pro
[13:12] version um Can parse a lot so I might
[13:18] say what are the most important
[13:22] fields from the following
[13:26] Json to get the price info and review
[13:32] info for the current
[13:36] product paste this in which is crazy
[13:40] very long um I'd say limit to just
[13:45] 10 uh
[13:50] items share full path to get to that
[13:55] field using python syntax or something
[13:58] like that
[14:02] this looks like pretty
[14:04] good um I noticed that it's just going
[14:07] like four levels deep so I feel like
[14:09] it's skipped over the stuff to get to
[14:12] product um I might say hey wait to get
[14:16] to product you need to use the following
[14:23] path here's the code I found that works
[14:28] for
[14:30] price please adjust Solutions
[14:38] accordingly copy this paste it
[14:43] in perfect this looks useful so take
[14:46] this as an example but you can use like
[14:48] large language models to help you parse
[14:50] you know massive amounts of Json like
[14:52] this but to simplify the process if you
[14:56] go to the GitHub link I'll share some a
[14:59] code stimp it that will help us get the
[15:01] information we're looking for so I'll go
[15:03] ahead and paste that in here we go so
[15:05] here's a nice little product info
[15:08] dictionary and I can go ahead and print
[15:12] product
[15:14] info for our current product look at
[15:17] that review count 239 let's check to see
[15:20] if that's also valid look at that 239
[15:24] reviews so this looks great we're
[15:25] getting some information I think the big
[15:27] thing though is if we wanted to collect
[15:29] a lot of information on products such as
[15:33] you know these monitors and collect it
[15:34] over time and you know every day maybe
[15:36] run something that gets all this
[15:38] information and it stores it in a
[15:39] database or something we need to modify
[15:42] this a bit so what might we do well we
[15:45] can separate this into maybe two
[15:47] functions one function will be called
[15:50] extract product
[15:54] info that will take in some sort of
[15:56] product URL and we can basically just
[16:00] paste in what we already
[16:03] have so this is product URL now and
[16:08] instead of printing product info we'll
[16:10] return product info so that's one of our
[16:13] functions and then the other function we
[16:15] might want to have is like get uh
[16:19] product
[16:20] pages or get product links or something
[16:23] like
[16:23] that and what that might look like is
[16:26] maybe that takes in a a query if you
[16:29] remember how we got to this monitor page
[16:32] the first thing we did was we did a
[16:34] search that looks like this so I'm going
[16:37] to go ahead and copy this call this like
[16:39] base URL or like search URL Hub out
[16:43] search URL equals this and then we can
[16:46] use an F string to F flip out the search
[16:50] term and make that whatever our query is
[16:53] I think the only other thing I might add
[16:54] to this is let's say we wanted to scrape
[16:57] a bunch of infation
[16:59] on
[17:00] monitors well this is all the monitors
[17:04] but you know what if we wanted more than
[17:06] just the first page of monitors we might
[17:09] go to the second page and we see that
[17:12] you can also leverage Page information
[17:14] here so I'm going to go ahead and copy
[17:16] that and put this also into my URL so
[17:20] how about we also pass in a page
[17:24] number and by default it can start at
[17:27] one you want to do a similar type thing
[17:30] get all of the pages or get get this
[17:33] search URL then we would probably want
[17:36] to get all of the links from this so if
[17:39] I look at this page right click
[17:42] inspect we get this href here looks like
[17:47] this um these is are sponsored ones
[17:51] scroll out these first ones are all
[17:53] sponsored so I might just see what it
[17:54] looks like if it looks any different on
[17:56] non-sponsored ones okay here's another
[17:59] link it's and this look at that it's
[18:03] just a a doesn't have the full URL just
[18:07] has the kind of this part of it so we
[18:11] want to be able to handle both cases so
[18:14] grabbing all of those links might look
[18:16] something
[18:18] like basically we want to find all the
[18:20] links in this so
[18:30] find all a
[18:35] tags I want to make sure that they have
[18:37] an
[18:38] href as part of it because that's how
[18:40] we're going to access this
[18:42] information and then we can do something
[18:44] like
[18:45] for for
[18:49] Link in links and then we know that the
[18:53] product URLs I think one little trick
[18:55] always have this slash I them so what I
[19:00] might do to grab all the links and not
[19:02] get anything that's like a link to some
[19:04] other random
[19:06] page um that's not super useful is I
[19:10] might check to see if in the text so if
[19:16] this specific term is in the in link hre
[19:22] href is the actual link stuff so that's
[19:24] why we're looking at that specifically
[19:26] and we basically want to add that link
[19:27] to some sort of cue
[19:29] so I might add a
[19:33] list and we saw two different types of
[19:35] URLs if https is in the URL then
[19:40] basically our URL is already a full
[19:48] URL full URL
[19:58] however if it's just the IP stuff then
[20:01] our full URL would be equal
[20:06] to and then basically we will want
[20:16] to append this link to our product
[20:23] links this is just basically things
[20:25] we'll search
[20:27] for and then we'll want to return our
[20:30] product
[20:36] links
[20:37] cool so now let's test out to see if
[20:40] this works so let's create a main
[20:52] function how we get product links for a
[20:55] search term like computers
[21:02] run
[21:03] that and we do see we get a bunch of
[21:07] links cool this looks pretty good so now
[21:10] if we wanted to create some sort of kind
[21:12] of like you could either do the a
[21:14] database use something like mongod DB or
[21:17] SQL light or you could even save some of
[21:19] this stuff locally it depends on how
[21:21] complex you want to get but basically
[21:23] what we could do is do a loop over get
[21:27] product links and and uh extract product
[21:30] info and basically combine the two
[21:32] things and uh save the
[21:37] results so what that might look like I'm
[21:40] going to save this as a Json lines file
[21:43] so I'll call this output file
[21:51] equals and then we'll write to our
[21:54] output file
[22:12] we can Loop over a search query so how
[22:20] about links equals product or get
[22:23] product links
[22:26] computers how we make this a while
[22:31] true while true links equals get product
[22:35] Links Page
[22:39] number and we can start popping off
[22:44] links not product
[22:47] links if not
[22:50] links how about or if we get you know
[22:54] once we get past the first 100 Pages
[22:56] we're probably fine also breaking or
[23:00] page
[23:01] number greater than
[23:04] 99 we break out of the loop iterate over
[23:08] the
[23:13] links and we will go ahead and product
[23:18] info equals extract product info on this
[23:23] link and then if we do get a result or
[23:26] something like that we might
[23:29] right to our
[23:31] file and we should also add a new line
[23:35] to this because this is a Json lines
[23:37] file so basically each dictionary will
[23:40] be its own line and it's probably good
[23:43] practice for us to surround this in a
[23:45] tri accept
[23:59] then we'll want to increase the page
[24:03] number by one and we
[24:13] might show that we're going to a next
[24:15] page cool this is a basic way but this
[24:18] should scrape all of our computer
[24:19] information I believe and we'll see if
[24:21] this
[24:24] works there we go oops messed up the
[24:27] syntax there
[24:30] and while this runs we can kind of check
[24:32] to see if it's working by looking at the
[24:34] file and see if lines of Json are being
[24:38] added and we do see items being
[24:44] added again all of the code is linked in
[24:47] the description and I'll kind of break
[24:49] it down by where we're at in the
[24:51] video so that finished scraping and if
[24:54] we look at our product info uh we can
[24:56] see we have a bunch of product scraped I
[24:59] think there's a couple ways that I would
[25:00] improve this though I think one of the
[25:02] big ones is that this wasn't quite smart
[25:05] on what products we scraped so if you
[25:08] look you'll see like some items pop up
[25:11] all the time like this Acer Chromebook
[25:13] so there's two different ways you could
[25:15] kind of solve this problem is you could
[25:16] check the item ID and only add items to
[25:19] your kind of cue of links to search if
[25:22] you hadn't already added it already and
[25:24] you could do that by the item ID or you
[25:26] could kind of the simpler solution might
[25:28] be just look at the link that you're
[25:33] scraping um here and if you've already
[25:35] seen that link before don't scrape the
[25:37] info another thing we might want to do
[25:39] too is that this ran successfully but we
[25:41] only had one search term computers what
[25:43] if you wanted to look up a bunch of
[25:44] things what would that look like and
[25:46] would this same system be able to
[25:49] operate there I'm going to paste in some
[25:51] code with these changes uh again both
[25:55] the code you see currently and this new
[25:57] code all available in the description on
[26:01] GitHub but we'll go ahead and paste this
[26:05] in couple changes now some of the
[26:08] variable names have changed but we have
[26:10] now multiple search queries so computers
[26:12] is still there but we have other items
[26:14] as well that we're going to be looking
[26:15] at and we're going to use a que to kind
[26:18] of keep track of what products to look
[26:20] at as well as we'll keep track of URLs
[26:23] we've already seen in the scene URLs but
[26:25] the process is just about the same also
[26:27] have some print statements to kind of
[26:29] show what we're doing run a query get
[26:32] all the links same as before add it to
[26:35] the product info but now we can just
[26:37] scrap more info and what does that look
[26:39] like we can go ahead and run
[26:48] this and we see all the URLs processing
[26:51] one issue is that we just overwrote the
[26:54] original file we made so this is why you
[26:56] might want to go to like a online mongod
[26:58] where you're always just adding items to
[27:00] something instead of using doing the
[27:02] risky way of writing local files uh
[27:05] because this video is focused on web
[27:07] scraping not necessarily how to Output
[27:09] things I'm going to leave it as is for
[27:12] now but that's kind of one way you can
[27:14] improve this project on your own and so
[27:17] I'll fast forward and let this run and
[27:19] ultimately so what you'll see is that it
[27:22] ran a bunch of these products but as we
[27:25] got through different search queries of
[27:28] eventually Walmart blocked blocked our
[27:30] IP address and basically said no more
[27:33] scraping it kind of saw that what we
[27:34] were doing because we kept making
[27:36] requests with the same exact IP address
[27:40] and it does not like that so what do you
[27:43] do if you're in this situation where you
[27:44] need to keep scraping but you're
[27:46] physically getting blocked by a site
[27:49] like walmart.com That's not liking what
[27:51] you're doing this is exactly where
[27:54] bright
[27:55] data comes in so if you use the link in
[27:59] the description you'll get $15 free doar
[28:01] for bright data but basically our IP got
[28:04] blocked and you still want to scrape so
[28:07] this is where some bright data tools can
[28:09] come in so I already created an account
[28:11] but if you haven't I recommend you know
[28:13] starting free trialer setting up with
[28:14] Google again Link in the description
[28:17] will give you $15 free dollars so
[28:19] definitely sign up using that link I'm
[28:20] going to go to user my user dashboard
[28:23] and what we'll do to start I'll show
[28:26] some other stuff but we're going to to
[28:28] use what's known as a proxy so basically
[28:31] a proxy Network allows us to send
[28:34] requests from different spots basically
[28:37] so if you think of the way we normally
[28:38] do it we have one single computer which
[28:41] has a IP address associated with it and
[28:43] every request has that IP when you use
[28:46] uh bright data proxy server our request
[28:49] goes through bright data and then bright
[28:51] data allows us to make requests as many
[28:53] different IPS many different locations
[28:55] and thus makes it much much more
[28:58] difficult to block the requests we're
[29:01] making basically we can set our proxy
[29:03] server up in whatever way so that we
[29:06] make sure that we get our requests
[29:07] through so we're going to click on a few
[29:10] proxy products and we can go ahead and
[29:12] add a new one I think that I would
[29:14] recommend probably a good starting point
[29:16] is Data Center proxies and there's
[29:19] different ways you can approach this
[29:21] let's say you needed to use the same IP
[29:23] address for every one of your requests
[29:26] when you come back and run this again
[29:28] maybe you had some sort of white listing
[29:30] on a server that it's communicating with
[29:31] and you wanted to make sure it was set
[29:33] number of ips you could do dedicated you
[29:36] could also do premium but I think for
[29:37] most use cases the shared pool of data
[29:41] center IPS is probably going to be
[29:43] exactly what you need this basically is
[29:45] just a massive pool of different IP
[29:48] addresses and different people at
[29:49] different times can use in kind of use
[29:51] from this pool here we see number of ips
[29:54] so we're not going to play around with
[29:55] here what we can do is actually go into
[29:58] Advanced options and I'm going to switch
[30:00] to pay for usage basically gives us
[30:03] access to 10,000 plus IPS it really
[30:07] depends on how much you expect to use
[30:09] what I'd recommend is you can kind of as
[30:11] you run your jobs you can kind of see
[30:12] the costs and decide whether or not
[30:15] paying per IP or paying per usage makes
[30:18] more sense I'll give this a name called
[30:21] like
[30:22] scraping
[30:24] proxy so we're going to go ahead and add
[30:26] this
[30:29] yes there's some Advanced options here
[30:32] but most important thing we will want to
[30:35] do is basically keep track of our host
[30:38] username and password I'll blur these
[30:40] out I'll show how we can actually
[30:43] leverage this information in our
[30:45] beautiful super requests um we also can
[30:48] see statistics as we actually start
[30:49] using our proxy but um we see some
[30:54] example python code here if we go into
[30:57] the documentation I believe we can see
[30:58] some more um python
[31:04] code I like this a lot this already has
[31:07] some requests Library stuff I think we
[31:09] can go ahead basically and
[31:12] copy this all into our code so I'm going
[31:17] to paste this
[31:21] in um our username and password will
[31:24] have to change what I recommend for
[31:27] username and password word is you can
[31:29] create a EnV file in your repo you can
[31:34] set a I'm going to call this a bright
[31:36] data
[31:38] username equals and a bright data
[31:42] password equals and we'll just do this
[31:44] for example test and test password and
[31:49] then basically on our code side what we
[31:51] can
[31:52] do okay so in our code um we could just
[31:56] run this but we can just modify our
[31:58] existing code to leverage this we need
[32:00] to fix our username and password so what
[32:03] we can do is import what's known
[32:06] asnv and actually what we'll want to do
[32:08] is from. EnV import load. EnV and if we
[32:14] run load. n basically this then creates
[32:18] environment variables for us to use
[32:19] based on the environment we set here in
[32:23] ourm file that's within the same folder
[32:26] as our scraper code also want to import
[32:31] OS and now basically what we can do is
[32:35] if I went ahead and set this to os.
[32:39] Environ BRD username and we set this to
[32:45] os. Environ BRD password what we'll see
[32:50] is if I
[32:52] print username and
[32:56] print password we'll see those test
[32:59] variables that we put in I'm going to
[33:00] temporarily comment out all this bottom
[33:04] code right run the file test test
[33:07] password cool so now I'm going to
[33:09] actually paste in my password in my
[33:12] username from The Zone we created next
[33:15] time I run we'll have access to that and
[33:17] now we see that in our request.get we
[33:19] can use proxies there's also this cool
[33:21] thing where we can do this my ip. Json
[33:25] and make our request there and actually
[33:26] see where the
[33:28] um IP is that we're sending our request
[33:30] from so let's go ahead
[33:33] and uncomment all of
[33:37] this again I'm going to delete this
[33:40] example code real quick and I'll show
[33:43] how we can incorporate it in okay so and
[33:46] I think do we need to change the host at
[33:47] all let's just check nope the host looks
[33:50] good it's what it has in the docs docs
[33:54] are also linked in the description but
[33:56] we can go ahead now and as a little
[34:01] sample I going to just use this
[34:04] URL instead of the search
[34:06] [Music]
[34:14] URL and I'll do the same thing with the
[34:16] product
[34:24] URL and I just want to show what happens
[34:26] if I do print do responsejson
[34:30] here so we just temporarily put in this
[34:33] URL we run this uh and
[34:37] basically we see always the same IP
[34:41] address but if we then decided to
[34:47] insert proxies equal proxies so we set
[34:51] our
[34:52] proxies here based on this proxy URL
[34:56] which contains our our username and
[34:58] password and the host
[35:02] info then what we get after saving that
[35:06] and including the proxies
[35:08] and both of our requests when we start
[35:12] printing out
[35:13] things we see different IP addresses
[35:17] every time we send so this little URL is
[35:20] cool for showing where these requests
[35:23] are coming from we see it's different
[35:24] countries and everything uh very cool it
[35:28] proves that our proxies are working by
[35:30] passing in proxies like this we can go
[35:32] ahead and keep the URL as we expected it
[35:36] to be move this print statement okay our
[35:40] proxies are passed
[35:42] in going to delete this stuff kind of
[35:45] clean up the code a little bit but now
[35:47] we have access to the bright data proxy
[35:50] which when we got blocked last time now
[35:55] that we have
[35:59] um the bright data proxy set that for
[36:02] all these search queries we can run them
[36:05] and basically get around the block that
[36:07] will happen when we use only a single IP
[36:10] so that's pretty
[36:12] cool if you haven't if you're running
[36:14] into any errors with um accessing things
[36:17] you might have to do a pip three install
[36:19] of any new library such
[36:21] ASN or we also included prettyprint here
[36:28] so make sure to uh pip 3
[36:32] install these items if you
[36:36] didn't Now by using Proxes we kind of
[36:39] open ourselves up to some other new kind
[36:41] of potential issues that we want to be
[36:43] aware of one is maybe the proxy is not
[36:47] available for a second and a request H
[36:50] you know trips up or maybe the specific
[36:53] country that we used in our proxy
[36:55] request um can't access the website
[36:59] we're trying to scrape so we want to
[37:00] make sure we can like fail gracefully if
[37:03] those types of things happen many
[37:05] different ways to do this but basically
[37:06] I feel like it comes down to just being
[37:10] cognizant in like a try accept to handle
[37:14] failures properly so I'm going
[37:17] to um add in some additional error
[37:20] handling and paste in kind of a more um
[37:23] error proof way to do this but within
[37:26] the bright data ecosystem I think one
[37:29] thing that I would recommend is when you
[37:32] are looking at the proxies and the
[37:34] different things um kind of in the DAT
[37:38] like you could add specific geolocation
[37:41] targeting in either the shared or
[37:43] dedicated uh and you see like maybe you
[37:46] only wanted it to be United
[37:49] States and you know maybe and you can
[37:51] even add cities that's pretty cool uh
[37:54] Canada so this would limit where your
[37:56] IPS are coming from from uh so if you
[37:59] find yourself getting blocked based on
[38:00] the location you know look at things
[38:02] like this with geolocation targeting or
[38:05] specific cities uh you can go very
[38:08] sophisticated pretty cool stuff so
[38:10] that's good to keep in mind all right
[38:12] but I'm going to go ahead and paste in a
[38:15] version of this that handles some
[38:18] failures a little bit more sophisticated
[38:20] banner and also I'm going to use more
[38:22] Search terms because one of the perks of
[38:24] using the proxy networks is we could
[38:27] just keep scraping things uh quite nice
[38:31] so basically
[38:33] just retries things the basically the
[38:36] difference here with this code is it
[38:38] will retry um getting product links or
[38:41] retry extracting product info with a max
[38:44] number of retries for each URL and if it
[38:47] fails ever it will sleep a certain
[38:50] amount of time um that's dependent on
[38:54] what we defined as a backoff factor and
[38:56] the attempt number and just basically
[38:59] let us make sure that if anything kind
[39:02] of hiccups that we can process it um
[39:06] another way you might improve this
[39:07] further is you could run things in
[39:10] parallel uh it really depends on what's
[39:12] most important to you whether it's just
[39:14] getting the info or getting the info
[39:15] quickly uh there's different ways you
[39:17] could play around with this and improve
[39:18] it further but this is a good starting
[39:21] point what might you might get blocked
[39:24] in you know a couple thousand requests
[39:27] Walmart normally you can do tens and
[39:29] tens of thousands when using a bright
[39:31] data proxy Network which is super nice
[39:35] so I think this covers a lot of the kind
[39:37] of what you'll need to know when it
[39:38] comes to Advanced scraping techniques
[39:41] but I think the next logical step to
[39:43] take your skills even further is uh
[39:45] being able to actually automate actions
[39:47] on a page using a library such as
[39:49] selenium in Python so if you look up
[39:53] something like
[39:54] lenium python bright data you you can
[39:57] find some good information on how to
[40:00] kind of get started using
[40:03] selenium and really yeah selenium is
[40:06] going to be coming when you let's say
[40:09] page loads but there's this table that
[40:12] loads a little bit slower and maybe
[40:14] after like 5 Seconds it loads in and you
[40:17] need to access the the information in
[40:18] the page you can use selenium to like
[40:21] load the page wait 5 seconds and then
[40:24] access the information that is slow to
[40:27] load by using things such as implicitly
[40:31] weight here through selenium and then
[40:34] additionally one thing that's super
[40:36] useful within the bright data ecosystem
[40:38] is this scraping browser so if you look
[40:41] at documentation scraping browser
[40:44] scraping browser configuration python
[40:49] selenium uh you can find some
[40:50] information on how to use it but uh one
[40:52] thing that's really nice about the
[40:53] scraping browser is if let's say you're
[40:55] using selenium and you run into some uh
[40:58] captas and stuff that are particularly
[41:00] hard to get by uh the the scraping
[41:03] browser can kind of help you
[41:04] automatically bypass those so you can
[41:07] really access that tricky to grab data
[41:12] programmatically I think because this
[41:13] video is already a good length we're
[41:15] going to hold off on all of that if you
[41:17] want to see selenium python bright data
[41:20] stuff in more depth let me know in the
[41:22] comments and I definitely will make
[41:23] another follow-up video happen but I
[41:26] think this was good launching off point
[41:28] that builds off the kind of the basics
[41:30] that we learned in the original tutorial
[41:32] hopefully you now have a little bit more
[41:34] Tools in your tool case especially the
[41:36] bright data proxy stuff that can help
[41:38] you kind of take your scraping skills to
[41:40] the next level with that I think we'll
[41:42] call the video I hope you learned
[41:44] something I hope you enjoyed this video
[41:46] if you did means a lot if you throw it a
[41:48] thumbs up and click the Subscribe button
[41:50] if you haven't
[41:51] already um yeah got a bunch of tutorials
[41:53] on the way thank you to Bright data
[41:57] again for sponsoring this video link to
[42:00] get your first $15 free on bright data
[42:03] down in the description I think that is
[42:05] it until next time everyone
[42:09] peace
[42:12] out one quick question before we go I'm
[42:16] just super curious how all of you plan
[42:18] on taking the knowledge from this video
[42:20] on accessing web data via web scraping
[42:22] and applying it to your own projects so
[42:24] we' be super curious to hear what types
[42:26] of projects you all want to work on with
[42:29] this knowledge let me know in the
[42:31] comments it's always fun to read through
[42:34] what what you all are working on all
[42:36] right now I'm really out
[42:40] peace now