[0:00] hey what's up everyone welcome back to [0:01] another video in this video we are going [0:03] to do some Advanced web scraping with [0:06] python if you haven't already seen it I [0:08] recommend checking out my overview on [0:10] the python beautiful soup Library I'll [0:12] pop it up right here but this is going [0:14] to kind of take that video and really [0:16] build onto more advanced things that you [0:18] might see in the wild web scraping is a [0:20] incredibly useful skill I do a lot of [0:23] work on upwork and I feel like the most [0:25] common thing that I ever see is web [0:27] scraping jobs so if you can Master web [0:29] scraping you have tons of opportunities [0:32] when it comes to freelance when it comes [0:33] to work etc before we get into the video [0:36] though I want to give a shout out to [0:38] this video sponsor and that is bright [0:40] data when it comes to really Advanced [0:43] web scraping stuff it's hard to do it on [0:45] your own one of the things that I like [0:46] most about bright data our sponsor is [0:48] they offer all sorts of proxy tools [0:51] which basically allows you to send [0:53] requests from many different locations [0:55] and it bypasses it allows you to [0:57] ultimately bypass a lot of the [0:58] restrictions that sites try to implement [1:01] to prevent you from scraping them uh if [1:03] you don't want to deal with learning how [1:05] to web scrape and watching this whole [1:07] tutorial bright data also offers a bunch [1:10] of data sents that are available for [1:12] purchase such as Walmart which we'll be [1:14] scraping today but also Amazon [1:17] Airbnb Instagram LinkedIn Etc so tons of [1:23] data sets in the data set Marketplace [1:24] that you could also just get started [1:26] super quickly but without further Ado [1:27] let's get into this tutorial all right [1:30] to demonstrate these Advanced scraping [1:32] techniques we're going to be doing some [1:33] scraping on walmart.com and if this ever [1:36] feels too quick for you I recommend [1:38] checking out the original video I did on [1:40] uh beautiful soup which will be linked [1:42] in the description and also I'll pop it [1:44] up right here the first thing we'll want [1:45] to do with any sort of web scraping [1:48] project is kind of identify the HTML of [1:51] what we're trying to scrape so let's say [1:53] I was looking for a new computer monitor [1:56] because of course I'm a programmer I [1:57] need 10,000 monitors at all times time [2:00] so I look up monitor here and let's see [2:03] what looks good let's click on this [2:06] first Samsung one how about this Odyssey [2:08] G6 we'll check that out so if I was [2:11] trying to web scrape and like collect [2:13] information because I want to pay the [2:14] best price for these monitors uh I'm [2:17] going to be looking at the HTML on the [2:19] page so I might be trying to get this [2:21] information I might be able to you know [2:23] trying to get the you save information [2:26] probably be trying to get this [2:27] information maybe the reviews [2:29] information bunch of things that I want [2:31] to get but the way that we do this in [2:34] web scraping land is you for most [2:36] internet browsers you can write click [2:39] and click on inspect which opens up the [2:42] kind of HTML source code so if we look [2:46] in here we can specifically click [2:48] inspect on this tag and kind of find [2:51] where the price is one thing though that [2:53] I also notice with this specific Walmart [2:55] page is there's this concept of item [2:58] props so that kind of leads me to to [3:00] think that maybe there's another kind of [3:02] technique going on here that's getting [3:04] some of this information if we look at [3:07] all of the information on this page [3:09] trying to see if there's anything that [3:11] sticks out here yep there we go um [3:14] there's this next data field let's see [3:16] what's in that uh and we see in here [3:19] there's this massive massive Json object [3:23] that has props so this next data tag is [3:26] telling me this is a specific type of [3:28] JavaScript project [3:30] that leverages these props and passes [3:33] them in on the web page so instead of [3:35] actually scraping from the HTML tags [3:37] themselves we can actually usually find [3:39] the information we're looking for within [3:41] this props [3:43] field um so I think kind of our [3:46] technique for scraping information from [3:48] Walmart will be do a search like we just [3:51] did so we had you know the search for [3:53] monitor this is a pretty easy format I [3:55] could very easily replace monitor here [3:58] with other words and search for them [4:00] and then kind of programmatically go [4:02] through this page maybe click on items [4:05] and then once we're clicking on an [4:08] item we'll see for this monitor too that [4:11] it also has these page props and then in [4:15] this page props [4:17] information basically we'll want to look [4:19] for and find where the actual price is [4:24] within this nested Json so that's the [4:27] general technique Let's uh open up a [4:29] code code editor and kind of just write [4:31] some template code out to help us do [4:34] this so I'm in this Advanced scraping [4:36] repo and I'll link this in the [4:39] description uh I'll link my GitHub repo [4:42] with all the code that we'll see in this [4:43] video but I'll just call this like [4:46] Walmart [4:47] scraper [4:49] dopy and probably the first thing we'll [4:51] want to do is install beautiful soup and [4:54] I recommend if um you know how to to set [4:57] up a virtual environment to do this I'll [5:00] link a video on how to do that right [5:01] here but we can go pip 3 install [5:03] beautiful soup 4 we see we already have [5:06] it and we're also going to want to use [5:09] the requests Library so I'll just kind [5:11] of template out some code and then kind [5:12] of explain it more in depth in a [5:17] sec often like when I forget how to [5:20] import things feel like beautiful soup [5:22] is always a weird thing I like using [5:23] Copilot [5:30] and then I also want to import the [5:31] requests library and Walmart URL equals [5:36] how about [5:38] https [5:39] walmart.com SL maybe we grab a specific [5:43] one of those pages that we had [5:45] open so I'm just going to steal this [5:48] page real quick that URL is this so just [5:51] take any product URL from Walmart we got [5:54] the curved Ultra wide Monitor and then [5:58] we'll want to get that so [6:00] response equals requests.get pass in a [6:04] URL and we'll print out that [6:09] HTML okay so let's go ahead and do that [6:12] cool we get the [6:15] HTML and the next thing we'll want to do [6:19] is get specifically that next data tag [6:23] so we can start looking at what's in [6:24] that so if we look at our web page again [6:27] I inspect this [6:29] this next data basically we'll want to [6:32] find the script with the ID next data so [6:34] how do we do that with beautiful soup so [6:36] we'll say soup [6:39] equals beautiful soup response. text [6:43] we'll use the HTML [6:46] parser and we'll specifically want to [6:48] find that sup tag with the ID [6:53] so sup find script tag with idid [7:01] next data I think it was like this we [7:05] can change the ID if we have to that [7:09] looks good basically we want to just [7:11] print out the HTML again and it says [7:13] None that is strange why does it say [7:15] none so let's make sure that the next [7:18] data is correct next 22 [7:21] underscores so that looks correct and so [7:24] this is the first I think like Advanced [7:27] feature that you'll have to know about [7:29] if you ever have a situation where you [7:31] retrieve a page it's usually one or two [7:33] and like the tag you're looking for is [7:35] not there when you retrieve it with [7:36] python it's e either one of two things [7:39] one is that this is a dynamic page that [7:42] like some of the stuff loads a little [7:43] bit later or two they're purposely [7:46] hiding stuff from from requests that [7:49] don't look Human by default there's [7:51] certain headers when you use the [7:53] requests library in Python and so to get [7:56] around and make ourselves look more [7:57] human we want to make sure that that we [8:00] mock whatever headers we use when we're [8:02] actually making this request in the [8:05] python land what I would probably do [8:07] here to figure out what are good headers [8:09] to use look at our Network and any [8:12] really one of these probably will be [8:13] fine and basically we want to just copy [8:17] the headers that we use in our actual [8:19] web browser so we have the request [8:21] headers down here we copy some of these [8:24] parameters we'll probably find we have [8:27] more success so we probably want to copy [8:29] the accept languages and the user agent [8:33] tags good and sometimes you might even [8:36] want to like have this cycle between [8:39] several different options so you always [8:41] are changing up what you look like We'll [8:43] add [8:47] commas that looks good cool and so now [8:52] what we can do with our requests library [8:54] and I guess I have to change the order [8:56] of this slightly [9:01] we can pass in the [9:04] headers equals the new headers that [9:06] we've added and now the question is can [9:08] we find this script tag so we're going [9:11] to go ahead and run all of this [9:16] code and look at that we do get um this [9:20] next data field that we are looking for [9:24] awesome all right now that we have this [9:27] data here we want to be able to grab [9:30] specifically the price from this so what [9:32] I would recommend is this is [9:34] crazy uh complicated to look at I feel [9:38] like probably it's purposely like just [9:39] super OB obis skated to make it hard to [9:43] grab everything what I might do is go [9:46] back to the web page inspect grab the [9:50] next data I might copy the element and [9:54] then what I might recommend do is doing [9:56] is copy this Json go to like a online [10:00] token counter probably a bunch of these [10:02] will work I'm just going to use this [10:04] streamlet app I see and paste in your [10:08] Json and then see how many tokens so we [10:10] see we have like 59,000 tokens not super [10:13] friendly to honestly the like GPT 3.5 I [10:16] think that's above any limits it can [10:19] process you could use GPT 4 if you have [10:21] access to full [10:23] 128k context basically the goal is can [10:26] we look at what's in this without having [10:29] to to dig through all of this Json [10:31] couple different ways to do it um what [10:33] we might do real quick and we'll want to [10:35] make this Json so we can do that by [10:39] importing the Json Library as [10:41] well and then we can go ahead and we [10:43] should be able to do something like um [10:46] data [10:47] equals HTML or script [10:51] tag. string that'll get all the text [10:56] within it we can do json. loads [11:00] and now we should be able to print out [11:02] data. Keys cool look at that [11:06] so basically we can kind of keep [11:08] repeating this process we could get the [11:11] props and then see what the keys of that [11:14] are it's going to be kind of a slow [11:17] process but I'm just kind of showing how [11:19] you could do it [11:21] manually so what are the props of the [11:23] keys or the keys of the props page props [11:27] so we know we have to go into page props [11:30] and I'm thinking they're in props just [11:31] because that's what you know the [11:33] JavaScript framework will [11:36] leverage uh initial data sounds pretty [11:47] good I would say probably [11:49] data not [11:54] headers um probably product [11:59] but maybe also review information would [12:01] be helpful [12:03] too oh wow now there's a lot of things [12:06] now I'd look at if there's a price in [12:08] here somewhere [12:09] price do we see any [12:13] price price info I like [12:20] that now what's in the price info [12:23] one current price I like that this might [12:27] be the end of our [12:31] dictionary we'll [12:32] see and then I think probably if we get [12:35] price from [12:39] this see what we get look at that that [12:42] gives us $199 and if we look at the [12:45] product that we were looking at it is [12:49] 100% $199 so that was getting [12:52] information from this page not super [12:54] easy but uh the reason I was asking [12:57] saying go to this token counter is [12:59] if you have a large language model that [13:02] has the ability to parse this number of [13:04] tokens what I might recommend is copying [13:08] your code like for example Claude if you [13:11] have the pro [13:12] version um Can parse a lot so I might [13:18] say what are the most important [13:22] fields from the following [13:26] Json to get the price info and review [13:32] info for the current [13:36] product paste this in which is crazy [13:40] very long um I'd say limit to just [13:45] 10 uh [13:50] items share full path to get to that [13:55] field using python syntax or something [13:58] like that [14:02] this looks like pretty [14:04] good um I noticed that it's just going [14:07] like four levels deep so I feel like [14:09] it's skipped over the stuff to get to [14:12] product um I might say hey wait to get [14:16] to product you need to use the following [14:23] path here's the code I found that works [14:28] for [14:30] price please adjust Solutions [14:38] accordingly copy this paste it [14:43] in perfect this looks useful so take [14:46] this as an example but you can use like [14:48] large language models to help you parse [14:50] you know massive amounts of Json like [14:52] this but to simplify the process if you [14:56] go to the GitHub link I'll share some a [14:59] code stimp it that will help us get the [15:01] information we're looking for so I'll go [15:03] ahead and paste that in here we go so [15:05] here's a nice little product info [15:08] dictionary and I can go ahead and print [15:12] product [15:14] info for our current product look at [15:17] that review count 239 let's check to see [15:20] if that's also valid look at that 239 [15:24] reviews so this looks great we're [15:25] getting some information I think the big [15:27] thing though is if we wanted to collect [15:29] a lot of information on products such as [15:33] you know these monitors and collect it [15:34] over time and you know every day maybe [15:36] run something that gets all this [15:38] information and it stores it in a [15:39] database or something we need to modify [15:42] this a bit so what might we do well we [15:45] can separate this into maybe two [15:47] functions one function will be called [15:50] extract product [15:54] info that will take in some sort of [15:56] product URL and we can basically just [16:00] paste in what we already [16:03] have so this is product URL now and [16:08] instead of printing product info we'll [16:10] return product info so that's one of our [16:13] functions and then the other function we [16:15] might want to have is like get uh [16:19] product [16:20] pages or get product links or something [16:23] like [16:23] that and what that might look like is [16:26] maybe that takes in a a query if you [16:29] remember how we got to this monitor page [16:32] the first thing we did was we did a [16:34] search that looks like this so I'm going [16:37] to go ahead and copy this call this like [16:39] base URL or like search URL Hub out [16:43] search URL equals this and then we can [16:46] use an F string to F flip out the search [16:50] term and make that whatever our query is [16:53] I think the only other thing I might add [16:54] to this is let's say we wanted to scrape [16:57] a bunch of infation [16:59] on [17:00] monitors well this is all the monitors [17:04] but you know what if we wanted more than [17:06] just the first page of monitors we might [17:09] go to the second page and we see that [17:12] you can also leverage Page information [17:14] here so I'm going to go ahead and copy [17:16] that and put this also into my URL so [17:20] how about we also pass in a page [17:24] number and by default it can start at [17:27] one you want to do a similar type thing [17:30] get all of the pages or get get this [17:33] search URL then we would probably want [17:36] to get all of the links from this so if [17:39] I look at this page right click [17:42] inspect we get this href here looks like [17:47] this um these is are sponsored ones [17:51] scroll out these first ones are all [17:53] sponsored so I might just see what it [17:54] looks like if it looks any different on [17:56] non-sponsored ones okay here's another [17:59] link it's and this look at that it's [18:03] just a a doesn't have the full URL just [18:07] has the kind of this part of it so we [18:11] want to be able to handle both cases so [18:14] grabbing all of those links might look [18:16] something [18:18] like basically we want to find all the [18:20] links in this so [18:30] find all a [18:35] tags I want to make sure that they have [18:37] an [18:38] href as part of it because that's how [18:40] we're going to access this [18:42] information and then we can do something [18:44] like [18:45] for for [18:49] Link in links and then we know that the [18:53] product URLs I think one little trick [18:55] always have this slash I them so what I [19:00] might do to grab all the links and not [19:02] get anything that's like a link to some [19:04] other random [19:06] page um that's not super useful is I [19:10] might check to see if in the text so if [19:16] this specific term is in the in link hre [19:22] href is the actual link stuff so that's [19:24] why we're looking at that specifically [19:26] and we basically want to add that link [19:27] to some sort of cue [19:29] so I might add a [19:33] list and we saw two different types of [19:35] URLs if https is in the URL then [19:40] basically our URL is already a full [19:48] URL full URL [19:58] however if it's just the IP stuff then [20:01] our full URL would be equal [20:06] to and then basically we will want [20:16] to append this link to our product [20:23] links this is just basically things [20:25] we'll search [20:27] for and then we'll want to return our [20:30] product [20:36] links [20:37] cool so now let's test out to see if [20:40] this works so let's create a main [20:52] function how we get product links for a [20:55] search term like computers [21:02] run [21:03] that and we do see we get a bunch of [21:07] links cool this looks pretty good so now [21:10] if we wanted to create some sort of kind [21:12] of like you could either do the a [21:14] database use something like mongod DB or [21:17] SQL light or you could even save some of [21:19] this stuff locally it depends on how [21:21] complex you want to get but basically [21:23] what we could do is do a loop over get [21:27] product links and and uh extract product [21:30] info and basically combine the two [21:32] things and uh save the [21:37] results so what that might look like I'm [21:40] going to save this as a Json lines file [21:43] so I'll call this output file [21:51] equals and then we'll write to our [21:54] output file [22:12] we can Loop over a search query so how [22:20] about links equals product or get [22:23] product links [22:26] computers how we make this a while [22:31] true while true links equals get product [22:35] Links Page [22:39] number and we can start popping off [22:44] links not product [22:47] links if not [22:50] links how about or if we get you know [22:54] once we get past the first 100 Pages [22:56] we're probably fine also breaking or [23:00] page [23:01] number greater than [23:04] 99 we break out of the loop iterate over [23:08] the [23:13] links and we will go ahead and product [23:18] info equals extract product info on this [23:23] link and then if we do get a result or [23:26] something like that we might [23:29] right to our [23:31] file and we should also add a new line [23:35] to this because this is a Json lines [23:37] file so basically each dictionary will [23:40] be its own line and it's probably good [23:43] practice for us to surround this in a [23:45] tri accept [23:59] then we'll want to increase the page [24:03] number by one and we [24:13] might show that we're going to a next [24:15] page cool this is a basic way but this [24:18] should scrape all of our computer [24:19] information I believe and we'll see if [24:21] this [24:24] works there we go oops messed up the [24:27] syntax there [24:30] and while this runs we can kind of check [24:32] to see if it's working by looking at the [24:34] file and see if lines of Json are being [24:38] added and we do see items being [24:44] added again all of the code is linked in [24:47] the description and I'll kind of break [24:49] it down by where we're at in the [24:51] video so that finished scraping and if [24:54] we look at our product info uh we can [24:56] see we have a bunch of product scraped I [24:59] think there's a couple ways that I would [25:00] improve this though I think one of the [25:02] big ones is that this wasn't quite smart [25:05] on what products we scraped so if you [25:08] look you'll see like some items pop up [25:11] all the time like this Acer Chromebook [25:13] so there's two different ways you could [25:15] kind of solve this problem is you could [25:16] check the item ID and only add items to [25:19] your kind of cue of links to search if [25:22] you hadn't already added it already and [25:24] you could do that by the item ID or you [25:26] could kind of the simpler solution might [25:28] be just look at the link that you're [25:33] scraping um here and if you've already [25:35] seen that link before don't scrape the [25:37] info another thing we might want to do [25:39] too is that this ran successfully but we [25:41] only had one search term computers what [25:43] if you wanted to look up a bunch of [25:44] things what would that look like and [25:46] would this same system be able to [25:49] operate there I'm going to paste in some [25:51] code with these changes uh again both [25:55] the code you see currently and this new [25:57] code all available in the description on [26:01] GitHub but we'll go ahead and paste this [26:05] in couple changes now some of the [26:08] variable names have changed but we have [26:10] now multiple search queries so computers [26:12] is still there but we have other items [26:14] as well that we're going to be looking [26:15] at and we're going to use a que to kind [26:18] of keep track of what products to look [26:20] at as well as we'll keep track of URLs [26:23] we've already seen in the scene URLs but [26:25] the process is just about the same also [26:27] have some print statements to kind of [26:29] show what we're doing run a query get [26:32] all the links same as before add it to [26:35] the product info but now we can just [26:37] scrap more info and what does that look [26:39] like we can go ahead and run [26:48] this and we see all the URLs processing [26:51] one issue is that we just overwrote the [26:54] original file we made so this is why you [26:56] might want to go to like a online mongod [26:58] where you're always just adding items to [27:00] something instead of using doing the [27:02] risky way of writing local files uh [27:05] because this video is focused on web [27:07] scraping not necessarily how to Output [27:09] things I'm going to leave it as is for [27:12] now but that's kind of one way you can [27:14] improve this project on your own and so [27:17] I'll fast forward and let this run and [27:19] ultimately so what you'll see is that it [27:22] ran a bunch of these products but as we [27:25] got through different search queries of [27:28] eventually Walmart blocked blocked our [27:30] IP address and basically said no more [27:33] scraping it kind of saw that what we [27:34] were doing because we kept making [27:36] requests with the same exact IP address [27:40] and it does not like that so what do you [27:43] do if you're in this situation where you [27:44] need to keep scraping but you're [27:46] physically getting blocked by a site [27:49] like walmart.com That's not liking what [27:51] you're doing this is exactly where [27:54] bright [27:55] data comes in so if you use the link in [27:59] the description you'll get $15 free doar [28:01] for bright data but basically our IP got [28:04] blocked and you still want to scrape so [28:07] this is where some bright data tools can [28:09] come in so I already created an account [28:11] but if you haven't I recommend you know [28:13] starting free trialer setting up with [28:14] Google again Link in the description [28:17] will give you $15 free dollars so [28:19] definitely sign up using that link I'm [28:20] going to go to user my user dashboard [28:23] and what we'll do to start I'll show [28:26] some other stuff but we're going to to [28:28] use what's known as a proxy so basically [28:31] a proxy Network allows us to send [28:34] requests from different spots basically [28:37] so if you think of the way we normally [28:38] do it we have one single computer which [28:41] has a IP address associated with it and [28:43] every request has that IP when you use [28:46] uh bright data proxy server our request [28:49] goes through bright data and then bright [28:51] data allows us to make requests as many [28:53] different IPS many different locations [28:55] and thus makes it much much more [28:58] difficult to block the requests we're [29:01] making basically we can set our proxy [29:03] server up in whatever way so that we [29:06] make sure that we get our requests [29:07] through so we're going to click on a few [29:10] proxy products and we can go ahead and [29:12] add a new one I think that I would [29:14] recommend probably a good starting point [29:16] is Data Center proxies and there's [29:19] different ways you can approach this [29:21] let's say you needed to use the same IP [29:23] address for every one of your requests [29:26] when you come back and run this again [29:28] maybe you had some sort of white listing [29:30] on a server that it's communicating with [29:31] and you wanted to make sure it was set [29:33] number of ips you could do dedicated you [29:36] could also do premium but I think for [29:37] most use cases the shared pool of data [29:41] center IPS is probably going to be [29:43] exactly what you need this basically is [29:45] just a massive pool of different IP [29:48] addresses and different people at [29:49] different times can use in kind of use [29:51] from this pool here we see number of ips [29:54] so we're not going to play around with [29:55] here what we can do is actually go into [29:58] Advanced options and I'm going to switch [30:00] to pay for usage basically gives us [30:03] access to 10,000 plus IPS it really [30:07] depends on how much you expect to use [30:09] what I'd recommend is you can kind of as [30:11] you run your jobs you can kind of see [30:12] the costs and decide whether or not [30:15] paying per IP or paying per usage makes [30:18] more sense I'll give this a name called [30:21] like [30:22] scraping [30:24] proxy so we're going to go ahead and add [30:26] this [30:29] yes there's some Advanced options here [30:32] but most important thing we will want to [30:35] do is basically keep track of our host [30:38] username and password I'll blur these [30:40] out I'll show how we can actually [30:43] leverage this information in our [30:45] beautiful super requests um we also can [30:48] see statistics as we actually start [30:49] using our proxy but um we see some [30:54] example python code here if we go into [30:57] the documentation I believe we can see [30:58] some more um python [31:04] code I like this a lot this already has [31:07] some requests Library stuff I think we [31:09] can go ahead basically and [31:12] copy this all into our code so I'm going [31:17] to paste this [31:21] in um our username and password will [31:24] have to change what I recommend for [31:27] username and password word is you can [31:29] create a EnV file in your repo you can [31:34] set a I'm going to call this a bright [31:36] data [31:38] username equals and a bright data [31:42] password equals and we'll just do this [31:44] for example test and test password and [31:49] then basically on our code side what we [31:51] can [31:52] do okay so in our code um we could just [31:56] run this but we can just modify our [31:58] existing code to leverage this we need [32:00] to fix our username and password so what [32:03] we can do is import what's known [32:06] asnv and actually what we'll want to do [32:08] is from. EnV import load. EnV and if we [32:14] run load. n basically this then creates [32:18] environment variables for us to use [32:19] based on the environment we set here in [32:23] ourm file that's within the same folder [32:26] as our scraper code also want to import [32:31] OS and now basically what we can do is [32:35] if I went ahead and set this to os. [32:39] Environ BRD username and we set this to [32:45] os. Environ BRD password what we'll see [32:50] is if I [32:52] print username and [32:56] print password we'll see those test [32:59] variables that we put in I'm going to [33:00] temporarily comment out all this bottom [33:04] code right run the file test test [33:07] password cool so now I'm going to [33:09] actually paste in my password in my [33:12] username from The Zone we created next [33:15] time I run we'll have access to that and [33:17] now we see that in our request.get we [33:19] can use proxies there's also this cool [33:21] thing where we can do this my ip. Json [33:25] and make our request there and actually [33:26] see where the [33:28] um IP is that we're sending our request [33:30] from so let's go ahead [33:33] and uncomment all of [33:37] this again I'm going to delete this [33:40] example code real quick and I'll show [33:43] how we can incorporate it in okay so and [33:46] I think do we need to change the host at [33:47] all let's just check nope the host looks [33:50] good it's what it has in the docs docs [33:54] are also linked in the description but [33:56] we can go ahead now and as a little [34:01] sample I going to just use this [34:04] URL instead of the search [34:06] [Music] [34:14] URL and I'll do the same thing with the [34:16] product [34:24] URL and I just want to show what happens [34:26] if I do print do responsejson [34:30] here so we just temporarily put in this [34:33] URL we run this uh and [34:37] basically we see always the same IP [34:41] address but if we then decided to [34:47] insert proxies equal proxies so we set [34:51] our [34:52] proxies here based on this proxy URL [34:56] which contains our our username and [34:58] password and the host [35:02] info then what we get after saving that [35:06] and including the proxies [35:08] and both of our requests when we start [35:12] printing out [35:13] things we see different IP addresses [35:17] every time we send so this little URL is [35:20] cool for showing where these requests [35:23] are coming from we see it's different [35:24] countries and everything uh very cool it [35:28] proves that our proxies are working by [35:30] passing in proxies like this we can go [35:32] ahead and keep the URL as we expected it [35:36] to be move this print statement okay our [35:40] proxies are passed [35:42] in going to delete this stuff kind of [35:45] clean up the code a little bit but now [35:47] we have access to the bright data proxy [35:50] which when we got blocked last time now [35:55] that we have [35:59] um the bright data proxy set that for [36:02] all these search queries we can run them [36:05] and basically get around the block that [36:07] will happen when we use only a single IP [36:10] so that's pretty [36:12] cool if you haven't if you're running [36:14] into any errors with um accessing things [36:17] you might have to do a pip three install [36:19] of any new library such [36:21] ASN or we also included prettyprint here [36:28] so make sure to uh pip 3 [36:32] install these items if you [36:36] didn't Now by using Proxes we kind of [36:39] open ourselves up to some other new kind [36:41] of potential issues that we want to be [36:43] aware of one is maybe the proxy is not [36:47] available for a second and a request H [36:50] you know trips up or maybe the specific [36:53] country that we used in our proxy [36:55] request um can't access the website [36:59] we're trying to scrape so we want to [37:00] make sure we can like fail gracefully if [37:03] those types of things happen many [37:05] different ways to do this but basically [37:06] I feel like it comes down to just being [37:10] cognizant in like a try accept to handle [37:14] failures properly so I'm going [37:17] to um add in some additional error [37:20] handling and paste in kind of a more um [37:23] error proof way to do this but within [37:26] the bright data ecosystem I think one [37:29] thing that I would recommend is when you [37:32] are looking at the proxies and the [37:34] different things um kind of in the DAT [37:38] like you could add specific geolocation [37:41] targeting in either the shared or [37:43] dedicated uh and you see like maybe you [37:46] only wanted it to be United [37:49] States and you know maybe and you can [37:51] even add cities that's pretty cool uh [37:54] Canada so this would limit where your [37:56] IPS are coming from from uh so if you [37:59] find yourself getting blocked based on [38:00] the location you know look at things [38:02] like this with geolocation targeting or [38:05] specific cities uh you can go very [38:08] sophisticated pretty cool stuff so [38:10] that's good to keep in mind all right [38:12] but I'm going to go ahead and paste in a [38:15] version of this that handles some [38:18] failures a little bit more sophisticated [38:20] banner and also I'm going to use more [38:22] Search terms because one of the perks of [38:24] using the proxy networks is we could [38:27] just keep scraping things uh quite nice [38:31] so basically [38:33] just retries things the basically the [38:36] difference here with this code is it [38:38] will retry um getting product links or [38:41] retry extracting product info with a max [38:44] number of retries for each URL and if it [38:47] fails ever it will sleep a certain [38:50] amount of time um that's dependent on [38:54] what we defined as a backoff factor and [38:56] the attempt number and just basically [38:59] let us make sure that if anything kind [39:02] of hiccups that we can process it um [39:06] another way you might improve this [39:07] further is you could run things in [39:10] parallel uh it really depends on what's [39:12] most important to you whether it's just [39:14] getting the info or getting the info [39:15] quickly uh there's different ways you [39:17] could play around with this and improve [39:18] it further but this is a good starting [39:21] point what might you might get blocked [39:24] in you know a couple thousand requests [39:27] Walmart normally you can do tens and [39:29] tens of thousands when using a bright [39:31] data proxy Network which is super nice [39:35] so I think this covers a lot of the kind [39:37] of what you'll need to know when it [39:38] comes to Advanced scraping techniques [39:41] but I think the next logical step to [39:43] take your skills even further is uh [39:45] being able to actually automate actions [39:47] on a page using a library such as [39:49] selenium in Python so if you look up [39:53] something like [39:54] lenium python bright data you you can [39:57] find some good information on how to [40:00] kind of get started using [40:03] selenium and really yeah selenium is [40:06] going to be coming when you let's say [40:09] page loads but there's this table that [40:12] loads a little bit slower and maybe [40:14] after like 5 Seconds it loads in and you [40:17] need to access the the information in [40:18] the page you can use selenium to like [40:21] load the page wait 5 seconds and then [40:24] access the information that is slow to [40:27] load by using things such as implicitly [40:31] weight here through selenium and then [40:34] additionally one thing that's super [40:36] useful within the bright data ecosystem [40:38] is this scraping browser so if you look [40:41] at documentation scraping browser [40:44] scraping browser configuration python [40:49] selenium uh you can find some [40:50] information on how to use it but uh one [40:52] thing that's really nice about the [40:53] scraping browser is if let's say you're [40:55] using selenium and you run into some uh [40:58] captas and stuff that are particularly [41:00] hard to get by uh the the scraping [41:03] browser can kind of help you [41:04] automatically bypass those so you can [41:07] really access that tricky to grab data [41:12] programmatically I think because this [41:13] video is already a good length we're [41:15] going to hold off on all of that if you [41:17] want to see selenium python bright data [41:20] stuff in more depth let me know in the [41:22] comments and I definitely will make [41:23] another follow-up video happen but I [41:26] think this was good launching off point [41:28] that builds off the kind of the basics [41:30] that we learned in the original tutorial [41:32] hopefully you now have a little bit more [41:34] Tools in your tool case especially the [41:36] bright data proxy stuff that can help [41:38] you kind of take your scraping skills to [41:40] the next level with that I think we'll [41:42] call the video I hope you learned [41:44] something I hope you enjoyed this video [41:46] if you did means a lot if you throw it a [41:48] thumbs up and click the Subscribe button [41:50] if you haven't [41:51] already um yeah got a bunch of tutorials [41:53] on the way thank you to Bright data [41:57] again for sponsoring this video link to [42:00] get your first $15 free on bright data [42:03] down in the description I think that is [42:05] it until next time everyone [42:09] peace [42:12] out one quick question before we go I'm [42:16] just super curious how all of you plan [42:18] on taking the knowledge from this video [42:20] on accessing web data via web scraping [42:22] and applying it to your own projects so [42:24] we' be super curious to hear what types [42:26] of projects you all want to work on with [42:29] this knowledge let me know in the [42:31] comments it's always fun to read through [42:34] what what you all are working on all [42:36] right now I'm really out [42:40] peace now