Why Web Scraping is a Goldmine for Freelancers
45sHighlights the lucrative freelance opportunities in web scraping, motivating viewers to learn.
▶ Play ClipThis tutorial demonstrates advanced web scraping techniques using Python and Beautiful Soup, focusing on scraping product data from Walmart.com. The video covers extracting data from JavaScript-rendered pages, handling anti-scraping measures, and using proxy networks to avoid IP blocks.
The video covers advanced web scraping with Python. Sponsor Bright Data offers proxy tools and datasets.
Right-click and inspect HTML to find data. Walmart uses 'next data' script tag containing a JSON object with page props.
Use requests library with custom headers (User-Agent, Accept-Language) to mimic a real browser and avoid being blocked.
Extract the script tag with id 'next data', parse JSON, and navigate nested keys to find product price and review info.
Large language models like Claude can help identify important fields in large JSON objects.
Create two functions: extract_product_info (given a product URL) and get_product_links (given a search query and page number).
Loop over search result pages, collect product links, and extract info for each product, saving to a JSON lines file.
Improve scraper by tracking seen URLs to avoid duplicates and support multiple search queries.
Walmart blocks IP after many requests. Bright Data proxy network provides multiple IPs to avoid blocks.
Use environment variables for proxy credentials and pass proxies parameter in requests.get to rotate IPs.
Add retry logic with backoff to handle proxy failures gracefully.
For dynamic content, use Selenium with Bright Data scraping browser to bypass CAPTCHAs.
This tutorial provides a solid foundation for advanced web scraping, including handling JavaScript-rendered data, using proxies to avoid IP blocks, and scaling up with multiple search queries. The next step is to explore Selenium for dynamic content and CAPTCHA bypass.
"Title accurately describes advanced scraping with Beautiful Soup, delivering on its promise."
What is the ID of the script tag that contains the JSON data on Walmart product pages?
__NEXT_DATA__
07:30
What two headers are recommended to mimic a real browser when scraping?
User-Agent and Accept-Language
06:00
How do you parse a JSON string in Python?
Use json.loads()
10:00
What is the purpose of using a proxy network like Bright Data?
To rotate IP addresses and avoid being blocked by the target website.
27:00
What is the format of the proxies dictionary for Bright Data?
{'http': 'http://username:password@host:port', 'https': 'http://username:password@host:port'}
33:00
What library is used to load environment variables from a .env file?
python-dotenv
31:00
What is the key path to get the current price from the Walmart JSON data?
data['props']['pageProps']['initialData']['data']['product']['priceInfo']['currentPrice']['price']
11:00
What is the purpose of the 'backoff factor' in retry logic?
To increase the sleep time between retries exponentially.
37:00
Mimicking Browser Headers
Essential technique to avoid being blocked by websites that check User-Agent.
06:00Using LLMs to Parse JSON
Innovative use of AI to navigate complex nested JSON structures.
09:00IP Blocking and Proxy Solution
Demonstrates real-world challenge of scraping at scale and how proxies solve it.
27:00Retry Logic with Backoff
Important for building robust scrapers that handle transient failures.
37:00[00:00] hey what's up everyone welcome back to
[00:01] another video in this video we are going
[00:03] to do some Advanced web scraping with
[00:06] python if you haven't already seen it I
[00:08] recommend checking out my overview on
[00:10] the python beautiful soup Library I'll
[00:12] pop it up right here but this is going
[00:14] to kind of take that video and really
[00:16] build onto more advanced things that you
[00:18] might see in the wild web scraping is a
[00:20] incredibly useful skill I do a lot of
[00:23] work on upwork and I feel like the most
[00:25] common thing that I ever see is web
[00:27] scraping jobs so if you can Master web
[00:29] scraping you have tons of opportunities
[00:32] when it comes to freelance when it comes
[00:33] to work etc before we get into the video
[00:36] though I want to give a shout out to
[00:38] this video sponsor and that is bright
[00:40] data when it comes to really Advanced
[00:43] web scraping stuff it's hard to do it on
[00:45] your own one of the things that I like
[00:46] most about bright data our sponsor is
[00:48] they offer all sorts of proxy tools
[00:51] which basically allows you to send
[00:53] requests from many different locations
[00:55] and it bypasses it allows you to
[00:57] ultimately bypass a lot of the
[00:58] restrictions that sites try to implement
[01:01] to prevent you from scraping them uh if
[01:03] you don't want to deal with learning how
[01:05] to web scrape and watching this whole
[01:07] tutorial bright data also offers a bunch
[01:10] of data sents that are available for
[01:12] purchase such as Walmart which we'll be
[01:14] scraping today but also Amazon
[01:17] Airbnb Instagram LinkedIn Etc so tons of
[01:23] data sets in the data set Marketplace
[01:24] that you could also just get started
[01:26] super quickly but without further Ado
[01:27] let's get into this tutorial all right
[01:30] to demonstrate these Advanced scraping
[01:32] techniques we're going to be doing some
[01:33] scraping on walmart.com and if this ever
[01:36] feels too quick for you I recommend
[01:38] checking out the original video I did on
[01:40] uh beautiful soup which will be linked
[01:42] in the description and also I'll pop it
[01:44] up right here the first thing we'll want
[01:45] to do with any sort of web scraping
[01:48] project is kind of identify the HTML of
[01:51] what we're trying to scrape so let's say
[01:53] I was looking for a new computer monitor
[01:56] because of course I'm a programmer I
[01:57] need 10,000 monitors at all times time
[02:00] so I look up monitor here and let's see
[02:03] what looks good let's click on this
[02:06] first Samsung one how about this Odyssey
[02:08] G6 we'll check that out so if I was
[02:11] trying to web scrape and like collect
[02:13] information because I want to pay the
[02:14] best price for these monitors uh I'm
[02:17] going to be looking at the HTML on the
[02:19] page so I might be trying to get this
[02:21] information I might be able to you know
[02:23] trying to get the you save information
[02:26] probably be trying to get this
[02:27] information maybe the reviews
[02:29] information bunch of things that I want
[02:31] to get but the way that we do this in
[02:34] web scraping land is you for most
[02:36] internet browsers you can write click
[02:39] and click on inspect which opens up the
[02:42] kind of HTML source code so if we look
[02:46] in here we can specifically click
[02:48] inspect on this tag and kind of find
[02:51] where the price is one thing though that
[02:53] I also notice with this specific Walmart
[02:55] page is there's this concept of item
[02:58] props so that kind of leads me to to
[03:00] think that maybe there's another kind of
[03:02] technique going on here that's getting
[03:04] some of this information if we look at
[03:07] all of the information on this page
[03:09] trying to see if there's anything that
[03:11] sticks out here yep there we go um
[03:14] there's this next data field let's see
[03:16] what's in that uh and we see in here
[03:19] there's this massive massive Json object
[03:23] that has props so this next data tag is
[03:26] telling me this is a specific type of
[03:28] JavaScript project
[03:30] that leverages these props and passes
[03:33] them in on the web page so instead of
[03:35] actually scraping from the HTML tags
[03:37] themselves we can actually usually find
[03:39] the information we're looking for within
[03:41] this props
[03:43] field um so I think kind of our
[03:46] technique for scraping information from
[03:48] Walmart will be do a search like we just
[03:51] did so we had you know the search for
[03:53] monitor this is a pretty easy format I
[03:55] could very easily replace monitor here
[03:58] with other words and search for them
[04:00] and then kind of programmatically go
[04:02] through this page maybe click on items
[04:05] and then once we're clicking on an
[04:08] item we'll see for this monitor too that
[04:11] it also has these page props and then in
[04:15] this page props
[04:17] information basically we'll want to look
[04:19] for and find where the actual price is
[04:24] within this nested Json so that's the
[04:27] general technique Let's uh open up a
[04:29] code code editor and kind of just write
[04:31] some template code out to help us do
[04:34] this so I'm in this Advanced scraping
[04:36] repo and I'll link this in the
[04:39] description uh I'll link my GitHub repo
[04:42] with all the code that we'll see in this
[04:43] video but I'll just call this like
[04:46] Walmart
[04:47] scraper
[04:49] dopy and probably the first thing we'll
[04:51] want to do is install beautiful soup and
[04:54] I recommend if um you know how to to set
[04:57] up a virtual environment to do this I'll
[05:00] link a video on how to do that right
[05:01] here but we can go pip 3 install
[05:03] beautiful soup 4 we see we already have
[05:06] it and we're also going to want to use
[05:09] the requests Library so I'll just kind
[05:11] of template out some code and then kind
[05:12] of explain it more in depth in a
[05:17] sec often like when I forget how to
[05:20] import things feel like beautiful soup
[05:22] is always a weird thing I like using
[05:23] Copilot
[05:30] and then I also want to import the
[05:31] requests library and Walmart URL equals
[05:36] how about
[05:38] https
[05:39] walmart.com SL maybe we grab a specific
[05:43] one of those pages that we had
[05:45] open so I'm just going to steal this
[05:48] page real quick that URL is this so just
[05:51] take any product URL from Walmart we got
[05:54] the curved Ultra wide Monitor and then
[05:58] we'll want to get that so
[06:00] response equals requests.get pass in a
[06:04] URL and we'll print out that
[06:09] HTML okay so let's go ahead and do that
[06:12] cool we get the
[06:15] HTML and the next thing we'll want to do
[06:19] is get specifically that next data tag
[06:23] so we can start looking at what's in
[06:24] that so if we look at our web page again
[06:27] I inspect this
[06:29] this next data basically we'll want to
[06:32] find the script with the ID next data so
[06:34] how do we do that with beautiful soup so
[06:36] we'll say soup
[06:39] equals beautiful soup response. text
[06:43] we'll use the HTML
[06:46] parser and we'll specifically want to
[06:48] find that sup tag with the ID
[06:53] so sup find script tag with idid
[07:01] next data I think it was like this we
[07:05] can change the ID if we have to that
[07:09] looks good basically we want to just
[07:11] print out the HTML again and it says
[07:13] None that is strange why does it say
[07:15] none so let's make sure that the next
[07:18] data is correct next 22
[07:21] underscores so that looks correct and so
[07:24] this is the first I think like Advanced
[07:27] feature that you'll have to know about
[07:29] if you ever have a situation where you
[07:31] retrieve a page it's usually one or two
[07:33] and like the tag you're looking for is
[07:35] not there when you retrieve it with
[07:36] python it's e either one of two things
[07:39] one is that this is a dynamic page that
[07:42] like some of the stuff loads a little
[07:43] bit later or two they're purposely
[07:46] hiding stuff from from requests that
[07:49] don't look Human by default there's
[07:51] certain headers when you use the
[07:53] requests library in Python and so to get
[07:56] around and make ourselves look more
[07:57] human we want to make sure that that we
[08:00] mock whatever headers we use when we're
[08:02] actually making this request in the
[08:05] python land what I would probably do
[08:07] here to figure out what are good headers
[08:09] to use look at our Network and any
[08:12] really one of these probably will be
[08:13] fine and basically we want to just copy
[08:17] the headers that we use in our actual
[08:19] web browser so we have the request
[08:21] headers down here we copy some of these
[08:24] parameters we'll probably find we have
[08:27] more success so we probably want to copy
[08:29] the accept languages and the user agent
[08:33] tags good and sometimes you might even
[08:36] want to like have this cycle between
[08:39] several different options so you always
[08:41] are changing up what you look like We'll
[08:43] add
[08:47] commas that looks good cool and so now
[08:52] what we can do with our requests library
[08:54] and I guess I have to change the order
[08:56] of this slightly
[09:01] we can pass in the
[09:04] headers equals the new headers that
[09:06] we've added and now the question is can
[09:08] we find this script tag so we're going
[09:11] to go ahead and run all of this
[09:16] code and look at that we do get um this
[09:20] next data field that we are looking for
[09:24] awesome all right now that we have this
[09:27] data here we want to be able to grab
[09:30] specifically the price from this so what
[09:32] I would recommend is this is
[09:34] crazy uh complicated to look at I feel
[09:38] like probably it's purposely like just
[09:39] super OB obis skated to make it hard to
[09:43] grab everything what I might do is go
[09:46] back to the web page inspect grab the
[09:50] next data I might copy the element and
[09:54] then what I might recommend do is doing
[09:56] is copy this Json go to like a online
[10:00] token counter probably a bunch of these
[10:02] will work I'm just going to use this
[10:04] streamlet app I see and paste in your
[10:08] Json and then see how many tokens so we
[10:10] see we have like 59,000 tokens not super
[10:13] friendly to honestly the like GPT 3.5 I
[10:16] think that's above any limits it can
[10:19] process you could use GPT 4 if you have
[10:21] access to full
[10:23] 128k context basically the goal is can
[10:26] we look at what's in this without having
[10:29] to to dig through all of this Json
[10:31] couple different ways to do it um what
[10:33] we might do real quick and we'll want to
[10:35] make this Json so we can do that by
[10:39] importing the Json Library as
[10:41] well and then we can go ahead and we
[10:43] should be able to do something like um
[10:46] data
[10:47] equals HTML or script
[10:51] tag. string that'll get all the text
[10:56] within it we can do json. loads
[11:00] and now we should be able to print out
[11:02] data. Keys cool look at that
[11:06] so basically we can kind of keep
[11:08] repeating this process we could get the
[11:11] props and then see what the keys of that
[11:14] are it's going to be kind of a slow
[11:17] process but I'm just kind of showing how
[11:19] you could do it
[11:21] manually so what are the props of the
[11:23] keys or the keys of the props page props
[11:27] so we know we have to go into page props
[11:30] and I'm thinking they're in props just
[11:31] because that's what you know the
[11:33] JavaScript framework will
[11:36] leverage uh initial data sounds pretty
[11:47] good I would say probably
[11:49] data not
[11:54] headers um probably product
[11:59] but maybe also review information would
[12:01] be helpful
[12:03] too oh wow now there's a lot of things
[12:06] now I'd look at if there's a price in
[12:08] here somewhere
[12:09] price do we see any
[12:13] price price info I like
[12:20] that now what's in the price info
[12:23] one current price I like that this might
[12:27] be the end of our
[12:31] dictionary we'll
[12:32] see and then I think probably if we get
[12:35] price from
[12:39] this see what we get look at that that
[12:42] gives us $199 and if we look at the
[12:45] product that we were looking at it is
[12:49] 100% $199 so that was getting
[12:52] information from this page not super
[12:54] easy but uh the reason I was asking
[12:57] saying go to this token counter is
[12:59] if you have a large language model that
[13:02] has the ability to parse this number of
[13:04] tokens what I might recommend is copying
[13:08] your code like for example Claude if you
[13:11] have the pro
[13:12] version um Can parse a lot so I might
[13:18] say what are the most important
[13:22] fields from the following
[13:26] Json to get the price info and review
[13:32] info for the current
[13:36] product paste this in which is crazy
[13:40] very long um I'd say limit to just
[13:45] 10 uh
[13:50] items share full path to get to that
[13:55] field using python syntax or something
[13:58] like that
[14:02] this looks like pretty
[14:04] good um I noticed that it's just going
[14:07] like four levels deep so I feel like
[14:09] it's skipped over the stuff to get to
[14:12] product um I might say hey wait to get
[14:16] to product you need to use the following
[14:23] path here's the code I found that works
[14:28] for
[14:30] price please adjust Solutions
[14:38] accordingly copy this paste it
[14:43] in perfect this looks useful so take
[14:46] this as an example but you can use like
[14:48] large language models to help you parse
[14:50] you know massive amounts of Json like
[14:52] this but to simplify the process if you
[14:56] go to the GitHub link I'll share some a
[14:59] code stimp it that will help us get the
[15:01] information we're looking for so I'll go
[15:03] ahead and paste that in here we go so
[15:05] here's a nice little product info
[15:08] dictionary and I can go ahead and print
[15:12] product
[15:14] info for our current product look at
[15:17] that review count 239 let's check to see
[15:20] if that's also valid look at that 239
[15:24] reviews so this looks great we're
[15:25] getting some information I think the big
[15:27] thing though is if we wanted to collect
[15:29] a lot of information on products such as
[15:33] you know these monitors and collect it
[15:34] over time and you know every day maybe
[15:36] run something that gets all this
[15:38] information and it stores it in a
[15:39] database or something we need to modify
[15:42] this a bit so what might we do well we
[15:45] can separate this into maybe two
[15:47] functions one function will be called
[15:50] extract product
[15:54] info that will take in some sort of
[15:56] product URL and we can basically just
[16:00] paste in what we already
[16:03] have so this is product URL now and
[16:08] instead of printing product info we'll
[16:10] return product info so that's one of our
[16:13] functions and then the other function we
[16:15] might want to have is like get uh
[16:19] product
[16:20] pages or get product links or something
[16:23] like
[16:23] that and what that might look like is
[16:26] maybe that takes in a a query if you
[16:29] remember how we got to this monitor page
[16:32] the first thing we did was we did a
[16:34] search that looks like this so I'm going
[16:37] to go ahead and copy this call this like
[16:39] base URL or like search URL Hub out
[16:43] search URL equals this and then we can
[16:46] use an F string to F flip out the search
[16:50] term and make that whatever our query is
[16:53] I think the only other thing I might add
[16:54] to this is let's say we wanted to scrape
[16:57] a bunch of infation
[16:59] on
[17:00] monitors well this is all the monitors
[17:04] but you know what if we wanted more than
[17:06] just the first page of monitors we might
[17:09] go to the second page and we see that
[17:12] you can also leverage Page information
[17:14] here so I'm going to go ahead and copy
[17:16] that and put this also into my URL so
[17:20] how about we also pass in a page
[17:24] number and by default it can start at
[17:27] one you want to do a similar type thing
[17:30] get all of the pages or get get this
[17:33] search URL then we would probably want
[17:36] to get all of the links from this so if
[17:39] I look at this page right click
[17:42] inspect we get this href here looks like
[17:47] this um these is are sponsored ones
[17:51] scroll out these first ones are all
[17:53] sponsored so I might just see what it
[17:54] looks like if it looks any different on
[17:56] non-sponsored ones okay here's another
[17:59] link it's and this look at that it's
[18:03] just a a doesn't have the full URL just
[18:07] has the kind of this part of it so we
[18:11] want to be able to handle both cases so
[18:14] grabbing all of those links might look
[18:16] something
[18:18] like basically we want to find all the
[18:20] links in this so
[18:30] find all a
[18:35] tags I want to make sure that they have
[18:37] an
[18:38] href as part of it because that's how
[18:40] we're going to access this
[18:42] information and then we can do something
[18:44] like
[18:45] for for
[18:49] Link in links and then we know that the
[18:53] product URLs I think one little trick
[18:55] always have this slash I them so what I
[19:00] might do to grab all the links and not
[19:02] get anything that's like a link to some
[19:04] other random
[19:06] page um that's not super useful is I
[19:10] might check to see if in the text so if
[19:16] this specific term is in the in link hre
[19:22] href is the actual link stuff so that's
[19:24] why we're looking at that specifically
[19:26] and we basically want to add that link
[19:27] to some sort of cue
[19:29] so I might add a
[19:33] list and we saw two different types of
[19:35] URLs if https is in the URL then
[19:40] basically our URL is already a full
[19:48] URL full URL
[19:58] however if it's just the IP stuff then
[20:01] our full URL would be equal
[20:06] to and then basically we will want
[20:16] to append this link to our product
[20:23] links this is just basically things
[20:25] we'll search
[20:27] for and then we'll want to return our
[20:30] product
[20:36] links
[20:37] cool so now let's test out to see if
[20:40] this works so let's create a main
[20:52] function how we get product links for a
[20:55] search term like computers
[21:02] run
[21:03] that and we do see we get a bunch of
[21:07] links cool this looks pretty good so now
[21:10] if we wanted to create some sort of kind
[21:12] of like you could either do the a
[21:14] database use something like mongod DB or
[21:17] SQL light or you could even save some of
[21:19] this stuff locally it depends on how
[21:21] complex you want to get but basically
[21:23] what we could do is do a loop over get
[21:27] product links and and uh extract product
[21:30] info and basically combine the two
[21:32] things and uh save the
[21:37] results so what that might look like I'm
[21:40] going to save this as a Json lines file
[21:43] so I'll call this output file
[21:51] equals and then we'll write to our
[21:54] output file
[22:12] we can Loop over a search query so how
[22:20] about links equals product or get
[22:23] product links
[22:26] computers how we make this a while
[22:31] true while true links equals get product
[22:35] Links Page
[22:39] number and we can start popping off
[22:44] links not product
[22:47] links if not
[22:50] links how about or if we get you know
[22:54] once we get past the first 100 Pages
[22:56] we're probably fine also breaking or
[23:00] page
[23:01] number greater than
[23:04] 99 we break out of the loop iterate over
[23:08] the
[23:13] links and we will go ahead and product
[23:18] info equals extract product info on this
[23:23] link and then if we do get a result or
[23:26] something like that we might
[23:29] right to our
[23:31] file and we should also add a new line
[23:35] to this because this is a Json lines
[23:37] file so basically each dictionary will
[23:40] be its own line and it's probably good
[23:43] practice for us to surround this in a
[23:45] tri accept
[23:59] then we'll want to increase the page
[24:03] number by one and we
[24:13] might show that we're going to a next
[24:15] page cool this is a basic way but this
[24:18] should scrape all of our computer
[24:19] information I believe and we'll see if
[24:21] this
[24:24] works there we go oops messed up the
[24:27] syntax there
[24:30] and while this runs we can kind of check
[24:32] to see if it's working by looking at the
[24:34] file and see if lines of Json are being
[24:38] added and we do see items being
[24:44] added again all of the code is linked in
[24:47] the description and I'll kind of break
[24:49] it down by where we're at in the
[24:51] video so that finished scraping and if
[24:54] we look at our product info uh we can
[24:56] see we have a bunch of product scraped I
[24:59] think there's a couple ways that I would
[25:00] improve this though I think one of the
[25:02] big ones is that this wasn't quite smart
[25:05] on what products we scraped so if you
[25:08] look you'll see like some items pop up
[25:11] all the time like this Acer Chromebook
[25:13] so there's two different ways you could
[25:15] kind of solve this problem is you could
[25:16] check the item ID and only add items to
[25:19] your kind of cue of links to search if
[25:22] you hadn't already added it already and
[25:24] you could do that by the item ID or you
[25:26] could kind of the simpler solution might
[25:28] be just look at the link that you're
[25:33] scraping um here and if you've already
[25:35] seen that link before don't scrape the
[25:37] info another thing we might want to do
[25:39] too is that this ran successfully but we
[25:41] only had one search term computers what
[25:43] if you wanted to look up a bunch of
[25:44] things what would that look like and
[25:46] would this same system be able to
[25:49] operate there I'm going to paste in some
[25:51] code with these changes uh again both
[25:55] the code you see currently and this new
[25:57] code all available in the description on
[26:01] GitHub but we'll go ahead and paste this
[26:05] in couple changes now some of the
[26:08] variable names have changed but we have
[26:10] now multiple search queries so computers
[26:12] is still there but we have other items
[26:14] as well that we're going to be looking
[26:15] at and we're going to use a que to kind
[26:18] of keep track of what products to look
[26:20] at as well as we'll keep track of URLs
[26:23] we've already seen in the scene URLs but
[26:25] the process is just about the same also
[26:27] have some print statements to kind of
[26:29] show what we're doing run a query get
[26:32] all the links same as before add it to
[26:35] the product info but now we can just
[26:37] scrap more info and what does that look
[26:39] like we can go ahead and run
[26:48] this and we see all the URLs processing
[26:51] one issue is that we just overwrote the
[26:54] original file we made so this is why you
[26:56] might want to go to like a online mongod
[26:58] where you're always just adding items to
[27:00] something instead of using doing the
[27:02] risky way of writing local files uh
[27:05] because this video is focused on web
[27:07] scraping not necessarily how to Output
[27:09] things I'm going to leave it as is for
[27:12] now but that's kind of one way you can
[27:14] improve this project on your own and so
[27:17] I'll fast forward and let this run and
[27:19] ultimately so what you'll see is that it
[27:22] ran a bunch of these products but as we
[27:25] got through different search queries of
[27:28] eventually Walmart blocked blocked our
[27:30] IP address and basically said no more
[27:33] scraping it kind of saw that what we
[27:34] were doing because we kept making
[27:36] requests with the same exact IP address
[27:40] and it does not like that so what do you
[27:43] do if you're in this situation where you
[27:44] need to keep scraping but you're
[27:46] physically getting blocked by a site
[27:49] like walmart.com That's not liking what
[27:51] you're doing this is exactly where
[27:54] bright
[27:55] data comes in so if you use the link in
[27:59] the description you'll get $15 free doar
[28:01] for bright data but basically our IP got
[28:04] blocked and you still want to scrape so
[28:07] this is where some bright data tools can
[28:09] come in so I already created an account
[28:11] but if you haven't I recommend you know
[28:13] starting free trialer setting up with
[28:14] Google again Link in the description
[28:17] will give you $15 free dollars so
[28:19] definitely sign up using that link I'm
[28:20] going to go to user my user dashboard
[28:23] and what we'll do to start I'll show
[28:26] some other stuff but we're going to to
[28:28] use what's known as a proxy so basically
[28:31] a proxy Network allows us to send
[28:34] requests from different spots basically
[28:37] so if you think of the way we normally
[28:38] do it we have one single computer which
[28:41] has a IP address associated with it and
[28:43] every request has that IP when you use
[28:46] uh bright data proxy server our request
[28:49] goes through bright data and then bright
[28:51] data allows us to make requests as many
[28:53] different IPS many different locations
[28:55] and thus makes it much much more
[28:58] difficult to block the requests we're
[29:01] making basically we can set our proxy
[29:03] server up in whatever way so that we
[29:06] make sure that we get our requests
[29:07] through so we're going to click on a few
[29:10] proxy products and we can go ahead and
[29:12] add a new one I think that I would
[29:14] recommend probably a good starting point
[29:16] is Data Center proxies and there's
[29:19] different ways you can approach this
[29:21] let's say you needed to use the same IP
[29:23] address for every one of your requests
[29:26] when you come back and run this again
[29:28] maybe you had some sort of white listing
[29:30] on a server that it's communicating with
[29:31] and you wanted to make sure it was set
[29:33] number of ips you could do dedicated you
[29:36] could also do premium but I think for
[29:37] most use cases the shared pool of data
[29:41] center IPS is probably going to be
[29:43] exactly what you need this basically is
[29:45] just a massive pool of different IP
[29:48] addresses and different people at
[29:49] different times can use in kind of use
[29:51] from this pool here we see number of ips
[29:54] so we're not going to play around with
[29:55] here what we can do is actually go into
[29:58] Advanced options and I'm going to switch
[30:00] to pay for usage basically gives us
[30:03] access to 10,000 plus IPS it really
[30:07] depends on how much you expect to use
[30:09] what I'd recommend is you can kind of as
[30:11] you run your jobs you can kind of see
[30:12] the costs and decide whether or not
[30:15] paying per IP or paying per usage makes
[30:18] more sense I'll give this a name called
[30:21] like
[30:22] scraping
[30:24] proxy so we're going to go ahead and add
[30:26] this
[30:29] yes there's some Advanced options here
[30:32] but most important thing we will want to
[30:35] do is basically keep track of our host
[30:38] username and password I'll blur these
[30:40] out I'll show how we can actually
[30:43] leverage this information in our
[30:45] beautiful super requests um we also can
[30:48] see statistics as we actually start
[30:49] using our proxy but um we see some
[30:54] example python code here if we go into
[30:57] the documentation I believe we can see
[30:58] some more um python
[31:04] code I like this a lot this already has
[31:07] some requests Library stuff I think we
[31:09] can go ahead basically and
[31:12] copy this all into our code so I'm going
[31:17] to paste this
[31:21] in um our username and password will
[31:24] have to change what I recommend for
[31:27] username and password word is you can
[31:29] create a EnV file in your repo you can
[31:34] set a I'm going to call this a bright
[31:36] data
[31:38] username equals and a bright data
[31:42] password equals and we'll just do this
[31:44] for example test and test password and
[31:49] then basically on our code side what we
[31:51] can
[31:52] do okay so in our code um we could just
[31:56] run this but we can just modify our
[31:58] existing code to leverage this we need
[32:00] to fix our username and password so what
[32:03] we can do is import what's known
[32:06] asnv and actually what we'll want to do
[32:08] is from. EnV import load. EnV and if we
[32:14] run load. n basically this then creates
[32:18] environment variables for us to use
[32:19] based on the environment we set here in
[32:23] ourm file that's within the same folder
[32:26] as our scraper code also want to import
[32:31] OS and now basically what we can do is
[32:35] if I went ahead and set this to os.
[32:39] Environ BRD username and we set this to
[32:45] os. Environ BRD password what we'll see
[32:50] is if I
[32:52] print username and
[32:56] print password we'll see those test
[32:59] variables that we put in I'm going to
[33:00] temporarily comment out all this bottom
[33:04] code right run the file test test
[33:07] password cool so now I'm going to
[33:09] actually paste in my password in my
[33:12] username from The Zone we created next
[33:15] time I run we'll have access to that and
[33:17] now we see that in our request.get we
[33:19] can use proxies there's also this cool
[33:21] thing where we can do this my ip. Json
[33:25] and make our request there and actually
[33:26] see where the
[33:28] um IP is that we're sending our request
[33:30] from so let's go ahead
[33:33] and uncomment all of
[33:37] this again I'm going to delete this
[33:40] example code real quick and I'll show
[33:43] how we can incorporate it in okay so and
[33:46] I think do we need to change the host at
[33:47] all let's just check nope the host looks
[33:50] good it's what it has in the docs docs
[33:54] are also linked in the description but
[33:56] we can go ahead now and as a little
[34:01] sample I going to just use this
[34:04] URL instead of the search
[34:06] [Music]
[34:14] URL and I'll do the same thing with the
[34:16] product
[34:24] URL and I just want to show what happens
[34:26] if I do print do responsejson
[34:30] here so we just temporarily put in this
[34:33] URL we run this uh and
[34:37] basically we see always the same IP
[34:41] address but if we then decided to
[34:47] insert proxies equal proxies so we set
[34:51] our
[34:52] proxies here based on this proxy URL
[34:56] which contains our our username and
[34:58] password and the host
[35:02] info then what we get after saving that
[35:06] and including the proxies
[35:08] and both of our requests when we start
[35:12] printing out
[35:13] things we see different IP addresses
[35:17] every time we send so this little URL is
[35:20] cool for showing where these requests
[35:23] are coming from we see it's different
[35:24] countries and everything uh very cool it
[35:28] proves that our proxies are working by
[35:30] passing in proxies like this we can go
[35:32] ahead and keep the URL as we expected it
[35:36] to be move this print statement okay our
[35:40] proxies are passed
[35:42] in going to delete this stuff kind of
[35:45] clean up the code a little bit but now
[35:47] we have access to the bright data proxy
[35:50] which when we got blocked last time now
[35:55] that we have
[35:59] um the bright data proxy set that for
[36:02] all these search queries we can run them
[36:05] and basically get around the block that
[36:07] will happen when we use only a single IP
[36:10] so that's pretty
[36:12] cool if you haven't if you're running
[36:14] into any errors with um accessing things
[36:17] you might have to do a pip three install
[36:19] of any new library such
[36:21] ASN or we also included prettyprint here
[36:28] so make sure to uh pip 3
[36:32] install these items if you
[36:36] didn't Now by using Proxes we kind of
[36:39] open ourselves up to some other new kind
[36:41] of potential issues that we want to be
[36:43] aware of one is maybe the proxy is not
[36:47] available for a second and a request H
[36:50] you know trips up or maybe the specific
[36:53] country that we used in our proxy
[36:55] request um can't access the website
[36:59] we're trying to scrape so we want to
[37:00] make sure we can like fail gracefully if
[37:03] those types of things happen many
[37:05] different ways to do this but basically
[37:06] I feel like it comes down to just being
[37:10] cognizant in like a try accept to handle
[37:14] failures properly so I'm going
[37:17] to um add in some additional error
[37:20] handling and paste in kind of a more um
[37:23] error proof way to do this but within
[37:26] the bright data ecosystem I think one
[37:29] thing that I would recommend is when you
[37:32] are looking at the proxies and the
[37:34] different things um kind of in the DAT
[37:38] like you could add specific geolocation
[37:41] targeting in either the shared or
[37:43] dedicated uh and you see like maybe you
[37:46] only wanted it to be United
[37:49] States and you know maybe and you can
[37:51] even add cities that's pretty cool uh
[37:54] Canada so this would limit where your
[37:56] IPS are coming from from uh so if you
[37:59] find yourself getting blocked based on
[38:00] the location you know look at things
[38:02] like this with geolocation targeting or
[38:05] specific cities uh you can go very
[38:08] sophisticated pretty cool stuff so
[38:10] that's good to keep in mind all right
[38:12] but I'm going to go ahead and paste in a
[38:15] version of this that handles some
[38:18] failures a little bit more sophisticated
[38:20] banner and also I'm going to use more
[38:22] Search terms because one of the perks of
[38:24] using the proxy networks is we could
[38:27] just keep scraping things uh quite nice
[38:31] so basically
[38:33] just retries things the basically the
[38:36] difference here with this code is it
[38:38] will retry um getting product links or
[38:41] retry extracting product info with a max
[38:44] number of retries for each URL and if it
[38:47] fails ever it will sleep a certain
[38:50] amount of time um that's dependent on
[38:54] what we defined as a backoff factor and
[38:56] the attempt number and just basically
[38:59] let us make sure that if anything kind
[39:02] of hiccups that we can process it um
[39:06] another way you might improve this
[39:07] further is you could run things in
[39:10] parallel uh it really depends on what's
[39:12] most important to you whether it's just
[39:14] getting the info or getting the info
[39:15] quickly uh there's different ways you
[39:17] could play around with this and improve
[39:18] it further but this is a good starting
[39:21] point what might you might get blocked
[39:24] in you know a couple thousand requests
[39:27] Walmart normally you can do tens and
[39:29] tens of thousands when using a bright
[39:31] data proxy Network which is super nice
[39:35] so I think this covers a lot of the kind
[39:37] of what you'll need to know when it
[39:38] comes to Advanced scraping techniques
[39:41] but I think the next logical step to
[39:43] take your skills even further is uh
[39:45] being able to actually automate actions
[39:47] on a page using a library such as
[39:49] selenium in Python so if you look up
[39:53] something like
[39:54] lenium python bright data you you can
[39:57] find some good information on how to
[40:00] kind of get started using
[40:03] selenium and really yeah selenium is
[40:06] going to be coming when you let's say
[40:09] page loads but there's this table that
[40:12] loads a little bit slower and maybe
[40:14] after like 5 Seconds it loads in and you
[40:17] need to access the the information in
[40:18] the page you can use selenium to like
[40:21] load the page wait 5 seconds and then
[40:24] access the information that is slow to
[40:27] load by using things such as implicitly
[40:31] weight here through selenium and then
[40:34] additionally one thing that's super
[40:36] useful within the bright data ecosystem
[40:38] is this scraping browser so if you look
[40:41] at documentation scraping browser
[40:44] scraping browser configuration python
[40:49] selenium uh you can find some
[40:50] information on how to use it but uh one
[40:52] thing that's really nice about the
[40:53] scraping browser is if let's say you're
[40:55] using selenium and you run into some uh
[40:58] captas and stuff that are particularly
[41:00] hard to get by uh the the scraping
[41:03] browser can kind of help you
[41:04] automatically bypass those so you can
[41:07] really access that tricky to grab data
[41:12] programmatically I think because this
[41:13] video is already a good length we're
[41:15] going to hold off on all of that if you
[41:17] want to see selenium python bright data
[41:20] stuff in more depth let me know in the
[41:22] comments and I definitely will make
[41:23] another follow-up video happen but I
[41:26] think this was good launching off point
[41:28] that builds off the kind of the basics
[41:30] that we learned in the original tutorial
[41:32] hopefully you now have a little bit more
[41:34] Tools in your tool case especially the
[41:36] bright data proxy stuff that can help
[41:38] you kind of take your scraping skills to
[41:40] the next level with that I think we'll
[41:42] call the video I hope you learned
[41:44] something I hope you enjoyed this video
[41:46] if you did means a lot if you throw it a
[41:48] thumbs up and click the Subscribe button
[41:50] if you haven't
[41:51] already um yeah got a bunch of tutorials
[41:53] on the way thank you to Bright data
[41:57] again for sponsoring this video link to
[42:00] get your first $15 free on bright data
[42:03] down in the description I think that is
[42:05] it until next time everyone
[42:09] peace
[42:12] out one quick question before we go I'm
[42:16] just super curious how all of you plan
[42:18] on taking the knowledge from this video
[42:20] on accessing web data via web scraping
[42:22] and applying it to your own projects so
[42:24] we' be super curious to hear what types
[42:26] of projects you all want to work on with
[42:29] this knowledge let me know in the
[42:31] comments it's always fun to read through
[42:34] what what you all are working on all
[42:36] right now I'm really out
[42:40] peace now
⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.