[0:00] Have you ever been on a website, scrolling  through a huge list of quotes, names, or data,   [0:05] and thought – I wish I could just grab all of  this automatically instead of copying it one   [0:10] by one? Well, that's exactly what web scraping  lets you do. And today, we're going to learn how   [0:16] to use a Python library called BeautifulSoup. So what is BeautifulSoup? In plain English,   [0:25] it's a tool that reads the code behind a webpage  and lets you pull out exactly the pieces you   [0:30] want. Think of a webpage like a magazine.  BeautifulSoup is like a highlighter that   [0:35] lets you mark just the quotes, just the author  names, or just the tags – whatever you're after.  [0:42] The best part? You don't need to be an expert.  If you know basic Python – variables, loops,   [0:48] print statements – you're completely  ready for this. We're going to go slow,   [0:53] explain every single line, and by the end of  this video you'll be pulling real data off a   [0:58] real website and saving it to a file. Alright – let's get into it.  [1:04] Before we install anything, let's set up a  virtual environment. Think of it like a clean,   [1:09] separate workspace just for this project – so  the libraries we install here don't interfere   [1:14] with anything else on your computer. First, let’s install pipenv.  [1:21] Then navigate to your project  folder and run this command.  [1:25] That's it – one command. It creates the virtual  environment and activates it at the same time.   [1:32] You'll see your terminal change, which means  you're now inside the environment and ready to go.  [1:38] Now, let's get another thing  installed. Open your terminal and type:  [1:43] pip install requests What's requests? It's a   [1:48] Python library that lets your code fetch  a webpage – kind of like telling Python,   [1:53] 'go open this URL for me and bring  back whatever you find.' That's it.  [1:58] We're NOT installing BeautifulSoup yet. There's  a reason for that – and you're about to see why.  [2:05] Alright, let's write a quick script.  We're going to try fetching a website   [2:09] using just requests, with no extra setup. Simple enough – we're telling Python: go to this   [2:15] URL, grab the content, and print it out. Let's run it. [2:25] Okay - you'll see a massive wall of HTML  printed out. The data is in there somewhere,   [2:31] but it's basically unreadable. You'd have to  dig through hundreds of lines just to find   [2:35] one author's name. We need something better. When your browser visits a website, it sends   [2:41] a small label called a user-agent – basically  saying 'I'm Google Chrome on a Mac.' A plain   [2:48] Python script doesn't send that, so some websites  block it immediately. BeautifulSoup, combined   [2:54] with adding that label to our request, fixes both  problems – it makes us look like a real browser,   [3:00] and it organizes that messy HTML into something  we can actually search through. Let's set that up.  [3:07] Time to install BeautifulSoup in  your terminal with this command.  [3:12] Note that it's beautifulsoup4 with a 4 at the  end. That's just how the package is named.  [3:18] Now here are our imports. And this time we're adding a headers dictionary.  [3:32] Now you might be looking at Mozilla/5.0  and thinking – what is that?   [3:37] It's actually a browser signature. When a real  browser like Chrome or Firefox visits a website,   [3:44] it sends this string to identify itself.  Mozilla/5.0 is the base signature that   [3:50] almost every modern browser uses – Chrome,  Firefox, Safari, they all start with it.  [3:57] So by adding this to our request,  we're telling the website – 'hey,   [4:02] I'm a normal browser, not a Python script.'  Most websites will see this and let us through.  [4:09] If you're curious, a full real user-agent  string actually looks like this.  [4:14] But for our tutorial today, the  short version works perfectly fine.  [4:19] Let’s continue the code and  confirm if everything is working.  [4:39] If you see 200 printed – that means success.  Status code 200 is the web's way of saying   [4:47] everything is fine and you're in.  Alright – let's start scraping.  [4:52] This is the main part of the video. We're going  to scrape real quotes, author names, and tags   [4:58] from quotes.toscrape.com. This site was built  specifically for practicing web scraping – so   [5:04] it's completely safe and legal to use here. First, we pass the HTML into   [5:10] BeautifulSoup with this command. response.text is the raw HTML of   [5:15] the page – all the code that makes the website.  And 'html.parser' is telling BeautifulSoup which   [5:21] tool to use to read that code. The good news -  it's built into Python, no extra install needed.  [5:28] Now let's find the quotes. If  you right-click and hit Inspect,   [5:32] you'll see each quote lives inside a div with a  class of quote. So we grab them all like this.  [5:39] The find_all variable searches through the entire  HTML and returns every element that matches.  [5:51] Now quotes is a list and let's loop through it. This just means – go through each quote   [5:56] one by one and do the following: We're looking inside each quote block   [6:01] for a span with the class text, and grabbing  the text inside it. That's the quote itself.  [6:08] Same idea – find a small tag with the class  author. We include the tag name small to be more   [6:15] specific, because a class name alone isn't always  unique on a page. That gives us the author's name. [6:21] Here we create an empty list called tags, then  loop through all the tag links inside each   [6:27] quote and add each one to that list. And finally we print everything out.  [6:40] The '\n' just adds a blank line between  each quote so the output is easy to read.  [6:49] And join(tags) joins all the tags into  one clean string separated by commas.  [6:55] Let's run it.  [6:58] Look at that. Real quotes, real authors,  real tags – all pulled automatically.  [7:04] Now let's save this data to a CSV file so  you can open it in Excel or Google Sheets.  [7:10] First import the csv. Next, write this command.  [7:18] This line opens a new file called quotes.csv.  The 'w' means we're writing to it – creating   [7:31] it fresh. Newline prevents extra blank lines  appearing between rows, and encoding='utf-8'   [7:39] makes sure special characters like apostrophes  or accented letters don't break anything.  [7:45] The writer variable creates a CSV writer – think  of it as the pen that writes into our file.  [7:51] writer.writerow writes our header row. The column  names at the top of the spreadsheet – the text,   [7:58] author, and tags we already extracted above. The parsing code in the middle is exactly the   [8:04] same as before – we're just wrapping it inside the  file writer now. No need to change anything there.  [8:14] Run it, and you'll see a quotes.csv file  appear in your project folder. Open it,   [8:21] all your quotes are right there. Before we wrap up, let's talk about the   [8:26] errors you will almost definitely run  into. And I mean this happens to everyone,   [8:30] it's completely normal, and once you know  what they are you'll fix them in seconds. [8:36] We will have a new errors.py file for that. Error one - NoneType error.  [8:43] Let me show you this live. Watch what  happens when I use the wrong class name.  [8:58] See that? AttributeError: 'NoneType' object has  no attribute 'text' – BeautifulSoup returned None   [9:05] because it found nothing, and then .text on  None crashes. The fix is simple – let’s see. [9:31] Error two - Empty results. Similar idea but this time with find_all. Watch.  [9:48] It just returns an empty list [] - no crash, but  no data either. This means a wrong class name or   [9:55] tag. Note that during inspection copy the exact  class name carefully. One typo is all it takes. [10:02] Error three - Connection timeout. This one I can't show you live because   [10:08] our practice site is too fast and reliable –  which is a good thing. But in the real world,   [10:15] when you're scraping slower or larger websites,   [10:18] timeouts will happen. Here's the code you'll  need when they do – just keep it handy.  [10:41] That's it for errors. Three  simple fixes that'll cover the   [10:45] vast majority of what you'll face as a beginner. [10:48] Before we finish, a few important  things about scraping responsibly.  [10:53] First – always check robots.txt. Go to  any website and add /robots.txt at the   [11:01] end of the URL. Let me show  you – walmart.com/robots.txt  [11:03] See these lines? The Sitemap part is not related to us.  [11:03] Disallow means don't scrape this section.  Allow means this part is open. Always read   [11:03] this file before scraping any real  website and respect what it says.  [11:03] Now our practice site quotes.toscrape.com doesn't  even have a robots.txt – and that's actually   [11:03] intentional. It was built specifically  to be scrapped freely, no restrictions. [11:03] Now let's talk about how to be polite   [11:06] when scraping multiple pages. We can do this in our multi-page   [11:08] example with time.sleep(1). You add it inside  your loop, after each request. One second   [11:08] between requests is polite – you're  not hammering the server all at once. [11:14] Third – always read the terms of service  of any site before scraping it seriously.   [11:20] Some sites explicitly say no scraping. Better  to check than to get blocked or in trouble.  [11:26] And that's a wrap! Let's do a quick  recap of what we covered today.  [11:32] We started with a problem – a plain requests  call with no setup, messy unreadable HTML,   [11:38] and no way to extract anything useful.  Then we installed BeautifulSoup,   [11:44] added a user-agent header to look like a real  browser, and suddenly everything worked cleanly.  [11:50] We scraped real quotes, authors, and tags from a  live website, handled some common errors you'll   [11:56] run into, and talked about scraping responsibly  with robots.txt, delays, and terms of service.  [12:04] That's a lot for one video –  and you did it all from scratch.  [12:09] From here you can scrape  news headlines, job postings,   [12:12] sports stats, product listings – anything  publicly visible on the web. BeautifulSoup   [12:19] is your starting point for all of it. If this helped, hit subscribe and drop   [12:24] a comment telling me what you're planning  to scrape first. See you in the next one.