[0:00] Have you ever been on a website, scrolling 
through a huge list of quotes, names, or data,  
[0:05] and thought – I wish I could just grab all of 
this automatically instead of copying it one  
[0:10] by one? Well, that's exactly what web scraping 
lets you do. And today, we're going to learn how  
[0:16] to use a Python library called BeautifulSoup.
So what is BeautifulSoup? In plain English,  
[0:25] it's a tool that reads the code behind a webpage 
and lets you pull out exactly the pieces you  
[0:30] want. Think of a webpage like a magazine. 
BeautifulSoup is like a highlighter that  
[0:35] lets you mark just the quotes, just the author 
names, or just the tags – whatever you're after. 
[0:42] The best part? You don't need to be an expert. 
If you know basic Python – variables, loops,  
[0:48] print statements – you're completely 
ready for this. We're going to go slow,  
[0:53] explain every single line, and by the end of 
this video you'll be pulling real data off a  
[0:58] real website and saving it to a file.
Alright – let's get into it. 
[1:04] Before we install anything, let's set up a 
virtual environment. Think of it like a clean,  
[1:09] separate workspace just for this project – so 
the libraries we install here don't interfere  
[1:14] with anything else on your computer.
First, let’s install pipenv. 
[1:21] Then navigate to your project 
folder and run this command. 
[1:25] That's it – one command. It creates the virtual 
environment and activates it at the same time.  
[1:32] You'll see your terminal change, which means 
you're now inside the environment and ready to go. 
[1:38] Now, let's get another thing 
installed. Open your terminal and type: 
[1:43] pip install requests
What's requests? It's a  
[1:48] Python library that lets your code fetch 
a webpage – kind of like telling Python,  
[1:53] 'go open this URL for me and bring 
back whatever you find.' That's it. 
[1:58] We're NOT installing BeautifulSoup yet. There's 
a reason for that – and you're about to see why. 
[2:05] Alright, let's write a quick script. 
We're going to try fetching a website  
[2:09] using just requests, with no extra setup.
Simple enough – we're telling Python: go to this  
[2:15] URL, grab the content, and print it out.
Let's run it.
[2:25] Okay - you'll see a massive wall of HTML 
printed out. The data is in there somewhere,  
[2:31] but it's basically unreadable. You'd have to 
dig through hundreds of lines just to find  
[2:35] one author's name. We need something better.
When your browser visits a website, it sends  
[2:41] a small label called a user-agent – basically 
saying 'I'm Google Chrome on a Mac.' A plain  
[2:48] Python script doesn't send that, so some websites 
block it immediately. BeautifulSoup, combined  
[2:54] with adding that label to our request, fixes both 
problems – it makes us look like a real browser,  
[3:00] and it organizes that messy HTML into something 
we can actually search through. Let's set that up. 
[3:07] Time to install BeautifulSoup in 
your terminal with this command. 
[3:12] Note that it's beautifulsoup4 with a 4 at the 
end. That's just how the package is named. 
[3:18] Now here are our imports.
And this time we're adding a headers dictionary. 
[3:32] Now you might be looking at Mozilla/5.0 
and thinking – what is that?  
[3:37] It's actually a browser signature. When a real 
browser like Chrome or Firefox visits a website,  
[3:44] it sends this string to identify itself. 
Mozilla/5.0 is the base signature that  
[3:50] almost every modern browser uses – Chrome, 
Firefox, Safari, they all start with it. 
[3:57] So by adding this to our request, 
we're telling the website – 'hey,  
[4:02] I'm a normal browser, not a Python script.' 
Most websites will see this and let us through. 
[4:09] If you're curious, a full real user-agent 
string actually looks like this. 
[4:14] But for our tutorial today, the 
short version works perfectly fine. 
[4:19] Let’s continue the code and 
confirm if everything is working. 
[4:39] If you see 200 printed – that means success. 
Status code 200 is the web's way of saying  
[4:47] everything is fine and you're in. 
Alright – let's start scraping. 
[4:52] This is the main part of the video. We're going 
to scrape real quotes, author names, and tags  
[4:58] from quotes.toscrape.com. This site was built 
specifically for practicing web scraping – so  
[5:04] it's completely safe and legal to use here.
First, we pass the HTML into  
[5:10] BeautifulSoup with this command.
response.text is the raw HTML of  
[5:15] the page – all the code that makes the website. 
And 'html.parser' is telling BeautifulSoup which  
[5:21] tool to use to read that code. The good news - 
it's built into Python, no extra install needed. 
[5:28] Now let's find the quotes. If 
you right-click and hit Inspect,  
[5:32] you'll see each quote lives inside a div with a 
class of quote. So we grab them all like this. 
[5:39] The find_all variable searches through the entire 
HTML and returns every element that matches. 
[5:51] Now quotes is a list and let's loop through it.
This just means – go through each quote  
[5:56] one by one and do the following:
We're looking inside each quote block  
[6:01] for a span with the class text, and grabbing 
the text inside it. That's the quote itself. 
[6:08] Same idea – find a small tag with the class 
author. We include the tag name small to be more  
[6:15] specific, because a class name alone isn't always 
unique on a page. That gives us the author's name.
[6:21] Here we create an empty list called tags, then 
loop through all the tag links inside each  
[6:27] quote and add each one to that list.
And finally we print everything out. 
[6:40] The '\n' just adds a blank line between 
each quote so the output is easy to read. 
[6:49] And join(tags) joins all the tags into 
one clean string separated by commas. 
[6:55] Let's run it. 
[6:58] Look at that. Real quotes, real authors, 
real tags – all pulled automatically. 
[7:04] Now let's save this data to a CSV file so 
you can open it in Excel or Google Sheets. 
[7:10] First import the csv.
Next, write this command. 
[7:18] This line opens a new file called quotes.csv. 
The 'w' means we're writing to it – creating  
[7:31] it fresh. Newline prevents extra blank lines 
appearing between rows, and encoding='utf-8'  
[7:39] makes sure special characters like apostrophes 
or accented letters don't break anything. 
[7:45] The writer variable creates a CSV writer – think 
of it as the pen that writes into our file. 
[7:51] writer.writerow writes our header row. The column 
names at the top of the spreadsheet – the text,  
[7:58] author, and tags we already extracted above.
The parsing code in the middle is exactly the  
[8:04] same as before – we're just wrapping it inside the 
file writer now. No need to change anything there. 
[8:14] Run it, and you'll see a quotes.csv file 
appear in your project folder. Open it,  
[8:21] all your quotes are right there.
Before we wrap up, let's talk about the  
[8:26] errors you will almost definitely run 
into. And I mean this happens to everyone,  
[8:30] it's completely normal, and once you know 
what they are you'll fix them in seconds.
[8:36] We will have a new errors.py file for that.
Error one - NoneType error. 
[8:43] Let me show you this live. Watch what 
happens when I use the wrong class name. 
[8:58] See that? AttributeError: 'NoneType' object has 
no attribute 'text' – BeautifulSoup returned None  
[9:05] because it found nothing, and then .text on 
None crashes. The fix is simple – let’s see.
[9:31] Error two - Empty results.
Similar idea but this time with find_all. Watch. 
[9:48] It just returns an empty list [] - no crash, but 
no data either. This means a wrong class name or  
[9:55] tag. Note that during inspection copy the exact 
class name carefully. One typo is all it takes.
[10:02] Error three - Connection timeout.
This one I can't show you live because  
[10:08] our practice site is too fast and reliable – 
which is a good thing. But in the real world,  
[10:15] when you're scraping slower or larger websites,  
[10:18] timeouts will happen. Here's the code you'll 
need when they do – just keep it handy. 
[10:41] That's it for errors. Three 
simple fixes that'll cover the  
[10:45] vast majority of what you'll face as a beginner.
[10:48] Before we finish, a few important 
things about scraping responsibly. 
[10:53] First – always check robots.txt. Go to 
any website and add /robots.txt at the  
[11:01] end of the URL. Let me show 
you – walmart.com/robots.txt 
[11:03] See these lines?
The Sitemap part is not related to us. 
[11:03] Disallow means don't scrape this section. 
Allow means this part is open. Always read  
[11:03] this file before scraping any real 
website and respect what it says. 
[11:03] Now our practice site quotes.toscrape.com doesn't 
even have a robots.txt – and that's actually  
[11:03] intentional. It was built specifically 
to be scrapped freely, no restrictions.
[11:03] Now let's talk about how to be polite  
[11:06] when scraping multiple pages.
We can do this in our multi-page  
[11:08] example with time.sleep(1). You add it inside 
your loop, after each request. One second  
[11:08] between requests is polite – you're 
not hammering the server all at once.
[11:14] Third – always read the terms of service 
of any site before scraping it seriously.  
[11:20] Some sites explicitly say no scraping. Better 
to check than to get blocked or in trouble. 
[11:26] And that's a wrap! Let's do a quick 
recap of what we covered today. 
[11:32] We started with a problem – a plain requests 
call with no setup, messy unreadable HTML,  
[11:38] and no way to extract anything useful. 
Then we installed BeautifulSoup,  
[11:44] added a user-agent header to look like a real 
browser, and suddenly everything worked cleanly. 
[11:50] We scraped real quotes, authors, and tags from a 
live website, handled some common errors you'll  
[11:56] run into, and talked about scraping responsibly 
with robots.txt, delays, and terms of service. 
[12:04] That's a lot for one video – 
and you did it all from scratch. 
[12:09] From here you can scrape 
news headlines, job postings,  
[12:12] sports stats, product listings – anything 
publicly visible on the web. BeautifulSoup  
[12:19] is your starting point for all of it.
If this helped, hit subscribe and drop  
[12:24] a comment telling me what you're planning 
to scrape first. See you in the next one.