Web Scraping Basics
Download HTML pages with requests and extract structured data with BeautifulSoup — when there’s no API available.
"Web scraping is for when the data you need lives on a website but has no API. You download the HTML page, find the parts you want, and extract them — like reading a book and taking notes."
— ShuraiAPI vs Scraping — Which to Use?
Doesn’t break when site redesigns
Usually faster to work with
Legal & permitted
Check robots.txt & ToS first
Add delays between requests
May break if HTML changes
Visit https://example.com/robots.txt to see what a site allows scrapers to access. Never scrape at high speed — add time.sleep(1) between requests. Only collect publicly visible data.
The Scraping Workflow — 3 Steps
requests.get(url) to fetch the page source, exactly as a browser does.BeautifulSoup, which builds a navigable tree of HTML elements..find(), .find_all(), or .select() to locate exactly the data you need.pip install requests beautifulsoup4
Step-by-Step: Your First Scraper
import requests
from bs4 import BeautifulSoup
# ── Step 1: Download ──────────────────────────────────────
url = "https://books.toscrape.com/" # legal practice site
resp = requests.get(url, timeout=10)
resp.raise_for_status()
# ── Step 2: Parse ─────────────────────────────────────────
soup = BeautifulSoup(resp.text, "html.parser")
# ── Step 3: Extract ───────────────────────────────────────
# Each book is inside <article class="product_pod">
for book in soup.select("article.product_pod")[:5]:
title = book.find("h3").find("a")["title"]
price = book.find("p", class_="price_color").get_text(strip=True)
print(f"{price:8} {title}")
£51.77 A Light in the Attic
£53.74 Tipping the Velvet
£50.10 Soumission
£47.82 Sharp Objects
£54.23 Sapiens: A Brief History of Humankind
BeautifulSoup — Finding Elements
There are three main ways to find elements. Understanding when to use each saves a lot of time:
| Method | Returns | Best for |
|---|---|---|
| .find("tag") | First match | When you want one item (e.g. the page title) |
| .find_all("tag") | List of all matches | When you want every matching element |
| .select("css") | List of all matches | Complex selectors like div.card h2 a |
# by tag
soup.find("h1") # first h1
# by tag + class (note: class_ with underscore — class is a Python keyword)
soup.find("p", class_="price_color") # first p.price_color
soup.find_all("div", class_="card") # ALL div.card
# by id
soup.find("section", id="results") # <section id="results">
# CSS selector — most powerful
soup.select("article.product_pod h3 a") # a inside h3 inside article.product_pod
# Getting text and attributes from a found element
el = soup.find("a")
print(el.get_text(strip=True)) # visible text, whitespace stripped
print(el["href"]) # attribute — raises KeyError if missing
print(el.get("href", "#")) # safe — returns "#" if no href
Real Example — Collecting All Book Titles & Prices
import requests, csv, time
from bs4 import BeautifulSoup
def scrape_page(url):
resp = requests.get(url, timeout=10)
soup = BeautifulSoup(resp.text, "html.parser")
books = []
for article in soup.select("article.product_pod"):
books.append({
"title": article.find("h3").find("a")["title"],
"price": article.find("p", class_="price_color").get_text(strip=True),
"rating": article.find("p", class_="star-rating")["class"][1],
})
return books
# Scrape first 3 pages politely
all_books = []
for p in range(1, 4):
url = f"https://books.toscrape.com/catalogue/page-{p}.html"
all_books.extend(scrape_page(url))
time.sleep(1) # be polite — 1 second between requests
print(f"Page {p} done ({len(all_books)} books so far)")
# Save to CSV
with open("books.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["title", "price", "rating"])
writer.writeheader()
writer.writerows(all_books)
print(f"Saved {len(all_books)} books to books.csv")
In Chrome or Firefox: right-click the element you want → "Inspect" → right-click the highlighted element in DevTools → "Copy" → "Copy selector". Paste it into .select(). This is faster than guessing.
"Always add time.sleep(1) between page requests. It's the difference between a polite visitor and a DDoS attack. Even 1 second keeps you far under any server's rate limit."
— Shurai🧠 Quiz — Q1
What are the three steps of web scraping in order?
🧠 Quiz — Q2
What is the difference between soup.find() and soup.find_all()?
🧠 Quiz — Q3
Why do we write class_= (with an underscore) in BeautifulSoup instead of class=?
🧠 Quiz — Q4
Why should you add time.sleep(1) between scraping requests?