Web Scraping Basics

Download HTML pages with requests and extract structured data with BeautifulSoup — when there’s no API available.

"Web scraping is for when the data you need lives on a website but has no API. You download the HTML page, find the parts you want, and extract them — like reading a book and taking notes."

— Shurai

API vs Scraping — Which to Use?

✓ Use an API (preferred)

Structured, reliable data
Doesn’t break when site redesigns
Usually faster to work with
Legal & permitted

⚡ Scrape when no API exists

Data only lives as HTML
Check robots.txt & ToS first
Add delays between requests
May break if HTML changes

⚖️ Always check the site’s Terms of Service and robots.txt

Visit https://example.com/robots.txt to see what a site allows scrapers to access. Never scrape at high speed — add time.sleep(1) between requests. Only collect publicly visible data.

The Scraping Workflow — 3 Steps

1

Download the HTML

Use requests.get(url) to fetch the page source, exactly as a browser does.

2

Parse it into a tree

Pass the HTML to BeautifulSoup, which builds a navigable tree of HTML elements.

3

Find & extract elements

Use .find(), .find_all(), or .select() to locate exactly the data you need.

terminal — install

pip install requests beautifulsoup4

Step-by-Step: Your First Scraper

python — complete working example

import requests
from bs4 import BeautifulSoup

# ── Step 1: Download ──────────────────────────────────────
url  = "https://books.toscrape.com/"   # legal practice site
resp = requests.get(url, timeout=10)
resp.raise_for_status()

# ── Step 2: Parse ─────────────────────────────────────────
soup = BeautifulSoup(resp.text, "html.parser")

# ── Step 3: Extract ───────────────────────────────────────
# Each book is inside <article class="product_pod">
for book in soup.select("article.product_pod")[:5]:
    title = book.find("h3").find("a")["title"]
    price = book.find("p", class_="price_color").get_text(strip=True)
    print(f"{price:8}  {title}")

output

  £51.77  A Light in the Attic
  £53.74  Tipping the Velvet
  £50.10  Soumission
  £47.82  Sharp Objects
  £54.23  Sapiens: A Brief History of Humankind

BeautifulSoup — Finding Elements

There are three main ways to find elements. Understanding when to use each saves a lot of time:

Method	Returns	Best for
.find("tag")	First match	When you want one item (e.g. the page title)
.find_all("tag")	List of all matches	When you want every matching element
.select("css")	List of all matches	Complex selectors like `div.card h2 a`

python — the key selector patterns

# by tag
soup.find("h1")                        # first h1

# by tag + class  (note: class_ with underscore — class is a Python keyword)
soup.find("p", class_="price_color")  # first p.price_color
soup.find_all("div", class_="card")   # ALL div.card

# by id
soup.find("section", id="results")   # <section id="results">

# CSS selector — most powerful
soup.select("article.product_pod h3 a") # a inside h3 inside article.product_pod

# Getting text and attributes from a found element
el = soup.find("a")
print(el.get_text(strip=True))   # visible text, whitespace stripped
print(el["href"])                # attribute — raises KeyError if missing
print(el.get("href", "#"))       # safe — returns "#" if no href

Real Example — Collecting All Book Titles & Prices

python — save results to CSV

import requests, csv, time
from bs4 import BeautifulSoup

def scrape_page(url):
    resp = requests.get(url, timeout=10)
    soup = BeautifulSoup(resp.text, "html.parser")
    books = []
    for article in soup.select("article.product_pod"):
        books.append({
            "title": article.find("h3").find("a")["title"],
            "price": article.find("p", class_="price_color").get_text(strip=True),
            "rating": article.find("p", class_="star-rating")["class"][1],
        })
    return books

# Scrape first 3 pages politely
all_books = []
for p in range(1, 4):
    url = f"https://books.toscrape.com/catalogue/page-{p}.html"
    all_books.extend(scrape_page(url))
    time.sleep(1)   # be polite — 1 second between requests
    print(f"Page {p} done ({len(all_books)} books so far)")

# Save to CSV
with open("books.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "price", "rating"])
    writer.writeheader()
    writer.writerows(all_books)
print(f"Saved {len(all_books)} books to books.csv")

🔎 How to find the right CSS selector

In Chrome or Firefox: right-click the element you want → "Inspect" → right-click the highlighted element in DevTools → "Copy" → "Copy selector". Paste it into .select(). This is faster than guessing.

"Always add time.sleep(1) between page requests. It's the difference between a polite visitor and a DDoS attack. Even 1 second keeps you far under any server's rate limit."

— Shurai

🧠 Quiz — Q1

What are the three steps of web scraping in order?

🧠 Quiz — Q2

What is the difference between soup.find() and soup.find_all()?

🧠 Quiz — Q3

Why do we write class_= (with an underscore) in BeautifulSoup instead of class=?

🧠 Quiz — Q4

Why should you add time.sleep(1) between scraping requests?