Beginner’s Guide: Your First Web Scraper with Python & ScrapingDuck

So, you want to extract data from a website? Maybe you need to monitor prices, gather research data, or archive old posts. Welcome to the world of web scraping.

If you’ve tried scraping before, you might have hit roadblocks: IP bans, complex JavaScript that hides data, or “Access Denied” errors. This is where a scraping API comes in handy.

In this post, we will build a basic scraper using Python and ScrapingDuck. We aren’t going to extract specific data (parse) just yet; we are going to focus on the most important first step: reliably getting the raw HTML from a real website.

Why ScrapingDuck?

When you write a basic Python script to visit a site, it often looks like a “bot” to the server. ScrapingDuck handles the hard work for you:

  • IP Rotation: It rotates IP addresses so you don’t get blocked.
  • JavaScript Rendering: It renders JavaScript (crucial for modern websites).
  • Free Tier: You can create a free account at app.scrapingduck.com and start scraping with a generous daily free limit.

The Target Website

For this tutorial, we will scrape Books to Scrape.

Why this site? It is a “sandbox” specifically designed for developers to practice web scraping. It mimics a real e-commerce store with products, prices, and stock information, but without the risk of getting banned.

Prerequisites

You need Python installed on your computer. We will use the requests library, which is the industry standard for making HTTP requests in Python.

If you don’t have it installed, open your terminal and run:

pip install requests

Step 1: Get Your API Key

  1. Go to app.scrapingduck.com.
  2. Sign up for a free account.
  3. Copy your personal API Key from the dashboard.

Step 2: The Code

We will write a simple script that asks ScrapingDuck to visit books.toscrape.com and send us back the HTML code.

Create a file named scraper.py and paste the following code:

import requests

# 1. Configuration
# Replace 'YOUR_API_KEY_HERE' with the key you copied from the dashboard
API_KEY = "YOUR_API_KEY_HERE"
TARGET_URL = "http://books.toscrape.com/"

# 2. Define the ScrapingDuck Endpoint
# We are using the /v1/scrape/source endpoint which returns raw HTML
endpoint = "https://api.scrapingduck.com/v1/scrape/source"

# 3. Set up the parameters
# We pass the target URL and our API key.
# We disable JavaScript here because 'Books to Scrape' is a static site.
# This makes the request faster. For dynamic sites (like Twitter), set this to False.
params = {
    "url": TARGET_URL,
    "apiKey": API_KEY,
    "disableJavaScript": "true" 
}

# 4. Make the Request
try:
    print(f"Scraping {TARGET_URL}...")
    response = requests.get(endpoint, params=params)

    # 5. Check if the request was successful
    if response.status_code == 200:
        print("Success! Data retrieved.")
        
        # Print the first 500 characters of the HTML to verify
        print("\n--- Start of HTML ---")
        print(response.text[:500]) 
        print("--- End of Snippet ---")
        
    else:
        print(f"Error: Failed to retrieve data. Status Code: {response.status_code}")
        print(f"Details: {response.text}")

except Exception as e:
    print(f"An error occurred: {e}")

Understanding the Code

Here is what is happening behind the scenes:

ConceptExplanation
The EndpointWe send our request to https://api.scrapingduck.com/v1/scrape/source. This tells ScrapingDuck, “Go get the source code for me.”
urlThe actual website address we want to visit.
apiKeyYour password/key that proves you are allowed to use the service.
disableJavaScriptWe set this to "true" because Books to Scrape is a simple site. This saves time. If you were scraping a complex site (like a Single Page Application), you would set this to "false".

Step 3: Run It

Open your terminal or command prompt and run:

python scraper.py

What You Should See

If everything works, you will see “Success!” followed by a block of HTML code that looks like this:

<!DOCTYPE html>
<html lang="en-us" class="no-js">
    <head>
        <title>
    All products | Books to Scrape - Sandbox
</title>
...

Congratulations! You have just successfully scraped a website using an enterprise-grade proxy and rendering engine, all with about 15 lines of Python.

Summary

You now have the raw data (HTML) from a website on your local machine. You didn’t have to worry about headers, user-agents, or IP rotation: ScrapingDuck handled the infrastructure while you wrote standard Python code.

What’s next?
Now that you have this giant block of HTML text, you need to make sense of it. You need to parse it to find specific book titles and prices. In the next post, we will introduce BeautifulSoup, a Python library that helps you search through this HTML to extract exactly the data you need.

Share your love