AI Agents · Lesson

Respectful Scraping Practices

robots.txt compliance, user agent headers, request delays, and caching.

Why Respectful Scraping Matters

Web scraping can strain servers, violate terms of service, and get your agent's IP banned. Responsible scrapers respect rate limits, identify themselves, and honor the rules websites publish.

This lesson covers the tools and techniques for scraping ethically and sustainably.

Checking robots.txt

Every website can publish a robots.txt file specifying which paths automated agents may or may not access. Python's standard library includes urllib.robotparser to parse this file.

Always check robots.txt before scraping a site.

import urllib.robotparser

rp = urllib.robotparser.RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()

# Check if our bot can fetch a specific path
can_fetch = rp.can_fetch('MyAgent/1.0', 'https://example.com/public-data')
print(f'Can fetch: {can_fetch}')  # True or False

cannot_fetch = rp.can_fetch('*', 'https://example.com/private/admin')
print(f'Admin allowed: {cannot_fetch}')  # Often False

All lessons in this course

HTTP Clients for Agents: httpx and requests
Parsing HTML with BeautifulSoup
Handling Pagination and Dynamic Content
Respectful Scraping Practices

← Back to AI Agents