Web Scraping & Bots · Lesson

Understanding Robots.txt

Interpret and respect the `robots.txt` file to understand a website's scraping policies and restrictions.

What is Robots.txt?

When building a web scraper or bot, it's crucial to be a good internet citizen. The robots.txt file is a key part of this.

It's a text file that websites use to communicate with web crawlers and other bots. It tells them which parts of the site they are allowed to access and which parts they should avoid.

Finding Robots.txt

Every website that uses a robots.txt file places it in a standard location: the root directory of its domain.

This means you can always find it by adding /robots.txt to the end of the website's main URL. For example:

https://www.example.com/robots.txt
https://www.google.com/robots.txt

You can simply type this into your browser to view a site's rules.

All lessons in this course

Understanding Robots.txt
Terms of Service & Copyright
Ethical Scraping Practices
Rate Limiting and Respectful Crawling

← Back to Web Scraping & Bots