A survey of robots.txt - part one
Sep 19, 2017 · 3 minute read · CommentsAfter reading CollinMorris’s analysis of favicons of the top 1 million sites on the web, I thought it would be interesting to do the same for other common parts of websites that often get overlooked.
The robots.txt
file is a plain text file found at on most websites which communicates information to web crawlers and spiders about how to scan a website.. For example, here’s an excerpt from robots.txt
for google.com
1User-agent: *
2Disallow: /search
3Allow: /search/about
4Allow: /search/howsearchworks
5...
The above excerpt tells all web crawlers not to scan the /search
path, but allows them to scan /search/about
and /search/howsearchworks
paths. There are a few more supported keywords, but these are the most common. Following these instructions is not required, but it is considered good internet etiquette. If you want to read more about the standard, Wikipedia has a great page here.
In order to do an analysis of robots.txt
, first I need to crawl the web for them – ironic, I know.
Scraping
I wrote a scraper using scrapy to make a request for robots.txt
for each of the domains in Alexa’s top 1 million websites. If the response code was 200, the Content-Type
header contained text/plain
, and the response body was not empty, I stored the response body in a file, with the same name as the domain name.
One complication I encountered was that not all domains respond on the same protocol or subdomain. For example, some websites respond on http://{domain_name}
while others require http://www.{domain_name}
. If a website doesn’t automatically redirect you to the correct protocol or subdomain, the only way to find the correct one, is to try them all! So I wrote a small class, extending scrapy’s RetryMiddleware
, to do this:
1COMMON_PREFIXES = {'http://', 'https://', 'http://www.', 'https://www.'}
2
3class PrefixRetryMiddleware(RetryMiddleware):
4
5 def process_exception(self, request, exception, spider):
6 prefixes_tried = request.meta['prefixes']
7 if COMMON_PREFIXES == prefixes_tried:
8 return exception
9
10 new_prefix = choice(tuple(COMMON_PREFIXES - prefixes_tried))
11 request = self.update_request(request, new_prefix)
12
13 return self._retry(request, exception, spider)
The rest of the scraper itself is quite simple, but you can read the full code on GitHub.
Results
Scraping the full Alexa top 1 million websites list took around 24 hours. Once it was finished, I had just under 700k robots.txt files
1$ find -type f | wc -l
2677686
totalling 493MB
1$ du -sh
2493M .
The smallest robots.txt
was 1 byte1, but the largest was over 5MB.
1$ find -type f -exec du -Sh {} + | sort -rh | head -n 1
25.6M ./haberborsa.com.tr
3
4$ find -not -empty -type f -exec du -b {} + | sort -h | head -n 1
51 ./0434.cc
The full data set is released under the Open Database License (ODbL) v1.0 and can be found on GitHub
What’s next?
In the next part of this blog series I’m going to analyse all the robots.txt
to see if I can find anything interesting. In particular I’d like to know why exactly someone needs a robots.txt
file over 5MB in size, what is the most common web crawler listed (either allowed or disallowed), and are there any sites practising security by obscurity by trying to keep links out of search engines!
I needed to include
-not -empty
when looking for the smallest file, as there were errors when decoding the response body for some domains. I’ve included the empty files in the dataset for posterity, but I will exclude them from further analysis. ↩︎