A survey of robots.txt - part two

Mar 22, 2018 · 6 minute read

In part one of this article, I collected robots.txt from the top 1 million sites on the web. In this article I’m going to do some analysis, and see if there’s anything interesting to find from all the files I’ve collected.

First we’ll start with some setup.

1%matplotlib inline
2
3import pandas as pd
4import numpy as np
5import glob
6import os
7import matplotlib

Next I’m going to load the content of each file into my pandas dataframe, calculate the file size, and store that for later.

1l = [filename.split('/')[1] for filename in glob.glob('robots-txt/\*')]
2df = pd.DataFrame(l, columns=['domain'])
3df['content'] = df.apply(lambda x: open('robots-txt/' + x['domain']).read(), axis=1)
4df['size'] = df.apply(lambda x: os.path.getsize('robots-txt/' + x['domain']), axis=1)
5df.sample(5)

	domain	content	size
612419	veapple.com	User-agent: *\nAllow: /\n\nSitemap: http://www...	260
622296	buscadortransportes.com	User-agent: *\nDisallow: /out/	29
147795	dailynews360.com	User-agent: *\nAllow: /\n\nDisallow: /search/\...	248
72823	newfoundlandpower.com	User-agent: *\nDisallow: /Search.aspx\nDisallo...	528
601408	xfwed.com	#\n# robots.txt for www.xfwed.com\n# Version 3...	201

File sizes

Now that we’ve done the setup, let’s see what the spread of file sizes in robots.txt is.

1fig = df.plot.hist(title='robots.txt file size', bins=20)
2fig.set_yscale('log')

png

It looks like the majority of robots.txt are under 250KB in size. This is really no surprise as robots.txt supports regex, so complex rulesets can be built easily.

Let’s take a look at the files larger than 1MB. I can think of three possibilities: they’re automatically maintained; they’re some other file masquerading as robots.txt; or the site is doing something seriously wrong.

1large = df[df['size'] > 10 ** 6].sort_values(by='size', ascending=False)

 1import re
 2
 3def count_directives(value, domain):
 4content = domain['content']
 5return len(re.findall(value, content, re.IGNORECASE))
 6
 7large['disallow'] = large.apply(lambda x: count_directives('Disallow', x), axis=1)
 8large['user-agent'] = large.apply(lambda x: count_directives('User-agent', x), axis=1)
 9large['comments'] = large.apply(lambda x: count_directives('#', x), axis=1)
10
11# The directives below are non-standard
12
13large['crawl-delay'] = large.apply(lambda x: count_directives('Crawl-delay', x), axis=1)
14large['allow'] = large.apply(lambda x: count_directives('Allow', x), axis=1)
15large['sitemap'] = large.apply(lambda x: count_directives('Sitemap', x), axis=1)
16large['host'] = large.apply(lambda x: count_directives('Host', x), axis=1)
17
18large

	domain	content	size	disallow	user-agent	comments	crawl-delay	allow	sitemap	host
632170	haberborsa.com.tr	User-agent: *\nAllow: /\n\nDisallow: /?ref=\nD...	5820350	71244	2	0	0	71245	5	10
23216	miradavetiye.com	Sitemap: https://www.miradavetiye.com/sitemap_...	5028384	47026	7	0	0	47026	2	0
282904	americanrvcompany.com	Sitemap: http://www.americanrvcompany.com/site...	4904266	56846	1	1	0	56852	2	0
446326	exibart.com	User-Agent: *\nAllow: /\nDisallow: /notizia.as...	3275088	61403	1	0	0	61404	0	0
579263	sinospectroscopy.org.cn	http://www.sinospectroscopy.org.cn/readnews.ph...	2979133	0	0	0	0	0	0	0
55309	vibralia.com	# robots.txt automaticaly generated by PrestaS...	2835552	39712	1	15	0	39736	0	0
124850	oftalmolog30.ru	User-Agent: *\nHost: chuzmsch.ru\nSitemap: htt...	2831975	87752	1	0	0	87752	2	2
557116	la-viephoto.com	User-Agent:*\nDisallow:/aloha_blog/\nDisallow:...	2768134	29782	2	0	0	29782	2	0
677400	bigclozet.com	User-agent: *\nDisallow: /item/\n\nUser-agent:...	2708717	51221	4	0	0	51221	0	0
621834	tranzilla.ru	Host: tranzilla.ru\nSitemap: http://tranzilla....	2133091	27647	1	0	0	27648	2	1
428735	autobaraholka.com	User-Agent: *\nDisallow: /registration/\nDisal...	1756983	39330	1	0	0	39330	0	2
628591	megasmokers.ru	User-agent: \nDisallow: /route=account/\nDis...	1633963	92	2	0	0	92	2	1
647336	valencia-cityguide.com	# If the Joomla site is installed within a fol...	1559086	17719	1	12	0	17719	1	99
663372	vetality.fr	# robots.txt automaticaly generated by PrestaS...	1536758	27737	1	12	0	27737	0	0
105735	golden-bee.ru	User-agent: Yandex\nDisallow: /*_openstat\nDis...	1139308	24081	4	1	0	24081	0	1
454311	dreamitalive.com	user-agent: google\ndisallow: /memberprofileda...	1116416	34392	3	0	0	34401	0	9
245895	gobankingrates.com	User-agent: *\nDisallow: /wp-admin/\nAllow: /w...	1018109	7362	28	20	2	7363	0	0

It looks like all of these sites are misusing Disallow and Allow. In fact, looking at the raw files it appears as if they list all of the articles on the site under an individual Disallow command. I can only guess that when publishing an article, a corresponding line in robots.txt is added.

Now let’s take a look at the smallest robots.txt

1small = df[df['size'] > 0].sort_values(by='size', ascending=True)
2
3small.head(5)

	domain	content	size
336828	iforce2d.net	\n	1
55335	togetherabroad.nl	\n	1
471397	behchat.ir	\n	1
257727	docteurtamalou.fr		1
669247	lastminute-cottages.co.uk	\n	1

There’s not really anything interesting here, so let’s take a look at some larger files

1small = df[df['size'] > 10].sort_values(by='size', ascending=True)
2
3small.head(5)

	domain	content	size
676951	fortisbc.com	sitemap.xml	11
369859	aurora.com.cn	User-agent:	11
329775	klue.kr	Disallow: /	11
390064	chneic.sh.cn	Disallow: /	11
355604	hpi-mdf.com	Disallow: /	11

Disallow: / tells all webcrawlers not to crawl anything on this site, and should (hopefully) keep it out of any search engines, but not all webcrawlers follow robots.txt.

User agents

User agents can be listed in robots.txt to either Allow or Disallow certain paths. Let’s take a look at the most common webcrawlers.

1from collections import Counter
2
3def find_user_agents(content):
4    return re.findall('User-agent:? (.*)', content)
5
6user_agent_list = [find_user_agents(x) for x in df['content']]
7user_agent_count = Counter(x.strip() for xs in user_agent_list for x in set(xs))
8user_agent_count.most_common(n=10)

 1[('*', 587729),
 2('Mediapartners-Google', 36654),
 3('Yandex', 29065),
 4('Googlebot', 25932),
 5('MJ12bot', 22250),
 6('Googlebot-Image', 16680),
 7('Baiduspider', 13646),
 8('ia_archiver', 13592),
 9('Nutch', 11204),
10('AhrefsBot', 11108)]

It’s no surprise that the top result is a wildcard (*). Google takes spots 2, 4, and 6 with their AdSense, search and image web crawlers respectively. It does seem a little strange to see the AdSense bot listed above the usual search web crawler. Some of the other large search engines’ bots are also found in the top 10: Yandex, Baidu, and Yahoo (Slurp). MJ12bot is a crawler I had not heard of before, but according to their site it belongs to a UK based SEO company—and according to some of the results about it, it doesn’t behave very well. ia_archiver belongs to The Internet Archive, and (I assume) crawls pages for the Wayback Machine. Finally there is Apache Nutch, an open source webcrawler that can be run by anyone.

Security by obscurity

There are certain paths that you might not want a webcrawler to know about. For example, a .git directory, htpasswd files, or parts of a site that are still in testing, and aren’t meant to be found by anyone on Google. Let’s see if there’s anything interesting.

1sec_obs = ['\.git', 'alpha', 'beta', 'secret', 'htpasswd', 'install\.php', 'setup\.php']
2sec_obs_regex = re.compile('|'.join(sec_obs))
3
4def find_security_by_obscurity(content):
5return sec_obs_regex.findall(content)
6
7sec_obs_list = [find_security_by_obscurity(x) for x in df['content']]
8sec_obs_count = Counter(x.strip() for xs in sec_obs_list for x in set(xs))
9sec_obs_count.most_common(10)

1[('install.php', 28925),
2('beta', 2834),
3('secret', 753),
4('alpha', 597),
5('.git', 436),
6('setup.php', 73),
7('htpasswd', 45)]

Just because a file or directory is mentioned in robots.txt, it doesn’t mean that it can actually be accessed. However, if even 1% of Wordpress installs leave their install.php open to the world, that’s still a lot of vulnerable sites. Any attacker could get the keys to the kingdom very easily. The same goes for a .git directory. Even if it is read-only, people accidentally commit secrets to their git repository all the time.

Conclusion

robots.txt is a fairly innocuous part of the web. It’s been interesting to see how popular websites (ab)use it, and which web crawlers are naughty or nice. Most of all this has been a great exercise for myself in collecting data and analysing it using pandas and Jupyter.

The full data set is released under the Open Database License (ODbL) v1.0 and can be found on GitHub

Jamie Magee

A survey of robots.txt - part two

File sizes

User agents

Security by obscurity

Conclusion