A survey of robots.txt - part one

After reading CollinMorris’s analysis of favicons of the top 1 million sites on the web, I thought it would be interesting to do the same for other common parts of websites that often get overlooked.

The robots.txt file is a plain text file found at on most websites which communicates information to web crawlers and spiders about how to scan a website.. For example, here’s an excerpt from robots.txt for google.com

User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/howsearchworks
...

The above excerpt tells all web crawlers not to scan the /search path, but allows them to scan /search/about and /search/howsearchworks paths. There are a few more supported keywords, but these are the most common. Following these instructions is not required, but it is considered good internet etiquette. If you want to read more about the standard, Wikipedia has a great page here.

In order to do an analysis of robots.txt, first I need to crawl the web for them – ironic, I know.

Scraping

I wrote a scraper using scrapy to make a request for robots.txt for each of the domains in Alexa’s top 1 million websites. If the response code was 200, the Content-Type header contained text/plain, and the response body was not empty, I stored the response body in a file, with the same name as the domain name.

One complication I encountered was that not all domains respond on the same protocol or subdomain. For example, some websites respond on http://{domain_name} while others require http://www.{domain_name}. If a website doesn’t automatically redirect you to the correct protocol or subdomain, the only way to find the correct one, is to try them all! So I wrote a small class, extending scrapy’s RetryMiddleware, to do this:

COMMON_PREFIXES = {'http://', 'https://', 'http://www.', 'https://www.'}

class PrefixRetryMiddleware(RetryMiddleware):

    def process_exception(self, request, exception, spider):
            prefixes_tried = request.meta['prefixes']
            if COMMON_PREFIXES == prefixes_tried:
                return exception
    
            new_prefix = choice(tuple(COMMON_PREFIXES - prefixes_tried))
            request = self.update_request(request, new_prefix)
    
            return self._retry(request, exception, spider)

The rest of the scraper itself is quite simple, but you can read the full code on GitHub.

Results

Scraping the full Alexa top 1 million websites list took around 24 hours. Once it was finished, I had just under 700k robots.txt files

$ find -type f | wc -l
677686

totalling 493MB

$ du -sh
493M	.

The smallest robots.txt was 1 byte1, but the largest was over 5MB.

$ find -type f -exec du -Sh {} + | sort -rh | head -n 1
5.6M	./haberborsa.com.tr

$ find -not -empty -type f -exec du -b {} + | sort -h | head -n 1
1	./0434.cc

The full data set is released under the Open Database License (ODbL) v1.0 and can be found on GitHub

What’s next?

In the next part of this blog series I’m going to analyse all the robots.txt to see if I can find anything interesting. In particular I’d like to know why exactly someone needs a robots.txt file over 5MB in size, what is the most common web crawler listed (either allowed or disallowed), and are there any sites practising security by obscurity by trying to keep links out of search engines!

  1. I needed to include -not -empty when looking for the smallest file, as there were errors when decoding the response body for some domains. I’ve included the empty files in the dataset for posterity, but I will exclude them from further analysis. 

Setting up nginx reverse proxy with Let's Encrypt on unRAID

Late last year I set about building a new NAS to replace my aging HP ProLiant MicroServer N36L (though that’s a story for a different post). I decided to go with unRAID as my OS, over FreeNAS that I’d been running previously, mostly due to the simpler configuration, ease of expanding an array, and support for Docker and KVM.

Docker support makes it a lot easier to run some of the web apps that I rely on like Plex, Sonarr, CouchPotato and more, but accessing them securely outside my network is a different story. On FreeNAS I ran an nginx reverse proxy in a BSD jail, secured using basic auth, and SSL certificates from StartSSL. Thankfully there is already a Docker image, nginx-proxy by jwilder, which automatically configures nginx for you. As for SSL, Let’s Encrypt went into public beta in December, and recently issued their millionth certificate. There’s also a Docker image which will automatically manage your certificates, and configure nginx for you - letsencrypt-nginx-proxy-companion by jrcs.

Preparation

Firstly, you need to set up and configure Docker. There is a fantastic guide on how to configure Docker here on the Lime Technology website. Next, you need to install the Community Applications plugin. This allows us to install Docker containers directly from the Docker Hub.

In order to access everything from outside your LAN you’ll need to forward ports 80 and 443 to ports 8008 and 443, respectively, on your unRAID host. In addition, you’ll need to use some sort of Dynamic DNS service, though in my case I bought a domain name and use CloudFlare to handle my DNS.

nginx

From the Apps page on the unRAID web interface, search for nginx-proxy and click ‘get more results from Docker Hub’. Click ‘add’ under the listing for nginx-proxy by jwilder. Set the following container settings, changing your ‘host path’ to wherever you store Docker configuration files on your unRAID host

nginx-proxy settings

To add basic auth to any of the sites you’ll need to make a file with the VIRTUAL_HOST of the site, available to nginx in /etc/nginx/htpasswd. For example, I added a file in /mnt/user/docker/nginx/htpasswd/. You can create htpasswd files using apache2-utils, or there are sites available which can create them.

Let’s Encrypt

From the Apps page again, search for letsencrypt-nginx-proxy-companion, click ‘get more results from Docker Hub’, and then click ‘add’ under the listing for letsencrypt-nginx-proxy-companion by jrcs. Enter the following container settings, again changing your ‘host path’ to wherever you store Docker configuration files on your unRAID host

lets encrypt settings

Putting it all together

In order to configure nginx, you’ll need to add four environment variables to the Docker containers you wish to put behind the reverse proxy. They are VIRTUAL_HOST, VIRTUAL_PORT, LETSENCRYPT_HOST, and LETSENCRYPT_EMAIL. VIRTUAL_HOST and LETSENCRYPT_HOST most likely need to be the same, and will be something like subdomain.yourdomain.com. VIRTUAL_PORT should be the port your Docker container exposes. For example, Sonarr uses port 8989 by default. LETSENCRYPT_EMAIL should be a valid email address that Let’s Encrypt can use to email you about certificate expiries, etc.

Once nginx-proxy, letsencrypt-nginx-proxy-companion, and all your Docker containers are configured you should be able to access them all over SSL, with basic auth, from outside your LAN. You don’t even have to worry about certificate renewals as it’s all handled for you.

Do you really want "bank grade" security in your SSL? Danish edition

I recently saw an article on /r/programming called Do you really want “bank grade” security in your SSL? Here’s how Aussie banks fare. The author used the Qualys SSL Labs test to determine how good Aussie banks’ SSL implementations really are. I thought the article was great, and gave good, actionable feedback. At the time of writing this two of the banks listed have already improved their SSL scores.

It got me thinking: how well (or badly) do banks in Denmark fare? We put our trust - and our money - in these banks, but do they really deserve it? The banks I’ll be testing come from the list of systemically important banks, more commonly known as “Too big to fail.” The list consists of:

  • Danske Bank
  • Nykredit
  • Nordea
  • Jyske Bank
  • Sydbank
  • DLR Kredit

The Qualys SSL Labs test gives an overall grade, from A to F, but also points out any pressing issues with the SSL configuration. To score well a site must:

  • Disable SSL 3 protocol support as it is obselete and insecure
  • Support TLS 1.2 as it is the current best protocol
  • Have no SHA1 certificates (excluding the root certificate) in the chain as modern browsers will show the site as insecure
  • Disable the RC4 cipher as it is a weak cipher
  • Support forward secrecy to prevent a compromise of a secure key affecting the confidentiality of past conversations
  • Mitigate POODLE attacks, to prevent attackers downgrading secure connections to insecure connections

To make this as realistic as possible, I’ll be testing the login pages.

So I’ve got my list of sites, and my test all sorted. Let’s dive right in!

BankGradeSSL 3TLS 1.2SHA1RC4Forward SecrecyPOODLE
Danske BankA-PassPassPassPassFailPass
NordeaBPassFailFail1FailFailPass
DLR KreditCFailFailFailFailFailFail2
Jyske BankFPassFailPassFailFailFail
SydbankFPassFailPassFailFailFail
NykreditFFailFailFail3FailFailFail

1Nordea’s SSL certificate is SHA256, but they use an SHA1 intermediate certificate

2DLR receives a C overall, as only its SSL implementations are vulnerable to POODLE, but not its TLS implementation

3Nykredit would pass, but they provide an unnecessary certificate which is signed using SHA1

Only one bank, Danske Bank, managed to get an A (though it is an A-). This is mostly due to the lack of forward secrecy support. If they fix this they can increase their rating to an A. They are also the only bank to enable TLS 1.2 support, and disable the RC4 cipher.

Nordea comes in second, managing a B, with some very odd results. They only support the TLS 1.0 protocol, but the list of server preferred cipher suites starts with RC4 (which is insecure). The server also sent a certificate chain with an MD2 certificate in it!

DLR Kredit achieves a C, despite the fact they are vulnerable to POODLE, because only their SSL 3 implementation is vulnerable. Their certificate is signed using SHA1, but stranger still their server didn’t send any intermediate certificates. DLR Kredit is also the only bank which is vulnerable to the CRIME attack. This allows an attacker to read secure cookies sent by the server. Disabling TLS compression is all that is required to mitigate this.

Jyske Bank and Sydbank appear to use the same server configuration as their results are equally bad. They both have disabled SSL 3, and have a complete certificate chain signed using SHA256, but fail on all other tests. In addition, both banks are intolerant to TLS versions, meaning if their websites are badly written they may stop working when new TLS versions comes out.

Finally we have Nykredit, who fares worst of all. Their server sent unnecessary certificates, signed using SHA1, which causes them to fail this test. They only support TLS 1.0 and SSL 3, but their server’s preferred cipher is RC4. Most worrying is the lack of support for secure renegotiation. This vulnerability is nearly 5 years old, and allows a man-in-the-middle to inject arbitrary content to an encrypted session.

Overall it’s not looking too good. A lot of Denmark’s biggest banks are not as secure as they would have you believe. Many are vulnerable to a lot of different avenues of attack - the most worrying being POODLE. These results are similar to those in Troy Hunt’s original article, and just go to show that just because something is “bank grade” doesn’t necessarily mean that it’s actually good.

Continuously deploy Jekyll to Azure Web Apps

I’ve been thinking about writing a blog for a while now, but there are just so many blogging platforms out there to choose from. I finally settled on Jekyll as it’s really lightweight (compared to platforms like Wordpress), it has an active development community, and you can write all your articles in Markdown.

Many Jekyll users host their Jekyll sites through GitHub Pages, and there are a lot of advantages to this:

  • Free web hosting
  • Built in version control
  • Continuous deployment

However, the main disadvantage is that GitHub Pages runs Jekyll in safe mode. This means that it’s not possible to extend Jekyll with plugins. There are some ways to avoid this restriction, but they’re all awkward workarounds.

A solution

I still wanted all the advantages GitHub Pages has, but with Jekyll plugins too. The solution was to use Travis to build the Jekyll site and host it on Azure Web Apps. Travis is a continuous integration service that’s free for open source projects to use, and I’m going to use it to build my Jekyll site. Azure provides cheap, or even free, web hosting for websites and it’s what I’m familiar with.

Azure

First of all you need to create an Azure Web App.

  • Go to the Azure management site
  • Click New > Compute > Web App > Quick Create
  • Enter the URL you want and select an app service plan

Azure new web app

On the dashboard for your new web app, make a note of the FTP host name and your deployment credentials. If you’ve forgotten your deployment credentials you can reset them from here as well.

GitHub

I know I said earlier that I didn’t want to use GitHub, but I’m not actually using GitHub Pages. Travis depends on GitHub Webhooks in order to figure out when you push an update to your site and kick off a build. In addition to the standard Jekyll files you’ll need to add a .travis.yml configuration file as well as a build and deploy script. My Travis configuration looks like this:

language: ruby
rvm:
  - 2.2
script:
  - chmod +x script/build
  - ./script/build
after_success:
  - chmod +x script/deploy
  - ./script/deploy
env:
  global:
  - NOKOGIRI_USE_SYSTEM_LIBRARIES=true

Taking it section by section

language: ruby
rvm:
  - 2.2

This tells Travis that the project is Ruby based, and what version of Ruby to use - in this case Ruby 2.2.

script:
  - chmod +x script/build
  - ./script/build

My build script in the scripts directory. This sets the execute flag, then executes it.

after_success:
  - chmod +x script/deploy
  - ./script/deploy

If the build is successful, Travis will set the execute flag on the deploy script, then execute it.

env:
  global:
  - NOKOGIRI_USE_SYSTEM_LIBRARIES=true

I’m using html-proofer to check all the links and images on my site, and this allows me to speed up the build time by using pre-installed libraries.

Now onto the build script. There’s nothing terribly exciting here, just build the site and run html-proofer on it.

bundle exec jekyll build
bundle exec htmlproof ./_site

The real magic happens in the deploy script.

sudo apt-get install -qq ncftp

ncftp -u "$USERNAME" -p "$PASSWORD" $HOST<<EOF
rm -rf site/wwwroot
mkdir site/wwwroot
quit
EOF

cd _site
ncftpput -R -v -u "$USERNAME" -p "$PASSWORD" $HOST /site/wwwroot .

I’m making use of a handy program called ncftp in order to deploy my site. Firstly Travis deletes the currently deployed site, then puts the generated Jekyll site on the FTP server.

Travis

To put it all together you need to configure Travis builds for your GitHub repository, and set the environment variables to allow Travis to deploy to Azure:

  • Go your your Travis profile
  • Click the slider next to your Jekyll repository
  • Go to your repositories and click on your Jekyll repository
  • Click Settings > Environment variables
  • Set environment variables for your Azure Web App where
    • USERNAME is azure-web-app-name\\azure-deployment-username
    • PASSWORD is azure-deployment-password
    • HOST is ftp-server-name.ftp.azurewebsites.windows.net

Travis environment variables

Remember that USERNAME requires a double backslash to escape the character in the terminal.

Putting it all together

Now that everything is all configured all you need to do is push a commit to GitHub and wait. If everything is good you should see your Jekyll site deployed automatically to your Azure Web App - though if you’re anything like me html-proofer will have picked up some broken links on your site!