Happy Monday! pic.twitter.com/DHAfxnEcSo— Twitter Singapore (@TwitterSG) February 19, 2018 If you’re a company, and you have enough money, you can buy your own hashflag as well! That’s exactly what Disney did for the release of Star Wars: The Last Jedi.
If you spend the money to buy a hashflag, it’s important that you launch it correctly—otherwise they can flop. #白白白白白白白白白白 is an example of what not to do. At time of writing, it has only 10 uses.
Hashflags aren’t exclusive to English, and they can help add context to a tweet in another language. I don’t speak any Russian, but I do know that this image is of BB-8!
Unfortunately hashflags are temporary, so any context they add to a tweet can sometimes be lost at a later date. Currently Twitter doesn’t provide an official API for hashflags, and there is no canonical list of currently active hashflags. @hashflaglist tracks hashflags, but it’s easy to miss one—this is where Azure Functions come in.
It turns out that on Twitter.com the list of currently active hashflags is sent as a JSON object in the HTML as initial data. All I need to do is fetch Twitter.com, and extract the JSON object from the HTML.
$ curl https://twitter.com -v --silent 2>&1 | grep -o -P '.{6}activeHashflags.{6}'
"activeHashflags"
I wrote some C# to parse and extract the activeHashflags
JSON object, and store it in an Azure blob. You can find it here. Using Azure Functions I can run this code on a timer, so the Azure blob is always up to date with the latest Twitter hashflags. This means the blob can be used as an unofficial Twitter hashflags API—but I didn’t want to stop there.
I wanted to solve some of the issues with hashflags around both discovery and durability. Azure Functions is the perfect platform for these small, single purpose pieces of code. I ended up writing five Azure Functions in total—all of which can be found on GitHub.

ActiveHashflags
fetches the active hashflags from Twitter, and stores them in a JSON object in an Azure Storage Blob. You can find the list of current hashflags here. UpdateHashflagState
reads the JSON, and updates the hashflag table with the current state of each hashflag. StoreHashflagImage
downloads the hashflag image, and stores it in a blob store. CreateHeroImage
creates a hero image of the hashtag and hashflag. TweetHashflag
tweets the hashtag and hero image.
Say hello to @HashflagArchive!

@HashflagArchive solves both the issues I have with hashflags: it tweets out new hashflags the same hour they are activated on twitter, which solves the issue of discovery; and it tweets an image of the hashtag and hashflag, which solves the issue of hashflags being temporary.
So this is great, but there’s still one issue—how to use hashflags outside of Twitter.com and the official Twitter apps. This is where the JSON blob comes in. I can build a wrapper library around that, and then using that library, build applications with Twitter hashflags. So that’s exactly what I did.

I wrote an npm package called hashflags
. It’s pretty simple to use, and integrates nicely with the official twitter-text
npm package.
import { Hashflags } from 'hashflags';
let hf: Hashflags;
Hashflags.FETCH().then((val: Map<string, string>) => {
hf = new Hashflags(val);
console.log(hf.activeHashflags);
});
I wrote it in TypeScript, but it can also be used from plain old JS as well.
const Hashflags = require('hashflags').Hashflags;
let hf;
Hashflags.FETCH().then(val => {
hf = new Hashflags(val);
console.log(hf.activeHashflags);
});
So there you have it, a quick introduction to Twitter hashflags via Azure Functions and an npm library. If you’ve got any questions please leave a comment below, or reach out to me on Twitter @Jamie_Magee.
22 Mar 2018 In part one of this article, I collected robots.txt
from the top 1 million sites on the web. In this article I’m going to do some analysis, and see if there’s anything interesting to find from all the files I’ve collected.
First we’ll start with some setup.
%matplotlib inline
import pandas as pd
import numpy as np
import glob
import os
import matplotlib
Next I’m going to load the content of each file into my pandas dataframe, calculate the file size, and store that for later.
l = [filename.split('/')[1] for filename in glob.glob('robots-txt/*')]
df = pd.DataFrame(l, columns=['domain'])
df['content'] = df.apply(lambda x: open('robots-txt/' + x['domain']).read(), axis=1)
df['size'] = df.apply(lambda x: os.path.getsize('robots-txt/' + x['domain']), axis=1)
df.sample(5)
| domain | content | size |
612419 | veapple.com | User-agent: *\nAllow: /\n\nSitemap: http://www... | 260 |
622296 | buscadortransportes.com | User-agent: *\nDisallow: /out/ | 29 |
147795 | dailynews360.com | User-agent: *\nAllow: /\n\nDisallow: /search/\... | 248 |
72823 | newfoundlandpower.com | User-agent: *\nDisallow: /Search.aspx\nDisallo... | 528 |
601408 | xfwed.com | #\n# robots.txt for www.xfwed.com\n# Version 3... | 201 |
File sizes
Now that we’ve done the setup, let’s see what the spread of file sizes in robots.txt
is.
fig = df.plot.hist(title='robots.txt file size', bins=20)
fig.set_yscale('log')

It looks like the majority of robots.txt
are under 250KB in size. This is really no surprise as robots.txt
supports regex, so complex rulesets can be built easily.
Let’s take a look at the files larger than 1MB. I can think of three possibilities: they’re automatically maintained; they’re some other file masquerading as robots.txt
; or the site is doing something seriously wrong.
large = df[df['size'] > 10 ** 6].sort_values(by='size', ascending=False)
import re
def count_directives(value, domain):
content = domain['content']
return len(re.findall(value, content, re.IGNORECASE))
large['disallow'] = large.apply(lambda x: count_directives('Disallow', x), axis=1)
large['user-agent'] = large.apply(lambda x: count_directives('User-agent', x), axis=1)
large['comments'] = large.apply(lambda x: count_directives('#', x), axis=1)
# The directives below are non-standard
large['crawl-delay'] = large.apply(lambda x: count_directives('Crawl-delay', x), axis=1)
large['allow'] = large.apply(lambda x: count_directives('Allow', x), axis=1)
large['sitemap'] = large.apply(lambda x: count_directives('Sitemap', x), axis=1)
large['host'] = large.apply(lambda x: count_directives('Host', x), axis=1)
large
| domain | content | size | disallow | user-agent | comments | crawl-delay | allow | sitemap | host |
632170 | haberborsa.com.tr | User-agent: *\nAllow: /\n\nDisallow: /?ref=\nD... | 5820350 | 71244 | 2 | 0 | 0 | 71245 | 5 | 10 |
23216 | miradavetiye.com | Sitemap: https://www.miradavetiye.com/sitemap_... | 5028384 | 47026 | 7 | 0 | 0 | 47026 | 2 | 0 |
282904 | americanrvcompany.com | Sitemap: http://www.americanrvcompany.com/site... | 4904266 | 56846 | 1 | 1 | 0 | 56852 | 2 | 0 |
446326 | exibart.com | User-Agent: *\nAllow: /\nDisallow: /notizia.as... | 3275088 | 61403 | 1 | 0 | 0 | 61404 | 0 | 0 |
579263 | sinospectroscopy.org.cn | http://www.sinospectroscopy.org.cn/readnews.ph... | 2979133 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
55309 | vibralia.com | # robots.txt automaticaly generated by PrestaS... | 2835552 | 39712 | 1 | 15 | 0 | 39736 | 0 | 0 |
124850 | oftalmolog30.ru | User-Agent: *\nHost: chuzmsch.ru\nSitemap: htt... | 2831975 | 87752 | 1 | 0 | 0 | 87752 | 2 | 2 |
557116 | la-viephoto.com | User-Agent:*\nDisallow:/aloha_blog/\nDisallow:... | 2768134 | 29782 | 2 | 0 | 0 | 29782 | 2 | 0 |
677400 | bigclozet.com | User-agent: *\nDisallow: /item/\n\nUser-agent:... | 2708717 | 51221 | 4 | 0 | 0 | 51221 | 0 | 0 |
621834 | tranzilla.ru | Host: tranzilla.ru\nSitemap: http://tranzilla.... | 2133091 | 27647 | 1 | 0 | 0 | 27648 | 2 | 1 |
428735 | autobaraholka.com | User-Agent: *\nDisallow: /registration/\nDisal... | 1756983 | 39330 | 1 | 0 | 0 | 39330 | 0 | 2 |
628591 | megasmokers.ru | User-agent: *\nDisallow: /*route=account/\nDis... | 1633963 | 92 | 2 | 0 | 0 | 92 | 2 | 1 |
647336 | valencia-cityguide.com | # If the Joomla site is installed within a fol... | 1559086 | 17719 | 1 | 12 | 0 | 17719 | 1 | 99 |
663372 | vetality.fr | # robots.txt automaticaly generated by PrestaS... | 1536758 | 27737 | 1 | 12 | 0 | 27737 | 0 | 0 |
105735 | golden-bee.ru | User-agent: Yandex\nDisallow: /*_openstat\nDis... | 1139308 | 24081 | 4 | 1 | 0 | 24081 | 0 | 1 |
454311 | dreamitalive.com | user-agent: google\ndisallow: /memberprofileda... | 1116416 | 34392 | 3 | 0 | 0 | 34401 | 0 | 9 |
245895 | gobankingrates.com | User-agent: *\nDisallow: /wp-admin/\nAllow: /w... | 1018109 | 7362 | 28 | 20 | 2 | 7363 | 0 | 0 |
It looks like all of these sites are misusing Disallow
and Allow
. In fact, looking at the raw files it appears as if they list all of the articles on the site under an individual Disallow
command. I can only guess that when publishing an article, a corresponding line in robots.txt
is added.
Now let’s take a look at the smallest robots.txt
small = df[df['size'] > 0].sort_values(by='size', ascending=True)
small.head(5)
| domain | content | size |
336828 | iforce2d.net | \n | 1 |
55335 | togetherabroad.nl | \n | 1 |
471397 | behchat.ir | \n | 1 |
257727 | docteurtamalou.fr | | 1 |
669247 | lastminute-cottages.co.uk | \n | 1 |
There’s not really anything interesting here, so let’s take a look at some larger files
small = df[df['size'] > 10].sort_values(by='size', ascending=True)
small.head(5)
| domain | content | size |
676951 | fortisbc.com | sitemap.xml | 11 |
369859 | aurora.com.cn | User-agent: | 11 |
329775 | klue.kr | Disallow: / | 11 |
390064 | chneic.sh.cn | Disallow: / | 11 |
355604 | hpi-mdf.com | Disallow: / | 11 |
Disallow: /
tells all webcrawlers not to crawl anything on this site, and should (hopefully) keep it out of any search engines, but not all webcrawlers follow robots.txt
.
User agents
User agents can be listed in robots.txt
to either Allow
or Disallow
certain paths. Let’s take a look at the most common webcrawlers.
from collections import Counter
def find_user_agents(content):
return re.findall('User-agent:? (.*)', content)
user_agent_list = [find_user_agents(x) for x in df['content']]
user_agent_count = Counter(x.strip() for xs in user_agent_list for x in set(xs))
user_agent_count.most_common(n=10)
[('*', 587729),
('Mediapartners-Google', 36654),
('Yandex', 29065),
('Googlebot', 25932),
('MJ12bot', 22250),
('Googlebot-Image', 16680),
('Baiduspider', 13646),
('ia_archiver', 13592),
('Nutch', 11204),
('AhrefsBot', 11108)]
It’s no surprise that the top result is a wildcard (*
). Google takes spots 2, 4, and 6 with their AdSense, search and image web crawlers respectively. It does seem a little strange to see the AdSense bot listed above the usual search web crawler. Some of the other large search engines’ bots are also found in the top 10: Yandex, Baidu, and Yahoo (Slurp
). MJ12bot
is a crawler I had not heard of before, but according to their site it belongs to a UK based SEO company—and according to some of the results about it, it doesn’t behave very well. ia_archiver
belongs to The Internet Archive, and (I assume) crawls pages for the Wayback Machine. Finally there is Apache Nutch, an open source webcrawler that can be run by anyone.
Security by obscurity
There are certain paths that you might not want a webcrawler to know about. For example, a .git
directory, htpasswd
files, or parts of a site that are still in testing, and aren’t meant to be found by anyone on Google. Let’s see if there’s anything interesting.
sec_obs = ['\.git', 'alpha', 'beta', 'secret', 'htpasswd', 'install\.php', 'setup\.php']
sec_obs_regex = re.compile('|'.join(sec_obs))
def find_security_by_obscurity(content):
return sec_obs_regex.findall(content)
sec_obs_list = [find_security_by_obscurity(x) for x in df['content']]
sec_obs_count = Counter(x.strip() for xs in sec_obs_list for x in set(xs))
sec_obs_count.most_common(10)
[('install.php', 28925),
('beta', 2834),
('secret', 753),
('alpha', 597),
('.git', 436),
('setup.php', 73),
('htpasswd', 45)]
Just because a file or directory is mentioned in robots.txt
, it doesn’t mean that it can actually be accessed. However, if even 1% of Wordpress installs leave their install.php
open to the world, that’s still a lot of vulnerable sites. Any attacker could get the keys to the kingdom very easily. The same goes for a .git
directory. Even if it is read-only, people accidentally commit secrets to their git repository all the time.
Conclusion
robots.txt
is a fairly innocuous part of the web. It’s been interesting to see how popular websites (ab)use it, and which web crawlers are naughty or nice. Most of all this has been a great exercise for myself in collecting data and analysing it using pandas and Jupyter.
The full data set is released under the Open Database License (ODbL) v1.0 and can be found on GitHub
19 Sep 2017 After reading CollinMorris’s analysis of favicons of the top 1 million sites on the web, I thought it would be interesting to do the same for other common parts of websites that often get overlooked.
The robots.txt
file is a plain text file found at on most websites which communicates information to web crawlers and spiders about how to scan a website.. For example, here’s an excerpt from robots.txt
for google.com
User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/howsearchworks
...
The above excerpt tells all web crawlers not to scan the /search
path, but allows them to scan /search/about
and /search/howsearchworks
paths. There are a few more supported keywords, but these are the most common. Following these instructions is not required, but it is considered good internet etiquette. If you want to read more about the standard, Wikipedia has a great page here.
In order to do an analysis of robots.txt
, first I need to crawl the web for them – ironic, I know.
Scraping
I wrote a scraper using scrapy to make a request for robots.txt
for each of the domains in Alexa’s top 1 million websites. If the response code was 200, the Content-Type
header contained text/plain
, and the response body was not empty, I stored the response body in a file, with the same name as the domain name.
One complication I encountered was that not all domains respond on the same protocol or subdomain. For example, some websites respond on http://{domain_name}
while others require http://www.{domain_name}
. If a website doesn’t automatically redirect you to the correct protocol or subdomain, the only way to find the correct one, is to try them all! So I wrote a small class, extending scrapy’s RetryMiddleware
, to do this:
COMMON_PREFIXES = {'http://', 'https://', 'http://www.', 'https://www.'}
class PrefixRetryMiddleware(RetryMiddleware):
def process_exception(self, request, exception, spider):
prefixes_tried = request.meta['prefixes']
if COMMON_PREFIXES == prefixes_tried:
return exception
new_prefix = choice(tuple(COMMON_PREFIXES - prefixes_tried))
request = self.update_request(request, new_prefix)
return self._retry(request, exception, spider)
The rest of the scraper itself is quite simple, but you can read the full code on GitHub.
Results
Scraping the full Alexa top 1 million websites list took around 24 hours. Once it was finished, I had just under 700k robots.txt files
$ find -type f | wc -l
677686
totalling 493MB
The smallest robots.txt
was 1 byte, but the largest was over 5MB.
$ find -type f -exec du -Sh {} + | sort -rh | head -n 1
5.6M ./haberborsa.com.tr
$ find -not -empty -type f -exec du -b {} + | sort -h | head -n 1
1 ./0434.cc
The full data set is released under the Open Database License (ODbL) v1.0 and can be found on GitHub
What’s next?
In the next part of this blog series I’m going to analyse all the robots.txt
to see if I can find anything interesting. In particular I’d like to know why exactly someone needs a robots.txt
file over 5MB in size, what is the most common web crawler listed (either allowed or disallowed), and are there any sites practising security by obscurity by trying to keep links out of search engines!
I needed to include -not -empty
when looking for the smallest file, as there were errors when decoding the response body for some domains. I’ve included the empty files in the dataset for posterity, but I will exclude them from further analysis.
28 Mar 2016 Late last year I set about building a new NAS to replace my aging HP ProLiant MicroServer N36L (though that’s a story for a different post). I decided to go with unRAID as my OS, over FreeNAS that I’d been running previously, mostly due to the simpler configuration, ease of expanding an array, and support for Docker and KVM.
Docker support makes it a lot easier to run some of the web apps that I rely on like Plex, Sonarr, CouchPotato and more, but accessing them securely outside my network is a different story. On FreeNAS I ran an nginx reverse proxy in a BSD jail, secured using basic auth, and SSL certificates from StartSSL. Thankfully there is already a Docker image, nginx-proxy by jwilder, which automatically configures nginx for you. As for SSL, Let’s Encrypt went into public beta in December, and recently issued their millionth certificate. There’s also a Docker image which will automatically manage your certificates, and configure nginx for you - letsencrypt-nginx-proxy-companion by jrcs.
Preparation
Firstly, you need to set up and configure Docker. There is a fantastic guide on how to configure Docker here on the Lime Technology website. Next, you need to install the Community Applications plugin. This allows us to install Docker containers directly from the Docker Hub.
In order to access everything from outside your LAN you’ll need to forward ports 80 and 443 to ports 8008 and 443, respectively, on your unRAID host. In addition, you’ll need to use some sort of Dynamic DNS service, though in my case I bought a domain name and use CloudFlare to handle my DNS.
nginx
From the Apps page on the unRAID web interface, search for nginx-proxy
and click ‘get more results from Docker Hub’. Click ‘add’ under the listing for nginx-proxy
by jwilder. Set the following container settings, changing your ‘host path’ to wherever you store Docker configuration files on your unRAID host

To add basic auth to any of the sites you’ll need to make a file with the VIRTUAL_HOST
of the site, available to nginx in /etc/nginx/htpasswd
. For example, I added a file in /mnt/user/docker/nginx/htpasswd/
. You can create htpasswd files using apache2-utils
, or there are sites available which can create them.
Let’s Encrypt
From the Apps page again, search for letsencrypt-nginx-proxy-companion
, click ‘get more results from Docker Hub’, and then click ‘add’ under the listing for letsencrypt-nginx-proxy-companion
by jrcs. Enter the following container settings, again changing your ‘host path’ to wherever you store Docker configuration files on your unRAID host

Putting it all together
In order to configure nginx, you’ll need to add four environment variables to the Docker containers you wish to put behind the reverse proxy. They are VIRTUAL_HOST
, VIRTUAL_PORT
, LETSENCRYPT_HOST
, and LETSENCRYPT_EMAIL
. VIRTUAL_HOST
and LETSENCRYPT_HOST
most likely need to be the same, and will be something like subdomain.yourdomain.com
. VIRTUAL_PORT
should be the port your Docker container exposes. For example, Sonarr uses port 8989 by default. LETSENCRYPT_EMAIL
should be a valid email address that Let’s Encrypt can use to email you about certificate expiries, etc.
Once nginx-proxy
, letsencrypt-nginx-proxy-companion
, and all your Docker containers are configured you should be able to access them all over SSL, with basic auth, from outside your LAN. You don’t even have to worry about certificate renewals as it’s all handled for you.
06 May 2015 I recently saw an article on /r/programming called Do you really want “bank grade” security in your SSL? Here’s how Aussie banks fare. The author used the Qualys SSL Labs test to determine how good Aussie banks’ SSL implementations really are. I thought the article was great, and gave good, actionable feedback. At the time of writing this two of the banks listed have already improved their SSL scores.
It got me thinking: how well (or badly) do banks in Denmark fare? We put our trust - and our money - in these banks, but do they really deserve it? The banks I’ll be testing come from the list of systemically important banks, more commonly known as “Too big to fail.” The list consists of:
- Danske Bank
- Nykredit
- Nordea
- Jyske Bank
- Sydbank
- DLR Kredit
The Qualys SSL Labs test gives an overall grade, from A to F, but also points out any pressing issues with the SSL configuration. To score well a site must:
- Disable SSL 3 protocol support as it is obselete and insecure
- Support TLS 1.2 as it is the current best protocol
- Have no SHA1 certificates (excluding the root certificate) in the chain as modern browsers will show the site as insecure
- Disable the RC4 cipher as it is a weak cipher
- Support forward secrecy to prevent a compromise of a secure key affecting the confidentiality of past conversations
- Mitigate POODLE attacks, to prevent attackers downgrading secure connections to insecure connections
To make this as realistic as possible, I’ll be testing the login pages.
So I’ve got my list of sites, and my test all sorted. Let’s dive right in!
Bank | Grade | SSL 3 | TLS 1.2 | SHA1 | RC4 | Forward Secrecy | POODLE |
Danske Bank | A- | Pass | Pass | Pass | Pass | Fail | Pass |
Nordea | B | Pass | Fail | Fail1 | Fail | Fail | Pass |
DLR Kredit | C | Fail | Fail | Fail | Fail | Fail | Fail2 |
Jyske Bank | F | Pass | Fail | Pass | Fail | Fail | Fail |
Sydbank | F | Pass | Fail | Pass | Fail | Fail | Fail |
Nykredit | F | Fail | Fail | Fail3 | Fail | Fail | Fail |
1Nordea’s SSL certificate is SHA256, but they use an SHA1 intermediate certificate
2DLR receives a C overall, as only its SSL implementations are vulnerable to POODLE, but not its TLS implementation
3Nykredit would pass, but they provide an unnecessary certificate which is signed using SHA1
Only one bank, Danske Bank, managed to get an A (though it is an A-). This is mostly due to the lack of forward secrecy support. If they fix this they can increase their rating to an A. They are also the only bank to enable TLS 1.2 support, and disable the RC4 cipher.
Nordea comes in second, managing a B, with some very odd results. They only support the TLS 1.0 protocol, but the list of server preferred cipher suites starts with RC4 (which is insecure). The server also sent a certificate chain with an MD2 certificate in it!
DLR Kredit achieves a C, despite the fact they are vulnerable to POODLE, because only their SSL 3 implementation is vulnerable. Their certificate is signed using SHA1, but stranger still their server didn’t send any intermediate certificates. DLR Kredit is also the only bank which is vulnerable to the CRIME attack. This allows an attacker to read secure cookies sent by the server. Disabling TLS compression is all that is required to mitigate this.
Jyske Bank and Sydbank appear to use the same server configuration as their results are equally bad. They both have disabled SSL 3, and have a complete certificate chain signed using SHA256, but fail on all other tests. In addition, both banks are intolerant to TLS versions, meaning if their websites are badly written they may stop working when new TLS versions comes out.
Finally we have Nykredit, who fares worst of all. Their server sent unnecessary certificates, signed using SHA1, which causes them to fail this test. They only support TLS 1.0 and SSL 3, but their server’s preferred cipher is RC4. Most worrying is the lack of support for secure renegotiation. This vulnerability is nearly 5 years old, and allows a man-in-the-middle to inject arbitrary content to an encrypted session.
Overall it’s not looking too good. A lot of Denmark’s biggest banks are not as secure as they would have you believe. Many are vulnerable to a lot of different avenues of attack - the most worrying being POODLE. These results are similar to those in Troy Hunt’s original article, and just go to show that just because something is “bank grade” doesn’t necessarily mean that it’s actually good.