Twitter Hashflags (Hash-what?)

Have you ever tweeted out a hastag, and discovered a small image attached to the side of it? It could be for #StPatricksDay, #MarchForOurLives, or whatever #白白白白白白白白白白 is meant to be. These are hashflags.

A hashflag, sometimes called Twitter emoji, is a small image that appears after a #hashtag for special events. They are not regular emoji, and you can only use them on the Twitter website, or the official Twitter apps. For example:

If you’re a company, and you have enough money, you can buy your own hashflag as well! That’s exactly what Disney did for the release of Star Wars: The Last Jedi.

If you spend the money to buy a hashflag, it’s important that you launch it correctly—otherwise they can flop. #白白白白白白白白白白 is an example of what not to do. At time of writing, it has only 10 uses.

Hashflags aren’t exclusive to English, and they can help add context to a tweet in another language. I don’t speak any Russian, but I do know that this image is of BB-8!

Unfortunately hashflags are temporary, so any context they add to a tweet can sometimes be lost at a later date. Currently Twitter doesn’t provide an official API for hashflags, and there is no canonical list of currently active hashflags. @hashflaglist tracks hashflags, but it’s easy to miss one—this is where Azure Functions come in.

It turns out that on Twitter.com the list of currently active hashflags is sent as a JSON object in the HTML as initial data. All I need to do is fetch Twitter.com, and extract the JSON object from the HTML.

$ curl https://twitter.com -v --silent 2>&1 | grep -o -P '.{6}activeHashflags.{6}'

"activeHashflags"

I wrote some C# to parse and extract the activeHashflags JSON object, and store it in an Azure blob. You can find it here. Using Azure Functions I can run this code on a timer, so the Azure blob is always up to date with the latest Twitter hashflags. This means the blob can be used as an unofficial Twitter hashflags API—but I didn’t want to stop there.

I wanted to solve some of the issues with hashflags around both discovery and durability. Azure Functions is the perfect platform for these small, single purpose pieces of code. I ended up writing five Azure Functions in total—all of which can be found on GitHub.

Screenshot of hashflags-function GitHub page

  1. ActiveHashflags fetches the active hashflags from Twitter, and stores them in a JSON object in an Azure Storage Blob. You can find the list of current hashflags here.
  2. UpdateHashflagState reads the JSON, and updates the hashflag table with the current state of each hashflag.
  3. StoreHashflagImage downloads the hashflag image, and stores it in a blob store.
  4. CreateHeroImage creates a hero image of the hashtag and hashflag.
  5. TweetHashflag tweets the hashtag and hero image.

Say hello to @HashflagArchive!

Screenshot of HashflagArchive Twitter stream

@HashflagArchive solves both the issues I have with hashflags: it tweets out new hashflags the same hour they are activated on twitter, which solves the issue of discovery; and it tweets an image of the hashtag and hashflag, which solves the issue of hashflags being temporary.

So this is great, but there’s still one issue—how to use hashflags outside of Twitter.com and the official Twitter apps. This is where the JSON blob comes in. I can build a wrapper library around that, and then using that library, build applications with Twitter hashflags. So that’s exactly what I did.

Screenshot of hashflags-node GitHub page

I wrote an npm package called hashflags. It’s pretty simple to use, and integrates nicely with the official twitter-text npm package.

import { Hashflags } from 'hashflags';

let hf: Hashflags;
Hashflags.FETCH().then((val: Map<string, string>) => {
  hf = new Hashflags(val);
  console.log(hf.activeHashflags);
});

I wrote it in TypeScript, but it can also be used from plain old JS as well.

const Hashflags = require('hashflags').Hashflags;

let hf;
Hashflags.FETCH().then(val => {
  hf = new Hashflags(val);
  console.log(hf.activeHashflags);
});

So there you have it, a quick introduction to Twitter hashflags via Azure Functions and an npm library. If you’ve got any questions please leave a comment below, or reach out to me on Twitter @Jamie_Magee.

A survey of robots.txt - part two

In part one of this article, I collected robots.txt from the top 1 million sites on the web. In this article I’m going to do some analysis, and see if there’s anything interesting to find from all the files I’ve collected.

First we’ll start with some setup.

%matplotlib inline

import pandas as pd
import numpy as np
import glob
import os
import matplotlib

Next I’m going to load the content of each file into my pandas dataframe, calculate the file size, and store that for later.

l = [filename.split('/')[1] for filename in glob.glob('robots-txt/*')]
df = pd.DataFrame(l, columns=['domain'])
df['content'] = df.apply(lambda x: open('robots-txt/' + x['domain']).read(), axis=1)
df['size'] = df.apply(lambda x: os.path.getsize('robots-txt/' + x['domain']), axis=1)
df.sample(5)
domain content size
612419 veapple.com User-agent: *\nAllow: /\n\nSitemap: http://www... 260
622296 buscadortransportes.com User-agent: *\nDisallow: /out/ 29
147795 dailynews360.com User-agent: *\nAllow: /\n\nDisallow: /search/\... 248
72823 newfoundlandpower.com User-agent: *\nDisallow: /Search.aspx\nDisallo... 528
601408 xfwed.com #\n# robots.txt for www.xfwed.com\n# Version 3... 201

File sizes

Now that we’ve done the setup, let’s see what the spread of file sizes in robots.txt is.

fig = df.plot.hist(title='robots.txt file size', bins=20)
fig.set_yscale('log')

png

It looks like the majority of robots.txt are under 250KB in size. This is really no surprise as robots.txt supports regex, so complex rulesets can be built easily.

Let’s take a look at the files larger than 1MB. I can think of three possibilities: they’re automatically maintained; they’re some other file masquerading as robots.txt; or the site is doing something seriously wrong.

large = df[df['size'] > 10 ** 6].sort_values(by='size', ascending=False)
import re

def count_directives(value, domain):
    content = domain['content']
    return len(re.findall(value, content, re.IGNORECASE))


large['disallow'] = large.apply(lambda x: count_directives('Disallow', x), axis=1)
large['user-agent'] = large.apply(lambda x: count_directives('User-agent', x), axis=1)
large['comments'] = large.apply(lambda x: count_directives('#', x), axis=1)
# The directives below are non-standard
large['crawl-delay'] = large.apply(lambda x: count_directives('Crawl-delay', x), axis=1)
large['allow'] = large.apply(lambda x: count_directives('Allow', x), axis=1)
large['sitemap'] = large.apply(lambda x: count_directives('Sitemap', x), axis=1)
large['host'] = large.apply(lambda x: count_directives('Host', x), axis=1)

large
domain content size disallow user-agent comments crawl-delay allow sitemap host
632170 haberborsa.com.tr User-agent: *\nAllow: /\n\nDisallow: /?ref=\nD... 5820350 71244 2 0 0 71245 5 10
23216 miradavetiye.com Sitemap: https://www.miradavetiye.com/sitemap_... 5028384 47026 7 0 0 47026 2 0
282904 americanrvcompany.com Sitemap: http://www.americanrvcompany.com/site... 4904266 56846 1 1 0 56852 2 0
446326 exibart.com User-Agent: *\nAllow: /\nDisallow: /notizia.as... 3275088 61403 1 0 0 61404 0 0
579263 sinospectroscopy.org.cn http://www.sinospectroscopy.org.cn/readnews.ph... 2979133 0 0 0 0 0 0 0
55309 vibralia.com # robots.txt automaticaly generated by PrestaS... 2835552 39712 1 15 0 39736 0 0
124850 oftalmolog30.ru User-Agent: *\nHost: chuzmsch.ru\nSitemap: htt... 2831975 87752 1 0 0 87752 2 2
557116 la-viephoto.com User-Agent:*\nDisallow:/aloha_blog/\nDisallow:... 2768134 29782 2 0 0 29782 2 0
677400 bigclozet.com User-agent: *\nDisallow: /item/\n\nUser-agent:... 2708717 51221 4 0 0 51221 0 0
621834 tranzilla.ru Host: tranzilla.ru\nSitemap: http://tranzilla.... 2133091 27647 1 0 0 27648 2 1
428735 autobaraholka.com User-Agent: *\nDisallow: /registration/\nDisal... 1756983 39330 1 0 0 39330 0 2
628591 megasmokers.ru User-agent: *\nDisallow: /*route=account/\nDis... 1633963 92 2 0 0 92 2 1
647336 valencia-cityguide.com # If the Joomla site is installed within a fol... 1559086 17719 1 12 0 17719 1 99
663372 vetality.fr # robots.txt automaticaly generated by PrestaS... 1536758 27737 1 12 0 27737 0 0
105735 golden-bee.ru User-agent: Yandex\nDisallow: /*_openstat\nDis... 1139308 24081 4 1 0 24081 0 1
454311 dreamitalive.com user-agent: google\ndisallow: /memberprofileda... 1116416 34392 3 0 0 34401 0 9
245895 gobankingrates.com User-agent: *\nDisallow: /wp-admin/\nAllow: /w... 1018109 7362 28 20 2 7363 0 0

It looks like all of these sites are misusing Disallow and Allow. In fact, looking at the raw files it appears as if they list all of the articles on the site under an individual Disallow command. I can only guess that when publishing an article, a corresponding line in robots.txt is added.

Now let’s take a look at the smallest robots.txt

small = df[df['size'] > 0].sort_values(by='size', ascending=True)

small.head(5)
domain content size
336828 iforce2d.net \n 1
55335 togetherabroad.nl \n 1
471397 behchat.ir \n 1
257727 docteurtamalou.fr 1
669247 lastminute-cottages.co.uk \n 1

There’s not really anything interesting here, so let’s take a look at some larger files

small = df[df['size'] > 10].sort_values(by='size', ascending=True)

small.head(5)
domain content size
676951 fortisbc.com sitemap.xml 11
369859 aurora.com.cn User-agent: 11
329775 klue.kr Disallow: / 11
390064 chneic.sh.cn Disallow: / 11
355604 hpi-mdf.com Disallow: / 11

Disallow: / tells all webcrawlers not to crawl anything on this site, and should (hopefully) keep it out of any search engines, but not all webcrawlers follow robots.txt.

User agents

User agents can be listed in robots.txt to either Allow or Disallow certain paths. Let’s take a look at the most common webcrawlers.

from collections import Counter

def find_user_agents(content):
    return re.findall('User-agent:? (.*)', content)

user_agent_list = [find_user_agents(x) for x in df['content']]
user_agent_count = Counter(x.strip() for xs in user_agent_list for x in set(xs))
user_agent_count.most_common(n=10)
[('*', 587729),
 ('Mediapartners-Google', 36654),
 ('Yandex', 29065),
 ('Googlebot', 25932),
 ('MJ12bot', 22250),
 ('Googlebot-Image', 16680),
 ('Baiduspider', 13646),
 ('ia_archiver', 13592),
 ('Nutch', 11204),
 ('AhrefsBot', 11108)]

It’s no surprise that the top result is a wildcard (*). Google takes spots 2, 4, and 6 with their AdSense, search and image web crawlers respectively. It does seem a little strange to see the AdSense bot listed above the usual search web crawler. Some of the other large search engines’ bots are also found in the top 10: Yandex, Baidu, and Yahoo (Slurp). MJ12bot is a crawler I had not heard of before, but according to their site it belongs to a UK based SEO company—and according to some of the results about it, it doesn’t behave very well. ia_archiver belongs to The Internet Archive, and (I assume) crawls pages for the Wayback Machine. Finally there is Apache Nutch, an open source webcrawler that can be run by anyone.

Security by obscurity

There are certain paths that you might not want a webcrawler to know about. For example, a .git directory, htpasswd files, or parts of a site that are still in testing, and aren’t meant to be found by anyone on Google. Let’s see if there’s anything interesting.

sec_obs = ['\.git', 'alpha', 'beta', 'secret', 'htpasswd', 'install\.php', 'setup\.php']
sec_obs_regex = re.compile('|'.join(sec_obs))

def find_security_by_obscurity(content):
    return sec_obs_regex.findall(content)

sec_obs_list = [find_security_by_obscurity(x) for x in df['content']]
sec_obs_count = Counter(x.strip() for xs in sec_obs_list for x in set(xs))
sec_obs_count.most_common(10)
[('install.php', 28925),
 ('beta', 2834),
 ('secret', 753),
 ('alpha', 597),
 ('.git', 436),
 ('setup.php', 73),
 ('htpasswd', 45)]

Just because a file or directory is mentioned in robots.txt, it doesn’t mean that it can actually be accessed. However, if even 1% of Wordpress installs leave their install.php open to the world, that’s still a lot of vulnerable sites. Any attacker could get the keys to the kingdom very easily. The same goes for a .git directory. Even if it is read-only, people accidentally commit secrets to their git repository all the time.

Conclusion

robots.txt is a fairly innocuous part of the web. It’s been interesting to see how popular websites (ab)use it, and which web crawlers are naughty or nice. Most of all this has been a great exercise for myself in collecting data and analysing it using pandas and Jupyter.

The full data set is released under the Open Database License (ODbL) v1.0 and can be found on GitHub

A survey of robots.txt - part one

After reading CollinMorris’s analysis of favicons of the top 1 million sites on the web, I thought it would be interesting to do the same for other common parts of websites that often get overlooked.

The robots.txt file is a plain text file found at on most websites which communicates information to web crawlers and spiders about how to scan a website.. For example, here’s an excerpt from robots.txt for google.com

User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/howsearchworks
...

The above excerpt tells all web crawlers not to scan the /search path, but allows them to scan /search/about and /search/howsearchworks paths. There are a few more supported keywords, but these are the most common. Following these instructions is not required, but it is considered good internet etiquette. If you want to read more about the standard, Wikipedia has a great page here.

In order to do an analysis of robots.txt, first I need to crawl the web for them – ironic, I know.

Scraping

I wrote a scraper using scrapy to make a request for robots.txt for each of the domains in Alexa’s top 1 million websites. If the response code was 200, the Content-Type header contained text/plain, and the response body was not empty, I stored the response body in a file, with the same name as the domain name.

One complication I encountered was that not all domains respond on the same protocol or subdomain. For example, some websites respond on http://{domain_name} while others require http://www.{domain_name}. If a website doesn’t automatically redirect you to the correct protocol or subdomain, the only way to find the correct one, is to try them all! So I wrote a small class, extending scrapy’s RetryMiddleware, to do this:

COMMON_PREFIXES = {'http://', 'https://', 'http://www.', 'https://www.'}

class PrefixRetryMiddleware(RetryMiddleware):

    def process_exception(self, request, exception, spider):
            prefixes_tried = request.meta['prefixes']
            if COMMON_PREFIXES == prefixes_tried:
                return exception

            new_prefix = choice(tuple(COMMON_PREFIXES - prefixes_tried))
            request = self.update_request(request, new_prefix)

            return self._retry(request, exception, spider)

The rest of the scraper itself is quite simple, but you can read the full code on GitHub.

Results

Scraping the full Alexa top 1 million websites list took around 24 hours. Once it was finished, I had just under 700k robots.txt files

$ find -type f | wc -l
677686

totalling 493MB

$ du -sh
493M	.

The smallest robots.txt was 1 byte1, but the largest was over 5MB.

$ find -type f -exec du -Sh {} + | sort -rh | head -n 1
5.6M	./haberborsa.com.tr

$ find -not -empty -type f -exec du -b {} + | sort -h | head -n 1
1	./0434.cc

The full data set is released under the Open Database License (ODbL) v1.0 and can be found on GitHub

What’s next?

In the next part of this blog series I’m going to analyse all the robots.txt to see if I can find anything interesting. In particular I’d like to know why exactly someone needs a robots.txt file over 5MB in size, what is the most common web crawler listed (either allowed or disallowed), and are there any sites practising security by obscurity by trying to keep links out of search engines!

I needed to include -not -empty when looking for the smallest file, as there were errors when decoding the response body for some domains. I’ve included the empty files in the dataset for posterity, but I will exclude them from further analysis.

Setting up nginx reverse proxy with Let’s Encrypt on unRAID

Late last year I set about building a new NAS to replace my aging HP ProLiant MicroServer N36L (though that’s a story for a different post). I decided to go with unRAID as my OS, over FreeNAS that I’d been running previously, mostly due to the simpler configuration, ease of expanding an array, and support for Docker and KVM.

Docker support makes it a lot easier to run some of the web apps that I rely on like Plex, Sonarr, CouchPotato and more, but accessing them securely outside my network is a different story. On FreeNAS I ran an nginx reverse proxy in a BSD jail, secured using basic auth, and SSL certificates from StartSSL. Thankfully there is already a Docker image, nginx-proxy by jwilder, which automatically configures nginx for you. As for SSL, Let’s Encrypt went into public beta in December, and recently issued their millionth certificate. There’s also a Docker image which will automatically manage your certificates, and configure nginx for you - letsencrypt-nginx-proxy-companion by jrcs.

Preparation

Firstly, you need to set up and configure Docker. There is a fantastic guide on how to configure Docker here on the Lime Technology website. Next, you need to install the Community Applications plugin. This allows us to install Docker containers directly from the Docker Hub.

In order to access everything from outside your LAN you’ll need to forward ports 80 and 443 to ports 8008 and 443, respectively, on your unRAID host. In addition, you’ll need to use some sort of Dynamic DNS service, though in my case I bought a domain name and use CloudFlare to handle my DNS.

nginx

From the Apps page on the unRAID web interface, search for nginx-proxy and click ‘get more results from Docker Hub’. Click ‘add’ under the listing for nginx-proxy by jwilder. Set the following container settings, changing your ‘host path’ to wherever you store Docker configuration files on your unRAID host

nginx-proxy settings

To add basic auth to any of the sites you’ll need to make a file with the VIRTUAL_HOST of the site, available to nginx in /etc/nginx/htpasswd. For example, I added a file in /mnt/user/docker/nginx/htpasswd/. You can create htpasswd files using apache2-utils, or there are sites available which can create them.

Let’s Encrypt

From the Apps page again, search for letsencrypt-nginx-proxy-companion, click ‘get more results from Docker Hub’, and then click ‘add’ under the listing for letsencrypt-nginx-proxy-companion by jrcs. Enter the following container settings, again changing your ‘host path’ to wherever you store Docker configuration files on your unRAID host

lets encrypt settings

Putting it all together

In order to configure nginx, you’ll need to add four environment variables to the Docker containers you wish to put behind the reverse proxy. They are VIRTUAL_HOST, VIRTUAL_PORT, LETSENCRYPT_HOST, and LETSENCRYPT_EMAIL. VIRTUAL_HOST and LETSENCRYPT_HOST most likely need to be the same, and will be something like subdomain.yourdomain.com. VIRTUAL_PORT should be the port your Docker container exposes. For example, Sonarr uses port 8989 by default. LETSENCRYPT_EMAIL should be a valid email address that Let’s Encrypt can use to email you about certificate expiries, etc.

Once nginx-proxy, letsencrypt-nginx-proxy-companion, and all your Docker containers are configured you should be able to access them all over SSL, with basic auth, from outside your LAN. You don’t even have to worry about certificate renewals as it’s all handled for you.

Do you really want “bank grade” security in your SSL? Danish edition

I recently saw an article on /r/programming called Do you really want “bank grade” security in your SSL? Here’s how Aussie banks fare. The author used the Qualys SSL Labs test to determine how good Aussie banks’ SSL implementations really are. I thought the article was great, and gave good, actionable feedback. At the time of writing this two of the banks listed have already improved their SSL scores.

It got me thinking: how well (or badly) do banks in Denmark fare? We put our trust - and our money - in these banks, but do they really deserve it? The banks I’ll be testing come from the list of systemically important banks, more commonly known as “Too big to fail.” The list consists of:

  • Danske Bank
  • Nykredit
  • Nordea
  • Jyske Bank
  • Sydbank
  • DLR Kredit

The Qualys SSL Labs test gives an overall grade, from A to F, but also points out any pressing issues with the SSL configuration. To score well a site must:

  • Disable SSL 3 protocol support as it is obselete and insecure
  • Support TLS 1.2 as it is the current best protocol
  • Have no SHA1 certificates (excluding the root certificate) in the chain as modern browsers will show the site as insecure
  • Disable the RC4 cipher as it is a weak cipher
  • Support forward secrecy to prevent a compromise of a secure key affecting the confidentiality of past conversations
  • Mitigate POODLE attacks, to prevent attackers downgrading secure connections to insecure connections

To make this as realistic as possible, I’ll be testing the login pages.

So I’ve got my list of sites, and my test all sorted. Let’s dive right in!

Bank Grade SSL 3 TLS 1.2 SHA1 RC4 Forward Secrecy POODLE
Danske Bank A- Pass Pass Pass Pass Fail Pass
Nordea B Pass Fail Fail1 Fail Fail Pass
DLR Kredit C Fail Fail Fail Fail Fail Fail2
Jyske Bank F Pass Fail Pass Fail Fail Fail
Sydbank F Pass Fail Pass Fail Fail Fail
Nykredit F Fail Fail Fail3 Fail Fail Fail

1Nordea’s SSL certificate is SHA256, but they use an SHA1 intermediate certificate

2DLR receives a C overall, as only its SSL implementations are vulnerable to POODLE, but not its TLS implementation

3Nykredit would pass, but they provide an unnecessary certificate which is signed using SHA1

Only one bank, Danske Bank, managed to get an A (though it is an A-). This is mostly due to the lack of forward secrecy support. If they fix this they can increase their rating to an A. They are also the only bank to enable TLS 1.2 support, and disable the RC4 cipher.

Nordea comes in second, managing a B, with some very odd results. They only support the TLS 1.0 protocol, but the list of server preferred cipher suites starts with RC4 (which is insecure). The server also sent a certificate chain with an MD2 certificate in it!

DLR Kredit achieves a C, despite the fact they are vulnerable to POODLE, because only their SSL 3 implementation is vulnerable. Their certificate is signed using SHA1, but stranger still their server didn’t send any intermediate certificates. DLR Kredit is also the only bank which is vulnerable to the CRIME attack. This allows an attacker to read secure cookies sent by the server. Disabling TLS compression is all that is required to mitigate this.

Jyske Bank and Sydbank appear to use the same server configuration as their results are equally bad. They both have disabled SSL 3, and have a complete certificate chain signed using SHA256, but fail on all other tests. In addition, both banks are intolerant to TLS versions, meaning if their websites are badly written they may stop working when new TLS versions comes out.

Finally we have Nykredit, who fares worst of all. Their server sent unnecessary certificates, signed using SHA1, which causes them to fail this test. They only support TLS 1.0 and SSL 3, but their server’s preferred cipher is RC4. Most worrying is the lack of support for secure renegotiation. This vulnerability is nearly 5 years old, and allows a man-in-the-middle to inject arbitrary content to an encrypted session.

Overall it’s not looking too good. A lot of Denmark’s biggest banks are not as secure as they would have you believe. Many are vulnerable to a lot of different avenues of attack - the most worrying being POODLE. These results are similar to those in Troy Hunt’s original article, and just go to show that just because something is “bank grade” doesn’t necessarily mean that it’s actually good.