Honey, I shrunk the npm package

Sep 27, 2023 · 11 minute read · Comments

Have you ever wondered what lies beneath the surface of an npm package? At its heart, it’s nothing more than a gzipped tarball. Working in software development, source code and binary artifacts are nearly always shipped as .tar.gz or .tgz files. And gzip compression is supported by every HTTP server and web browser out there. caniuse.com doesn’t even give statistics for support, it just says “supported in effectively all browsers”. But here’s the kicker: gzip is starting to show its age, making way for newer, more modern compression algorithms like Brotli and ZStandard. Now, imagine a world where npm embraces one of these new algorithms. In this blog post, I’ll dive into the realm of compression and explore the possibilities of modernising npm’s compression strategy.

What’s the competition?

The two major players in this space are Brotli and ZStandard (or zstd for short). Brotli was released by Google in 2013 and zstd was released by Facebook in 2016. They’ve since been standardised, in RFC 7932 and RFC 8478 respectively, and have seen widespread use all over the software industry. It was actually the announcement by Arch Linux that they were going to start compressing their packages with zstd by default that made think about this in the first place. Arch Linux was by no means the first project, nor is it the only one. But to find out if it makes sense for the Node ecosystem, I need to do some benchmarks. And that means breaking out tar.

Benchmarking part 1

https://xkcd.com/1168/ — https://xkcd.com/1168

I’m going to start with tar and see what sort of comparisons I can get by switching gzip, Brotli, and zstd. I’ll test with the npm package of npm itself as it’s a pretty popular package, averaging over 4 million downloads a week, while also being quite large at around 11MB unpacked.

1$ curl --remote-name https://registry.npmjs.org/npm/-/npm-9.7.1.tgz
2$ ls -l --human npm-9.7.1.tgz 
3-rw-r--r-- 1 jamie users 2.6M Jun 16 20:30 npm-9.7.1.tgz 
4$ tar --extract --gzip --file npm-9.7.1.tgz
5$ du --summarize --human --apparent-size package
611M	package

gzip is already giving good results, compressing 11MB to 2.6MB for a compression ratio of around 0.24. But what can the contenders do? I’m going to stick with the default options for now:

 1$ brotli --version
 2brotli 1.0.9
 3$ tar --use-compress-program brotli --create --file npm-9.7.1.tar.br package
 4$ zstd --version
 5*** Zstandard CLI (64-bit) v1.5.5, by Yann Collet ***
 6$ tar --use-compress-program zstd --create --file npm-9.7.1.tar.zst package
 7$ ls -l --human npm-9.7.1.tgz npm-9.7.1.tar.br npm-9.7.1.tar.zst 
 8-rw-r--r-- 1 jamie users 1.6M Jun 16 21:14 npm-9.7.1.tar.br
 9-rw-r--r-- 1 jamie users 2.3M Jun 16 21:14 npm-9.7.1.tar.zst
10-rw-r--r-- 1 jamie users 2.6M Jun 16 20:30 npm-9.7.1.tgz

Wow! With no configuration both Brotli and zstd come out ahead of gzip, but Brotli is the clear winner here. It manages a compression ratio of 0.15 versus zstd’s 0.21. In real terms that means a saving of around 1MB. That doesn’t sound like much, but at 4 million weekly downloads, that would save 4TB of bandwidth per week.

Benchmarking part 2: Electric boogaloo

The compression ratio is only telling half of the story. Actually, it’s a third of the story, but compression speed isn’t really a concern. Compression of a package only happens once, when a package is published, but decompression happens every time you run npm install. So any time saved decompressing packages means quicker install or build steps.

To test this, I’m going to use hyperfine, a command-line benchmarking tool. Decompressing each of the packages I created earlier 100 times should give me a good idea of the relative decompression speed.

1$ hyperfine --runs 100 --export-markdown hyperfine.md \
2  'tar --use-compress-program brotli --extract --file npm-9.7.1.tar.br --overwrite' \
3  'tar --use-compress-program zstd --extract --file npm-9.7.1.tar.zst --overwrite' \
4  'tar --use-compress-program gzip --extract --file npm-9.7.1.tgz --overwrite'

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
tar –use-compress-program brotli –extract –file npm-9.7.1.tar.br –overwrite	51.6 ± 3.0	47.9	57.3	1.31 ± 0.12
tar –use-compress-program zstd –extract –file npm-9.7.1.tar.zst –overwrite	39.5 ± 3.0	33.5	51.8	1.00
tar –use-compress-program gzip –extract –file npm-9.7.1.tgz –overwrite	47.0 ± 1.7	44.0	54.9	1.19 ± 0.10

This time zstd comes out in front, followed by gzip and Brotli. This makes sense, as “real-time compression” is one of the big features that is touted in zstd’s documentation. While Brotli is 31% slower compared to zstd, in real terms it’s only 12ms. And compared to gzip, it’s only 5ms slower. To put that into context, you’d need a more than 1Gbps connection to make up for the 5ms loss it has in decompression compared with the 1MB it saves in package size.

Benchmarking part 3: This time it’s serious

Up until now I’ve just been looking at Brotli and zstd’s default settings, but both have a lot of knobs and dials that you can adjust to change the compression ratio and compression or decompression speed. Thankfully, the industry standard lzbench has got me covered. It can run through all of the different quality levels for each compressor, and spit out a nice table with all the data at the end.

But before I dive in, there are a few caveats I should point out. The first is that lzbench isn’t able to compress an entire directory like tar , so I opted to use lib/npm.js for this test. The second is that lzbench doesn’t include the gzip tool. Instead it uses zlib, the underlying gzip library. The last is that the versions of each compressor aren’t quite current. The latest version of zstd is 1.5.5, released on April 4th 2023, whereas lzbench uses version 1.4.5, released on May 22nd 2020. The latest version of Brotli is 1.0.9, released on August 27th 2020, whereas lzbench uses a version released on October 1st 2019.

1$ lzbench -o1 -ezlib/zstd/brotli package/lib/npm.js

Click to expand results

Compressor name	Compression	Decompress.	Compr. size	Ratio	Filename
memcpy	117330 MB/s	121675 MB/s	13141	100.00	package/lib/npm.js
zlib 1.2.11 -1	332 MB/s	950 MB/s	5000	38.05	package/lib/npm.js
zlib 1.2.11 -2	382 MB/s	965 MB/s	4876	37.11	package/lib/npm.js
zlib 1.2.11 -3	304 MB/s	986 MB/s	4774	36.33	package/lib/npm.js
zlib 1.2.11 -4	270 MB/s	1009 MB/s	4539	34.54	package/lib/npm.js
zlib 1.2.11 -5	204 MB/s	982 MB/s	4452	33.88	package/lib/npm.js
zlib 1.2.11 -6	150 MB/s	983 MB/s	4425	33.67	package/lib/npm.js
zlib 1.2.11 -7	125 MB/s	983 MB/s	4421	33.64	package/lib/npm.js
zlib 1.2.11 -8	92 MB/s	989 MB/s	4419	33.63	package/lib/npm.js
zlib 1.2.11 -9	95 MB/s	986 MB/s	4419	33.63	package/lib/npm.js
zstd 1.4.5 -1	594 MB/s	1619 MB/s	4793	36.47	package/lib/npm.js
zstd 1.4.5 -2	556 MB/s	1423 MB/s	4881	37.14	package/lib/npm.js
zstd 1.4.5 -3	510 MB/s	1560 MB/s	4686	35.66	package/lib/npm.js
zstd 1.4.5 -4	338 MB/s	1584 MB/s	4510	34.32	package/lib/npm.js
zstd 1.4.5 -5	275 MB/s	1647 MB/s	4455	33.90	package/lib/npm.js
zstd 1.4.5 -6	216 MB/s	1656 MB/s	4439	33.78	package/lib/npm.js
zstd 1.4.5 -7	140 MB/s	1665 MB/s	4422	33.65	package/lib/npm.js
zstd 1.4.5 -8	101 MB/s	1714 MB/s	4416	33.60	package/lib/npm.js
zstd 1.4.5 -9	97 MB/s	1673 MB/s	4410	33.56	package/lib/npm.js
zstd 1.4.5 -10	97 MB/s	1672 MB/s	4410	33.56	package/lib/npm.js
zstd 1.4.5 -11	37 MB/s	1665 MB/s	4371	33.26	package/lib/npm.js
zstd 1.4.5 -12	27 MB/s	1637 MB/s	4336	33.00	package/lib/npm.js
zstd 1.4.5 -13	20 MB/s	1601 MB/s	4310	32.80	package/lib/npm.js
zstd 1.4.5 -14	18 MB/s	1582 MB/s	4309	32.79	package/lib/npm.js
zstd 1.4.5 -15	18 MB/s	1582 MB/s	4309	32.79	package/lib/npm.js
zstd 1.4.5 -16	9.03 MB/s	1556 MB/s	4305	32.76	package/lib/npm.js
zstd 1.4.5 -17	8.86 MB/s	1559 MB/s	4305	32.76	package/lib/npm.js
zstd 1.4.5 -18	8.86 MB/s	1558 MB/s	4305	32.76	package/lib/npm.js
zstd 1.4.5 -19	8.86 MB/s	1559 MB/s	4305	32.76	package/lib/npm.js
zstd 1.4.5 -20	8.85 MB/s	1558 MB/s	4305	32.76	package/lib/npm.js
zstd 1.4.5 -21	8.86 MB/s	1559 MB/s	4305	32.76	package/lib/npm.js
zstd 1.4.5 -22	8.86 MB/s	1589 MB/s	4305	32.76	package/lib/npm.js
brotli 2019-10-01 -0	604 MB/s	813 MB/s	5182	39.43	package/lib/npm.js
brotli 2019-10-01 -1	445 MB/s	775 MB/s	5148	39.18	package/lib/npm.js
brotli 2019-10-01 -2	347 MB/s	947 MB/s	4727	35.97	package/lib/npm.js
brotli 2019-10-01 -3	266 MB/s	936 MB/s	4645	35.35	package/lib/npm.js
brotli 2019-10-01 -4	164 MB/s	930 MB/s	4559	34.69	package/lib/npm.js
brotli 2019-10-01 -5	135 MB/s	944 MB/s	4276	32.54	package/lib/npm.js
brotli 2019-10-01 -6	129 MB/s	949 MB/s	4257	32.39	package/lib/npm.js
brotli 2019-10-01 -7	103 MB/s	953 MB/s	4244	32.30	package/lib/npm.js
brotli 2019-10-01 -8	84 MB/s	919 MB/s	4240	32.27	package/lib/npm.js
brotli 2019-10-01 -9	7.74 MB/s	958 MB/s	4237	32.24	package/lib/npm.js
brotli 2019-10-01 -10	4.35 MB/s	690 MB/s	3916	29.80	package/lib/npm.js
brotli 2019-10-01 -11	1.59 MB/s	761 MB/s	3808	28.98	package/lib/npm.js

This pretty much confirms what I’ve shown up to now. zstd is able to provide faster decompression speed than either gzip or Brotli, and slightly edge out gzip in compression ratio. Brotli, on the other hand, has comparable decompression speeds and compression ratio with gzip at lower quality levels, but at levels 10 and 11 it’s able to edge out both gzip and zstd’s compression ratio.

Everything is derivative

Now that I’ve finished with benchmarking, I need to step back and look at my original idea of replacing gzip as npm’s compression standard. As it turns out, Evan Hahn had a similar idea in 2022 and proposed an npm RFC. He proposed using Zopfli, a backwards-compatible gzip compression library, and Brotli’s older (and cooler 😎) sibling. Zopfli is able to produce smaller artifacts with the trade-off of a much slower compression speed. In theory an easy win for the npm ecosystem. And if you watch the RFC meeting recording or read the meeting notes, everyone seems hugely in favour of the proposal. However, the one big roadblock that prevents this RFC from being immediately accepted, and ultimately results in it being abandoned, is the lack of a native JavaScript implementation.

Learning from this earlier RFC and my results from benchmarking Brotli and zstd, what would it take to build a strong RFC of my own?

Putting it all together

Both Brotli and zstd’s reference implementations are written in C. And while there are a lot of ports on the npm registry using Emscripten or WASM, Brotli has an implementation in Node.js’s zlib module, and has done since Node.js 10.16.0, released in May 2019. I opened an issue in Node.js’s GitHub repo to add support for zstd, but it’ll take a long time to make its way into an LTS release, nevermind the rest of npm’s dependency chain. I was already leaning towards Brotli, but this just seals the deal.

Deciding on an algorithm is one thing, but implementing it is another. npm’s current support for gzip compression ultimately comes from Node.js itself. But the dependency chain between npm and Node.js is long and slightly different depending on if you’re packing or unpacking a package.

The dependency chain for packing, as in npm pack or npm publish, is:

npm → libnpmpack → pacote → tar → minizlib → zlib (Node.js)

But the dependency chain for unpacking (or ‘reifying’ as npm calls it), as in npm install or npm ci is:

npm → @npmcli/arborist → pacote → tar → minizlib → zlib (Node.js)

That’s quite a few packages that need to be updated, but thankfully the first steps have already been taken. Support for Brotli was added to minizlib 1.3.0 back in September 2019. I built on top of that and contributed Brotil support to tar. That is now available in version 6.2.0. It may take a while, but I can see a clear path forward.

The final issue is backwards compatibility. This wasn’t a concern with Evan Hahn’s RFC, as Zopfli generates backwards-compatible gzip files. However, Brotli is an entirely new compression format, so I’ll need to propose a very careful adoption plan. The process I can see is:

Support for packing and unpacking is added in a minor release of the current version of npm
1. Unpacking using Brotli is handled transparently
2. Packing using Brotli is disabled by default and only enabled if one of the following are true:
  1. The engines field in package.json is set to a version of npm that supports Brotli
  2. The engines field in package.json is set to a version of node that bundles a version of npm that supports Brotli
  3. Brotli support is explicitly enabled in .npmrc
Packing using Brotli is enabled by default in the next major release of npm after the LTS version of Node.js that bundles it goes out of support

Let’s say that Node.js 22 comes with npm 10, which has Brotli support. Node.js 22 will stop getting LTS updates in April 2027. Then, the next major version of npm after that date should enable Brotli packing by default.

I admit that this is an incredibly long transition period. However, it will guarantee that if you’re using a version of Node.js that is still being supported, there will be no visible impact to you. And it still allows early adopters to opt-in to Brotli support. But if anyone has other ideas about how to do this transition, I am open to suggestions.

What’s next?

As I wrap up my exploration into npm compression, I must admit that my journey has only just begun. To push the boundaries further, there are a lot more steps. First and foremost, I need to do some more extensive benchmarking with the top 250 most downloaded npm packages, instead of focusing on a single package. Once that’s complete, I need to draft an npm RFC and seek feedback from the wider community. If you’re interested in helping out, or just want to see how it’s going, you can follow me on Mastodon at @[email protected], or on Twitter at @Jamie_Magee.

Container Plumbing Days 2023—Windows containers: The forgotten stepchild

May 5, 2023 · 1 minute read · Comments

When it comes to Linux containers, there are plenty of tools out there that can scan container images, generate Software Bill of Materials (SBOM), or list vulnerabilities. However, Windows container images are more like the forgotten stepchild in the container ecosystem. And that means we’re forgetting the countless developers using Windows containers, too.

Instead of allowing this gap to widen further, container tool authors—especially SBOM tools and vulnerability scanners—need to add support for Windows container images.

In my presentation at Container Plumbing Days 2023 I showed how to extract version information from Windows containers images that can be used to generate SBOMs, as well as how to integrate with the Microsoft Security Updates API which can provide detailed vulnerability information.

Your Jest tests might be wrong

May 4, 2023 · 7 minute read · Comments

Is your Jest test suite failing you? You might not be using the testing framework’s full potential, especially when it comes to preventing state leakage between tests. The Jest settings clearMocks, resetMocks, restoreMocks, and resetModules are set to false by default. If you haven’t changed these defaults, your tests might be fragile, order-dependent, or just downright wrong. In this blog post, I’ll dig into what each setting does, and how you can fix your tests.

`clearMocks`

First up is clearMocks:

Automatically clear mock calls, instances, contexts and results before every test. Equivalent to calling jest.clearAllMocks() before each test. This does not remove any mock implementation that may have been provided.

Every Jest mock has some context associated with it. It’s how you’re able to call functions like mockReturnValueOnce instead of only mockReturnValue. But if clearMocks is false by default, then that context can be carried between tests.

Take this example function:

1export function randomNumber() {
2  return Math.random();
3}

And this simple test for it:

 1jest.mock('.');
 2
 3const { randomNumber } = require('.');
 4
 5describe('tests', () => {
 6    randomNumber.mockReturnValue(42);
 7  
 8    it('should return 42', () => {
 9        const random = randomNumber();
10    
11        expect(random).toBe(42);
12        expect(randomNumber).toBeCalledTimes(1)
13    });
14});

The test passes and works as expected. However, if we add another test to our test suite:

 1jest.mock('.');
 2
 3const { randomNumber } = require('.');
 4
 5describe('tests', () => {
 6    randomNumber.mockReturnValue(42);
 7  
 8    it('should return 42', () => {
 9        const random = randomNumber();
10    
11        expect(random).toBe(42);
12        expect(randomNumber).toBeCalledTimes(1)
13    });
14    
15    it('should return same number', () => {
16        const random1 = randomNumber();
17        const random2 = randomNumber();
18    
19        expect(random1).toBe(42);
20        expect(random2).toBe(42);
21    
22        expect(randomNumber).toBeCalledTimes(2)
23    });
24});

Our second test fails with the error:

1Error: expect(jest.fn()).toBeCalledTimes(expected)
2
3Expected number of calls: 2
4Received number of calls: 3

And even worse, if we change the order of our tests:

 1jest.mock('.');
 2
 3const { randomNumber } = require('.');
 4
 5describe('tests', () => {
 6    randomNumber.mockReturnValue(42);
 7  
 8    it('should return same number', () => {
 9        const random1 = randomNumber();
10        const random2 = randomNumber();
11    
12        expect(random1).toBe(42);
13        expect(random2).toBe(42);
14    
15        expect(randomNumber).toBeCalledTimes(2)
16    });
17  
18    it('should return 42', () => {
19        const random = randomNumber();
20    
21        expect(random).toBe(42);
22        expect(randomNumber).toBeCalledTimes(1)
23    });
24});

We get the same error as before, but this time for should return 42 instead of should return same number.

Enabling clearMocks in your Jest configuration ensures that every mock’s context is reset between tests. You can achieve the same result by adding jest.clearAllMocks() to your beforeEach() functions. But this isn’t a great idea as it means you have to remember to add it to each test file to make your tests safe, instead of using clearMocks to make them all safe by default.

`resetMocks`

Next up is resetMocks:

Automatically reset mock state before every test. Equivalent to calling jest.resetAllMocks() before each test. This will lead to any mocks having their fake implementations removed but does not restore their initial implementation.

resetMocks takes clearMocks a step further, by clearing the implementation of any mocks. However, you need to use it in addition to clearMocks.

Going back to my first example again, I’m going to move the mock setup inside the first test case randomNumber.mockReturnValue(42);.

 1describe('tests', () => {
 2    it('should return 42', () => {
 3        randomNumber.mockReturnValue(42);
 4        const random = randomNumber();
 5
 6        expect(random).toBe(42);
 7        expect(randomNumber).toBeCalledTimes(1)
 8    });
 9
10    it('should return 42 twice', () => {
11        const random1 = randomNumber();
12        const random2 = randomNumber();
13
14        expect(random1).toBe(42);
15        expect(random2).toBe(42);
16
17        expect(randomNumber).toBeCalledTimes(2)
18    });
19});

Logically, you might expect this to fail, but it passes! Jest mocks are global to the file they’re in. It doesn’t matter what describe, it, or test scope you use. And if I change the order of tests again, they fail. This makes it very easy to write tests that leak state and are order-dependent.

Enabling resetMocks in your Jest context ensures that every mock implementation is reset between tests. Like before, you can also add jest.resetAllMocks() to beforeEach() in every test file. But it’s a much better idea to make your tests safe by default instead of having to opt-in to safe tests.

`restoreMocks`

Next is restoreMocks:

Automatically restore mock state and implementation before every test. Equivalent to calling jest.restoreAllMocks() before each test. This will lead to any mocks having their fake implementations removed and restores their initial implementation.

restoreMocks takes test isolation and safety to the next level.

Let me rewrite my example a little bit, so instead of mocking the function directly, I’m mocking Math.random() instead.

 1const { randomNumber } = require('.');
 2
 3const spy = jest.spyOn(Math, 'random');
 4
 5describe('tests', () => {
 6    it('should return 42', () => {
 7        spy.mockReturnValue(42);
 8        const random = randomNumber();
 9
10        expect(random).toBe(42);
11        expect(spy).toBeCalledTimes(1)
12    });
13
14    it('should return 42 twice', () => {
15        spy.mockReturnValue(42);
16
17        const random1 = randomNumber();
18        const random2 = randomNumber();
19
20        expect(random1).toBe(42);
21        expect(random2).toBe(42);
22
23        expect(spy).toBeCalledTimes(2)
24    });
25});

With clearMocks and resetMocks enabled, and restoreMocks disabled, my tests pass. But if I enable restoreMocks both tests fail with an error message like:

1Error: expect(received).toBe(expected) // Object.is equality
2
3Expected: 42
4Received: 0.503533695686772

restoreMocks has restored the original implementation of Math.random() before each test, so now I’m getting an actual random number instead of my mocked return value of 42. This forces me to be explicit about not only the mocked return values I’m expecting, but the mocks themselves.

To fix my tests I can set up my Jest mocks in each individual test.

 1describe('tests', () => {
 2    it('should return 42', () => {
 3        const spy = jest.spyOn(Math, 'random').mockReturnValue(42);
 4        const random = randomNumber();
 5
 6        expect(random).toBe(42);
 7        expect(spy).toBeCalledTimes(1)
 8    });
 9
10    it('should return 42 twice', () => {
11        const spy = jest.spyOn(Math, 'random').mockReturnValue(42);
12
13        const random1 = randomNumber();
14        const random2 = randomNumber();
15
16        expect(random1).toBe(42);
17        expect(random2).toBe(42);
18
19        expect(spy).toBeCalledTimes(2)
20    });
21});

`resetModules`

Finally, we have resetModules:

By default, each test file gets its own independent module registry. Enabling resetModules goes a step further and resets the module registry before running each individual test. This is useful to isolate modules for every test so that the local module state doesn’t conflict between tests. This can be done programmatically using jest.resetModules().

Again, this builds on top of clearMocks, resetMocks, and restoreMocks. I don’t think this level of isolation is required for most tests, but I’m a completionist.

Let’s take my example from above and expand it to include some initialization that needs to happen before I can call randomNumber. Maybe I need to make sure there’s enough entropy to generate random numbers? My module might look something like this:

 1let isInitialized = false;
 2
 3export function initialize() {
 4    isInitialized = true;
 5}
 6
 7export function randomNumber() {
 8    if (!isInitialized) {
 9        throw new Error();
10    }
11
12    return Math.random();
13}

I also want to write some tests to make sure that this works as expected:

 1const random = require('.');
 2
 3describe('tests', () => {
 4    it('does not throw when initialized', () => {
 5        expect(() => random.initialize()).not.toThrow();
 6    });
 7
 8    it('throws when not initialized', () => {
 9        expect(() => random.randomNumber()).toThrow();
10    });
11});

initialize shouldn’t throw an error, but randomNumber should throw an error if initialize isn’t called first. Great! Except it doesn’t work. Instead I get:

1Error: expect(received).toThrow()
2
3Received function did not throw

That’s because without enabling resetModules, the module is shared between all tests in the file. So when I called random.initialize() in my first test, isInitialized is still true for my second test. But once again, if I were to switch the order of my tests in the file, they would both pass. So my tests are order-dependent again!

Enabling resetModules will give each test in the file a fresh version of the module for each test. Though, this might actually be a case where you want to use jest.resetAllModules() in your beforeEach() instead of enabling it globally. This kind of isolation isn’t required for every test. And if you’re using import instead of require, the syntax can get very awkward very quickly if you’re trying to avoid an 'import' and 'export' may only appear at the top level error.

TL;DR reset all of the things

By default, Jest tests are only isolated at the file level. If you really want to make sure your tests are safe and isolated, add this to your Jest config:

1{
2  clearMocks: true,
3  resetMocks: true,
4  restoreMocks: true,
5  resetModules: true // It depends
6}

There is a suggestion to make this part of the default configuration. But until then, you’ll have to do it yourself.

Maintaining AUR packages with Renovate

Mar 16, 2023 · 5 minute read · Comments

One big advantage that Arch Linux has over other distributions, apart from being able to say “BTW I use Arch.”, is the Arch User Repository (AUR). It’s a community-driven repository with over 80,000 packages. If you’re looking for a package, chances are you’ll find it in the AUR.

Keeping all those packages up to date, takes a lot of manual effort by a lot of volunteers. People have created and used tools, like urlwatch and aurpublish, to let them know when upstream releases are cut and automate some parts of the process. I know I do. But I wanted to automate the entire process. I think Renovate can help here.

Updating versions with Renovate

Renovate is an automated dependency update tool. You might have seen it opening pull requests on GitHub and making updates for npm or other package managers, but it’s a lot more powerful than just that.

Renovate has a couple of concepts that I need to explain first: datasources and managers. Datasources define where to look for new versions of a dependency. Renovate comes with over 50 different datasources, but the one that is important for AUR packages is the git-tags datasource. Managers are the Renovate concept for package managers. There isn’t an AUR or PKGBUILD manager, but there is a regex manager that I can use.

I can create a renovate.json configuration with the following regex manager configuration:

 1{
 2  "regexManagers": [
 3    {
 4      "fileMatch": ["(^|/)PKGBUILD$"],
 5      "matchStrings": [
 6        "pkgver=(?<currentValue>.*) # renovate: datasource=(?<datasource>.*) depName=(?<depName>.*)"
 7      ],
 8      "extractVersionTemplate": "^v?(?<version>.*)$"
 9    }
10  ]
11}

Breaking that down:

The fileMatch setting tells Renovate to look for any PKGBUILD files in a repository
The matchStrings is the regex format to extract the version, datasource, and dependency name from the PKGBUILD
The extractVersionTemplate is to handle a “v” in front of any version number that is sometimes added to Git tags

And here’s an extract from the PKGBUILD for the bicep-bin AUR package that I maintain:

1pkgver=0.15.31 # renovate: datasource=github-tags depName=Azure/bicep

Here I’m configuring Renovate to use the github-tags datasource and to look in the Azure/bicep GitHub repository for new versions. That means it’ll look in the list of tags for the Azure/bicep repository for any new versions. If Renovate finds any new versions, it’ll automatically update the PKGBUILD and open a pull request with the updated version.

So I’ve automated the PKGBUILD update, but that’s only half of the work. The checksums and .SRCINFO must be updated before pushing to the AUR. Unfortunately, Renovate can’t do that (yet, see Renovate issue #16923), but GitHub Actions can!

Updating checksums and `.SRCINFO` with GitHub Actions

Updating the checksums with updpkgsums is easy, and generating an updated .SRCINFO with makepkg --printsrcinfo > .SRCINFO is straightforward too. But doing that for a whole repository of AUR packages is going to be a little trickier. So let me build up the GitHub actions workflow step-by-step.

First, I only want to run this workflow on pull requests targeting the main branch.

1on:
2  pull_request:
3    types:
4      - opened
5      - synchronize
6    branches:
7      - main

Next, I’m going to need to check out the entire history of the repository, so I can compare the files changed in the latest commit with the Git history.

1jobs:
2  updpkgsums:
3    runs-on: ubuntu-latest
4    steps:
5      - name: Checkout
6        uses: actions/checkout@ac593985615ec2ede58e132d2e21d2b1cbd6127c # v3.3.0
7        with:
8          fetch-depth: 0
9          ref: ${{ github.ref }}

Getting the package that changed in a pull request requires a little bit of shell magic.

1- name: Find updated package
2  run: |
3    #!/usr/bin/env bash
4    set -euxo pipefail
5
6    echo "pkgbuild=$(git diff --name-only origin/main origin/${GITHUB_HEAD_REF} "*PKGBUILD" | head -1 | xargs dirname)" >> $GITHUB_ENV

Now I’ve found the package that changed in the Renovate pull request, I can update the files.

This step in the workflow uses a private GitHub Action that I have in my aur-packages repository. I’m not going to break it down here, but at its core it runs updpkgsums and makepkg --printsrcinfo > .SRCINFO with a little extra configuration required to run Arch Linux on GitHub Actions runners. You can check out the full code on GitHub.

1- name: Validate package
2  if: ${{ env.pkgbuild != '' }}
3  uses: ./.github/actions/aur
4  with:
5    action: 'updpkgsums'
6    pkgname: ${{ env.pkgbuild }}

Finally, once the PKGBUILD and .SRCINFO are updated I need to commit that change back to the pull request.

1- name: Commit
2  if: ${{ env.pkgbuild != '' }}
3  uses: stefanzweifel/git-auto-commit-action@3ea6ae190baf489ba007f7c92608f33ce20ef04a # v4.16.0
4  with:
5    file_pattern: '*/PKGBUILD */.SRCINFO'

Check out this pull request for bicep-bin where Renovate opened a pull request, and my GitHub Actions workflow updated the b2sums in the PKGBUILD and updated the .SRCINFO.

But why stop there? Let’s talk about publishing.

Publishing to the AUR

Each AUR package is its own Git repository. So to update a package in the AUR, I only need to push a new commit with the updated PKGBUILD and .SRCINFO. Thankfully, KSXGitHub created the github-actions-deploy-aur GitHub Action to streamline the whole process.

If I create a new GitHub Actions workflow to publish to the AUR, I can reuse the first two steps from my previous workflow to check out the repository and find the updated package. Then all I need to do is to use the github-actions-deploy-aur GitHub Action:

1- name: Publish package
2  uses: KSXGitHub/github-actions-deploy-aur@065b6056b25bdd43830d5a3f01899d0ff7169819 # v2.6.0
3  if: ${{ env.pkgbuild != '' }}
4  with:
5    pkgname: ${{ env.pkgbuild }}
6    pkgbuild: ${{ env.pkgbuild }}/PKGBUILD
7    commit_username: ${{ secrets.AUR_USERNAME }}
8    commit_email: ${{ secrets.AUR_EMAIL }}
9    ssh_private_key: ${{ secrets.AUR_SSH_PRIVATE_KEY }}

All together now

If you own any AUR packages and want to automate some of the maintenance burden, check out my AUR packages template GitHub repository. It contains all of the steps I showed in this blog post. And if you want to see how it works in practice, check out my AUR packages GitHub repository.

Scanning Windows container images is (surprisingly) easy!

Jan 2, 2023 · 5 minute read · Comments

I wanted to see what I’d need to make scanning tools for Windows container images. Turns out it’s pretty easy. So easy, in fact, I think the existing container tools should add support for Windows container images.

What version of Windows am I running?

The first question I needed to answer was: what version of Windows was the container image based on? This tells me what date the container image is from, what updates are applicable, and what vulnerabilities it has.

Container images are really just tar files, and Windows container images are no different. So first I saved a Windows container image locally using skopeo:

1$ skopeo --insecure-policy --override-os windows copy docker://mcr.microsoft.com/windows/nanoserver:ltsc2022 dir:///tmp/nanoserver
2$ ls /tmp/nanoserver
30db1879370e5c72dae7bff5d013772cbbfb95f30bfe1660dcef99e0176752f1c  7d843aa7407d9a5b1678482851d2e81f78b08185b72c18ffb6dfabcfed383858 manifest.json version

Next, I inspected the manifest using jq to find the layer that had the Windows files.

 1$ jq . manifest.json
 2{
 3  "schemaVersion": 2,
 4  "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
 5  "config": {
 6    "mediaType": "application/vnd.docker.container.image.v1+json",
 7    "size": 638,
 8    "digest": "sha256:0db1879370e5c72dae7bff5d013772cbbfb95f30bfe1660dcef99e0176752f1c"
 9  },
10  "layers": [
11    {
12      "mediaType": "application/vnd.docker.image.rootfs.foreign.diff.tar",
13      "size": 304908800,
14      "digest": "sha256:7d843aa7407d9a5b1678482851d2e81f78b08185b72c18ffb6dfabcfed383858"
15    }
16  ]
17}

I then extracted the layer and fixed the permissions.

 1$ mkdir layer
 2$ tar -xf 7d843aa7407d9a5b1678482851d2e81f78b08185b72c18ffb6dfabcfed383858 -C ./layer/
 3$ sudo find ./layer -type f -exec chmod 0644 {} \;
 4$ sudo find ./layer -type d -exec chmod 0755 {} \;
 5$ ls -lah layer/
 6total 16K
 7drwxr-xr-x 4 jamie users 4.0K Dec 28 15:05 .
 8drwxr-xr-x 3 jamie users 4.0K Dec 28 15:00 ..
 9drwxr-xr-x 5 jamie users 4.0K Dec  9 01:18 Files
10drwxr-xr-x 3 jamie users 4.0K Dec  9 01:22 UtilityVM
11$ ls -lah layer/Files/
12total 28K
13drwxr-xr-x  5 jamie users 4.0K Dec  9 01:18 .
14drwxr-xr-x  4 jamie users 4.0K Dec 28 15:05 ..
15-rw-r--r--  1 jamie users 5.6K Dec  9 01:18 License.txt
16drwxr-xr-x  4 jamie users 4.0K May  7  2021 ProgramData
17drwxr-xr-x  6 jamie users 4.0K Dec  9 01:19 Users
18drwxr-xr-x 20 jamie users 4.0K Dec  9 01:19 Windows

Inside the extracted layer there are two directories: Files and UtilityVM. Files had the filesystem of the Windows container image, while UtilityVM is used by Hyper-V behind the scenes. So I just needed to focus on Files.

How did I figure out the specific version of Windows the container is running? From the registry of course! The SOFTWARE registry hive contained information about installed software, including Windows itself, and was found at Files/Windows/System32/config/SOFTWARE.

Thankfully, there’s a great NuGet package called Registry that let me easily load and parse the registry, but there are also packages for Go, Rust, and even Node.js.

1using Registry;
2
3var registryHive = new RegistryHive("/tmp/nanoserver/layer/Files/Windows/System32/config/SOFTWARE");
4registryHive.ParseHive();
5var currentVersion = registryHive.GetKey(@"Microsoft\Windows NT\CurrentVersion");
6var fullVersion =
7    $"{currentVersion.GetValue("CurrentMajorVersionNumber")}.{currentVersion.GetValue("CurrentMinorVersionNumber")}.{currentVersion.GetValue("CurrentBuildNumber")}.{currentVersion.GetValue("UBR")}";
8Console.WriteLine(fullVersion);

Running this code, I got version 10.0.20348.1366 which was apparently released on 13th December 2022.

What about Windows updates?

The version of Windows doesn’t tell the whole story. There are also updates that can be applied on top. You might have seen them referred to by their KB number, for example KB1234567. Information on what updates have been applied is also stored in the registry.

By extending my earlier code, I can find out what updates this container image has.

 1var packages = registryHive.GetKey(@"Microsoft\Windows\CurrentVersion\Component Based Servicing\Packages");
 2var updatePackageRegex = new Regex(@"^Package_\d+_for_(KB\d+)~\w{16}~\w+~~((?:\d+\.){3}\d+)$");
 3
 4var updates = new Dictionary<string, string>();
 5foreach (var packageKey in packages.SubKeys)
 6{
 7    if (!updatePackageRegex.IsMatch(packageKey.KeyName))
 8    {
 9        continue;
10    }
11
12    var currentState = packageKey.Values.Find(v => v.ValueName == "CurrentState")?.ValueData;
13
14    // Installed
15    if (currentState == "112")
16    {
17        var groups = updatePackageRegex.Match(packageKey.KeyName).Groups;
18        updates[groups[1].Value] = groups[2].Value;
19    }
20}
21
22foreach (var update in updates)
23{
24    Console.WriteLine($"{update.Key}: {update.Value}");
25}

Running this gave me a single update: KB5020373: 20348.1300.1.0. Searching online for KB5020373 led me to the documentation for the update. It’s the November 2022 security update for .NET Framework and has a fix for CVE-2022-41064.

Done! …Now what if we scaled this?

It turns out it’s not that difficult to find out info about Windows container images. It took me a couple of hours to figure out, but that’s only because no one seems to have done this before. The actual code is only about 30 lines.

Windows containers are widely used for legacy applications, like .NET Framework applications, that haven’t been rewritten but could benefit from the cloud. All of the big three cloud providers offer managed Kubernetes services that support Windows nodes out of the box (yes, Kubernetes supports Windows nodes). There is clearly a demand for Windows containers, but there is a gap in the kind of container tooling that has sprung up for Linux containers.

Instead of allowing this gap to widen further, I think that container tool authors—especially SBOM tools and vulnerability scanners—should add support for Windows container images. These tools should then correlate the extracted information with the Microsoft Security Research Center (MSRC) API. MSRC publishes information every month on security updates. Comparing the Windows version from a container image with the fixed versions provided by the MSRC API, you could easily see your container image’s security vulnerabilities.

As my proof-of-concept has shown, it’s low-hanging fruit. A small addition that would have a big impact for the many forgotten developers and the applications they work on.

Newer Older

Jamie Magee