06 Feb 2018, 20:40

Making Immutable Infrastructure simpler with LinuxKit

Config Management Camp

I gave this talk at Config Management Camp 2018, in Gent. This is a great event, and I recommend you go if you are interested in systems and how to make them work.

Did I mention that Gent is a lovely Belgian town?

The slides can be downloaded here.

The video should be available shortly.

Below is a brief summary.


Some history of the ideas behind immutability.

“The self-modifying behavior of both manual and automatic administration techniques helps explain the difficulty and expense of maintaining high availability and security in conventionally-administered infrastructures. A concise and reliable way to describe any arbitrary state of a disk is to describe the procedure for creating that state.”

Steve Traugott, Why Order Matters: Turing Equivalence in Automated Systems Administration, 2002

“In the cloud, we know exactly what we want a server to be, and if we want to change that we simply terminate it and launch a new server with a new AMI.”

Netflix Building with Legos, 2011

“As a system administrator, one of the scariest things I ever encounter is a server that’s been running for ages. If you absolutely know a system has been created via automation and never changed since the moment of creation, most of the problems disappear.”

Chad Fowler,Trash Your Servers and Burn Your Code, 2013

“Use container-specific OSes instead of general-purpose ones to reduce attack surfaces. When using a container-specific OS, attack surfaces are typically much smaller than they would be with a general-purpose OS, so there are fewer opportunities to attack and compromise a container-specific OS.”

NIST Application Container Security Guide, 2017


Updating software is a hard thing to do. Sometimes you can update a config file and send a SIGHUP, other times you have to kill the process. Updating a library may mean restarting everything that depends on it. If you want to change the Docker config that means restarting all the containers potentially. Usually only Erlang programs self update correctly. Our tooling has got a domain specific view of how to do all this, but it is difficult. Usually there is some downtime on a single machine. But in a distributed system we always allow for single system downtime, so why be so hardcore about updates? Just restart the machine with a new image, that is the immutable infrastructure idea. Not immutable, just disposable.


Immutability does not mean there is no state. Twelve factor apps are not that interesting. Everything has data. But we have built Unix systems based on state being all kind of mixed up everywhere in the filesystem. We want to try to split between immutable code and mutable application state.

Functional programming is a useful model. There is state in functional programs, but it is always explicit not implicit. Mutable global state is the thing that functional programming was a reaction against. Control and understand your state mutation.

Immutability was something that we made people do for containers, well Docker did. LXC said treat containers like VMs, Docker said treat them as immutable. Docker had better usability and somehow we managed to get people to think they couldn’t update container state dynamically and to just redeploy. Sometimes people invent tooling to update containers, with Puppet or Chef or whatever, those people are weird.

The hard problems are about distributed systems. Really hard. We can’t even know what the state is. These are the interesting configuration management problems. Focus on these. Make the individual machine as simple as possible, and just think about the distribution issues. Those are really hard. You don’t want configuration drift on machines messing up your system, there are plenty of ways to mess up distributed systems anyway.


Why are there no immutable system products? Actually the sales model does not work well with something that is at build time only, not running on your infrastructure. The billing models for config management products don’t really work well. Immutable system tooling is likely to remain open source and community led for now. Cloud vendors may well be selling you products based on immutable infrastructure though.


LinuxKit was originally built for Docker for Mac. We needed a simple embedded, maintainable, invisible Linux host system to run Docker. The first commit message said “not required: self update: treated as immutable”. This project became LinuxKit, open sourced in 2017. The only kind of related tooling is Packer, but that is much more complex. One of the goals for LinuxKit was that you should be able to build an AWS AMI from your laptop without actually booting a machine. Essentially LinuxKit is a filesystem manipulation tool, based on containers.

LinuxKit is based on a really simple model, the same as a Kubernetes pod. First a sequential series of containers runs, to set up the system state, then containerd runs the main services. This config corresponds to the yaml config file, which itself is used to build the filesystem. Additional tooling lets you build any kind of disk format, for EFI or BIOS, such as ISOs, disk images or initramfs. There are development tools to run images on cloud providers and locally, but you can use any tooling, such as Terraform for production workloads.

Why are people not using immutable infrastructure?

Lack of tooling is one thing. Packer is really the only option other than LinuxKit, and it has a much more complex workflow involving booting a machine to install. This makes a CI pipeline much more complex. There are also nearly immutable distros like Container Linux, but this is very hard to customise compared to LinuxKit.


This is a very brief summary of the talk. Please check out LinuxKit it is an easy, different and fun way to use Linux.

21 Jan 2018, 23:00

Using the Noise Protocol Framework to design a distributed capability system


In order to understand this blog you should know about capability-based security. Perhaps still the best introduction, especially if you are mainly familiar with role based access control is the Capability Myths Demolished paper.

You will also need to be familiar with the Noise Protocol Framework. Noise is a fairly new crypto meta protocol, somewhat in the tradition of the NaCl Cryptobox: protocols you can use easily without error. It is used in modern secure applications like Wireguard. Before reading the specification this short (20m) talk from Real World Crypto 2018 by Trevor Perrin, the author, is an excellent introduction.


Our stacks have becoming increasingly complicated. One of the things I have been thinking about is protocols for lighter weight interactions. The smaller services get, and the more we want high performance services, the more the overhead of protocols designed for large scale monoliths don’t perform. We cannot replace larger scale systems with nanoservices, serverless and Edge services if they cannot perform. In addition to performance, we need scaleable security and identity for nanoservices. Currently nanoservices and serverless are not really competitive in performance with larger monolithic code, which can serve millions of requests a second. Current serverless stacks hack around this by persisting containers for minutes at a time to answer a single request. Edge devices need simpler protocols too; you don’t really want GRPC in microcontrollers. I will write more about this in future.

Noise is a great framework for simple secure crypto. In particular, we need understandable guarantees on the properties of the crypto between services. We also need a workable identity model, which is where capabilities come in.


Capability systems, and especially not distributed capability systems, are not terribly widely used at present. Early designs included KeyKos, and the E language, which has been taken up as the Cap’n Proto RPC design. Pony is also capability based, although these are somewhat different deny capabilities. Many systems include some capability like pieces though; Unix file descriptors are capabilities for example, which is why file descriptor based Unix APIs are so useful for security.

With a large number of small services, we want to give out fine grained capabilities. With dynamic services, this is much the most flexible way of identifying and authorizing services. Capabilities are inherently decentralised, with no CAs or other centralised infrastructure; services can create and distribute capabilities independently, and decide on their trust boundaries. Of course you can also use them just for systems you control and trust too.

While it has been recognised for quite a while that there is [an equivalence between public key cryptography and capabilities(http://www.cap-lore.com/CapTheory/Dist/PubKey.html), this has not been used much. I think part of the reason is that historically, public key cryptography was slow, but of course computers are faster now, and encryption is much more important.

The correspondance works as follows. In order for Alice to send an encrypted message to Bob, she must have his public key. Usually, people just publish public keys so that anyone can send them messages, but if you do not necessarily do this things get more interesting. Possession of Bob’s public key gives the capability of talking to Bob; without it you cannot construct an encrypted message that Bob can decode. Actually it is more useful to think of they keys in this case not as belonging to people but as roles or services. Having the service public key allows connecting to it; having the private key lets you serve it. Note you still need to find the service location to connect; a hash of the public key could be a suitable DNS name.

On single hosts, capabilities are usually managed by a privileged process, such as the operating system. This can give out secure references, such as small integers like file descriptors, or object pointers protected by the type system. These methods don’t really work in a distributed setup, and capabilities need a representation on the wire. One of the concerns in the literature is that if a (distributed) capability is just a string of (unguessable) bits that can be distributed, then it might get distributed maliciously. There are two aspects to this. First if a malicious agent has a capability at all, it can use it maliciously, including proxying other malicious users, if it has network access. So being able to pass the capability on is no worse. Generally, only pass capabilities to trusted code, ideally code that is confined by (lack of) capabilities in where it can communicate and does not have access to other back channels. Don’t run untrusted code. In terms of keys being exfiltrated unintentionally, this is also an issue that we generally have with private keys; with capabilities all keys become things that, especially in these times of Sceptre, we have to be very careful with. Mechanisms that avoid simply passing keys, and pass references instead, seem to me to be more complicated and likely to have their own security issues.

Using Noise to build a capability framework

The Noise spec says “The XX pattern is the most generically useful, since it supports mutual authentication and transmission of static public keys.” However we will see that there different options that make sense for our use case. The XX pattern allows two parties who do not know each other to communicate and exchange keys. The XX echange still requires some sort of authentication, such as certificates to see if the two parties should trust each other.

Note that Trevor Perrin pointed out that just using a public key is dangerous and using a pre-shared key (psk) in addition is a better design. So you should use psk+public key as the capability. This means that accidentally sharing the public key in a handshake is not a disastrous event.

When using keys as capabilities though we always know the public key (aka capability) of the service we want to connect to. In Noise spec notation, that is all the ones with <- s in the pre-message pattern. This indicates that prior to the start of the handshake phase, the responder (service) has sent their public key to the initiator (directly or indirectly). That is, that the initiator of the communication posseses the capability required to connect to the service, in capability speak. So these patterns are the ones that correspond to capability systems; for the interactive patterns that is NK, KK and XK.

NK corresponds to a communication where the initiator does not provide any identification. This is the normal situation for many capability systems; once you have the capability you perform an action. If the capability is public, or widely distributed, this corresponds more or less to a public web API, although with encryption.

XK, and IK are the equivalent in web terms of providing a (validated) login along with the connection. The initiator passes a public key (which could be a capability, or just used as a key) during the handshake. If you want to store some data which is attached to the identity you use as the passed public key, this handshake makes sense. Note that the initiator can create any number of public keys, so the key is not a unique identifier, just one chosen identity. IK is the same semantics but has a different, shorter, handshake with slightly different security properties; it is the one used by Wireguard.

KK is an unusual handshake in traditional capability terms; it requires that both parties know in advance each other’s public key, ie that there is in a sense a mutual capability arrangement, or rendezvous. You could just connect with XK and then check the key, but having this in the handshake may make sense. An XK handshake could be a precursor to a KK relationship in future.

In addition to the more common two way handshakes, noise supports unidirectional one way messages. It is not common to use public key encryption for offline messages, such as encrypting a file or database record at present. Usually symmetric keys are used. The noise one way methods use public keys, and all three N, X and K require the recipient public key (otherwise they would not be confidential), so they all correspond to capabilities based exchanges. Just like the interactive patterns, they can be anonymous or pass or require keys. There are disadvantages to these patterns, as there is no replay protection, as the receiver cannot provide an ephemeral key, but for offline uses, such as store and forward, or file or database encryption. Unlike symmetric keys for this use case, there is a seperate sender and receiver role, so the ability to read database records does not mean the ability to forge them, improving security. It also fits much better in a capabilities world, and is simpler as there is only one type of key, rather than having two types and complex key management.


I don’t have an implementation right now. I was working on a prototype previously, but using Noise XX and still trying to work out what to do for authentication. I much prefer this design, which answers those questions. There are a bunch of practicalities that are needed to make this usable, and some conventions and guidelines around key usage.

We can see that we can use the Noise Protocol as a set of three interactive and three one way patterns for capability based exchanges. No additional certificates or central source of authorisation is needed other than public and private key pairs for Diffie Hellmann exchanges. Public keys can be used as capabilities; private keys give the ability to provide services. The system is decentralised, encrypted and simple. And there are interesting properties of mutual capabilities that can be used if required.

23 Dec 2017, 12:40

Eleven syscalls that rock the world

People ask me “Justin, what are your favourite system calls, that you look forward to their return codes?” Here I reveal all.

If you are in a bad mood you might prefer sucky syscalls that should never have existed.

0. read

You cannot go wrong with a read. You can barely EFAULT it! On Linux amd64 it is syscall zero. If all its arguments are zero it returns zero. Cool!

1. pipe

The society for the preservation of historic calling conventions is very fond of pipe, as in many operating systems and architectures it preserves the fun feature of returning both of the file descriptors as return values. At least Linux MIPS does, and NetBSD does even on x86 and amd64. Multiple return values are making a comeback in languages like Lua and Go, but C has always had a bit of a funny thing about them, but they have long been supported in many calling conventions, so let us use them in syscalls! Well, one syscall.

2. kqueue

When the world went all C10K on our ass, and scaleable polling was a thing, Linux went epoll, the BSDs went kqueue and Solaris went /dev/poll. The nicest interface was kqueue, while epoll is some mix of edge and level triggered semantics and design errors so bugs are still being found.

3. unshare

Sounds like a selfish syscall, but this generous syscall call is the basis of Linux namespaces, allowing a process to isolate its resources. Containers are built from unshares.

4. setns

If you liked unshare, its younger but cooler friend takes file descriptors for namespaces. Pass it down a unix socket to another process, or stash it for later, and do that namespace switching. All the best system calls take file descriptors.

5. execveat

Despite its somewhat confusing name (FreeBSD has the saner fexecve, but other BSDs do not have support last time I checked), this syscall finally lets you execute a program just given a file descriptor for the file. I say finally, as Linux only implemented this in 3.19, which means it is hard to rely on it (yeah, stop using those stupid old kernels folks). Before that Glibc had a terrible userspace implementation that is basically useless. Perfect for creating sandboxes, as you can sandbox a program into a filesystem with nothing at all in, or with a totally controlled tree, by opening the file to execute before chroot or changing the namespace.

6. pdfork

Too cool for Linux, you have to head out to FreeBSD for this one. Like fork, but you get a file descriptor for the process not a pid. Then you can throw it in the kqueue or send it to another process. Once you have tried process descriptors you will never go back.

7. signalfd

You might detect a theme here, but if you have ever written traditional 1980s style signal handlers you know how much they suck. How about turning your signals into messages that you can read on, you guessed it, file descriptors. Like, usable.

8. wstat

This one is from Plan 9. It does the opposite of stat and writes the same structure. Simples. Avoids having chmod, chown, rename, utime and so on, by the simple expedient of making the syscall symmetric. Why not?

9. clonefile

The only cool syscall on OSX, and only supported on the new APFS filesystem. Copies whole files or directories on a single syscall using copy on write for all the data. Look on my works, copy_file_range and despair.

10. pledge

The little sandbox that worked. OpenBSD only here, they managed to make a simple sandbox that was practical for real programs, like the base OpenBSD system. Capsicum form FreeBSD (and promised for Linux for years but no sign) is a lovely design, and gave us pdfork, but its still kind of difficult and intrusive to implement. Linux has, well, seccomp, LSMs, and still nothing that usable for the average program.

23 Dec 2017, 12:40

Eleven syscalls that suck

People ask me “Justin, what system calls suck so much, suck donkeys for breakfast, like if Donald Trump were a syscall?” Here I reveal all.

If you don’t like sucky syscalls, try the eleven non sucky ones.

0. ioctl

It can‘t decide if it‘s arguments are integers, strings, or some struct that is lost in the midst of time. Make up your mind! Plan 9 was invented to get rid of this.

1. fcntl

Just like ioctl but for some different miscellaneous operations, because one miscelleny is not enough.

2. tuxcall

Linux put a web server in the kernel! To win a benchmark contest with Microsoft! It had it‘s own syscall! My enum tux_reactions are YUK! Don‘t worry though, it was a distro patch (thanks Red Hat!) and never made it upstream, so only the man page and reserved number survive to taunt you and remind you that the path of the righteous is beset by premature optimization!

3. io_setup

The Linux asynchronous IO syscalls are almost entirely useless! Almost nothing works! You have to use O_DIRECT for a start. And then they still barely work! They have one use, benchmarking SSDs, to show what speed you could get if only there was a usable API. Want async IO in kernel? Use Windows!

4. stat, and its friends and relatives

Yes this one is useful, but can you find the data structure it uses? We have oldstat, oldfstat, ustat, oldlstat, statfs, fstatfs, stat, lstat, fstat, stat64, lstat64, fstat64, statfs64, fstatfs64, fstatat64 for stating files and links and filesystems in Linux. A new bunch will be along soon for Y2038. Simplify your life, use a BSD, where they cleaned up the mess as they did the cooking! Linux on 32 bit platforms is just sucky in comparison, and will get worse. And don’t even look at MIPS, where the padding is wrong.

5. Linux on MIPS

Not a syscall, a whole implemntation of the Linux ABI. Unlike the lovely clean BSDs, Linux is different on each architecture, system calls randomly take arguments in different orders, and constants have different values, and there are special syscalls. But MIPS takes the biscuit, the whole packet of biscuits. It was made to be binary compatible with old SGI machines that don’t even exist, and has more syscall ABIs than I have had hot dinners. Clean it up! Make a new sane MIPS ABI and deprecate the old ones, nothing like adding another variant. So annoying I think I threw out all my MIPS machines, each different.

6. inotify, fanotify and friends

Linux has no fewer than three file system change notification protocols. The first, dnotify hopped on ioctl‘s sidekick fcntl, while the two later ones, inotify and fanotify added a bunch more syscalls. You can use any of them, and they still will not provide the notification API you want for most applications. Most people use the second one, inotify and curse it. Did you know kqueue can do this on the BSDs?

7. personality

Oozing in personality, but we just don’t get along. Basically obsolete, as the kernel can decide what kind of system emulation to do from binaries directly, it stays around with some use cases in persuading ./configure it is running on a 32 bit system. But it can turn off ASLR, and let the CVEs right into your system. We need less persoanlity!

8. gettimeofday

Still has an obsolete timezone value from an old times when people thought timezones should go all the way to the kernel. Now we know that your computer should not know. Set its clock to UTC. Do the timezones in the UI based on where the user is, not the computer. You should use clock_gettime now. Don’t even talk to me about locales. This syscall is fast though, don’t use it for benchmarking, its in the VDSO.

9. splice and tee

These, back in 2005 were a quite nice idea, although Linux said then “it is incomplete, the interfaces are ugly, and it will oops the system if anything goes wrong”. It won’t oops your system now, but usage has not taken off. The nice idea from Linus was that a pipe is just a ring buffer in the kernel, that can have a more general API and use cases for performant code, but a decade on it hasn’t really worked out. It was also supposed to be a more general sendfile, which in many ways was the successor of that Tux web server, but I think sendfile is still more widely used.

10. userfaultfd

Yes, I like file descriptors. Yes CRIU is kind of cool. But userspace handling page faults? Is nothing sacred? I get that you can do this badly with a SIGSEGV handler, but talk about lipstick on a pig.

01 Dec 2017, 08:52

Thinking about hardware support

Why not use hardware mechanisms?

I made a passing comment on twitter a while back suggesting that hardware virtualisation was not the right solution for some problem and was asked if my views have changed. I thought about it and realised that they had been changing for a while. There were a bunch of things that I go into here that affected that.

One of the earliest things was a discussion some years back in the Snabb community. They are developing software for handling 10Gb+ ethernet traffic. Applications are things like routing, tunneling and so on. One of the things that is a common assumption is that you have to use the offload facilities that network cards provide to get good performance. The issue came up around checksum offload. Pretty much all network hardware provides checksum offload, but it is only for the common cases, generally TCP, UDP with IPv4 and IPv6. If you are doing something different, such as a network tunnel, then support is very rare or only available for some tunnel protocols. So you need to write code for checksums in software anyway, and it will be your performance bottleneck in the case when hardware support does not work, so it needs to be optimised. So you optimise it with SSE and AVX2, and it is really fast. Especially as you have made sure that all the data is in cache at this point, because that is the only way you can do effective line rate processing at those speeds anyway. So you remove the code to use hardware offload, and are left with an absolutely consistent performance in all cases, with no hardware dependency, simpler code as there is only a single code path regardless, and the ability to do complex nested sets of checksums where necessary.

The lessons here are

  1. You have to write the software fallback anyway, and write it efficiently. So hardware support does not remove need to write code.
  2. If you always use the code, you will optimise (and debug) it better than if it is just a fallback that is often not tested.
  3. Hardware covers common cases, but people use things in more complex ways.

There are other lessons from the Snabb experience about hardware support too, such as the terrible performance of VF switching in much hardware. Is poor performance in hardware for certain features a bug, or not? Are we expecting too much of hardware, this was once an ethernet card and now we expect it to route between VMs, so it is not surprising the first generation does not work very well. Again we get to the point where we have to implement functions purely in software and only sometimes use the hardware support, which adds this extra complexity.

Is hardware even hardware?

It turns out that many recent TPM “chips” are emulated by the platform management firmware on recent Intel platforms. Given that this firmware has been shown to be totally insecure, do you have any reason to trust it? Most “hardware” features are largely firmware, closed source, impossible to debug or fix, and quite likely with security issues that are hard to find. Or backdoors. Hardware may just be software you can’t update.


Another factor is how the compute market makes hardware available. Cloud providers use a bunch of hardware mechanisms, especially virtualisation, which means that these mechanisms are no longer available to the users. The sort of size you can purchase is specified by the vendor. This was one of the issues with unikernel adoption, as the sizes of VMs provided by cloud vendors were generally too large for compact unikernel applications, and the smaller ones are also pretty terrible in terms of noisy neighbour effects and so on. You cannot run your own VM in the cloud. Now some nested virtualisation is starting to appear on GCP and Azure, but it is not clear the security properties of this, the performance is worse, and it is not always available and may be buggy. Obviously there are bare metal providers like Packet (and as of this week AWS) but they are currently a small part of the market.

Other hardware devices that you might like to use are similarly unavailable. If you want a hardware device to store security keys on AWS (attached over the network) it will cost you XXXX a month, as this is a premium feature.

Intel SGX

I was really optimistic about Intel SGX a year or so ago. It could be provided on clouds as well as on your own hardware (Azure has recently started to make it available), and could provide a similar sort of isolation to VMs, but with the advantage that the host cannot access the SGX contents at all. However there are both serious security concerns that data can be exfiltrated by side channel attacks, and Intel persists with a totally unusable model where they have to issue keys for every SGX application, which makes the whole thing so cumbersome as to be unrealistic for real applications. It is simpler to just rent a small dedicated machine for the sort of application in question, things like key management. A cynical view is that the main aim now is to generate lock-in for the Intel platform.

Hypervisor security lessons

Hypervisors are probably the most widely used hardware isolation mechanism, and what we have learned from them is that the vast majority of the security issues are in the surrounding software. Yes, QEMU, we are looking at you. Because communication is needed over the hypervisor boundary, and for compatibility and performance reasons that communication has been a pretty complex bit of code, it has had a very large number of bugs. In addition the OS as hypervisor model has seemingly won over the microkernel as hypervisor type model in the mass market at present, for convenience reasons. The large scale cloud providers do not use the open source hypervisors in any form that is close to what is shipped in open source. If you want to run a secure hypervisor you should be looking at CrosVM, Solo5, SEL4, Muen not the generic KVM+QEMU type model. A lot of the more complex enhancements, like memory de-duplication, have been shown to have security issues (eg side channel attacks). Defence in depth is very important - if you were to run KVM on Linux for example you should look at what CrosVM has done for isolation of the implementation: coding in Rust, seccomp, no Qemu, no emulation hardware, privilege separation etc.


Much of the most secure software does not use any hardware features at all. The Chromium and ChromeOS models are very much worth studying. Only recently is ChromeOS considering VMs as a sandboxing device for untrusted code (using CrosVM, not shipping yet). The browser is a difficult thing to protect compared to most server software as it executes arbitrary attacker supplied code, displays complex externally provided media, and implements complex specifications not designed with security in mind. It runs on off the shelf non hardened operating systems in general. It is perhaps not surprising that many of the recent innovation in secure software sandboxing have come from this environment. Things like NaCl, Web Assembly, Rust and so on all originate here.

Linux itself has introduced a fairly limited runtime, eBPF, that is considered safe enough to run even in a kernel context, using compiler technology to check that the code is secure and terminates. This type of technology is getting much more mature, and could be used more widely in userspace too.

Arguably the crutch of “VMs are safe” has meant that hardening Linux has had a lower importance. Or indeed software protection in general, which outside the browser case has had little attention. Java was meant to be safe code when it was introduced, but the libraries in particular turned out to have lots of interesting bugs that meant escapes were possible. Native Client was the next main step, which saw some use outside the browser context, eg in ZeroVM. Web Assembly is the major target now. There needs to be more here, as for example the current state of the art for serverless platforms is to wrap the user process in a container in a VM, which is very high overhead, and tooling for building trusted safe languages is becoming more necessary for embedding applications. Of course this is not helped by the poor overall security of Linux, which is not strongly hardened and largely driven by features not security, and which has an enormous attack surface that is very hard to cut back.

If we didn’t have a (misplaced) trust in hardware security, we would be forced to build better software only security. We could then combine that with hardware security mechanisms for defence in depth. Right now we mostly have too few choices.

/* removed Google analytics */