27 Jan 2019, 19:00

Kubernetes as an API standard

There is now a rustyk8s mailing list to discuss implementations of the Kubernetes API in Rust.

There was a lot of interest in my tweet a couple of months about writing an implementation of the Kubernetes API in Rust. I had a good conversation at Kubecon with some people about it, and thought I should explain more why it is interesting.

Kubernetes is an excellent API for running code reliably. So much so that people want to run it everywhere. People have described it as the universal distributred systems API, and something that will eventually be embedded into hardware, or the kernel (or Linux) of distributed systems. Maybe some of these are ambitious, but nothing wrong with ambition, and hey it is a nice, simple API at its core. Essentially it just does reconciliation between the world and desired state for an extensible set of things, things that include a concept of a pod by default. That is pretty much it, a simple idea.

A simple idea, but not simply expressed. If you build a standalone Kubernetes system, somehow that simple idea amounts to a gigabyte of compiled code. Sure, there are some extraneous debug symbols, and a few extra versions of etcd for version upgrades, and maybe one day Go will produce less bloated code, but that is not going to cut it for embedded systems and other interesting potential use cases of Kubernetes. Nor is it easy to understand, find your way around the code and hack on it.

Another problem with Kubernetes is that it suffers from the problem that the implementation is the specification. Lots of projects start like that but as they mature the specification is often separated, and alternative implementations can thrive. Without an independent specification, alternative implementations often have to copy every accidental nuance of the original, and even replicate bugs. Kubernetes is in the right state where starting to move towards an independent specification would be productive. We know that there are some rough edges in the implementation that need to be cleared up, and some parts where the API is not yet the best it could be.

One approach is to try to cut back the current implementation to a more manageable size, by removing parts. This is what Darren Shepherd of Rancher has done with “k3s”, removing a million or so lines of code. But a second, complementary approach is to build a new simple implementation from the ground up without any baggage to start with. Then by looking at differences in behaviour, you can start to understand which parts are the core specification, and which parts are accidental. Given that the way the code for Kubernetes is written has been described as a “clusterfuck” by Kris Nova, this seems a productive route: “Unknown to most, Kubernetes was originally written in Java… If the anti patterns weren’t enough we also observe how Kubernetes has over 20 main() functions in a monolithic “build” directory… Kubernetes successfully made vendoring even more challenging than it already was, and discuss the pitfalls with this design. We look at what it would take to begin undoing the spaghetti code that is the various Kubernetes binaries.”

Of course we could write a new implementation in Go, but the temptation would then be to import bunches of existing code, and it might not end up that different. A different language makes sense to stop that. The aim should be to build the minimum needed to implement the code API. So what language? Rust makes the most sense it seems, although there are some other options.

There is a small but growing community of cloud native Rust projects. In the CNCF, there is TiKV from PingCAP and the Linkerd 2 data plane. Anther project that has recently been launched in the space is AWS Firecracker. The Rust ecosystem is especially strong in security, and control of memory usage, both of which are important for effective scalable systems. In the last year or so the core libraries needed in the cloud native space have really been filled in.

So are you interested in hacking on a greenfield implementation of Kubernetes in Rust? There is not yet a public codebase to hack on, but I know that there are some people hacking in private. The minimal viable project is something that you can talk to with kubectl and run pods, and API extensions. The conformance tests should help, although they are not complete enough to constitute a specification by any means, but starting to pass some tests would be a satisfying achievement. If you want to meet up with cloud native Rust community, a bunch of people will be at Fosdem in early February, and I will sort out a fringe even at KubeCon EU as well. Happy hacking!

01 Jan 2019, 18:00


You might have noticed me tweeting a bunch about RISC-V in recent months. It is actually something I have been following for several years now, since the formation of LowRISC in Cambridge quite some time ago, but this year has suddenly seen a huge maturing of the ecosystem.

In case you have been sitting under a rock hacking on something for some time, RISC-V is an open instruction set for CPUs. It is pronounced “risk five”. It looks a bit like MIPS, if you know your instruction sets, and yes it is very RISC, pretty minimal really. It is designed to be cleanly extended, and has 32, 64 and 128 bit implementations. So far the 32 bit version is for microcontrollers, the 64 bit for operating systems like Linux with MMUs, and the 128 bit version is for future dreams.

But an instruction set, even one without licensing and patent issues, is not that interesting on its own. There are some other options there after all, although they all have some issues. What is more interesting is that there are open and freely modifiable open source implementations. Lots of them. There are proprietary ones too, and hybrid ones with some closed IP and some open, but the community has been building open. Not just open cores, but new open toolchains (largely written in Scala) for design, test, simulation and so on.

SiFive core designer

The size of the community growth this year has been huge, starting with the launch by SiFive of the first commercially available RISC-V machine that could run Linux at Fosdem in January. Going to a RISC-V meetup (they are springing up in Silicon Valley, Cambridge, Bristol and Israel) you feel that this is hardware done by people who want to do hardware like open source software is done. People are building cores, running in silicon or on FPGA, tooling, secure enclaves, operating systems, VC funded business and revenue funded businesses. You meet people from Arm at these meetups, finding out what is going on, while Intel is funding RISC-V businesses, as if they want to make serious competition for Arm or something! Meanwhile MIPS has opened its ISA as a somewhat late reaction.

A few years ago RISC-V was replacing a few small microcontrollers and custom CPUs, now we see companies like Western Digital announcing they will switch all their cores to RISC-V, while opening their designs. There are lots of AI/TPU cores being built with RISC-V cores, and Esperanto is building chips with over a thousand 64 bit RISC-V cores on. The market for specialist AI chips came along at the same time as RISC-V was maturing, and it was a logical new market.

RISC-V is by no means mature; it is forecast it will ship 10-100 million cores in 2019, the majority of them 32 bit microcontrollers, but that adds to the interest, it is at the stage where you can now start building things, and lots of people are building things for fun or serious reasons, or porting code, or developing formal ISA models or whatever. Open source wins because a huge community just decides it is the future and rallies around every piece of the ecosystem. 2018 was the year that movement became really visible for RISC-V.

I haven’t started hacking on any RISC-V code yet, but I have an idea for a little side project, but I have joined the RISC-V Foundation as an individual member and hope to get to the RISC-V Workshop in Zurich and several meetups. See you there and happy hacking!

28 Dec 2018, 14:00

2018 Conferences

I gave quite a few talks this year, and also organized several conference tracks.

Config Mangement Camp

It was an excellent Config Management Camp this year, and fun to speak at.

QCon London 2018

I organized the Modern Computer Science in the Real World Track at this conference, it was a great set of talks.

I also spoke in the Modern Operating Systems track

KubeCon Cloud Native Europe


Registration required to watch videos.


All Things Open

I don’t think this was recorded.


I curated the Modern Operating Systems track, and spoke on it. The videos are coming out on 7 and 14 January

  • Thomas Graf, How to Make Linux Microservice-Aware With Cilium and eBPF
  • Alan Kasindorf, Caching Beyond RAM: The Case for NVMe
  • Justin Cormack, The Modern Operating System in 2018, a somewhat changed version of my QCon London talk
  • Adin Scannell, gVisor: Building and Battle Testing a Userspace OS in Go
  • Bryan Cantrill, Is It Time to Rewrite the Operating System in Rust? (Don’t miss this!)

DockerCon EU

Registration required to watch videos. I helped organize the Black Belt track which had some great talks:

I gave a joint talk on Open Policy Agent and a re-run of the earlier talk with Liz Rice

Kubecon Cloud Native US


Don’t miss the Modern Operating Systems track at QCon London which I am curating, should be excellent.

  • Jessie Frazelle on eBPF
  • Avi Deitcher on LinuxKit
  • Kenton Varda on Cloudflare Workers
  • others TBC

I am planning or hoping to attend in 2019 at least the events below, but also no dount several other ones.

27 Dec 2018, 19:00

Confused Deputies Strike Back

A few weeks back Kubernetes had its first really severe security issue, CVE-2018-1002105. For some background on this, and how it was discovered, I recommend Darren Shepherd’s blog post, he discovered it via some side effects and initially it did not appear to be a security issue just an error handling issue. Of course we know well that many error handling issues can be escalated, but why was this one so bad?

To summarize the problem, there is an API server proxy component, that clients can use to talk to other API endpoints. As the postmortem document says

  • Kubernetes API server proxy components still use http/1.1 upgrade-based connection tunneling, which does not distinguish between request data sent by the apiserver while establishing the backend connection, and data sent by the requesting user

  • High and low-privilege API requests to aggregated API servers are proxied via the same component with the same high-permission transport credentials

Well, this security issue is actually well known enough to have its own name, it is the confused deputy problem, originally written about by Norm Hardy in 1988 although referring to an original example from the 1970s. The essence of the problem is that there are three parties involved, a user, a proxy or deputy type component and an object or service that needs to be accessed, or a similar set of endpoints. The user connects to the deputy to perform an action on an object, but the deputy can be persuaded to act on an object that it has access to rather than one the end user has access to.

Imagine asking your accountant to fill in your tax return. Your accountant has access to your tax return, but also to those of other customers. If the accountant is buggy or can be confused she could fill in one of these tax returns instead of yours. The general problem is that in order to run a tax return filing service, you need the ability to fill in lots of different people’s tax returns. You become a very privileged node, a superuser of tax returns. The tax office has to respect your authority to fill in lots of tax returns, and read them, so the accountant’s credentials must be very privileged. We see similar designs in all sorts of places, like suid applications in Unix that can do operations on behalf of any user and must be very highly trusted, and are often the source of security bugs.

What is the solution? Well we can not have these deputies. Fill in your own tax return! But in effect this says do not use microservices. If every endpoint needs to have the code for filling in tax returns we lose the benefit of microservices, we have to update lots of endpoints together, we cannot have a team building better accountant services and so on. What we really want is that the accountant does not have to be a superuser, but instead she has no permissions on her own but we can pass credentials (maybe time limited) to update our tax return (but not to generally impersonate us) with our request. This access control model is called capability-based security: access is granted via unforgeable but transferable tokens that provide access to objects. You can imagine they are keys, like passing your car key to a valet service, rather than the valet service having a master key for all cars that they might need to park.

The standard access control list (ACL) models of authorization are all about making decisions based on identity, a concept that clearly must not be transferable. I never want my accountant to have to (or be able to) pretend to be me to fill in my tax return. The classic solution in this case would be for me to be able to add additional people to the ACL for my tax return; this is modeled in new ACL frameworks like NGAC from NIST (sorry no link right now the website is down due to the government shutdown). This does not immediately seem applicable to the Kubernetes issue though, and is much more complex than passing my API access credential to the API proxy server. At this point I highly recommend the excellent short paper ACLs don’t by Tyler Close, one of my favourite papers (I should do a papers we love session on it). His examples mainly come from the browser, another prevalent deputy with a lot of security issues, such as CSRF another confused deputy attack. Capabilities are actually very simple to understand and reason about.

ACL based security is fine for many situations, in particular where there are only two parties and you just want to mediate access to a set of resources. But microservices do not appear to be in that sweet spot, as Kubernetes found out with its API proxy microservice. Bugs can be fixed, but as the retrospective points out all changes will need to be examined for security issues. As Tyler Close says “the correct implementation of an access policy cannot be ascertained by an examination of the ACLs configured for an application, but must also include an examination of the program’s source code. To date, this technique has been error prone.” It was not even the only bug that week that was a confused deputy issue, the Zoom critical bug was the same issue, where UDP packets could confuse the deputy service. These are critical issues happening on a regular basis, and no doubt many more lurk.

The entire reason for microservices is to have third parties to delegate services to, and we need to shift away from ACL based models to capabilities for microservices. Of course this is non trivial, distributed capabilities (as opposed to local ones) have not been used much and we don’t have a good infrastructure for them yet. I will write more about practicalities in a further post, but we need to start shifting security to be microservice native too not just adopting things that worked for monoliths.

26 Dec 2018, 20:00

QUIC for Unikernels

I had until recently mostly been ignoring QUIC. In case you had, QUIC is a new-ish protocol developed to Google, that will probably be HTTP version 3. The interesting pieces are that it runs over UDP not TCP, but supports reliable delivery by implementing retransmission itself, that it supports multiple streams without head of line blocking, and that it is designed to support encryption natively. Another important benefit is that connections can migrate from one IP to another without being dropped, as there is a connection ID independent of source and destination addresses. It is also designed to not ossify, with as much as possible of the packet encrypted as possible, so that intermediaries cannot inspect what is inside the packet and make decisions, to avoid the great difficulty in changing TCP where adding new extensions has a small chance of working as middle boxes will often strip things they do not understand, or not let the packets through. If all you can see is UDP flows it is harder to do much. Another benefit is faster handshake for the encrypted case. Proposed extensions include different algorithms for congestion control, for example for different environments like in datacentre or high latency connections, and forward error correction.

I had partly been ignoring QUIC as it has not yet been finalised, and had some temporary encryption included that was going to be removed to be replaced by TLS 1.3, and also as it seemed to be very tied up with HTTP/2. I also had some idea that layering encryption over TCP in possibly slightly non spec compliant ways might make sense. But then a few weeks back, a paper on nQUIC or Noise on QUIC came out, and I decided to take another look. It turned out that there were other people interested in removing the strong tie to TLS, and also looking further it seems that the protocol is not that tied to HTTP, and it does provide a general transport with multiple streams. The IETF standard drafts split out the TLS implementation, and it looks like there is interest in pushing for a standard Noise based version. Quic is not significantly more complex than TCP, especially as you can in effect hard code the number of streams if you do not want to use that feature, for example on an embedded system. Noise over QUIC, without HTTP looks pretty reasonable for small systems that have enough performance to do encryption and a little memory, even down to microcontrollers. You could even customise it for some applications in closed environments.

So what has this got to do with unikernels? Well the interesting thing about QUIC is that it always runs in userspace, not in the kernel on conventional systems. So that puts unikernels on an equal footing, they can use the same implementations as other applications use. There are already implementations in C, C++, Go, Rust, TypeScript, Objective C, Python, and no doubt more. Interfacing QUIC to a transport stack is pretty simple as UDP is just a thin layer over ethernet. There is no reason why the implementation should be any less efficient, indeed it can probably be made more efficient as it csn bypass several abstraction layers.

There are some potential issues in that some firewalls block QUIC (which is typically on UDP port 443); browsers will switch to TCP in that case. A QUIC only unikernel might not have that luxury, especially in some embedded situation. Larger machines can still fall back to TCP, but that can be a less optimised version. The main use case for QUIC would initially be for traffic between dedicated unikernel or embedded services, especially if you are using Noise rather than TLS for a very small implementation, not public endpoints. There are some concerns that the CPU overhead of QUIC is higher, so it may not be suitable for embedded applications, and there are no benefits over TCP for those cases. But there is freedom to iterate in a way that there is much less with TCP, so I think it is definitely worth examining. Research in whether CPU overhead is a necessary part of the protocol, and how to measure efficiency in different environments is also productive.

/* removed Google analytics */