23 Dec 2017, 12:40

By Justin Cormack

People ask me “Justin, what are your favourite system calls, that you look forward to their return codes?” Here I reveal all.

If you are in a bad mood you might prefer sucky syscalls that should never have existed.

0. `read`

You cannot go wrong with a read. You can barely EFAULT it! On Linux amd64 it is syscall zero. If all its arguments are zero it returns zero. Cool!

1. `pipe`

The society for the preservation of historic calling conventions is very fond of pipe, as in many operating systems and architectures it preserves the fun feature of returning both of the file descriptors as return values. At least Linux MIPS does, and NetBSD does even on x86 and amd64. Multiple return values are making a comeback in languages like Lua and Go, but C has always had a bit of a funny thing about them, but they have long been supported in many calling conventions, so let us use them in syscalls! Well, one syscall.

2. `kqueue`

When the world went all C10K on our ass, and scaleable polling was a thing, Linux went epoll, the BSDs went kqueue and Solaris went /dev/poll. The nicest interface was kqueue, while epoll is some mix of edge and level triggered semantics and design errors so bugs are still being found.

3. `unshare`

Sounds like a selfish syscall, but this generous syscall call is the basis of Linux namespaces, allowing a process to isolate its resources. Containers are built from unshares.

4. `setns`

If you liked unshare, its younger but cooler friend takes file descriptors for namespaces. Pass it down a unix socket to another process, or stash it for later, and do that namespace switching. All the best system calls take file descriptors.

5. `execveat`

Despite its somewhat confusing name (FreeBSD has the saner fexecve, but other BSDs do not have support last time I checked), this syscall finally lets you execute a program just given a file descriptor for the file. I say finally, as Linux only implemented this in 3.19, which means it is hard to rely on it (yeah, stop using those stupid old kernels folks). Before that Glibc had a terrible userspace implementation that is basically useless. Perfect for creating sandboxes, as you can sandbox a program into a filesystem with nothing at all in, or with a totally controlled tree, by opening the file to execute before chroot or changing the namespace.

6. `pdfork`

Too cool for Linux, you have to head out to FreeBSD for this one. Like fork, but you get a file descriptor for the process not a pid. Then you can throw it in the kqueue or send it to another process. Once you have tried process descriptors you will never go back.

7. `signalfd`

You might detect a theme here, but if you have ever written traditional 1980s style signal handlers you know how much they suck. How about turning your signals into messages that you can read on, you guessed it, file descriptors. Like, usable.

8. `wstat`

This one is from Plan 9. It does the opposite of stat and writes the same structure. Simples. Avoids having chmod, chown, rename, utime and so on, by the simple expedient of making the syscall symmetric. Why not?

9. `clonefile`

The only cool syscall on OSX, and only supported on the new APFS filesystem. Copies whole files or directories on a single syscall using copy on write for all the data. Look on my works, copy_file_range and despair.

10. `pledge`

The little sandbox that worked. OpenBSD only here, they managed to make a simple sandbox that was practical for real programs, like the base OpenBSD system. Capsicum form FreeBSD (and promised for Linux for years but no sign) is a lovely design, and gave us pdfork, but its still kind of difficult and intrusive to implement. Linux has, well, seccomp, LSMs, and still nothing that usable for the average program.

23 Dec 2017, 12:40

Eleven syscalls that suck

By Justin Cormack

People ask me “Justin, what system calls suck so much, suck donkeys for breakfast, like if Donald Trump were a syscall?” Here I reveal all.

If you don’t like sucky syscalls, try the eleven non sucky ones.

0. `ioctl`

It can‘t decide if it‘s arguments are integers, strings, or some struct that is lost in the midst of time. Make up your mind! Plan 9 was invented to get rid of this.

1. `fcntl`

Just like ioctl but for some different miscellaneous operations, because one miscelleny is not enough.

2. `tuxcall`

Linux put a web server in the kernel! To win a benchmark contest with Microsoft! It had it‘s own syscall! My enum tux_reactions are YUK! Don‘t worry though, it was a distro patch (thanks Red Hat!) and never made it upstream, so only the man page and reserved number survive to taunt you and remind you that the path of the righteous is beset by premature optimization!

3. `io_setup`

The Linux asynchronous IO syscalls are almost entirely useless! Almost nothing works! You have to use O_DIRECT for a start. And then they still barely work! They have one use, benchmarking SSDs, to show what speed you could get if only there was a usable API. Want async IO in kernel? Use Windows!

4. `stat`, and its friends and relatives

Yes this one is useful, but can you find the data structure it uses? We have oldstat, oldfstat, ustat, oldlstat, statfs, fstatfs, stat, lstat, fstat, stat64, lstat64, fstat64, statfs64, fstatfs64, fstatat64 for stating files and links and filesystems in Linux. A new bunch will be along soon for Y2038. Simplify your life, use a BSD, where they cleaned up the mess as they did the cooking! Linux on 32 bit platforms is just sucky in comparison, and will get worse. And don’t even look at MIPS, where the padding is wrong.

5. Linux on MIPS

Not a syscall, a whole implemntation of the Linux ABI. Unlike the lovely clean BSDs, Linux is different on each architecture, system calls randomly take arguments in different orders, and constants have different values, and there are special syscalls. But MIPS takes the biscuit, the whole packet of biscuits. It was made to be binary compatible with old SGI machines that don’t even exist, and has more syscall ABIs than I have had hot dinners. Clean it up! Make a new sane MIPS ABI and deprecate the old ones, nothing like adding another variant. So annoying I think I threw out all my MIPS machines, each different.

6. `inotify`, `fanotify` and friends

Linux has no fewer than three file system change notification protocols. The first, dnotify hopped on ioctl‘s sidekick fcntl, while the two later ones, inotify and fanotify added a bunch more syscalls. You can use any of them, and they still will not provide the notification API you want for most applications. Most people use the second one, inotify and curse it. Did you know kqueue can do this on the BSDs?

7. `personality`

Oozing in personality, but we just don’t get along. Basically obsolete, as the kernel can decide what kind of system emulation to do from binaries directly, it stays around with some use cases in persuading ./configure it is running on a 32 bit system. But it can turn off ASLR, and let the CVEs right into your system. We need less persoanlity!

8. `gettimeofday`

Still has an obsolete timezone value from an old times when people thought timezones should go all the way to the kernel. Now we know that your computer should not know. Set its clock to UTC. Do the timezones in the UI based on where the user is, not the computer. You should use clock_gettime now. Don’t even talk to me about locales. This syscall is fast though, don’t use it for benchmarking, its in the VDSO.

9. `splice` and `tee`

These, back in 2005 were a quite nice idea, although Linux said then “it is incomplete, the interfaces are ugly, and it will oops the system if anything goes wrong”. It won’t oops your system now, but usage has not taken off. The nice idea from Linus was that a pipe is just a ring buffer in the kernel, that can have a more general API and use cases for performant code, but a decade on it hasn’t really worked out. It was also supposed to be a more general sendfile, which in many ways was the successor of that Tux web server, but I think sendfile is still more widely used.

10. `userfaultfd`

Yes, I like file descriptors. Yes CRIU is kind of cool. But userspace handling page faults? Is nothing sacred? I get that you can do this badly with a SIGSEGV handler, but talk about lipstick on a pig.

01 Dec 2017, 08:52

Thinking about hardware support

By Justin Cormack

Why not use hardware mechanisms?

I made a passing comment on twitter a while back suggesting that hardware virtualisation was not the right solution for some problem and was asked if my views have changed. I thought about it and realised that they had been changing for a while. There were a bunch of things that I go into here that affected that.

One of the earliest things was a discussion some years back in the Snabb community. They are developing software for handling 10Gb+ ethernet traffic. Applications are things like routing, tunneling and so on. One of the things that is a common assumption is that you have to use the offload facilities that network cards provide to get good performance. The issue came up around checksum offload. Pretty much all network hardware provides checksum offload, but it is only for the common cases, generally TCP, UDP with IPv4 and IPv6. If you are doing something different, such as a network tunnel, then support is very rare or only available for some tunnel protocols. So you need to write code for checksums in software anyway, and it will be your performance bottleneck in the case when hardware support does not work, so it needs to be optimised. So you optimise it with SSE and AVX2, and it is really fast. Especially as you have made sure that all the data is in cache at this point, because that is the only way you can do effective line rate processing at those speeds anyway. So you remove the code to use hardware offload, and are left with an absolutely consistent performance in all cases, with no hardware dependency, simpler code as there is only a single code path regardless, and the ability to do complex nested sets of checksums where necessary.

The lessons here are

You have to write the software fallback anyway, and write it efficiently. So hardware support does not remove need to write code.
If you always use the code, you will optimise (and debug) it better than if it is just a fallback that is often not tested.
Hardware covers common cases, but people use things in more complex ways.

There are other lessons from the Snabb experience about hardware support too, such as the terrible performance of VF switching in much hardware. Is poor performance in hardware for certain features a bug, or not? Are we expecting too much of hardware, this was once an ethernet card and now we expect it to route between VMs, so it is not surprising the first generation does not work very well. Again we get to the point where we have to implement functions purely in software and only sometimes use the hardware support, which adds this extra complexity.

Is hardware even hardware?

It turns out that many recent TPM “chips” are emulated by the platform management firmware on recent Intel platforms. Given that this firmware has been shown to be totally insecure, do you have any reason to trust it? Most “hardware” features are largely firmware, closed source, impossible to debug or fix, and quite likely with security issues that are hard to find. Or backdoors. Hardware may just be software you can’t update.

Availability

Another factor is how the compute market makes hardware available. Cloud providers use a bunch of hardware mechanisms, especially virtualisation, which means that these mechanisms are no longer available to the users. The sort of size you can purchase is specified by the vendor. This was one of the issues with unikernel adoption, as the sizes of VMs provided by cloud vendors were generally too large for compact unikernel applications, and the smaller ones are also pretty terrible in terms of noisy neighbour effects and so on. You cannot run your own VM in the cloud. Now some nested virtualisation is starting to appear on GCP and Azure, but it is not clear the security properties of this, the performance is worse, and it is not always available and may be buggy. Obviously there are bare metal providers like Packet (and as of this week AWS) but they are currently a small part of the market.

Other hardware devices that you might like to use are similarly unavailable. If you want a hardware device to store security keys on AWS (attached over the network) it will cost you XXXX a month, as this is a premium feature.

Intel SGX

I was really optimistic about Intel SGX a year or so ago. It could be provided on clouds as well as on your own hardware (Azure has recently started to make it available), and could provide a similar sort of isolation to VMs, but with the advantage that the host cannot access the SGX contents at all. However there are both serious security concerns that data can be exfiltrated by side channel attacks, and Intel persists with a totally unusable model where they have to issue keys for every SGX application, which makes the whole thing so cumbersome as to be unrealistic for real applications. It is simpler to just rent a small dedicated machine for the sort of application in question, things like key management. A cynical view is that the main aim now is to generate lock-in for the Intel platform.

Hypervisor security lessons

Hypervisors are probably the most widely used hardware isolation mechanism, and what we have learned from them is that the vast majority of the security issues are in the surrounding software. Yes, QEMU, we are looking at you. Because communication is needed over the hypervisor boundary, and for compatibility and performance reasons that communication has been a pretty complex bit of code, it has had a very large number of bugs. In addition the OS as hypervisor model has seemingly won over the microkernel as hypervisor type model in the mass market at present, for convenience reasons. The large scale cloud providers do not use the open source hypervisors in any form that is close to what is shipped in open source. If you want to run a secure hypervisor you should be looking at CrosVM, Solo5, SEL4, Muen not the generic KVM+QEMU type model. A lot of the more complex enhancements, like memory de-duplication, have been shown to have security issues (eg side channel attacks). Defence in depth is very important - if you were to run KVM on Linux for example you should look at what CrosVM has done for isolation of the implementation: coding in Rust, seccomp, no Qemu, no emulation hardware, privilege separation etc.

Alternatives

Much of the most secure software does not use any hardware features at all. The Chromium and ChromeOS models are very much worth studying. Only recently is ChromeOS considering VMs as a sandboxing device for untrusted code (using CrosVM, not shipping yet). The browser is a difficult thing to protect compared to most server software as it executes arbitrary attacker supplied code, displays complex externally provided media, and implements complex specifications not designed with security in mind. It runs on off the shelf non hardened operating systems in general. It is perhaps not surprising that many of the recent innovation in secure software sandboxing have come from this environment. Things like NaCl, Web Assembly, Rust and so on all originate here.

Linux itself has introduced a fairly limited runtime, eBPF, that is considered safe enough to run even in a kernel context, using compiler technology to check that the code is secure and terminates. This type of technology is getting much more mature, and could be used more widely in userspace too.

Arguably the crutch of “VMs are safe” has meant that hardening Linux has had a lower importance. Or indeed software protection in general, which outside the browser case has had little attention. Java was meant to be safe code when it was introduced, but the libraries in particular turned out to have lots of interesting bugs that meant escapes were possible. Native Client was the next main step, which saw some use outside the browser context, eg in ZeroVM. Web Assembly is the major target now. There needs to be more here, as for example the current state of the art for serverless platforms is to wrap the user process in a container in a VM, which is very high overhead, and tooling for building trusted safe languages is becoming more necessary for embedding applications. Of course this is not helped by the poor overall security of Linux, which is not strongly hardened and largely driven by features not security, and which has an enormous attack surface that is very hard to cut back.

If we didn’t have a (misplaced) trust in hardware security, we would be forced to build better software only security. We could then combine that with hardware security mechanisms for defence in depth. Right now we mostly have too few choices.

02 Jan 2016, 19:03

Interview with Gareth Rushgrove

By Justin Cormack

Puppet: Configuration Management in the age of containers

Interview with Gareth Rushgrove of Puppet Labs, 30 April 2015, lightly edited transcript

This interview was originally for Docker in Production but did not get into the final version

There are three areas of configuration management that you might want to apply tools to, the host that manages containers, the containers themselves and the overall system. Tools can be applied independently to each area.

For building containers the tools work but they are not optimised yet, and the Docker build API is horrendous. Docker file is not a client, it is the wire format, so all tools that want to do a better job of building things must compile down to a Docker file which is limiting. We would like more direct integration.

It is also important to split up what configuration management is. It is not just tools, it is a discipline for managing your configuration. Some configuration is upfront, it is modelling, a mental model to share with your team and you want to ensure it is correct. There is also emerging configuration, runtime configuration. I have been describing this as two speeds. The modelling stuff changes infrequently, weeks months or hours, but the runtime stuff like a new machine being added is very fast. So it is important to think of it as a discipline not just as tools.

Containers are different, so people are learning new stuff, and are thinking more about dynamic systems. Looking at something new allows people to think about things in a new way. And smaller services, and more things mean there are more changes across the system, having small things allows more changes with less risk. The culture is about changing things faster while minimising risks.

Such as service discovery, which is part of configuration management. Service discovery moves configuration from modelling to emergent properties of the system. This changes how configuration is distributed. Your mental model changes from five servers connected to a load balancer to a pool of hosts, away from a fixed property.

Where there are finite resources, you have been given two servers or 20 IP addresses, you want an explicit model of where they are all the time. Self service infrastructure removes those constraints. People have used tools to model emergent properties, but you end up with a lot of complexity.

Puppet sees things as a graph, understanding how instances relate to security groups is similar to how a group of files and packages relate on a host. There is an onion of infrastructure, physical machines, applications, services spanning multiple hosts, layers of the onion.

Defining the primitives is useful. Some of the work we are doing around Kubernetes and Mesos is about defining the high level primitives once you stop configuring individual hosts. The more hosts you have the less configuring files and packages on hosts scales. After 100 it gets progressively harder and the rate we add machines outstrips the ability. We need to give power tools to support that and fix emerging problems. Need to move from configuring hundreds or thousands of individual hosts, and the race is on now to decide what the primitives are for configuring these systems.

The public cloud has been interesting, as all the vendors resisted standardisation. They all have a machine or VM, but when you come to networking the primitives are really different, and to do real network you have to drop down to the native APIs and more than that the primitives and models are different. You can’t abstract over them without going to lowest common denominator because the primitives are different. Networking is still a killer where the primitives and APIs are still different.

Currently with Puppet there is support for networking. There is support for switches like F5 or Juniper, and support for the cloud providers, but with different models. This lets you build the layers of configuration you need. People are exposing existing devices to software configuration, and standardising interfaces.

Puppet comes with a bunch of existing providers for files, packages and users out of the box, but that is really an implementation detail, packaging. Puppet could have been shipped without any of them out of the box. The interesting question is in twenty years time are declarative modelling languages still useful but no one is even bothered about individual users or files then Puppet won’t ship those in core, they will just be available as a module. Whereas today support for EC2 is shipped as a module, and there are modules to configure Chronos, which is not a host thing but sits across a bunch of machines. I think we will see way more resources like that.

The moment you are running a distributed protocol that is software and that needs to be configured and consistency is important and must be maintained. Even where there are new primitives like replicated applications the software and the usage of that needs configuration.

There is still a need for something that says these are the services I want to run and this is how they are configured. Moving these to containers does not change this massively. It makes a bunch of things easier. A container is a unit of a package and a service format, you need not think of those separately any more.

If you go to an organization that has been around for six months and has more than ten people, they will be running a huge amount of software, the messy real world. It is nice to think you can have one operating system and one version and one way of doing things. But the older organization has a bit of everything and this is not going to change and they are not going to get rid of stuff.

This has to do with financial and business constraints not operational constraints. If things are working, things are not going to move, and this will always be the case.

The messes with containers will be predictable, nothing novel or new. Docker is interesting as it has a really compelling developer story it does not yet have a really compelling operations story. That’s not that it won’t do but it is new. There is loads of focus on the technical operations side, getting the tech right, what do the operational models look like, how do you segregate things. In a multi tenant environment now you wrap containers in virtualisation. The abstraction above the physical is the easy bit but sometimes you care about things not being on the same host or network. Cloud Foundry did some work but it was more or less are you on inside or outside.

Google has one massive workload for search, but it is all about search, if you have access to one part you have access to the rest, so segregation does not matter so much. When you have persistent workloads that are very diverse I think thats where there is a disconnect between a brand new startup and a larger organization. If you have multiple microservices serving one application, segregation is not so important. The workload is not that diverse, and your internally services like HR are all outsourced. Large organizations will have independent things that are running on single machines. Wrapping this in containers does not make sense now. The more diverse the workload, the more you end up with many security domains, and the tools for reasoning about that are not there yet. The one big flat network is a failing of a naive developer mindset; not the developers are naive, the environments they are in do not have those problems.

Security was largely about a trust in hardware. Once everything is software the guarantees go away, and this changes the model. You have layers, model once and apply twice. People are not yet doing this, but you could. Testing what goes across the wire for example, by having a proxy container that sits between services and makes sure they are doing the right thing.

For example if services are supposed to be talking JSON, check if it is JSON and it meets the schema and having that implemented as a separate thing, potentially in a separate security domain. There are a lot of practises that don’t exist yet. Putting your tests into production, putting your tests in production, if it ever fails something has gone badly wrong and it should fail. Build tools so people do not have to trust things, but it is not common practise except at high end. Containers make it worse as you have more moving parts, potential to screw up configuration is higher.

The intersection between configuration management and monitoring is also important, seeing that the defined configuration is working. The model of the configuration drives the model of the monitoring system, so you have feedback into the configuration. It is not quite a closed loop yet, and monitoring tools do not have much context, but in theory you could put all these things together. The widespread adoption of these tools is still happening, and integration is slow as there is not one winner in terms of tools. With databases there are a few, but there are so many tools that there is not the same level of integration as there are so many options.

The commercial vendors are putting these stacks together into more opinionated layers, for example Tectonic from CoreOS is a bundle of CoreOS and Kubernetes and so on and you cannot change the pieces, the way they fit together is defined and opinionated.

There is a tension between what paid customers want, highly opinionated ways to do things, and the open source community which wants many small independent things that do one thing well that they can tie together how they want. With lots of open source products like Docker being funded by companies this is going to be interesting. It is not possible to square that.

Puppet will end up shipping more opinionated infrastructure supporting containers to large enterprises, and what we are building is the start of an opinionated workflow. Only a small number of our clients are using containers so far, we ship opinionated things are more do. The best practises around containers have not been worked out yet. The high level tools are not finished yet, nor is the networking layer.

Commercial interests mean there will be multiple flavours with different models, we are not going to adopt the same thing, from a technology and user point of view it would be lovely. From a business point of view it will not happen, there will be multiple platforms; some of the reasons are good, we don’t know what all the answers are yet. The fact you can compose tools in different ways does not mean it is easy or cheap.

If you are moving fast being able to change the components fast is really important. Doing this on the service level is a massive advantage when you have small growing teams. There are other benefits too. Will there be a market for microservices? Will SAP ship as 139 individual components that you can in theory replace? I don’t think we know yet what it will look like yet. From a consumer point of view I want one thing for one price. Containers as a unit of shipping certain types of software make sense, especially when more widely adopted. Appliance like containers are where enterprises will be interested. Both Docker and Red Hat are interested. Once there is a container runtime everywhere the installer will just find a bunch of containers, and this removes a bunch of work around packaging. But at this point it is an implementation detail. The container runtime is a given at some point, whether it is high level like Kubernetes or slightly lower level like Mesos or like Swarm or Windows containers; the whole multi host stuff is still not determined.

With Windows there is a certain amount of catchup. Server 2016 is shipping with containers out of the box, as is Hyper-V, with Docker as the interface. Containers on Windows will probably be technically better than on Linux, from what they have published so far. Containers are not going to be portable though between Windows and Linux. Windows has been somewhat irrelevant in this space but it is changing.

02 Jan 2016, 18:24

About this blog

By Justin Cormack

It has been some years since I wrote a blog, having spent some time helping write a book two years ago, Docker in Production and having written in various other places. But I like writing so I am going to start again.

The title comes from Eduardo Paolozzi’s 1971 series of etchings about robots, spaceflight and kids toys.

I am mostly going to write about software, cloud, atomicity, but no doubt some other things too.

I work at Docker in Cambridge, UK. You can find me on twitter.

Cloud Atomic Laboratory

Cloud Atomic Laboratory

Justin Cormack blog

23 Dec 2017, 12:40

0. read

1. pipe

2. kqueue

3. unshare

4. setns

5. execveat

6. pdfork

7. signalfd

8. wstat

9. clonefile

10. pledge

23 Dec 2017, 12:40

0. ioctl

1. fcntl

2. tuxcall

3. io_setup

4. stat, and its friends and relatives

5. Linux on MIPS

6. inotify, fanotify and friends

7. personality

8. gettimeofday

9. splice and tee

10. userfaultfd

01 Dec 2017, 08:52

Why not use hardware mechanisms?

Is hardware even hardware?

Availability

Intel SGX

Hypervisor security lessons

Alternatives

02 Jan 2016, 19:03

Puppet: Configuration Management in the age of containers

02 Jan 2016, 18:24

0. `read`

1. `pipe`

2. `kqueue`

3. `unshare`

4. `setns`

5. `execveat`

6. `pdfork`

7. `signalfd`

8. `wstat`

9. `clonefile`

10. `pledge`

0. `ioctl`

1. `fcntl`

2. `tuxcall`

3. `io_setup`

4. `stat`, and its friends and relatives

6. `inotify`, `fanotify` and friends

7. `personality`

8. `gettimeofday`

9. `splice` and `tee`

10. `userfaultfd`