Challenges in Cycle Counting

CPU Isolation in Apache Mesos and DC/OS
CPU Isolation in Apache Mesos and DC/OS

Jörg Schad and Johannes Unterstein discuss the challenges of CPU isolation in DC/OS.

TL;DR: DC/OS 1.10 enforces hard CPU limits with CFS isolation for both the Docker Engine and Universal Container runtimes. This gives more predictable performance across all tasks but might lead to a slowdown for tasks (and therefore also deployments) that previously have consumed more CPU cycles than allocated.

Isolating tasks is an essential function of container schedulers such as DC/OS. In this blog post we discuss the different options for isolating CPU resources and the rationale for a recent change in this behavior.

Container isolation has two goals:

  • Isolation should prevent one task from accessing critical information of another task. With containers this is using accomplished using namespaces. Namespaces are a Linux kernel feature that provide a separate namespace for each container akin to each container having its own process ID.
  • Isolation should provide fair and predictable access to resources. This is accomplished with Linux cgroups a Linux kerneo feature that allows you to specify a maximum of CPU cycles or a maximum memory usage per container.

This blog is about the different options for limiting access to CPU cycles in a fair and predictable fashion.

Overview of CPU isolation

As DC/OS uses Apache Mesos at its core, it also uses its isolation mechanisms which are enforced by containerizers. Mesos offers with two different containerizers: Mesos (referred to as UCR in DC/OS) and Docker.

For isolation, both containerizers rely on Linux cgroups.

For CPU isolation, you can use two different cgroups subsystems:

  • cgroups/cpu.shares

    CPU shares determine the relative weights of CPU cycles allocated to containers (recall that we use the term container here for different process groups). Imagine a system with two containers: container A has a CPU share of 500 whereas container B has a share of 250. That setup gives container A a relative share of ⅔ of the CPU cycles, whereas container B gets ⅓ of the CPU cycles. If container B is removed from the system, container A can use all of the CPU cycles on that machine. This flexible behavior makes it difficult to achieve predictable CPU performance as CPU performance depends on whether different tasks are co-located on the same CPU.

  • cgroups/cpuset

    CPU sets assign CPU cores to containers. E.g., you could assign cores 1 and 3 to container A and cores 0 and 2 to container B. While this has the disadvantage that you are partitioning your cluster by cores, it is useful in large NUMA systems where you might want to pin containers to cores that are directly attached to the same memory.

Both solutions have some limitations when it comes to the goal of flexible but predictable performance. This is where the completely fair scheduler (CFS) comes in. It allows strict CPU limitation (i.e., specifying the maximum CPU bandwidth available to a group or hierarchy).

It might seem to be a disadvantage that containers don’t receive idle CPU cycles in CFS. However, in production setups predictable performance is a more desirable characteristic.

How does it affect DC/OS users?

When using Apache Mesos you have the choice to use CPU shares or CFS strict CPU limitations. If you use CPU shares and your host system has free CPU cycles your task can consume more CPU cycles than initially configured in your Marathon app definition. If you use CFS strict CPU limitations, your task can only consume a maximum of CPU time based on your Marathon configuration.

The default configuration for Mesos is to use CPU shares. CFS strict CPU limitations as the default were introduced in DC/OS a while ago, but until recently this configuration was respected only by the Mesos executor and not by the Docker executor. The fix for MESOS-6134 in the latest Mesos release and also included in DC/OS 1.10 removes this limitation.

If you recently upgraded to DC/OS 1.10 or configured MESOS_CGROUPS_ENABLE_CFS=true in your Mesos agent configuration and you are now seeing slow running Docker applications or slow deployments, you probably want to take action!

If you run into such issues, you should increase the required CPU amount in your Marathon app definition. Your apps/deployments are running slowly because they require more CPU cycles than they are able to consume. Thus, the easiest way to solve this issue is to change the resource requirements in your Marathon app definition. Just change the cpus property of your app definition to a higher value and test if this change solves your issues.

In some special cases you may want to change Mesos Agent configuration to not use strict CFS CPU limitations. Maybe the majority of your applications have a CPU peak during startup and a lower consumption afterwards or you have other advanced CPU loads.

If you do not want strict CPU separation, you can change the current default behavior. In this scenario you need to change the configuration for your DC/OS installation as well as your Mesos Agent configurations. First change this line in your dcos-config.yaml to MESOS_CGROUPS_ENABLE_CFS=false. Once this is done, you can perform either a DC/OS re-installation or ssh to all Mesos agent nodes, or simply change the configuration in /opt/mesosphere/etc/mesos-slave-common to MESOS_CGROUPS_ENABLE_CFS=false and restart the Mesos agent process with sudo systemctl restart dcos-mesos-slave. If you are considering changing this configuration, you should also have a look at the Mesos oversubscription feature.