Apache Mesos 0.22.0 released

The latest Mesos release, 0.22.0, was announced March 25th and is now available for download. Mesosphere’s Core Team has been deeply involved in this release and we will give an overview of what this release means to operators and developers who use Mesos.

Disk quotas

Disk space isolation is now a standard feature in Mesos. Mesos has always supported the scheduling of disk resources. However, disk allocation has previously depended on frameworks operating as “good citizens” and staying within their share of the total available disk space. Now, operators can enforce these limits with the posix/disk isolator.

The new disk isolator checks the disk usage of each task periodically, with a configurable interval, and kills the task if it exceeds its share.

The isolator is enabled with two compute node options:

$ mesos-slave ... --isolation="posix/disk" --enforce_container_disk_quota

Without the --enforce_container_disk_quota flag, the isolator will not kill tasks that go over their quota, but will only measure the task’s disk usage. This data is exposed over the metrics endpoint on the compute node:

$ curl http://slave/metrics/snapshot
...
  "slave/disk_percent": 0.000135926701526202,
  "slave/disk_total": 470842,
  "slave/disk_used": 64,
...

With the enforcement flag enabled, a task that consumes more than its allocated share will be killed by the disk isolator. Below is an example of a task which writes 300MB into its sandbox even though it was only scheduled to run with 64MB. As soon as the task exceeds its quota, it is killed by Mesos.

$ mesos-execute ... --resources="...;disk:64" --command="dd ... bs=1024 count=327680 ..."
...
Received status update TASK_RUNNING for task foobar
Received status update TASK_FAILED for task foobar
...

For more information about disk isolation, take a look at the containerization documentation.

Compute node removal rate-limiting

Failure detection is a necessity in distributed systems and therefore crucial in Mesos. Compute nodes which fail, for example, due to hardware failure, network issues, or otherwise get disconnected will be marked not present and will get shut down if they connect again. However, if the master loses network connectivity or otherwise determines that all compute nodes, or a majority of the cluster, have disconnected, disasters can happen. While this sounds scary, Mesos has been continuously improving safety mechanisms to prevent these disasters and the latest guard is compute node removal rate-limiting. This is illustrated in the three examples below:

Under normal circumstances

Compute nodes signals their health, and thus presence, by responding to the master node’s ping messages.

Failure cases

In failure cases where the compute node doesn’t respond to those messages (after a number of consecutive failures), the master will try to shutdown the node immediately, or when/if it eventually connects.

Disaster

If the master node gets disconnected from all compute nodes, some rare circumstances may lead to the master believing that all nodes have disconnected and try to shut down the entire cluster.

The new removal rate limiting mechanism lets you specify the number of computer nodes the master is permitted to shut down over time through a new command line flag: --slave_removal_rate_limit.

$ mesos-master ... --slave_removal_rate_limit=1/1mins

In the example above, 1 compute node is allowed to be removed every minute. To get an idea of the impact of this feature, consider the 4 examples below. Prior to Mesos 0.22.0, the default behavior (‘no limit’) would give little or no time to recover from unhealthy masters! Now, the removal can be throttled with configurable rates (here shown with 1 node every minute, 5 nodes every minute and 10 nodes every 5 minutes), which makes it possible to monitor and act when a suspicious number of compute nodes gets disconnected, and prevent loss of important tasks running in your cluster.

For more information about compute node removal limits, please refer to the configuration documentation.

Task labels

From our experience working with framework writers and operators, we have found that tagging tasks has been a common use case and necessity. Prior to Mesos 0.22.0, the only way for framework writers and operators to encode globally visible traits on tasks was to encode them in the task names, e.g., cassandra.prod.foo.bar.apples.bananas.1.

In the recent release of Mesos, tasks can now have arbitrary key/value pairs which will get exposed over master and compute node endpoints! Why? It enables a variety of external and internal tooling, such as service discovery, security subsystems, resource accounting, tracing and much more. Below is an example of a task which has been launched on a Mesos cluster with labels set with two labels, environment and bananas:

$ curl http://master/state.json
...
{
  "executor_id": "default",
  "framework_id": "20150312-120017-16777343-5050-39028-0000",
  "id": "3",
  "labels": [
    {
      "key": "environment",
      "value": "prod"
    },
    {
      "key": "bananas",
      "value": "apples"
    }
  ],
  "name": "Task 3",
  "slave_id": "20150312-115625-16777343-5050-38751-S0",
  "state": "TASK_FINISHED",
  ...
},
...

Hooks and decorators

A new extension to the Mesos Modules subsystem landed in Mesos 0.22.0: module hooks and decorators. Mesos Modules is a powerful extension mechanism which was introduced in Mesos 0.21. The Mesos Modules subsystem enables developers to extend and replace the internals of Mesos without forking Mesos. This allows the customization Mesos for new environments or use cases. For example, you could insert experimental or proprietary resource isolators and authenticators without making changes to Mesos itself.

Hooks are a new way to extend or modify Mesos in a more lightweight manner. Hooks don’t replace full subsystems but work like an event callback across library borders. The first path within Mesos which has been extended with hooks is the launch task sequence, from and within the master, all the way to the compute node executing the task. In the figure below, a simplified picture of the launch task sequence has been illustrated with the hook points where module developers can be notified.

A decorator is a special kind of hook which has a return type. This allows module developers to ‘decorate’ the object in transit. For now, this is task labels and environment variables through the launch task sequence. In the example below, the master launch task decorator extends the task labels with a new ‘b’ pair and the slave executor environment decorator adds a new ‘c’ environment variable.

For more information about module hooks and decorators, please refer to the modules documentation or reach out on the Mesos Modules developer list.

Upgrading Mesos from 0.21.1 to 0.22.0

Live upgrading from 0.21.1 to 0.22.0 can be done seamlessly, but requires a bit more care than previous updates due to scheduler API changes. The upgrade documentation describes the details and necessary steps to upgrade a running 0.21.1 cluster to this release.

Furthermore, the master and compute node /stats.json metrics HTTP endpoint went through a deprecation cycle in 0.21.1 and has now been completely disabled. Please refer to the HTTP endpoint documentation for /metrics/snapshot for more information on how to locate the new metrics.

Upcoming releases and features

While Mesos 0.22.0 has introduced exciting new features, even more exciting work and features are ahead of us. The Mesos community is actively working on persistence primitives to support dynamic cluster resource reservations and enable a variety of stateful services on Mesos. With regards to Mesos security, we are in our final stages of enabling full SSL support, including live upgrade paths and making the ACL subsystem much more powerful.

Stay tuned,

The Mesosphere Core Team

Mesos 0.22.0 talk