DC/OS

Inverse offers in Mesos: The cluster maintenance use case

Oct 07, 2015

Joseph Wu

D2iQ

1 min read

 
A "resource offer" is the Mesos primitive used to allocate resources to frameworks. (For a more in-depth explanation, see the Mesos paper). In Mesos 0.25.0, we introduce the opposite primitive, an "inverse offer."
 
Inverse offers play a key role in the cluster maintenance feature added in 0.25.0, as well as several upcoming features. In this post, we'll give an overview of what inverse offers are and how they can be used for cluster maintenance. In a future post, we will cover potential applications of inverse offers to other Mesos features, like managing a resource quota across frameworks and defragmentation by custom allocators.
 
What is an inverse offer?
 
An inverse offer is a way to deallocate or "drain" resources. An inverse offer shares many of the semantics of a resource offer. It contains information about resources: it can be accepted, rejected, re-offered, and rescinded, and it prompts an action by the recipient framework. An inverse offer and a resource offer can hold many of the same fields, but an inverse offer requests resources rather than offering them. An inverse offer can be used to request a number of different things from the framework:
 
  • Drain an entire framework.
  • Drain an entire agent.
  • Drain a single executor and all tasks it spawned.
  • Drain a single task.
  • Drain some number or resources, from anywhere.
  • Drain some number of resources, from the specified agent, executor or task.
  • Drain any of the above, before some specified "unavailability" time.
 
 
Also new in Mesos 0.25.0, both inverse offers and resources offers now contain a field with unavailability information. The unavailability field takes a start time and a duration (omitting the duration means that the unavailability will last forever). For a resource offer, the duration states if and when the resources in that offer may go away. For an inverse offer, the duration states when the resources should be drained by.
 
Mesos cluster maintenance
 
Mesos 0.25.0 ships with support for scheduling maintenance on a Mesos cluster, followed by bringing machines down and back up. Maintenance primitives, as this feature is called, provide a workflow for a cluster operator and frameworks running on the cluster to cooperate on maintenance events.
 
Before maintenance primitives, maintenance was effectively treated like a normal machine failure because a framework could only adapt to a maintenance event post hoc. When a machine was brought down, it was up to the framework to grab resources and recover.
 
With maintenance primitives, frameworks are given a chance to adapt a priori. And if the framework can't adapt, it can let the operator know. Under the covers, maintenance primitives use inverse offers with unavailability information to communicate the maintenance schedule to frameworks. The Mesos master selectively sends inverse offers to frameworks that are using resources scheduled for maintenance. Each framework only receives inverse offers for resources that it is using.
 
Note that for maintenance primitives, an inverse offer will only ask frameworks to drain entire agents. The finer-grained control provided by inverse offers is not fully utilized. So, for example, maintenance primitives currently cannot be used to roll an upgrade through the cluster one task/resource/framework at a time.
 
Maintenance-ready frameworks
 
One important note before we continue: the interface for inverse offers was not added to the native C++ scheduler interface. Instead, a framework must use the V1 HTTP API to receive inverse offers.
 
For maintenance primitives, inverse offers play a key role in draining a machine of tasks prior to starting maintenance on a machine. Draining a machine means stopping all the tasks running on the machine prior to the maintenance event. Frameworks are responsible for draining machines when instructed by the Mesos master (if they don't, the operator can still "hard drain" the machine by pulling the plug).
 
In the workflow provided by the maintenance primitives feature, a machine starts to drain resources as soon as it is scheduled for maintenance. (For more details on how schedules are constructed, see the maintenance documentation). All frameworks using resources on the draining machine receive inverse offers periodically, asking for the resources back.
 
In the next sections, we will cover what it means for a framework to act upon the inverse offer.
 
Accepting and adapting
 
Framework developers should consider a few questions when deciding whether the framework should accept an inverse offer:
 
  • Can the framework adapt and, if so, what does it require to do so? Does it need extra resources to migrate tasks? Is there a time requirement (for instance, warming up or copying data)?
  • Is the time between the inverse offer and the unavailability sufficient for the framework to adapt? If the framework requires extra resources, is there a reasonable expectation that it will receive resource offers before the unavailability?
  • If the framework is storing data on the agent's disk, is the unavailability short or long? If it is short, the framework might not need to act. If it is long, the framework may need to copy the data elsewhere.
 
Let's consider the following scenario:
 
  • Suppose we have a framework that hosts a set of stateless web servers. This simple framework just tries to keep N instances (tasks) of the server running at once. Each task runs on a different agent.
  • An operator schedules a rolling restart through the entire cluster: starting in a few days, one machine will be taken offline for one hour at a time.
  • The framework receives one inverse offer for each of its N tasks (in no specific order). It accepts them all because there's plenty of time. For N=4, the operator's picture of the cluster might look like this:
 
 
  • The framework deals with maintenance with a simple rule: it prefers to start tasks on agents with unavailability further in the future, since there will be more time for the tasks to complete before the agent is unavailable. In the time before the rolling restart, the framework gets many resources offers, enough to start new servers on agents with the unavailability farthest into the future and to stop the other servers. In the meantime, inverse offers are still being sent periodically by the Mesos master. The framework continues to accept these inverse offers because there is still plenty of time before maintenance begins on any agents the framework is using. By the time the first maintenance window arrives, the cluster might look like this:
 
 
  • As the maintenance time arrives, the operator sees that nothing is running on the first agent and turns it off for maintenance. The operator repeats the process agent-by-agent. When the agent is turned back on, the framework will prefer to start new tasks on that agent because it no longer has scheduled unavailability.
 
 
  • The operator continues to restart machines one by one. By the time the operator reaches the last few machines, all of the framework's tasks have migrated away.
 
This is a rather optimistic picture of maintenance, with only one framework and plenty of free resources floating around. Obviously, this isn't always the case.
 
When to decline inverse offers
 
The overarching concern is whether the maintenance schedule interferes with the framework's ability to satisfy its requirements, such as a service level agreement. There are plenty of reasons why a maintenance schedule might not work for some frameworks:
 
  • There might not be sufficient resources, offers or time to move tasks around.
  • A random agent failure may make the maintenance untenable, especially if this exacerbates the previous point. 
     
    • For example, if you have a framework that handles replicated storage, like HDFS, taking down an agent may reduce the number of available replicas to unacceptable levels. If an agent goes down randomly, then the framework may want to decline the inverse offer until it can bring up another replica.
    • In general, frameworks should be conservative when accepting inverse offers. A framework may want to decline inverse offers unless it has plenty of redundancy ready or until it has migrated tasks away from machines scheduled for maintenance.
  • The maintenance schedule might be badly designed. What if it takes down every machine used by a framework at the same time? Or every machine with a specific set of resources (like GPUs)?
  • The framework might not support maintenance. In this case, the operator should push the framework writer to support maintenance.
 
Filters and inverse offers
 
In the example above, we mentioned that inverse offers are sent periodically, allowing frameworks to keep the operator up to date. The mechanism behind this is the same as the mechanism for resending resource offers—filters. Along with every accept/decline of an inverse offer, the framework can specify a filter. This filter determines when the inverse offer will be sent again.
 
Filtering inverse offers can be useful in many ways:
 
  • A framework might not want to deal with inverse offers until a few hours prior to the maintenance. It could use filters to silence inverse offers until then.
  • A framework might decline an inverse offer, but keep trying to comply. It could use filters to give a periodic update on its attempts to comply. 
 
The double negative: rescinding inverse offers
 
In addition to accepting, declining and filtering inverse offers, inverse offers can be rescinded like other offers. When any offer is rescinded, it means the offer is invalid and framework should not act on the offer nor change any state because of the offer. For example, if the Mesos master rescinds an inverse offer and a framework accepts the same inverse offer simultaneously, the framework should not commit any changes it would have made due to accepting the inverse offer.
 
For maintenance primitives, an inverse offer is rescinded when the operator updates the maintenance schedule. Note that inverse offers can be rescinded and immediately re-sent with updated information.
 
For more information about maintenance primitives, see Ben Mahler's excellent talk at MesosCon '15 in Seattle.

Ready to get started?