Marathon: Gracefully Handling Application Restarts in the Event of Failure

When running any kind of application or service at scale, failure is a fact of life. And as with life itself, it’s how we handle failure that matters. In this blog post we’re going to discuss and play around with a feature of Marathon that’s designed to handle application failures gracefully and in a way that doesn’t impact other running services: retry backoffs.

Note that to follow along with some of the examples below you’ll need a few utilities already installed, namely Docker, curl and jq.

Failure scenarios and their effects

Orchestration engines such as Marathon generally do their best to enforce a particular state, and in this case the desired state is that your application is running and that your service is available to users. However, mistakes are sometimes made or bugs are introduced in an application that aren’t caught during the testing process, such that it continually crashes and is subsequently restarted. It might also be an error with the deployment itself which causes a problem.

Other times it could be some kind of transient infrastructure error, such a dependent service that’s not responding as expected – perhaps too slowly or even not at all – resulting in your application crashing and restarting. Aggressive restarts, especially when it’s an application that has an ‘involved’ startup sequence in which the lack of a cached state causes pressure on other areas of infrastructure (think artifact downloads or a lot of DNS lookups), can be problematic if not handled correctly.

One fairly typical knock-on effect this kind of problem is that a task is repeatedly and aggressively restarted until available disk space on an agent is full, potentially taking out other services with it.

In any case, the responsible engineer will need to troubleshoot in order to determine root cause, but in the meantime we need to make sure our infrastructure is protected from these sorts scenarios. Thankfully, Marathon has some mechanisms to cope with exactly this.

Handling failures

Marathon has a group of settings that are related to the frequency with which it attempts to consecutively restart your application in the event that it exits for one reason or another. This avoids the scenarios described above by giving operators the ability to configure how aggressive these restarts are triggered. Briefly, these are:

  • backoffSeconds: This is the initial length of time that Marathon delays an application restart when it has failed;
  • backoffFactor: This value controls the rate at which the backoffSeconds delay grows when there are consecutive failures.
  • maxLaunchDelaySeconds: The largest value that backoffSeconds is allowed to reach.

The defaults for these are 1 second, “1.15”, and 3600 seconds respectively. If we set these to “3”, “2”, and “3600” then this would mean that after a failure Marathon would wait 3 seconds before attempting a restart, and then for consecutive restarts following failure this value will double, until we hit the value specified by maxLaunchDelaySeconds. So the next restart would be in 6 seconds, then 12, then 24, and so on until we hit the one hour (3600 second) mark.

Let’s look at what happens when we introduce an application that fails and how we can monitor Marathon’s backoff handling behaviour.

Demonstrating Marathon’s backoff feature

Let’s deploy a simple container which has a script that exits with a non-zero return code after a small delay. This will cause Marathon to attempt to reschedule our application each time it fails, and we can monitor its Event Bus to keep an eye on what’s going on.

Firstly we need to create this application. Populate two text files, as follows:

Dockerfile:

FROM alpine:edge
ADD failure.sh /failure.sh
RUN chmod +x /failure.sh
ENTRYPOINT ["/failure.sh"]

And ‘failure.sh’:

#!/usr/bin/env sh

echo "This didn't work, aborting!"
sleep 5
false

And then build and push our image to the Docker Hub – note that you’ll need to amend the tag so that it’s prefixed with your username:

$ docker build -t yankcrime/failure:latest .
Sending build context to Docker daemon  3.072kB
[..]
Successfully built 09b980595fb5
Successfully tagged yankcrime/failure:latest

$ docker push yankcrime/failure:latest
The push refers to repository [docker.io/yankcrime/failure]
030586f82351: Pushed
latest: digest: sha256:1b31369cdc36e72df4608c545bc5551fbc259b80273926d1220244bffb539dc8 size: 942

Now we can deploy an application using this image to our DC/OS cluster. Create the following JSON file, ‘failure.json’:

{
    "id": "failure",
    "instances": 1,
    "container": {
        "volumes": [],
        "type": "MESOS",
        "docker": {
            "privileged": false,
            "image": "yankcrime/failure"
        }
    },
    "backoffSeconds": 3,
    "backoffFactor": 2,
    "maxLaunchDelaySeconds": 3600
}

And then let’s create our application deployment:

$ dcos marathon app add failure.json
Created deployment 1cadb36b-05cb-454b-a6fa-9a3584428182

Investigating backoff

If we examine the state of our deployments we can see that our application is not actually running:

$ dcos marathon app list
ID            MEM   CPUS  TASKS  HEALTH  DEPLOYMENT  WAITING  CONTAINER  CMD
/failure      128    1     0/1    N/A       ---      False      MESOS    N/A
/grafana      512   0.3    1/1    1/1       ---      False      DOCKER   N/A
/marathon-lb  1024   2     1/1    1/1       ---      False      DOCKER   N/A
/prometheus   1024   1     1/1    1/1       ---      False       N/A     export...

And the output from the last task log for should come as no surprise:

$ dcos task --completed | grep failure | tail -1
failure           172.31.0.6   root    F    failure.f1fb810b-6c47-11e9-b4d3-6239820f53ec            3d5aa737-af67-44d9-85fa-bfa2ba53d271-S0   ---    ---
$ dcos task log --completed failure.f1fb810b-6c47-11e9-b4d3-6239820f53ec
This didn't work, aborting!

To examine this further, we need to take a look at the Marathon API. DC/OS exposes some of this information for us graphically, but really we need to poke at the API to retrieve the most amount of detail.

First of all, let’s set up a couple of environment variables for convenience:

$ BASEURL="$(dcos config show core.dcos_url)"
$ TOKEN="$(dcos config show core.dcos_acs_token)"

The documentation suggests that in order to see if our service is being delayed, we should query the /v2/queue endpoint, and if delay.timeLeftSeconds is greater than 0 for a service.id matching that of our application, then backoff behaviour is in effect. We can hit the API using any HTTP client such as curl, and to filter this down to the details we care about we can employ a sprinkling of jq:

$ curl -sk -H "Authorization: token=${TOKEN}" ${BASEURL}/marathon/v2/queue | jq -r '.[] | select(.[].app.id | contains("failure")) | .[].delay.timeLeftSeconds'
11

If you run that command a few times in quick succession, you’ll see this value decrease until it hits 0, and then in the case of our test, reset back to a higher number – double the previous starting value, given our backoffFactor of 2. Eventually this would keep resetting and incrementing until it hit the maximum value of maxLaunchDelaySeconds.

Monitoring backoff

As mentioned previously, Marathon has an Event Bus that’s better suited to monitoring events related to services, such as our problematic application that’s in a restart / backoff state. We can take a look at what’s going on here with the following command:

$ curl -H "Authorization: token=${TOKEN}" \
  -H "Accept: text/event-stream" \
  ${BASEURL}/service/marathon/v2/events

This will then output everything from the event bus until we hit Control-C:

event: event_stream_attached
data: {"remoteAddress":"172.31.0.4","eventType":"event_stream_attached","timestamp":"2019-05-02T18:48:03.108Z"}
data: {"slaveId":"3d5aa737-af67-44d9-85fa-bfa2ba53d271-S6","taskId":"failure.18e9e3fd-6d0b-11e9-b4d3-6239820f53ec","taskStatus":"TASK_STARTING","message":"","appId":"/failure","host":"172.31.0.3","ipAddresses":[{"ipAddress":"172.31.0.3","protocol":"IPv4"}],"ports":[25970],"version":"2019-05-02T18:48:23.694Z","eventType":"status_update_event","timestamp":"2019-05-02T18:50:04.595Z"}

event: instance_changed_event
data: {"instanceId":"failure.marathon-18e9e3fd-6d0b-11e9-b4d3-6239820f53ec","condition":"Starting","runSpecId":"/failure","agentId":"3d5aa737-af67-44d9-85fa-bfa2ba53d271-S6","host":"172.31.0.3","runSpecVersion":"2019-05-02T18:48:23.694Z","timestamp":"2019-05-02T18:50:04.595Z","eventType":"instance_changed_event"}

event: status_update_event
data: {"slaveId":"3d5aa737-af67-44d9-85fa-bfa2ba53d271-S6","taskId":"failure.18e9e3fd-6d0b-11e9-b4d3-6239820f53ec","taskStatus":"TASK_RUNNING","message":"","appId":"/failure","host":"172.31.0.3","ipAddresses":[{"ipAddress":"172.31.0.3","protocol":"IPv4"}],"ports":[25970],"version":"2019-05-02T18:48:23.694Z","eventType":"status_update_event","timestamp":"2019-05-02T18:50:04.611Z"}

event: instance_changed_event
data: {"instanceId":"failure.marathon-18e9e3fd-6d0b-11e9-b4d3-6239820f53ec","condition":"Running","runSpecId":"/failure","agentId":"3d5aa737-af67-44d9-85fa-bfa2ba53d271-S6","host":"172.31.0.3","runSpecVersion":"2019-05-02T18:48:23.694Z","timestamp":"2019-05-02T18:50:04.611Z","eventType":"instance_changed_event"}

event: status_update_event
data: {"slaveId":"3d5aa737-af67-44d9-85fa-bfa2ba53d271-S6","taskId":"failure.18e9e3fd-6d0b-11e9-b4d3-6239820f53ec","taskStatus":"TASK_FAILED","message":"Command exited with status 1","appId":"/failure","host":"172.31.0.3","ipAddresses":[{"ipAddress":"172.31.0.3","protocol":"IPv4"}],"ports":[25970],"version":"2019-05-02T18:48:23.694Z","eventType":"status_update_event","timestamp":"2019-05-02T18:50:09.805Z"}

This API can be used to export events into a metrics system, giving operators the necessary visibility into the state of their services.

If you keep an eye on this event stream you’ll notice that after a length of time, the rate at which Marathon attempts to restart our application is rather slow. This is somewhat anecdotal; if we were pumping these events into a time-series database then we could be a little more specific, but for the purposes of this demonstration it’ll suffice. We can actually reset the delay entirely by issuing a DELETE to the right REST API endpoint:

$ curl -X DELETE -k -H "Authorization: token=${TOKEN}" ${BASEURL}/marathon/v2/queue/failure/delay

And then immediately we’ll see that Marathon does its best to start our application:

event: instance_changed_event
data: {"instanceId":"failure.marathon-cacb5bfd-6d10-11e9-b4d3-6239820f53ec","condition":"Created","runSpecId":"/failure","agentId":"3d5aa737-af67-44d9-85fa-bfa2ba53d271-S4","host":"172.31.0.25","runSpecVersion":"2019-05-02T18:48:23.694Z","timestamp":"2019-05-02T19:30:50.076Z","eventType":"instance_changed_event"}

event: status_update_event
data: {"slaveId":"3d5aa737-af67-44d9-85fa-bfa2ba53d271-S4","taskId":"failure.cacb5bfd-6d10-11e9-b4d3-6239820f53ec","taskStatus":"TASK_STARTING","message":"","appId":"/failure","host":"172.31.0.25","ipAddresses":[{"ipAddress":"172.31.0.25","protocol":"IPv4"}],"ports":[2274],"version":"2019-05-02T18:48:23.694Z","eventType":"status_update_event","timestamp":"2019-05-02T19:30:50.462Z"}

event: instance_changed_event
data: {"instanceId":"failure.marathon-cacb5bfd-6d10-11e9-b4d3-6239820f53ec","condition":"Starting","runSpecId":"/failure","agentId":"3d5aa737-af67-44d9-85fa-bfa2ba53d271-S4","host":"172.31.0.25","runSpecVersion":"2019-05-02T18:48:23.694Z","timestamp":"2019-05-02T19:30:50.462Z","eventType":"instance_changed_event"}

event: status_update_event
data: {"slaveId":"3d5aa737-af67-44d9-85fa-bfa2ba53d271-S4","taskId":"failure.cacb5bfd-6d10-11e9-b4d3-6239820f53ec","taskStatus":"TASK_RUNNING","message":"","appId":"/failure","host":"172.31.0.25","ipAddresses":[{"ipAddress":"172.31.0.25","protocol":"IPv4"}],"ports":[2274],"version":"2019-05-02T18:48:23.694Z","eventType":"status_update_event","timestamp":"2019-05-02T19:30:50.473Z"}

event: instance_changed_event
data: {"instanceId":"failure.marathon-cacb5bfd-6d10-11e9-b4d3-6239820f53ec","condition":"Running","runSpecId":"/failure","agentId":"3d5aa737-af67-44d9-85fa-bfa2ba53d271-S4","host":"172.31.0.25","runSpecVersion":"2019-05-02T18:48:23.694Z","timestamp":"2019-05-02T19:30:50.473Z","eventType":"instance_changed_event"}

event: status_update_event
data: {"slaveId":"3d5aa737-af67-44d9-85fa-bfa2ba53d271-S4","taskId":"failure.cacb5bfd-6d10-11e9-b4d3-6239820f53ec","taskStatus":"TASK_FAILED","message":"Command exited with status 1","appId":"/failure","host":"172.31.0.25","ipAddresses":[{"ipAddress":"172.31.0.25","protocol":"IPv4"}],"ports":[2274],"version":"2019-05-02T18:48:23.694Z","eventType":"status_update_event","timestamp":"2019-05-02T19:30:55.682Z"}

Using the data pulled from this bus we can build comprehensive monitoring to alert on certain thresholds or to help analyze trends across deployments. This would entail subscribing to this event stream and parsing the data suitable for storing in a time-series database, such that this database can be queried based on eventType and appID (for example).

Conclusion

In this blog post we’ve introduced a set of options in Marathon that are designed to protect wider infrastructure in the event of application failure. There are a set of sensible defaults in place, but it’s useful for operators to be aware of this behavior so that it can be tweaked or fine-tuned on a per-deployment basis where necessary.

Our suggestion for DC/OS administrators is to build on the APIs that Marathon exposes and store salient metrics in a time-series database in order to provide insight into service behavior and to assist with ad-hoc troubleshooting in the event of anything going awry.