Delivering ‘Day 2’ operations with DC/OS

This blog post is the first in a three-part series on “Day 2” operations for DC/OS.

It takes more to run an application in production than just installing some software and starting applications. For operators, their jobs truly begin on what we like to call “Day 2”—at which time those applications must be maintained, upgraded, debugged and ran without downtime. Without the work that happens on Day 2 (and every day after that), everything that happens prior would be for nothing.

Since DC/OS is an operating system, we have the perfect platform on which to build the APIs and functionality required for operators to be successful and efficient at their jobs. Enabling our operators with rich APIs that are generic enough to fit into any stack—whether it’s ELK for logs, or DataDog for metrics—our aim is to ship features that integrate gracefully with the operator’s favorite tools.

Our focus doesn’t stop at the system components. We aim to ship metrics and logs for the applications users run on the cluster, as well. This gives our operators the best possible stack for maintaining uptime and availability. No longer do operators need to implement custom solutions for every component and application running in their datacenters. They can automatically get the data needed to keep that cluster up and running.

We have identified three core areas that we are aiming to ship in the forthcoming DC/OS 1.10 release in order to improve the Day 2 operations experience:

  • Logging
  • Metrics
  • Debugging

We will cover them in order throughout this series of blog posts, starting with logging. For those running Mesosphere Enterprise DC/OS, they can expect these APIs to be secured with the same authentication and authorization they’ve come to expect with the Enterprise DC/OS 1.8, which was released in September.

DC/OS logging API

Our aim in building a cluster-wide logging API is to ensure our operators can integrate DC/OS with any log aggregator. That means it needs to work as seamlessly with an ELK stack that is front-ended with Redis as well as it does dumping to Splunk or other hosted log systems. For our enterprise customers, it needs to obey our security requirements for authorization and authentication when being queried by services or cluster operators.

The logging API has one goal: make DC/OS core service logs, and applications deployed to DC/OS (frameworks or containers), available through one, intuitive HTTP API.

Step 1: Everything goes to journald

In order to do this we needed to re-design how we currently get logs from tasks. Today, DC/OS frameworks dump their STDOUT and STDERR to the Mesos Sandbox. This is neither easily accessible nor is it integrated with where all the other host-level systems (read: “DC/OS core services, such as Adminrouter”) dump their logs. Core services, or anything running as a systemd unit, dumps their logs to journald.

Our first step, then, is to make the task logs go to journald. To do this, we had to write a Mesos Module. This module takes every STDERR and STDOUT line that a framework produces and mutates it for journald ingress. With this new module in place, we get all the logs on a cluster aggregated into one place, and we can build an API on top of that to expose it to the rest of the world.

Step 2: Add some structured data

The second step is adding some structured data to the logs lines. You need to know more than just what a task is outputting in order to debug an issue. It is important to know what host the task is running on and which framework started it. Instead of requiring your developers to add that, we do it for you.

Step 3: Proxy the logs API on Adminrouter

The entry point to our logging API for the DC/OS CLI, user interface or external entities will go through Adminrouter. You get access to your logs for debugging without moving them around at all. If your log aggregation infrastructure is down or you just don’t want the expense of moving bits around, we’ve got you covered.This customized NGINX proxy figures out how to route requests to a specific host that has been given a Mesos role ID.

Log Integrations

The logging API and Mesos logging module together provide the foundation for seamless integrations with popular log shipping stacks such as ELK, Splunk or Fluentd. Since all the logs end up in journald, you can easily add shipping agents for these popular log aggregation stacks. These two primitive logging solutions give our customers and end-users a first-class experience for both application and DC/OS service logs.

DC/OS CLI Node Log Command

For some time, the DC/OS CLI has had its own log command to get framework logs to the end-user. This command will not change in usage, but will be leveraging the new log API. Before, users could only use this CLI command to get logs from tasks, but now they’ll be able to get logs for DC/OS core services such as Adminrouter or the Mesos master and agent services.

This is invaluable for debugging—for example, when you need to view the Marathon logs and your application logs at the same time. This is now possible from the same utility without having to SSH into a cluster host.

Keep reading

We hope this is a helpful and informative look into some of the new features coming to DC/OS. Keep an eye on our blog for Parts 2 and 3 of this series, which will dive deeper into DC/OS metrics and debugging, respectively. For more information about what’s possible in DC/OS today, check out the project documentation.