Hardening Kubernetes on the DCOS with etcd-mesos

As part of our work to put Kubernetes on the Mesosphere DC/OS, we built etcd-mesos, a system for maintaining an etcd cluster that can withstand devastating failures.

Kubernetes is a powerful tool for managing application containers. It provides opinionated solutions for deployment and service discovery, allowing you to spend more of your time building great products instead of figuring out how to get them into production. But before you can begin experimenting with its capabilities, you have to get Kubernetes itself into production.

If you want to use Kubernetes, you first have to learn:

  • How to configure networking.
  • How to setup etcd.
  • How to replace a failed Kubernetes API server, scheduler or controller manager.

With the Mesosphere DCOS, we are building a technology that allows someone with zero operational experience to deploy battle-hardened distributed systems, such as Cassandra, Kafka or Kubernetes, with the push of a button. etcd-mesos simplifies the etcd setup process, and reduces the recovery time for many types of failures.

etcd-mesos can even recover from failures that impact the majority of the cluster. Without etcd-mesos, these types of failures require an engineer to manually perform an emergency backup-and-restore of the cluster (see this etcd recovery document). etcd-mesos acts as a cautious-yet-efficient operator, taking preventative action where possible and recovering from unhealthy states when encountered.

Our testing regime

As part of our hardening process, thousands of clusters have been sacrificed through fault injection with close monitoring. Our engineers, who have accumulated invaluable experience running some of the largest database installations on Earth, drive this process. We have found several bugs in etcd itself, which were rapidly fixed by Xiang Li from CoreOS.

We know that our testing assumptions are insufficient to bet your business on, so we are excited to announce etcd-mesos as alpha-level open source software. You can try to break it and benefit from it! We look forward to your feedback and we are hopeful that you will invalidate some of our assumptions in the coming weeks, as the system continues to be hardened.

This is just one part of our work to make Kubernetes an effortless experience for users of the DCOS. With a rock-solid source of truth, we can get a higher return on work to harden and optimize other parts of the system.

Check it out, try to break it, have fun and let us know how it goes!