The 4 Top Service Orchestration Challenges

Dec 28, 2018

Alex Hisaka

D2iQ

Service orchestration faces challenges because the lifecycle of distributed stateful services is typically more complex compared to individual containers.

Also, as you scale services, you need to consider the underlying resource utilization and infrastructure on which these services can be deployed.

Challenge 1: Complex Lifecycle Management

Services, especially distributed ones, can have complex deployment steps with multiple dependencies.

Let's look at Kubernetes as an example: did you ever setup a distributed HA Kubernetes cluster (and no, minikube does not count here)? If you have, you know that DIYing your own Kubernetes cluster can be incredibly hard.

Here are the deployment steps for Kelsey Hightower's Kubernetes the Hard Way:

Check Prerequisites
Installing the Client Tools
Provisioning Compute Resources
Provisioning the CA and Generating TLS Certificates
Generating Kubernetes Configuration Files for Authentication
Generating the Data Encryption Config and Key
Bootstrapping the etcd Cluster (3x for HA)
Bootstrapping the etcd Cluster (3x for HA)
Bootstrapping the etcd Cluster (3x for HA)
Bootstrapping the Kubernetes Control Plane (3x for HA)
Bootstrapping the Kubernetes Control Plane (3x for HA)
Bootstrapping the Kubernetes Control Plane (3x for HA)
Bootstrapping the Kubernetes Worker Nodes
Configuring kubectl for Remote Access
Provisioning Pod Network Routes
Deploying the DNS Cluster Add-on (+ Deploying other Add-ons)
Deploying other Add-ons
Deploying other Add-ons
Deploying other Add-ons
Smoke Test
Cleaning Up

These 21 steps are only the tip of the iceberg because they only focus on deployment. But you have to actually manage Kubernetes. As a result, these are not the only challenges faced by Kubernetes operators. In addition to deployment, the operator also has to take care of upgrades, failures, configuration changes, scaling, and more.

Container orchestration with stateless containers is an easier challenge to solve than for stateful services. For example, a failed container can simply be restarted on any node, but when dealing with stateful services, restarting just one component might impact other components (for example reshuffling data between nodes in the event that the component moves).

Plus, as shown by the CI/CD pipeline example, you are not working with a single service, but many services, and each has its own peculiarities in terms of deployment, failure modes, monitoring, and so on.

The operator has to completely understand the nuances of a multitude of services and be able to fix issues on-the-fly when a node has failed.

And since we all know that nodes can only fail at 3 AM on Sunday morning…this is not a pleasant job.

Challenge 2: Resource Allocation and Utilization

Deploying multiple distributed systems natively often also results in suboptimal resource utilization as we create silos, i.e., subsets of CPU, memory, disk, and network resources dedicated for each service.

Each silo for these services is typically over-provisioned in order to account for the maximum load (e.g., Monday's between 8 and 9 AM when everyone is logging on you might need 10 instances of a particular service, but rest of the week two instances would be sufficient).

Challenge 3: Multi-Tenancy

Deploying and operating so many services is already challenging enough, but what happens if you add multiple tenants, each with their own requirements and all isolated with security controls?

Imagine multiple lines of business, each requiring a Kubernetes cluster and additional accompanying services as shown below:

This raises a number of questions, such as, "how can we make sure one tenant cannot take over all resources in the cluster and impact the performance of the other tenants?" or "how we make sure each tenant can only access her respective services including metrics, logs and, other metadata?"

Challenge 4: Infrastructure

Many of the popular services are offered by cloud providers as proprietary managed services, but most IT leaders can't stomach the risk of being tied to a specific vendor's cloud API since it introduces risk to the business. Being tied to one cloud provider can prevent moving to another cloud or an on-premise datacenter. You might not even be able to move all of your data to the cloud due to data privacy and governance requirements.

Further complicating the matter, many organizations need to have services that can run across different datacenters (or clouds). Imagine, for example, running the base workload on-prem, but having the need to add burst compute resources in the cloud. Yet another example of this issue could be a shared setup across multiple AWS regions for fault tolerance.

Download the eBook "How to Evolve From Container Orchestration to Service Orchestration" to learn how to develop and deploy your services independent of its underlying infrastructure regardless of what it is and where it resides.