mesosphere

Praekelt & DC/OS: Solving large social problems | D2iQ

Sep 12, 2018

Matt Jarvis

D2iQ

5 min read

 

In the latest in our series exploring user stories from the DC/OS community, Mesosphere's Developer Advocate Elizabeth K. Joseph spoke with the team at Praekelt.org in South Africa. Praekelt is an African nonprofit organization dedicated to using mobile technology to solve some of the world's largest social problems, and are using DC/OS, Mesos and Marathon as their underlying infrastructure. Praekelt's technology platforms have reached more than 100 million people in over 60 countries, with the belief that mobile technology would make it possible for every person on the planet to access essential lifesaving information.

Praekelt has two main projects running on DC/OS. The first is a Python based health care application, providing pregnant women in rural Africa with maternity information. This is delivered to basic mobile phones, either as voice, SMS or WhatsApp, depending on the available infrastructure in each country. The platform can send between 3 and 5 million messages in a day, and needs to ensure reliability. Any missed messages could potentially have impacts on health, and resending at that scale can be prohibitively expensive.

The second project is focused on youth services, providing mobile sites designed for low end mobile phones, and delivered on platforms like Freebasics, Facebook's project to deliver internet access to rural Africa. These mini sites are delivered in many different countries and languages, which was the driver towards container based delivery, allowing good resource utilization across limited infrastructure. From an architecture perspective, this platform makes use of a range of open source technologies including Django, Celery, RabbitMQ and PostgreSQL.

Elizabeth spoke with Jamie Hewland, Jeremy Thurgood, and, Milton Madanda about Praekelt's story and their journey with DC/OS. She started by talking about what originally brought the team to Apache Mesos and DC/OS.

Praekelt's first project requirement was to run many very small websites. The original implementation of this used uWSGI Emperor, but they rapidly reached the scaling limits for this approach, and the emergence of Docker containers provided a much better deployment strategy. This naturally led to a requirement for container orchestration, and around 2015 the team evaluated both Marathon/Mesos and Kubernetes. At that time, they found that they couldn't build a test platform using Kubernetes, and so Mesos and Marathon became the first foundation for the project. The team describe their original clusters as FrankenMesos, having tried to effectively build the feature set of DC/OS themselves, and once DC/OS became available as open source, this became the standard platform for new deployments and migrations. Jeremy described DC/OS as easier to understand from an architecture perspective than Kubernetes, which can have too much detail that you need to care about.

Praekelt's challenges from a platform perspective are different from many outside of Africa. They handle a large amount of medical data, so have strict regulatory requirements which keep them from using public cloud, and in many of the territories they operate in public cloud is simply not available. Their particular use case is also restricted by budget, which limits their choice of hosting environment, and in some countries reliable hosting facilities are also scarce or non-existent. Mesos' architecture provides for failure in the underlying components, and for Praekelt this has been key. Although their datacenter facilities in more developed countries such as South Africa are reliable, in other countries such as Uganda and Nigeria, they are often dealing with situations where up to 60% of their virtual machine capacity may be unavailable at any one time, and with slow and unreliable links to the internet, sometimes down to double-digit kilobits of bandwidth. While they accept there are limits to the robustness and fault tolerances of any system, the team say they simply could not have built their infrastructure without Mesos' ability to work around failures.

Across both of these platforms, the team identified that having persistence in containers is a difficult field, lots of legacy applications assume a POSIX filesystem, and getting distributed filesystems to containers is tricky. As the platform matures, they are slowly migrating towards object storage, using S3 where available, but also looking at technologies like OpenStack Swift. Where possible containers are kept stateless, and where state is required, Praekelt also make use of GlusterFS.

One of the other challenging areas has been load balancer setup. Managing hundreds of individual websites, all over HTTPS, together with managing certificates and integrating with the LetsEncrypt service turns out to be a complex problem, which Jamie has described on Medium ( https://medium.com/mobileforgood/evolving-our-container-load-balancers-4e0ec9f8cb89 ). Currently the system uses Marathon-LB, and the team are exploring switching to Envoy ( https://www.envoyproxy.io/ ) which is designed from the ground up for dynamic configuration environments. They currently have a prototype discovery mechanism in their lab, integrating Marathon, Let's Encrypt, and using Hashicorp's Vault ( https://www.vaultproject.io/ ) to store certs and secrets. 

You can learn more about their use and installation of Vault in a series of blog posts:

Prometheus is used for monitoring, hosted external to the DC/OS cluster, but making use of the built in Prometheus endpoints which DC/OS provides. Praekelt run 5 different clusters, across two continents and 4 countries, with a total of around 50 agents. This infrastructure is managed by a team of 4 engineers, responsible for infrastructure development as well as production systems, taking advantage of the low operational overhead that DC/OS and Mesos provide.

Finally Elizabeth asked about their wish list for DC/OS. In general the team described themselves as happy with the current feature set, although they highlighted improvements to log management, as well as improving the transparency of the product roadmap to give users more clarity in what's coming down the pipeline, and that feedback will get passed on to our engineering and product teams. Although not specific to DC/OS Jeremy also described how all current orchestration systems could benefit from a more resource-oriented management approach - looking at whole applications and their interconnects rather than individual containers, and providing simplified mechanisms for defining connections inside multi-service applications.

Praekelt's story is a fantastic example of how software can change the lives and outcomes of people who may not have access to information and services otherwise, and illustrates how the fault tolerance of Apache Mesos and DC/OS can provide highly reliable infrastructure even in highly challenging environments. We look forward to hearing from the team again in the future as they continue their great work.

Ready to get started?