Q&A with Zhongbo Tian from social networking platform Douban

This is the first post in a six-part series, highlighting Chinese DC/OS and Apache Mesos users that presented at MesosCon Asia in late June. MesosCon North America is coming up from September 13-15th in Los Angeles; register today.

Thank you so much for taking a some time to answer questions today. Let’s start off with a description of Douban for our US readers, who might not be familiar with your platform.

Douban is an innovative social-network and media company that provides highly personalized book, movie, and music recommendations, and helps its users connect with each other about cultural lifestyle content. We have been helping urban communities to discover and discuss insightful and inspirational content, via web-based and mobile applications. The size of Douban’s registered user base has been steadily increasing for years; as of June 2017 its monthly unique visitors have exceeded 300 million.

That’s a lot of users; is the need to process data from your users part of what brought you to Apache Mesos?

Partly, yes. In 2011 the Mesos paper successfully caught our interest in building cloud infrastructure for elastic High Performance Computing (HPC) and Big Data. We tested Mesos version 0.3 intensively and adopted version 0.4 in production that year. As we built our own in-house cloud infrastructure, we found that managing resources and keeping the whole design simple is extremely important. Mesos can schedule tasks fairly and automatically, And Mesos enables us to employ various kinds of advanced technologies such as Containers, GPUs, etc. keeping the whole infrastructure simple at the same time.

What frameworks and tools are you using today to facilitate your HPC and Big Data processing?

Currently, we use Mesos for our Big Data tasks together with our own open source framework DPark, which has gained some popularity outside of Douban as well. For HPC and Machine Learning/Deep Learning tasks we use Message Passing Interface, TensorFlow, XGBoost, and others. GPUs have been very helpful for our intensive computations. Beside that, we used Mesos to manage our cron jobs, and for other offline tasks to support our in-house cloud platform. We use MooseFS for data storage, and InfiniBand for high performance networking.

Things are working well, and we are looking forward to expanding the whole cluster to thousands of nodes. We believe this will be an easy task considering the excellent design and code quality of Mesos, and the fact that many users have already achieved this scale according to public tech talks. We have about 5-15 engineers and operators working on the core cloud platform and on the in-house cloud ecosystem.

You must have run into some challenges in all your time running Mesos. What issues have you encountered, and have you been able to address them?

Before Mesos 1.2, we occasionally encountered the scheduler hanging. After upgrading to Mesos 1.2, this bug seems to be fixed. For Deep Learning tasks, we need to employ InfiniBand to increase training speed. And we made some simple modifications for device isolation of Mesos, and submitted them back to upstream.

Given your years of experience running Mesos in production, where would you like to see Mesos go in the future?

Mesos is now powerful and stable, we hope the Mesos guys can keep things this way forever. As end users we do hope for more support for debugging and diagnosis, much easier deployment, more progress on containerization and the Container Storage Interface. We’d also like to see more opportunities for Mesos community members in China to connect with each other. We are expecting more summits and meetups to share and exchange ideas and insights in the Mesos community in China.

Thanks so much for taking the time to talk with us today, and for all your feedback!

Thank you!

Want to hear from other big Apache Mesos users like Douban? MesosCon North America is coming up from September 13-15th in Los Angeles; register today.