Apache Mesos – the next big thing?

Paco Nathan educated a standing room only meet-up on Apache Mesos this week at “Deep Dive on Apache Mesos”. It was easily one of the best meet-ups yet this year. The presentation can be found here.

My general conclusion is that Mesos will become an integral part of the big data stack. It provides the elusive hat-trick of allowing you to focus on what you want to do instead of how to architect it, making your admin’s life easier, and cutting your overall costs. I can only imagine in a few years most everyone with more than a handful of servers or instances and running a series of unrelated scheduled jobs (which means practically everyone dealing with big data) will leverage Mesos – this includes those working in the cloud or on bare metal.

One of the best explanations of the logic behind Mesos can be found on slide 19 “What are the costs of Single Tenancy?”. The idea is very simple, yet powerful. It shows three sample loads – Rails CPU load, Hadoop CPU Load, and Memcached CPU Load – and their respective utilization rates over time. In this example case, as in most real live cases, the loads peak at different periods. Due to the varying demands on capacity, it makes sense to utilize the excess in off-peak hours. By superimposing the load patterns you can easily see how excess capacity can be diminished. Mesos focuses on solving this excess capacity problem.

I have been involved in a variety of big data projects and the waste is tremendous. You will likely have a batch job that runs periodically, let’s stay at 1 am – after collecting the daily data. This job is crucial for longer term analysis but not so important for intra-day decision making. Then during the day, your ingestion of data will likely peak, but then slow dramatically at night. During ingestion, some sort of real time analysis will likely take place as well – and because of periodic spikes in data flow, you always want to leave some excess capacity just in case. The situation that this creates is tremendously wasteful in terms of excess capacity just sitting there.

Improving capacity utilization is a key metric in many industries, and in some it is a matter of determining which companies are profitable and which are not. Heavy industry is the most obvious example, but many others like airlines, education, and healthcare are focused on capacity as well. Years ago, I analyzed an education company and frankly was surprised just how important “filling the seats” was to the profit margin. I do not think we have gotten to this stage yet for big data as the growth curve is still very steep, but the logic is the same. If your competitor can successfully cut costs by doubling their average utilization rate, you are going to be in trouble.

The issue with capacity utilization and big data architecture will surely become more of a problem in the very near future. Hadoop continues to grow but is increasingly be used in conjunction with multiple other tools. Cases in point are Spark and Storm. As the need for real (or realer) time analysis increases, map reduce will be sidelined (still used for large batch jobs but not for anything to due with what is happening now) and there will be a migration to other tools, like Storm. And, as machine learning and real time queries become more popular, Spark will gain more traction. Under these conditions with an increasingly diverse architecture, Mesos will grow in popularity as well.

Another interesting point that Paco made in the question and answer portion of the talk was that it is not simply a matter of server costs but also personnel. Mesos makes admin and scheduling much easier and allows for increased operational leverage in that your ratio of engineers to devops can be improved. In other words, more money can be funneled towards accelerating your development cycle.