Few months ago, Uber released a blog post on their new processing framework, Hoodie. They’ve probably been working on this idea for a while, based on paradigms from google’s Dataflow. The crux of the post is:
Hot new data processing frameworks such as Lambda and Kappa architecture have good ideas, but very few consumers of data actually need the real time (<5 min) latency they provide. Therefore, lambda and kappa are a waste of time if we can just get batch upserts (portmanteau of update & insert) to update every 5 minutes.
This is more or less a continuation of the thought process over the last 15 years:
2004 – Batch (Batch) -> Data is cheap, compute is cheap. Lets put everything in one big pile (Data Warehouse) and do all the processing at once!
201? – Stream (Stream/Micro-batch) -> Batch is slow AF and I don’t care about schemas or joins for many use cases. I don’t have to pay for storage because I can aggregate data on the fly and put the rest in cold storage.
2013 – Lambda (Stream + Batch) -> We want to keep the fast stuff. However we also want to make raw data join-able with the rest of our data warehouse, and for long time. Why don’t we just process everything twice and get the benefits of both! (one fast, one slow)
2014 – Kappa (Stream only) -> Dual processing is unnecessarily complex. We can use all fast processing and separate consumption between fast and slow for different use cases.
2017 – “Hoodie” (Near-time/Mini batch) -> Dual consumption is unnecessarily complex. We can process a bit slower (5 min instead of real time) and make read access fast for everyone! Our data users don’t need real time anyway. Keeping logs of everything allows us to align streams.
But for workloads that can tolerate latencies of about 10 minutes, there is no need for a separate “speed” serving layer if there is a faster way to ingest and prepare data in HDFS. This unifies the serving layer and reduces the overall complexity and resource usage significantly.
So here we are – I imagine a lot of data infrastructures are likely to move to this mini batch processing in the next few years. The simplicity is too compelling (if it works well). Analysts and basic data science simply don’t need real-time processing, and this is very user-driven approach to data infrastructure tech.
- There are times when real-time info can help decide when to flip a switch – local Uber operators may want to turn off surge immediately based on surge spikes due to catastrophe. That being said, the changes you’d make based off analytics data usually require code pushes or changes in policy, both of which take some time.
- There exist system monitoring and production services that can leverage Lambda/Kappa and happen to use similar technology, such as Hadoop, Kafka, Storm/Samza/Spark Streaming/Flume, etc. However, these services have vastly different requirements and tradeoffs from analytics data and cannot be lumped together.
- The real killer feature here is incremental transforms because of its ability to do mini batch upserts. If Uber actually got this right across complex DAGs, this the promised savings on re-processing are real.
- With more constant processing, will there be locking issues if most of the time is doing read/writes?
- Data can be cheaply reprocessed, but how do you maintain trust with the user that data will not change under their feet every 15 minutes? Since you can never know if data is lagging, is there a way to help the user trace back sources of slow data based on limited information?
- What is the future?
- Can mini-batch processing get good enough to actually power production systems?
Maybe some of these questions are answered in other papers, so would love to hear your comments!