This is a pretty interesting trend – as lots of processing becomes compute bottlenecked by shitty queries, an on-demand service like Dataproc really appealing. The trends as outlined in the post involve:
resource isolation (so one query can’t kill everyone)
better auditing & monitoring (so you know who to yell at)
and more flexibility (so a select few can play around with few consequences)
TLDR: Lets trade off some performance across the board to better handle “lots of people writing lots of shitty queries”.
Also Ryan Noon sighting. Always asking the tough questions.
Few months ago, Uber released a blog post on their new processing framework, Hoodie. They’ve probably been working on this idea for a while, based on paradigms from google’s Dataflow. The crux of the post is:
Hot new data processing frameworks such as Lambda and Kappa architecture have good ideas, but very few consumers of data actually need the real time (<5 min) latency they provide. Therefore, lambda and kappa are a waste of time if we can just get batch upserts (portmanteau of update & insert) to update every 5 minutes.
This is more or less a continuation of the thought process over the last 15 years:
2004 – Batch (Batch) ->Data is cheap, compute is cheap. Lets put everything in one big pile (Data Warehouse) and do all the processing at once!
201? – Stream (Stream/Micro-batch) -> Batch is slow AF and I don’t care about schemas or joins for many use cases. I don’t have to pay for storage because I can aggregate data on the fly and put the rest in cold storage.
2013 – Lambda (Stream + Batch) -> We want to keep the fast stuff. However we also want to make raw data join-able with the rest of our data warehouse, and for long time. Why don’t we just process everything twice and get the benefits of both! (one fast, one slow)
2014 – Kappa (Stream only) -> Dual processing is unnecessarily complex. We can use all fast processing and separate consumption between fast and slow for different use cases.
2017 – “Hoodie” (Near-time/Mini batch) -> Dual consumption is unnecessarily complex. We can process a bit slower (5 min instead of real time) and make read access fast for everyone! Our data users don’t need real time anyway. Keeping logs of everything allows us to align streams.
But for workloads that can tolerate latencies of about 10 minutes, there is no need for a separate “speed” serving layer if there is a faster way to ingest and prepare data in HDFS. This unifies the serving layer and reduces the overall complexity and resource usage significantly.
So here we are – I imagine a lot of data infrastructures are likely to move to this mini batch processing in the next few years. The simplicity is too compelling (if it works well). Analysts and basic data science simply don’t need real-time processing, and this is very user-driven approach to data infrastructure tech.
There are times when real-time info can help decide when to flip a switch – local Uber operators may want to turn off surge immediately based on surge spikes due to catastrophe. That being said, the changes you’d make based off analytics data usually require code pushes or changes in policy, both of which take some time.
There exist system monitoring and production services that can leverage Lambda/Kappa and happen to use similar technology, such as Hadoop, Kafka, Storm/Samza/Spark Streaming/Flume, etc. However, these services have vastly different requirements and tradeoffs from analytics data and cannot be lumped together.
The real killer feature here is incremental transforms because of its ability to do mini batch upserts. If Uber actually got this right across complex DAGs, this the promised savings on re-processing are real.
With more constant processing, will there be locking issues if most of the time is doing read/writes?
Data can be cheaply reprocessed, but how do you maintain trust with the user that data will not change under their feet every 15 minutes? Since you can never know if data is lagging, is there a way to help the user trace back sources of slow data based on limited information?
What is the future?
Can mini-batch processing get good enough to actually power production systems?
Maybe some of these questions are answered in other papers, so would love to hear your comments!
A common fear of engineers in the data space is that, regardless of the job description or recruiting hype you produce, you are secretly searching for an ETL engineer.
In case you did not realize it, Nobody enjoys writing and maintaining data pipelines or ETL. It’s the industry’s ultimate hot potato. It really shouldn’t come as a surprise then that ETL engineering roles are the archetypal breeding ground of mediocrity.
Engineers should not write ETL. For the love of everything sacred and holy in the profession, this should not be a dedicated or specialized role. There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.
Magnusson has a pretty good insight on how relationships between data scientists and engineers have evolved in the past few years at Stitch Fix (& similar companies). I agree with most of his points. That being said…
The biggest risk of the staffing model he proposes is here:
This is one reason why ETL and API / production algorithm development is typically handed off to an engineer in assembly line style. But, all of those tasks are inherently vertically (point) focused. Talented engineers in the data space are almost always best focused on horizontal applications.
Holy cow I don’t think this could be any more wrong (author either intentionally underplayed this or is being naive). A significant part of data transformation involves building data tables that are used not only by you or your team. If your transforms are useful, other teams will want to access your transformed data. Now you are no longer building for the vertical – you are building for everyone. Changes you make to your ETL can have unintended consequences down your DAG.
The alternative is arguably worse. When other teams access your vertical without using your transforms, you get ETL redundancy, conflicting results (from minute differences in business logic), and a system that becomes increasingly difficult to maintain. There is no happy nirvana. Shuffling incentives doesn’t remove all the shitty ETL work – however the burden can become much more tolerable when spread across several teams.
Aside from this, I think this 2016 article is super solid and I will be referencing it in the future.