Nobody enjoys writing ETL (Stitch Fix)

What is an ETL pipeline anyway?

A common fear of engineers in the data space is that, regardless of the job description or recruiting hype you produce, you are secretly searching for an ETL engineer.

In case you did not realize it, Nobody enjoys writing and maintaining data pipelines or ETL. It’s the industry’s ultimate hot potato. It really shouldn’t come as a surprise then that ETL engineering roles are the archetypal breeding ground of mediocrity.

Engineers should not write ETL. For the love of everything sacred and holy in the profession, this should not be a dedicated or specialized role. There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.

http://multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl/

———————

Magnusson has a pretty good insight on how relationships between data scientists and engineers have evolved in the past few years at Stitch Fix (& similar companies). I agree with most of his points. That being said…

The biggest risk of the staffing model he proposes is here:

This is one reason why ETL and API / production algorithm development is typically handed off to an engineer in assembly line style. But, all of those tasks are inherently vertically (point) focused. Talented engineers in the data space are almost always best focused on horizontal applications.

Holy cow I don’t think this could be any more wrong (author either intentionally underplayed this or is being naive). A significant part of data transformation involves building data tables that are used not only by you or your team. If your transforms are useful, other teams will want to access your transformed data. Now you are no longer building for the vertical – you are building for everyone. Changes you make to your ETL can have unintended consequences down your DAG.

The alternative is arguably worse. When other teams access your vertical without using your transforms, you get ETL redundancy, conflicting results (from minute differences in business logic), and a system that becomes increasingly difficult to maintain. There is no happy nirvana. Shuffling incentives doesn’t remove all the shitty ETL work – however the burden can become much more tolerable when spread across several teams.

Aside from this, I think this 2016 article is super solid and I will be referencing it in the future.

Posted in: etl

Leave a Reply

Your email address will not be published. Required fields are marked *