The Datavore’s Dilemma

In the workplace, analysts and decision makers are often seen as data consumers.

A data consumer doesn’t necessarily need to know where their data comes from to digest it. But knowing where data is grown, the processing it goes through, and how it is distributed can make one much more effective as an analyst and decision maker.

In a workplace, this responsibility does not fall completely on the data consumer. The producer, the processor, the distributor all have an obligation to provide reliable product for their communities. We cannot complain that consumers “don’t understand distribution” when apples are bad – good systems and ingenuity are required for good product, produce, and pies to be delivered reliably.

The Data Chef

An apple goes from seed to fruit, from picking to distribution, from quality control to storefront, from kitchen to pie before it ends up in one’s belly.

The analyst is the chef. Analysts are in charge of obtaining the fruit and making the pie [chart]. The best chefs can make do with what they’re given, but the best chefs also know how to care about the highest quality ingredients.

Early on, it is easy to blame bad pies on bad apples (these were the apples I was given!). Soon, the analyst learns that they have power to choose the good from the bad, and select the fruit that satisfies their recipe. If the good apple does not exist, they find proxies in peaches and pears.

This is worst part of being an analyst – the less you trust your distributor, the more you have to vet by yourself. When the distributor can’t provide you with consistent product that meets the bar, the more tradeoffs you will have to make. “We can’t have apple pie on Thursday because we haven’t been getting good quality apples”. “We can’t have trustworthy dashboards on time because we’re not sure if the data is good”.


The easiest to figure out what’s wrong with your data is asking: “How did it get here?”

My nearby distributor of expensive non-organic fruit is Safeway. Safeway is constantly pushing the fold for more perfect, shiny apples whilst pushing up the price in the process. Some want the perfection, while others prefer not to pay a premium for the processing that persists apple longevity and shine.

A cheaper option involves low-QA distributors. Asian markets often provide produce for extremely low prices. The tradeoff is every other apple is rotten, over-ripe, or simply inedible. The quality tolerance and picking are then up to the consumer.

The cheapest option is going straight to the producer. By skipping the distribution, you can get the freshest fruit, picked recently from the tree. However, only the most committed consumers can afford make these tradeoffs. It takes time to hunt down farms that produce fuji apples instead of red delicious. Precious time spent sourcing takes away from cooking or doing things that supplement to your desired result (presentation, service, etc.)

A tangential option is the farmers market. Here, small scale distributors put love and care in their production, processing, and distribution. You can learn directly from the farmer what pesticides were used, how they know which fruit are good, and seasonal recommendations from experts.

New companies are increasingly less willing to pay for the high overhead of Safeway-level distribution and QA. Why use expensive software, proprietary data architectures, and BI overhead when open source distributed big data software can give you more data, faster latencies, and bring people closer to the data?

Why not go closer to the producer? Well, if you go to the Asian market and pick apples at random, you’ll be in a world of stomachache. If you naively think you can go to the fuji apple farm at any time, you may go embark on a long drive to find out they are out of season. Just like the challenges in buying farm-direct above, we’re finding out data closer to the source is not that simple.

Data in Tech

Let’s look how data flows in a new-ish software company (late 2000s onwards).

Product engineers supply the data via tracking calls scattered throughout code. Data engineers provide the pipelines that guarantee that apples make it to the stores while still fresh, unbruised by transport, and not forgotten at the docks. Centralized BI/Analytics/Data Science teams become the distributor-cooks. They make sure that the storefronts are pretty & well organized, and also prepare pies for those who can’t or don’t have time to bake their own data. Specialized data scientists and analysts flirt with gastronomy at the cutting edge. Decision makers eat the pie to gain access to the secrets of the world.

There is a constant push and pull of where the burden of quality lies. Should the transporters know the requirements of the distributors and serve them? Or is their goal to be reliable, so distributors can tune QA to their own needs. On one side, we don’t want data pipelines to aggressively throw away over-ripe apples that might be bad for baking but great for apple sauce. However, independent distributors might not have the machinery to efficiently filter data at scale. If experts in charge of pipelines allow others access to their machines, how can they be sure people are using them correctly? Do BI teams have to start hiring data engineers to maintain these machines that are being pushed downstream?

The old world was inefficient, bulky, and did not take advantages of technologies we have today. However, stripping away all that structure reveals ambiguous responsibilities and a new class of organizational challenges.

I don’t know what is the best for each company. I do know, however, that we can be smart and structure teams and technology based on company needs and existing experience. I do know that when datavores start to learn where their food comes from, they can gain an appreciation and critical eye for what they consume.


Leave a Reply

Your email address will not be published. Required fields are marked *