Every time I think there is one model for a data strategy, I can find dozens of holes to poke at it. When I attempt one few months later, there’s another littering of bullets that render the strategy incomplete.
I’ve come to believe that relying on a single framework is insufficient for a data strategy. However, we can effectively mix and match good frameworks and foundational principles to effectively leverage data across an organization.
Here are several unique but overlapping frameworks that contribute to a realistic data strategy.
- Data hierarchy of needs
- Source of Truth Ownership
- Focus on the stakeholders
- Responsibility for each data layer
There are a many data buzzwords these days, starting from gartner’s famous VVV’s of big data (volume, velocity, variety) in 2001 to 4 V’s (vvv + veracity) to 13 V’s to a wild west of acronyms. There’s big data, small data, red data, blue data. Data governance, streaming, data wrangling, data massaging, pick your word. However, data is more than just the operations we do on it – it is a ecosystem with many players, volatile requirements, and many dimensions.
Four data frameworks
Data Hierarchy of Needs
The Data Hierarchy of Needs is based on Maslow’s Hierarchy of Needs, a popular visual pyramid of human needs starting from physiological safety, then security, belonging, esteem, and finally self-actualization.
Data in an organization has a similar pyramid. You need a foundation of accurate data before any useful data work can be done. Then, you can do analysis, create automation, use AI, and finally find ways to (manually or automatically) provide value with this information through access, recommendations, robotics, or anything, really. Everyone can create their own, but for examples see here or google.
The strategic dials here are around foundation vs. impact. Building 100% for foundation risks building the wrong foundation before knowing what your customer wants, while focusing 100% on impact risks a shaky foundation that slows down decision making when you don’t know what data to trust, and systems don’t work reliably.
The data hierarchy of needs is the most abstract framework, and says more to priorities than realities. Forgetting about the pyramid is a dangerous early mistake, but realistically many people are going to contribute to many parts of the pyramid.
Source of Truth Ownership
An accurate source of truth is important for many reasons. Consequential business decisions made on bad data can have negative impact. Data discrepancies can stall decision making – one team’s data says X, and another says Y, each resulting in a different stance. Showing shareholders wrong data comes with social and financial penalties.
A source of truth owner provides the following:
- Clear lines of reporting authority to for high level reports.
- Clear lines of data authority on cross-domain projects, i.e. if Sales wants to use Marketing data.
- Outreach and documentation on important datasets.
In my opinion, the ideal state lies somewhere between decentralized ownership and centralized data ownership.
In decentralized ownership, each domain (marketing, sales, product A, product B) owns its own data from end to end. Since raw data & transformed data are a reflection of software and processes, domain experts are best suited to understand and react to changes that are happening in their domain. Domains include Sales, Marketing, Operations, Business Development, Support, Account Management, Product, etc.
Centralized ownership dictates that a central business intelligence or analytics team is responsible for data quality. This exists when decentralized teams don’t have the data expertise to understand the implications of data changes or adapt ETLs/reports to changing business processes. In addition to this, heavily cross-functional datasets are best suited for a central team to adapt to. The downside of centralized teams is that they will struggle to know and stay up to date about everything in dozens of business lines across specialized domains – gaps in oversight can form and grow.
A viable compromise is a centralized team that maintains a set of “golden tables”. These golden tables are vetted and reflect an accurate source of truth for a company’s most important metrics such as financial metrics, key operational metrics, and defining the border between various business lines. To handle data outside these golden tables, the central team trains key members on each domain on how to “own” their data. “Owning” the data includes populating data dictionaries, answering questions, and modifying pipelines/reports to reflect business changes.
Focus on the Stakeholders
Focus on internal stakeholders is the data strategy version of “eyes on the prize”.
Stakeholders here are defined by people who use the data. Execs need highly accurate, summary data of the organization. Sales teams need a variety of data for sales pitches, but also accurate reports for dynamic compensation and operational efficiency. Accounting is accountable to the board/shareholders and the government, while many other teams have business targets they need to hit and the data to track progress and make day to day decisions.
This model is focused on KPIs. This model is focused on impact. This model is focused not on data for data’s sake, but data for the company to succeed.
In this model, a central data management and governance team primarily ends up serving executive decision making. Individual C or VP level execs need answers to run their individual organizations as well, so they do whatever they can to get those answers.
This can be a model heavily based on trust and relationship-building. Key organization-wide decision makers are often less versed in the nitty gritty of data, but have high level needs. These decision-makers place their trust in proxies that serve as their trusted data experts. When the highly-trusted experts go away, there is risk of a rough transition period.
That being said, purely stakeholder-based data strategy is the most likely to lead to data silos, since the stakeholders are often willing to spend budget to “get the job done at any costs”.
Responsibility for each data layer
One popular way to treat data is in layers in which people are responsible for their own part of the stack. This is the modern data platform team. It goes something like this:
- The data infrastructure/platform team is in charge of making sure data gets from A to B in a reliable manner.
- The business intelligence team makes sure that data modeling is correct, optimized, and accessible on top of data infrastructure.
- The analysts, operators, and data scientists build reports, analyses, models and dashboards that are correct and useful based on BI’s data tables.
The system seems to work great when the platform, rules and documentation are well defined for each of these layers. Rules of engagement are clear, and the platform systems feel more resilient to employee churn. For engineers, building a platform is much more rewarding than doing ad hoc requests (so it is easier to hire and retain employees).
This is a model that believes that “as long as the tools are good, people will figure it out”. This is a model that limits the number of ad hoc requests for platform teams. Ultimately I believe that this is a model that, by itself, underestimates the breadth of analytical use cases and problems in a company.
When does a pure layer-based strategy break down?
- Ownership and escalation becomes a challenge. When data in an end-user dashboard is wrong, poor triaging can lead to wasted time.
- Context can be lost in translation when pushed down. Data infrastructure ultimately needs both business and technical reasoning to fix something, but it is difficult for non-technical business owners to communicate with technical platform owners.
- When the platform isn’t comprehensive, custom jobs need to be built or maintained for specific use cases.
- Inflexibility in the platform leads to teams choosing their own platforms and tooling that serves their needs.
- Misaligned incentives
- In this mode, incentives can be misaligned. When the business needs something that makes data slow down significantly, who makes the call?
The critical piece to this working is a strong, centralized data team that can gather data requests around the company and triage them effectively. Otherwise, a lot of the strategy breaks down due to the above. Anytime a strategy requires a critical piece to work, it becomes a framework rather than a “company strategy”.
For me, a single data strategy takes principles from all the frameworks. I’m sure there are many more. A lot of these ideas are argued in pieces such as Stitch Fix’s “Engineers Shouldn’t Write ETL” or a framed in a dichotomy in HBS’s article on Offensive vs. Defensive data. I believe there is no silver bullet here, but these frameworks reveal tendencies that we can learn to adjust for.
Newer data companies may tend to ignore the hierarchy of data needs, and proceed to overshoot on foundation when they realize their data is crap. Knowing how far to go comes with experience.
Engineering-focused companies may love the idea of a data platform or data service model, but deal with dozens of internal support requests when the platform doesn’t support unique business needs. Processes are then built out to circumvent a glorious platform.
Politically oriented companies might have data teams for each major domain, all experts in their space. Domain experts have a direct responsibility to their group lead, allowing for fast, iterative data-oriented decision making. However, these disparate data teams may resent or look down upon each other without understanding the complexity of problems the other side faces.
Analytics and data science heavy companies may have a penchant for very tight data guidelines to prevent bad analyses from seeing the light of day. Tight, centralized ETL processes and tools make migrations easy and create a safe way to “democratize data” (i.e. philosophy of Looker). Too tight, though, and your prioritization scheme creates underserved departments that are held to high data standards but don’t have the means to execute.
In the end, none of this really addresses the popular complexities of data today that have far surpassed basic analytical needs. From machine learning pipelines to data-backed product offerings. From real-time streaming tech to advances in sensors and data in the physical world, data has become a beast that cannot be put back in it’s box. That being said, basic human access to data for day to day decision making should be underestimated.