[Repost] What is a Data Analyst in Tech? (Part 1)

2 and a half years ago, I thought I knew everything on what it meant to be a data in tech and wrote a post “What is a Data Analyst in Tech (Part 1)”.

Looking at it now, there are many of naivete, but I think a lot of these still hold true.

Here is the post from 2015. I hope peers from then can relate to how much we’ve changed, and I hope those of you looking to change into tech can learn something new!

*I thought I had lost the post in my last WordPress migration – however I was able to find it thanks to the Web Archive at http://web.archive.org/web/20160221010248/http://brianylu.com/2015/01/23/what-is-a-data-analyst-in-tech-part-1/. Thanks Internet!

 

————————————————

What is a Data Analyst in Tech? (Part 1)

“So what exactly do you do as a data analyst?”

I get this question all the time.

If you are a consultant, banker, or new grad that wants to come into the analytics field, listen closely because I am writing this for you.

__

First of all, I will be the first to admit that “big data” is a buzzword and heavily overused. In other words:

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…

– Dan Ariely (Facebook)

OK, OK, its not that bad. For example, big data is what allows Google to do the self driving carlive translations, and all the other Googley things (more like every single Google thing). It is what brings us WatsonSiriCortana, and other AIs that will take over the world one day (read: TerminatorMatrix, every cool future movie)

However, if you’re like me and only know Excel coming in, you might be discouraged to enter the data field at first – you aren’t making the next Watson anytime soon. That being said, I want to let you know that being a data analyst isn’t about big data. Its about adaptation, grit, and constantly learning.

A great way to understand the data landscape is to define the job titles that have data in them. What do they mean?!? Lets investigate:

  • Data Scientist: Does heavy machine learning/stats stuff like recommendation engines, clustering, etc. These guys are the brains of “big data” you see in the news.
    • Also known as: Data Mining
    • Need: Stats MS/PhD in Stats or CS
  • Data Engineer: Deals with the server side of data, making sure data transfer systems are reliable and can handle high volumes of data. These guys allow “big data” to happen.
    • Also known as: Date Warehousing, Data Infrastructure
    • Need: CS degree (BS/MS/PhD all good)
  • Data Analyst (me): Does analytics on how either a product or the business is performing. In charge of day-to-day metrics as well as driving projects and recommendations.
    • Also known as: Business Analyst, Business Intelligence, ____ Analyst
    • Need: heart and brain, maybe bachelors degree (for Entry Level)
    • Career path is unclear. Options are to work towards above roles, a better analyst, or up into management.

A data analyst role sometimes looks a lot like consulting – this is why I sometimes describe my role as an ‘internal consultant’.

There’s a lot of posts about being a data analyst in places online, such as in Quora’s “What should I study or learn if I want to be a data analyst?“. To supplement others’ views on data analytics, I want to pitch in my experiences and hopefully an easy way to understand what I do.

__

DATA ANALYST PART 1: THE DATA CYCLE/PIPELINE

To fully explain what a data analyst does in years 0-2 (and on), I will first show you “Blu’s Data Pipeline”. This is the framework I think of day to day, and whenever I get the dreaded “What the hell do you do” question.

 

Looks like fun

There are no numbers in the pipeline, because there is no beginning or end.

You could work an entire quarter within the ANALYZE and PRESENT tasks, creating reports to help your team better understand a certain area of the business or product. You could spend a quarter focusing on the CLEAN and TRACK, working with data already in your warehouse to make dashboards that expose insights and visualize any anomalies in your work.

Personally I’ve never done a full cycle project. Steph, one of my first teammates, could probably tell you about a full product analytics cycle with her work on our recently redesigned Help Center. But since this is my blog, I’ll go ahead and tell you instead.

Lets see what I mean by going through a full analytics cycle:

  1. First step. Figure out how people were using the old Help Center. What pages were viewed the most? Did users click around? What kind of users clicked what kind of articles? (Analyze)
    1. First gotta use SQL to get data out of our warehouse.
  2. Look at competitors’ Help Centers to get inspiration on various designs and flows. (Research)
  3. Use data/visuals to make a case for making improvements to the Help Center (Present)
  4. Work with engineering, product, design, marketing, support to discuss direction of project (Collaborate)
  5. Determine what kind of data we want from users visiting help center (browser, OS used, location, etc.) (Design Logging)
  6. Use internal systems to shove all the data into our data warehouse in a way that is easy to understand and query. (Clean)
  7. Build dashboards so you and your project-mates can track who’s viewing, what their clicking, and how they are navigating the new Help Center. (Track)
  8. Work with team to constantly A/B test UI, content, and flow improvements, and do additional analysis as necessary. (Do it all again)
  9. Enjoy the impact you made on a beautiful redesign of the help center, going from
     to 

Since I’m not Steph, I cannot give you the full ins and outs of how amazing it is to see and work on a project from start to finish, and have something gorgeous to show for it.

However, I can say that whether you decide you like to focus on breadth or depth, understanding and being able to plug into any part of this cycle is what makes you one of the most versatile members of the org.

___

Think about a tech company and all the various web properties it owns. Think about the desktop version of the product, the web version of the product, the iOS application, the Android application and just how many different features and flows that need nuanced recommendations.

On the operations side, there’s marketing analytics, figuring out how to test ads from various mediums, and their effectiveness. There’s work to be done in learning about how we can help our existing customers become more engaged (retention).

There’s analysis to be made to figure out the financial impact of various initiatives, and what the baselines should even be (what is good? what is great? what is “somebody going to get fired” land?)

There is so much analysis that can be done, and this is why the data field is so hot right now.

___

Thanks for reading What is a Data Analyst in Tech? (Part 1)

The next few installations will include something along the lines of:

  • What I enjoy most about my role as a Data Analyst
  • Why Data Analyst job descriptions are obsessed with SQL (and why they shouldn’t be)
  • Big list of the buzzwords and terminology you need to sound smart.
    • & technologies and software that is being used in the analysis world
  • Career Paths for Data Analyst roles (Opinion/Discussion)

Please stay tuned for more, and hope you have a great day!

Managing distributed data is like managing a team

is there any difference??
  1. There is a central coordinator who needs a lot of memory and their only job is coordinating (the manager)
  2. Adding more nodes (employees) doesn’t automatically make things faster (esp. work is poorly distributed).
  3. It doesn’t make sense for everyone to know everything (data partitioning)
  4. Transfer of context can be inefficient. It helps for relevant knowledge to be colocated (domain expertise)
  5. When a single node (employee) goes MIA without telling anyone, it’s worse than lost work because it can block others.
  6. When it becomes prohibitively expensive to acquire larger nodes (10x employees), you must expand horizontally.
  7. If one person gets all the work (skew) they will be overworked and sad. Sometimes you won’t realize a task or domain is disproportionately hard until it’s too late.
  8. You need some knowledge overlap (replication) to avoid single points of failure when people leave or go on PTO.
  9. Rebalancing is expensive and creates downtime (meetings, meetings, more meetings)
  10. When networking goes down (slack) everyone is fucked.

[Article] On Demand Hadoop Processing by Google (a.k.a. “one way of dealing with lots of people writing shitty queries”)

Blog Post: https://cloud.google.com/blog/big-data/2017/06/fastest-track-to-apache-hadoop-and-spark-success-using-job-scoped-clusters-on-cloud-native-architecture

HN: https://news.ycombinator.com/item?id=14499170

This is a pretty interesting trend – as lots of processing becomes compute bottlenecked by shitty queries, an on-demand service like Dataproc really appealing. The trends as outlined in the post involve:

  • reducing complexity
  • resource isolation (so one query can’t kill everyone)
  • better auditing & monitoring (so you know who to yell at)
  • and more flexibility (so a select few can play around with few consequences)

TLDR: Lets trade off some performance across the board to better handle “lots of people writing lots of shitty queries”.

Also Ryan Noon sighting. Always asking the tough questions.

Data Analytics needs near-time, not real-time processing (Uber Blog, March 2017)

Hot new data processing frameworks such as Lambda and Kappa architecture have good ideas, but very few consumers of data actually need the real time (<5 min) latency those architectures provide. Taken from Uber’s blog post on Hoodie. https://eng.uber.com/hoodie/

Few months ago, Uber released a blog post on their new processing framework, Hoodie. They’ve probably been working on this idea for a while, based on paradigms from google’s Dataflow. The crux of the post is:

Hot new data processing frameworks such as Lambda and Kappa architecture have good ideas, but very few consumers of data actually need the real time (<5 min) latency they provide. Therefore, lambda and kappa are a waste of time if we can just get batch upserts (portmanteau of update & insert) to update every 5 minutes.

This is more or less a continuation of the thought process over the last 15 years:

2004 – Batch (Batch) -> Data is cheap, compute is cheap. Lets put everything in one big pile (Data Warehouse) and do all the processing at once!

201? – Stream (Stream/Micro-batch) -> Batch is slow AF and I don’t care about schemas or joins for many use cases. I don’t have to pay for storage because I can aggregate data on the fly and put the rest in cold storage.

2013 – Lambda (Stream + Batch) -> We want to keep the fast stuff. However we also want to make raw data join-able with the rest of our data warehouse, and for long time. Why don’t we just process everything twice and get the benefits of both! (one fast, one slow)

2014 – Kappa (Stream only) -> Dual processing is unnecessarily complex. We can use all fast processing and separate consumption between fast and slow for different use cases.

2017 – “Hoodie” (Near-time/Mini batch) -> Dual consumption is unnecessarily complex. We can process a bit slower (5 min instead of real time) and make read access fast for everyone! Our data users don’t need real time anyway. Keeping logs of everything allows us to align streams.

But for workloads that can tolerate latencies of about 10 minutes, there is no need for a separate “speed” serving layer if there is a faster way to ingest and prepare data in HDFS. This unifies the serving layer and reduces the overall complexity and resource usage significantly.

So here we are – I imagine a lot of data infrastructures are likely to move to this mini batch processing in the next few years. The simplicity is too compelling (if it works well). Analysts and basic data science simply don’t need real-time processing, and this is very user-driven approach to data infrastructure tech.


Personal thoughts:

  • There are times when real-time info can help decide when to flip a switch – local Uber operators may want to turn off surge immediately based on surge spikes due to catastrophe. That being said, the changes you’d make based off analytics data usually require code pushes or changes in policy, both of which take some time.
  • There exist system monitoring and production services that can leverage Lambda/Kappa and happen to use similar technology, such as Hadoop, Kafka, Storm/Samza/Spark Streaming/Flume, etc. However, these services have vastly different requirements and tradeoffs from analytics data and cannot be lumped together.
  • The real killer feature here is incremental transforms because of its ability to do mini batch upserts. If Uber actually got this right across complex DAGs, this the promised savings on re-processing are real.
    • With more constant processing, will there be locking issues if most of the time is doing read/writes?
    • Data can be cheaply reprocessed, but how do you maintain trust with the user that data will not change under their feet every 15 minutes? Since you can never know if data is lagging, is there a way to help the user trace back sources of slow data based on limited information?

      Incremental Processing? Does it work as promised?
  • What is the future?
    • Can mini-batch processing get good enough to actually power production systems?

Maybe some of these questions are answered in other papers, so would love to hear your comments!

Nobody enjoys writing ETL (Stitch Fix)

What is an ETL pipeline anyway?

A common fear of engineers in the data space is that, regardless of the job description or recruiting hype you produce, you are secretly searching for an ETL engineer.

In case you did not realize it, Nobody enjoys writing and maintaining data pipelines or ETL. It’s the industry’s ultimate hot potato. It really shouldn’t come as a surprise then that ETL engineering roles are the archetypal breeding ground of mediocrity.

Engineers should not write ETL. For the love of everything sacred and holy in the profession, this should not be a dedicated or specialized role. There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.

http://multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl/

———————

Magnusson has a pretty good insight on how relationships between data scientists and engineers have evolved in the past few years at Stitch Fix (& similar companies). I agree with most of his points. That being said…

The biggest risk of the staffing model he proposes is here:

This is one reason why ETL and API / production algorithm development is typically handed off to an engineer in assembly line style. But, all of those tasks are inherently vertically (point) focused. Talented engineers in the data space are almost always best focused on horizontal applications.

Holy cow I don’t think this could be any more wrong (author either intentionally underplayed this or is being naive). A significant part of data transformation involves building data tables that are used not only by you or your team. If your transforms are useful, other teams will want to access your transformed data. Now you are no longer building for the vertical – you are building for everyone. Changes you make to your ETL can have unintended consequences down your DAG.

The alternative is arguably worse. When other teams access your vertical without using your transforms, you get ETL redundancy, conflicting results (from minute differences in business logic), and a system that becomes increasingly difficult to maintain. There is no happy nirvana. Shuffling incentives doesn’t remove all the shitty ETL work – however the burden can become much more tolerable when spread across several teams.

Aside from this, I think this 2016 article is super solid and I will be referencing it in the future.