If you haven’t subscribed yet, you can get thoughts and musings about personal finance and whatever else I find interesting straight to your inbox by clicking here:
If you’re not yet familiar with dbt, it’s a data transformation technology that is widely used to enable ELT workflows in the modern data stack (which these days generally means: an ingest tool like Fivetran or HVR, a transformation tool -- usually dbt, and a cloud data warehouse, like Snowflake or BigQuery). It’s become really popular with analysts, data scientists, and analytics engineers, and in many ways, has become the de facto transformation technology for SQL-based cloud warehouses. When I joined dbt Labs, and started talking to customers, however, I was surprised to find that the introductory conversations were rarely with the analysts and data scientists, and that I was actually talking to data engineering teams in almost every conversation. Additionally, analytics engineering is a very new function for most organizations, and the title for the role was only coined in the past couple years. It got me wondering...who is actually using dbt?
There’s a variety of ways you could go about answering this question. One potential option would be to go into the dbt Community Slack (which is about 16,000 members strong), and run a poll. The challenge there, however, is that you’re probably getting a heavily biased set of respondents. The people most likely to respond are the most active users, and may not be representative of the full spectrum of users. What seemed like a better way to answer the question is to look at job postings looking for dbt experience. dbt has become common enough that it shows up in a fair number of job postings, and the added benefit is that job descriptions give a lot of context around which organizations are using dbt, what the people using it call themselves, and what the rest of their tech stack typically looks like.
Using a Python library I found, I was able to scrape about 700 job postings from LinkedIn, and started taking a look at the data. The next challenge for me was that raw job posting data is pretty raw. Full-text job descriptions are very unstructured, and job titles are a little bit easier to manage, but many companies will put superlatives in them (Pre-IPO! Urgently hiring!) or have varying levels (Senior Analyst vs Analyst II), which all makes it a bit difficult to really see the shape of the data. And for better or for worse, dbt is also a slightly overloaded term in the world. It means something very specific in the analytics and data world, but it also is a commonly used acronym for Dialectical Behavior Therapy. As a result, searching just for dbt on a job site results in a somewhat varied range of data and analytics jobs, as well as therapist and psychology-related roles.
The silent majority
What showed up from the data was actually fairly surprising to me, although once I saw it, it made a lot of sense. The dbt Community caters heavily to Analytics Engineers and Analysts. dbt Labs, itself, is very actively pushing forward the Analytics Engineering discipline. For a large portion of the community, the promise of dbt is one of empowerment and upskilling, and so it makes a lot of sense that the groups that would have the most to gain from dbt adoption are the ones who are the most vocal about it.
Less traditionally engineering-oriented roles made up around 37% of the roles where dbt was in use, and The “Other” bucket is also very large, which is mostly due to the fact that there were a lot of fairly unique titles that didn’t easily bucket into the other major groups (like “AI/ML Knowledge Graph Specialist” and “MarTech Operations Senior Associate”). Plain as day, however, is the fact that the largest bucket of users are Data Engineers.
On some level, the fact that most dbt-related job postings are Data Engineers is actually not surprising at all. dbt provides a lot of potential value to Data Engineering teams:
Empower more participants / refocus energy: I spend the bulk of my time talking to organizations that are interested in adopting dbt, and the reality is that most of the teams I talk to are data engineering teams. Data engineering teams look at dbt as a way of enabling others to self-serve, which both a) makes their internal customers happier and b) frees up the data engineering teams to focus on platform capabilities rather than adding new columns to tables.
Reduce shadow IT: Misalignment between IT and lines of business often results in disempowerment for the LOB, which invariably results in the shadow IT. This has become more prevalent as infrastructure products have become easier to adopt (swipe a credit card and off you go). It’s often more advantageous to find a solution that appeals to both sides of the equation, and dbt seems to provide just that.
Cognitive load / context switching reduction: Python and SQL are the most common languages used by data teams these days, but data pipelines are typically mixed-and-matched frameworks that require context switching between languages to do end-to-end development. dbt reduces cognitive load by leveraging SQL for the full lifecycle.
How different are Data Engineers and Analytics Engineers?
As I mentioned earlier, Analytics Engineer is a fairly new title, but it seems to be popping up all over the place. However, a common source of confusion is how Analytics Engineering roles really differ from Data Engineering roles. Are they actually two different roles? Are they just the same role, scoped to different parts (or different subsets) of the data stack?
We can start to form opinions if we take a look at the technologies used across these roles, as well as the actual responsibilities and activities that these roles orient around. A well-written job description should give us a fairly good view of the primary technologies and activities for both Data Engineers and Analytics Engineers, as well as some of the other data-focused roles. I’ve broken out a number of common technologies, concepts, and vendors that I would expect to show up in job descriptions for roles on data teams. Some of the categories represented include:
Programming Languages: SQL, Python, Scala, Java
Data Infrastructure: Airflow, dbt, Spark, Kafka, Kubernetes, Hadoop, Docker, Dagster, Cloudera, Databricks
Data Warehouses: Snowflake, Redshift, BigQuery, Postgres
BI / Visualization: Looker, Tableau, Mode
Data Integration: Fivetran, Talend, Databricks, Matillion, Informatica
Concepts: ETL, ELT, Batch, Streaming
By breaking out these different categories and looking at the frequency of terms appearing in various job descriptions, you can see interesting trends about the delineations between the different roles and responsibilities. This is not a huge sample size, so take the data with a grain of salt, but there are some interesting nuggets to pull out. It’s also worth noting that all of these job descriptions co-present with dbt, so there are some assumptions that you can make about the general tech stacks that the organizations have internally.
Unsurprisingly, SQL and Python are highly dominant languages when doing data processing. Historically, Java and Scala have been used heavily for developing underlying data infrastructure (for example, the vast majority of the Hadoop stack was written in Java, and Spark and Kafka both lean heavily on Scala). Java clearly sits as an outlier in the realm of data engineering teams, but SQL and Python are heavily used across all functions. In a lot of ways (some described further on in this post), Data Scientists are more like Analysts than they are like Data Engineers, but interestingly, Data Scientists are on the extreme end of both SQL and Python. It’s often a lot harder to use statistical capabilities through SQL than it is through a procedural language (like Python, or a language special-purpose for statistics like R or Matlab) -- linear algebraic methods and other matrix-oriented mathematics are great examples of common Data Science types of analysis that don’t lend themselves particularly well to SQL. There are a number of startups out there trying to buck this trend (Continual and RasgoML, for example), so maybe this will change in the future.
If we look at a much wider swath of overarching technologies:
There are much clearer delineations between roles in this graph. Data Engineers are clearly weighted towards infrastructural technologies (Kafka, Spark, Airflow). Tooling for interpretation, especially visualization tools, like Looker and Tableau clearly spike towards Analyst and Analytics Engineer roles, with Data Engineers being much less focused on that set. Airflow is an interesting case. It’s a data pipeline orchestration tool, so decidedly infrastructure, but is often leveraged to enable SQL-based pipelines, so the spikes across Analytics Engineers and Data Engineers is understandable, and given that Airflow is a Python-based framework, likely contributes to the importance of Python across all functions. Interestingly, Fivetran, which is a core ingestion tool for bringing data into warehouses, is heavily used by Analytics Engineers, and much less by anybody else. This likely is indicative of the fact that Fivetran typically enters organizations through a more under-the-radar shadow IT approach, rather than directly through the CIO. On the other end of the spectrum, tooling like Kafka is very clearly a Data Engineering concern. One open question that might be raised from this is where the border of Data Engineering and Software Engineering lives, and whether a tool like Kafka, which could be used as a central component in a software architecture, bleeds into the realm of core Software Engineering. The dataset of dbt-related job postings only included 5 job descriptions with a Software Engineer title, which suggests that dbt has not yet crossed over into the toolset of core engineering, so likely too few to gather any meaningful insight.
Overall, there do seem to be meaningful differences between Data Engineers and Analytics engineers in terms of the sets of technologies they work with. There is significant overlap between the two, enough to potentially argue that Analytics Engineers are a subset of Data Engineers, but the skew towards BI/visualization tools suggests a concrete difference in responsibilities.
What do they actually do all day?
The other area worth spending some time with is the actual responsibilities and activities of each role. I selected a set of verbs and nouns that I expected to skew interestingly across the set of titles. For example, I expect that Analysts would be heavily focused on producing stakeholder alignment and generating insights, whereas I wouldn’t expect those same activities (necessarily) from a Data Engineer. What came out were some interesting patterns:
What immediately pops out are many of the spikes for the Data Scientist. Insight, in particular, is very clearly an expectation of Analysts, and more than anybody else, the Data Scientist. Reading a bit into the chart, you might interpret that Data Science is still a somewhat mysterious discipline, and for many organizations, “insight” is the currency of that function. In that regard, Analysts and Data Scientists have a lot in common, being heavily-weighted towards analysis and insight generation.
Many of the individual terms suggest that Analytics Engineering is a function more like an Analyst or Data Scientist than a Data Engineer. You can see this pattern with terms like influence, analysis, stakeholder, insight, and curate. However, terms that seem like they should be more skewed towards Data Engineers, like test, develop, design, and maintain, still seem to be fairly well-distributed over all those roles. Architecture is one activity that does clearly skew towards Data Engineering, but still appears to be a core responsibility of Analytics Engineering.
Overall, Analytics Engineers appear to be a curiously hybrid role. They have significant overlap with Data Engineers in terms of the demands of technical knowledge, though there are clear domains that remain the purview of the Data Engineering team. From an activities and responsibilities perspective, Analytics Engineers actually look a lot more like Analysts and Data Scientists than they do a Data Engineer. The role appears to aim to strike a balance between technical competency and business-mindedness. Whether this will be a lasting role, as we saw when Data Scientists came onto the scene around a decade ago, remains to be seen.
Interested in playing with the data?
While it might have been overkill for my needs, the easiest way for me to express some of the cleansing I wanted to do was with SQL, so I ended up using dbt to do my cleaning and transformation. You can find the dbt project that I used to perform the work on GitHub, including the source data, which is provided as seeds, and I’ll probably devote a future post to a light introduction to dbt and how the different pieces fit together.
Special thanks to Chris Riccomini and Anna Filippova for reviewing earlier versions of this post!