Is Snowflake a database?
The other day, I had a portion of Tristan’s Analytics Engineering Roundup post misattributed to me on Twitter:
Leaving aside the fact that Drew Banin is a way smarter human being than I, I do find this topic really fascinating. The original context for the conversation boiled down to a question from Tristan: “Can you have a database that supports multiple workloads?” I actually don’t think that question alone is really answerable. You need some more context around what the asker means by “a database,” and depending on that context, I think the answer could actually be yes or no. I want to address this, but I also think this leads to a slightly more interesting meta-question: “how does the UX design of an API impact how a product is used?”
One size definitely doesn’t fit all
Twitter, with its 280 characters, is a pretty challenging form factor to do much more than shitpost about complex, subtle technical topics (and it was even harder with only 140). In the context of the Twitter thread, I asserted that “yes,” you can have a single database that solves for many workloads. Folks on the Twitter thread made a logical leap to boil that down to the idea that I was essentially in disagreement with Stonebraker’s (and Çetintemel’s) broad statements about database architectures. However, what I hoped to convey was that no, I’m actually in full agreement with Stonebraker (and as a potentially relevant side note, I’ll mention that I a) worked for Vertica, which was the product of Stonebraker’s C-Store paper after college, and b) that I worked on the VLDB demo paper for H-Store, which was the OLTP complement to C-Store).
So, no, I don’t think that it is necessarily likely that you will have multiple database systems that will handle dramatically different workloads. The query patterns, the I/O patterns, the optimization structures that are necessary for these things – they all will vary dramatically from workload to workload. But this is why I’m such a stickler for terminology, and being really specific about what you mean when you ask if “a database” can support multiple workloads. Let’s ask the deeper question.
What is a database?
Is Postgres a database? Yes, probably nobody will disagree that Postgres is a database. Is Redshift a database? Yeah, I think so. What about Amazon RDS? Is RDS a database? Well, it allows you to operate databases, but the actual name is Relational Database Service…so it’s a service for databases?
Let’s make this tougher.
Is Snowflake a database?
Snowflake definitely walks like a duck, and for the most part, it talks like a duck. It’s got a SQL interface that I can query and use to create tables with DDL. But at the same time, it’s also got Snowpark, which is a Scala/Java/Python Spark-compatible interface for other work, that lets me access the same storage I’m accessing from the SQL interface. Is that a database too? Or something else?
Snowflake, to me, is a product. A platform, or whatever term you want to use. A reductive view of the world might label Snowflake as a database, but early on, they called it DBaaS – database-as-a-service. And that service has evolved. Now it’s The Data Cloud. Is Snowflake still just a database? Or is it an interface to a data system that happens to have a SQL interface that feels like what we understand a database to be?
A database is a solution for a workload
What I would argue is that a particular database system is simply a solution for executing a particular workload. Snowflake’s original product (the SQL interface) was a database that solved for analytics workloads. Snowflake’s latest data system (Snowpark) could be argued to be a database that solves for a data engineering workload.
Other databases solve for other workloads. MongoDB (or Firebase or Couchbase) is a database that frequently solves for web-connected application state workloads. Obviously it solves for other workloads, as well, but that’s a pretty darn big one that covered by the document store class of systems. CockroachDB, Planetscale, Fauna – all of these cover a different workload (which I’ll admit, I don’t totally understand, but I do understand to be use cases that require extreme speed and processing at the edge, thus the need for geodistribution). These are all, in my mind, different databases, but the thing they solve is a workload.
A platform can have many underlying databases or data systems
So when I ask if Snowflake is a database, the thing I’m actually honing in on is that I think Snowflake is actually a data platform that provides one or more entry points to a potentially diverse set of databases that represent solutions for different workloads.
The promised land comes when you have a single entry point (or let’s say a single entry point per language binding) for all workloads. You can imagine a world where I write a query, and depending on the type of query I’m writing, or characteristics of the application, the query gets routed to different data systems to respond to that query. There ends up being a layer of the query planning infrastructure that looks at all of the possible workload engines, and decides: which is the right workload engine for this particular query?
Critically, this is all enabled by cloud-scale economics. Until recently, it was cost-prohibitive to have enough copies of data to optimize for the varying query patterns, or to have enough infrastructure to simultaneously operate a myriad of data systems alongside each other. However, given the reality of cloud, I suspect that along some timescale, Snowflake’s front-end interface will end up effectively being a veneer on top of an analytical database, a transactional database, a data engineering-focused workload engine, a geodistributed engine, etc. Does it matter what database you’re using if it all feels the same to the end user (in this case, the application developer)?
To be fair, I think that this sort of opaqueness may not appeal to every organization. This is actually in stark contrast to how AWS builds their offerings, which are more of like a Johnny Appleseed-esque strategy of just plopping down a new service every time a new workload appears. The difference is that in an AWS world, you have to know which service is the right service for your workload, rather than just sending a query in and assuming the system is smart enough to figure out the right system to serve your workload.
With that observation, I think we’ve finally arrived at the bigger point that I wanted to allude to, which is:
The UX of an API influences product-market fit
I actually think that this sort of theoretical data-platform-as-a-service strategy runs as a pseudo-one-size-fits-all counterpoint to another potential vision of the world, which is that organizations will know a lot about their own workloads, and will want to choose the best-fit platform for that particular workload, but that the data can be consolidated into a data-storage-platform-as-a-service (which is kind of how the companies behind Apache Iceberg and Apache Hudi view the world):
I don’t think that these two worldviews are necessarily mutually exclusive, either. In fact, I think they merely speak to different segments of the market.
At the end of the day, APIs and language bindings are a form of user experience. The way you structure the interface that people (or programs) interact with your product are going to inherently influence which users or what applications are most likely to consume your platform.
In the two examples above, I think that the Data-Platform-as-a-Service is really a more downmarket solution. An organization that doesn’t want to worry about having contracts with 20 different vendors for each of their data platforms, or wants a single interface to simplify application development is likeliest to want the one-size-fits-all solution. Snowflake is no longer primarily selling to SMBs or mid-market businesses, but they definitely got a lot of their start in that market, so taking a one-size-fits-all, big-tent, workload-oriented strategy to building a product actually makes a lot of sense in my mind.
On the other hand, the Data-Storage-Platform-as-a-Service is geared more towards that Enterprise Architect who recognizes that there are many boxes to fill on their architecture diagram, and that there may be a specific, best-of-breed vendor for each workload, and wants that fine-grained control over their stack. For them, the Data-Storage-Platform-as-a-Service makes a ton of sense, because it gives them the most control over the processing layer, which is likely the layer that requires the most differentiation in terms of the technical underpinnings.
The UX is really defined by a choice of where the interface exists in the application stack
To tie this all together into a hopefully cohesive thought, if we get back to that initial question of: “can you have a database that serves multiple workloads?” I actually think the right thing to observe is that this is actually a question of where the barrier between service and user interface exists. If you put it higher in the stack, closer to the end user or application developer, the product feels more like a single solution handling many different workloads. If you put it lower in the stack, closer to the metal, the developer has a lot more control over the end application, and the minutiae and details of the implementation become concerns of the developer, rather than the platform.
One workload or many workloads? I’d argue that it’s really just all in the eye of the beholder.