The Bull Case for Braintrust

Mar 06, 2025

I’ve spent the vast majority of my career working in core data infrastructure. As a near adjacency, the AI/ML space has long been an interest and a flirtation for me, but has progressed to a full-on love affair. I’m excited to join Braintrust as their Head of Field Engineering, and wanted to put down some thoughts on why I’m so invigorated by the space and this company in particular.

Bringing software engineering best practices to AI engineering

GenAI application development is the wild west. Over the past several years, there has been a veritable race by companies of all sizes to figure out how to bring AI capabilities to the market. Chatbots abound, copilots are everywhere, and on the more sophisticated fringes, agents are the future (though good luck getting anybody to agree on what the hell an agent actually is). Successful organizations are being rewarded handily. Close associations with AI are resulting in bigger funding rounds, loftier multiples, and for better or for worse, the venture funding environment surrounding AI technologies looks eerily similar to 2021 (everybody knows they raised interest rates, right?).

*Rare image of an altercation between GPT-4o, Claude, and DeepSeek-R1 outside the saloon in Gradient Gulch*

But that race to bring the next big thing to market brings with it major new challenges that engineering teams haven’t had to face in the past. Taking a step back, in traditional software engineering, when you want to build new features or make improvements to your product, you write new code and go through some sort of rigorous testing and validation process to make sure that you didn’t screw it all up and introduce bugs. Not rocket science. Billions of dollars of market value have been created by companies that exist to help ensure product quality.

However, the problem is that traditional software engineering approaches to testing and quality hinge on the idea that if you run a test multiple times on the same input, you’ll get the same output every time. The biggest issue in AI land is that LLMs are non-deterministic. If you provide the same input to ChatGPT (or your interface of choice), without sufficient instruction to limit the set of possible outputs, you’re almost certainly going to get different outputs every time. So we’re fucked, I guess?

The result is that the best-of-breed approaches to testing (called evaluations in AI land) are basically vibe checks (this is going to get so much worse if vibe-coding takes off). Engineers have sets of sample input and expected outputs, run the inputs through an LLM and try their damndest to determine if the app is improved. At worst, someone makes a change to an LLM prompt, sticks their finger in the wind and says “yup, this feels better.” At best, you’re either working out of some crazy spreadsheet attempting to figure out which outputs changed and figure out if things got better or worse. The problem is it’s never as simple as just are the outputs improved – you also have to think about performance (in AI parlance, time-to-first-token is the technical term for how long it took for the LLM to stop spinning and start spitting out text) or, god forbid, cost (tokens are non-trivially expensive for many foundation models).

*Win this jar of jelly beans by correctly guessing how many test cases were impacted by your prompt update*

The problem gets even more complex as the actual evaluation mechanisms get more complex. In traditional software approaches, deciding if a test passed is usually a matter of testing for a test’s output to be the exact same value as you expect, whether it’s numeric, textual, or some sort of more complex object. However, generative applications can produce an infinite set of possible outputs if the application is unconstrained, and often you actually want to score an output based on things like factuality (“is this a true statement?”) or relevance (“is this a helpful response?”), which are subjective in nature. When you get into subjective evaluation, the best-of-breed approach is to use LLMs as the judge itself. You heard me right, not only is the LLM producing the output, but there’s another prompt passed to an LLM to ask it to decide if the test passed. While that might sound like we’re building a house of cards, and like the quickest path to ceding all control over the system to our new AI overlords (who I, for one, welcome), you might be right, but it’s actually the best way, today, to perform these tasks.

A core part of Braintrust’s value is solving precisely this: making it possible to move away from subjective vibe checks to a place of true rigor and objectivity, so organizations can confidently say: this change to a prompt made the application better, along whatever axes are important for the business.

The dataset as moat

Another reason I think Braintrust’s market is hypercritical as AI proliferates is that moats for AI application companies are evaporating at a rapid pace. The more capability that gets baked into foundation models, the weaker the moat for the application companies becomes. The first crop of companies building on LLMs were effectively building thin veneers on top of GPT models, Claude, whatever was the most appropriate model for the app.

The rough pattern is a model provided with a trivial system prompt (Think “You are a talented SDR and you’re going to craft customized messaging for our customers so we can book initial meetings”), and a UX around it for end users to interact with. The problem is that this is wholly undifferentiated. Anybody can build this app.

A slightly more sophisticated app gets creative with prompts. You add examples, more specific instructions to further tune the responses of the LLM based on your desired use case. This gets you into the realm of prompt engineering and instruction tuning. The reality is that a lot of AI work ends up looking like a major effort to list out all of the different ways you want to constrain the LLM via providing examples of edge cases, but in order to provide the right edge cases and corner cases, you need to know the right examples and expected output to provide. Without that set of information, the LLM is still liable to provide bad responses.

Even further leads you to fine-tuning the model, which is a process to retrain the model on a narrower set of data that makes it more fit-for-purpose for a particular task. This is one of the most sophisticated ways to build differentiated AI product, but is also the most expensive, and hardest to execute successfully. The success of a fine-tuning initiative is wholly dependent upon having a good, representative set of data that covers a broad swath of the use cases the application aims to solve for. The dataset is paramount – a bad dataset will sink your battleship, and the right dataset will lead you to riches beyond your wildest imagination.

The problem here is that getting that golden dataset is extraordinarily non-trivial. Imagined examples often leave out the insane typos and wildly out-of-left-field inputs that real life users will often throw at applications. Synthetic datasets may be too constrained or not cover meaningful edge cases. One of the better ways to actually build these datasets are to cull the examples from live applications – sampling the actual inputs from real users.

This is where observability, logging, and monitoring really start to become crucial, and this is another major facet of what Braintrust aims to solve. Braintrust is designed to be brutally simple to plug into an existing application and takes near-zero engineering effort to enable real-time logging and tracing. It’s a quick way to start looking at the inputs and outputs that are being served up to actual users, and Braintrust has a host of capabilities designed to make it easier to bring attention to notable examples that can be harvested for these crucial datasets. Effectively, Braintrust is tooling to help build the moat.

Evaluations enable cost efficiency

A couple months ago, DeepSeek-R1 dropped onto the world with claims of dramatically reduced costs to train a frontier model and what it costs to run inference workloads. OpenAI is regularly dropping updates to their models – minor tweaks here and there, regular improvements, and so on. Hugging Face is a treasure trove of distilled models (small models trained on large models to get similar results more cheaply), fine-tunes (models that are trained on more narrowly scoped datasets) for different use cases and applications. The model scene moves at breakneck pace, and there are two primary drivers: speed and cost.

If you can get the same performance with a distilled model, for example, it likely means you can get faster responses, improved time-to-first-token, and balance it with a lower cost. So organizations have a strong incentive to be constantly battling models against each other to figure out how to optimize their user experience, spend, and performance. These are complex value equations, and not something that can be easily eyeballed. It requires some sort of statistical mechanism, at the very least, to determine if progress is being made.

Again, this is an arena where Braintrust is meaningfully innovating. Braintrust provides a Playground that is, for all intents and purposes, a prompt engineering development environment. It enables organizations to take a prompt and tools, and evaluate performance across different scoring mechanisms (scorers written in code, LLMs-as-judge, etc.) and even look at performance side-by-side across different models. All without requiring application code changes. This is huge from a perspective of engineering agility, and equates to real, meaningful dollars saved for an organization.

Another area Braintrust is innovating in is the underlying infrastructure that powers their product. Braintrust just announced the availability of a new database called Brainstore, which is purpose-built for the observability use cases that Braintrust serves. Where a system like AppDynamics (a popular observability tool) is reliant on being able to query for specific terms like “ERROR” or a particular system name to find relevant logs, in AI, observability querying is more dependent on full-text search, where you’re looking for text in a potentially long LLM response. That query pattern is meaningfully different, and in a system designed for term search, is likely to perform very poorly. Brainstore is designed to make these use cases blazing fast, and in addition to being a meaningful differentiator compared to what other competitors can provide, this is, again, evidence of Braintrust’s extreme focus on solving the core customer problems that they see.

Braintrust is a tested team with deep expertise building meaningful differentiation

One of the major downsides to jumping into the AI space right now is precisely the fact that the funding environment looks unhinged, with massive valuations and extreme multiples. There are at least a dozen companies trying to do precisely what Braintrust is trying to do. So why make a bet in such a crowded space? One customer of Braintrust (one of the fastest growing digital native businesses in the world) that I had the opportunity to speak with told me an anecdote about how they made their selection – they put out an RFP, which received 70+ responses and down-selected 27 companies to bring in for demos. From there, they selected 3 finalists to do deeper dive discussions with, and eventually selected one to do an actual POC with. Hopefully it’s obvious who that final selection was.

I have a strong conviction that this category is mission critical for companies serious about AI engineering. And it means there are going to be a lot of knife fights in the market as all of these companies land grab for customers, but in practice, the cream rises to the top. The best players in the space (a cohort that Braintrust is definitively in) will be the ones that most often grace the columns of companies’ comparison matrices, and there are fewer meaningful competitors to concern ourselves with. I also feel strongly that the right approach to competition is to be laser-focused on solving customer problems – as soon as you fall into the trap of anchoring on who has the best version of any particular feature, you lose sight of what really matters, which is how much value you provide to your customers and users. Braintrust is an organization that is wholly focused on the customer.

Beyond that, the DNA of the organization is extremely compelling. The CEO, Ankur Goyal, is a two-time founder (his last startup, Impira, exited via acquisition by Figma, where he led AI platform initiatives prior to founding Braintrust), in addition to being one of the first employees at SingleStore, where he was responsible for the engineering team and developed multiple novel database systems. He’s pulled over a deeply technical team that has a significant specialization in database internals to back this up. From an investment perspective, the smartest and most knowledgeable minds in the AI space have thrown their investments behind this company (Elad Gil, Saam Motamedi at Greylock, Martin Casado at a16z, Greg Brockman, Alana Goyal at basecase, Databricks Ventures, and Datadog Ventures).

Even with a better product and meaningful differentiation, more competitors means everybody ends up squeezed, so there better be a big fucking market backing this thing up. And there is. The unfortunate reality is that AI observability and evaluations are new categories of tech, so there are few clear incumbents to look at to help size the market. Any existing technology that companies are using for these are probably homegrown. Somewhat conveniently, one comparable has emerged in the past 24 hours, as CoreWeave (currently one of the most hotly anticipated IPOs) announced intent to acquire Weights & Biases for a reported $1.7B. While Weights & Biases has a historical anchor in the MLOps arena with their models management product, much of their future focus has been aimed towards a new product called Weave, which sits in precisely the same LLMOps space as Braintrust.

The right thing to look at is market size estimations, many of which peg the broader AI market at more than $1T by 2030. For an industry that existed nominally five years ago, that’s utterly insane on its own. So Braintrust’s market is some fraction of this, and the reality is that the vast majority of the market share right now is being eaten up by the foundation model companies, but it’s a starting point. Narrowing the focus, there are some reports more focused on “AI in Observability,” which is a little off to the side of observability for AI, but is estimated at $11B in the next ten years. I also see Braintrust as playing in the governance space, as it relates to AI, since data quality and app quality play into governance initiatives in a meaningful way, and the AI governance market is likely to top $16B by 2030.

I do think there is a very real question to answer around who ends up serious enough about AI to warrant buying an AI evals and observability platform. Having not yet started the job, I don’t have a perfect handle on how sophisticated or serious about AI engineering an organization needs to be to hit the tipping point of “I need a Braintrust.” It’s unsurprising that many of the premier digital native businesses (Notion, Instacart, Ramp, Stripe, etc.) are investing enough in AI development to need a solution to this problem. It’s less clear what the tipping point will be for a company like Costco or JP Morgan Chase, which have massive IT budgets (as all Fortune 500 companies will) and are undoubtedly experimenting or possibly productionizing GenAI use cases. My suspicions are that they, too, will see the need for this technology, but their requirements will certainly look different than Notion’s.

All in all, while the exact market size is likely a matter of great debate, there is a lot of cash to go around, and as the broader AI market increases, Braintrust’s opportunity follows. And in a gold rush, I’ve always been attracted to selling picks and shovels. Braintrust provides the picks and shovels that companies need to develop and mature their AI initiatives and build their moat.

Why I am personally excited about joining Braintrust

Beyond my conviction on the mission-criticality of this space, the talent of the team, and the size of the market, there are reasons that I am particularly excited about joining Braintrust. When I’m looking for an opportunity, I’m looking for a company with these sorts of ingredients, but one that also fits my skillsets and experiences. In many ways, I saw close analogues between Braintrust and dbt Labs.

dbt was also, in many ways, aimed at bringing software engineering best practices to a new area of engineering. Data engineering was historically extremely wild west-y, and dbt has a very similar value proposition with regard to improving productivity, governance, and data quality. It makes Braintrust extremely translatable for me, personally.

Further, I have a high degree of confidence that the sales motion is likely to end up having a lot of meaningful similarities to dbt Labs. I was asked recently about competitors built by foundation model companies themselves. For example, OpenAI has their own Evaluations product that organizations can use to do similar work. So why wouldn’t companies who have bought into OpenAI models just work with their product? The answer is simple: OpenAI is only incentivized to allow organizations to evaluate performance across their own models. They’re unlikely to ever provide the ability to also evaluate, say, Claude, or LLaMa models, because there’s risk that they would show up poorly, and cannibalize their own business. They offer Evaluations because of a strong customer demand, but that’s not where they’re making their money – they derive their primary revenue stream from selling access to their core models. The result is that, for a company to be successful in the AI evaluations space, there is an imperative to be independent and agnostic – a truly valuable solution doesn’t care what underlying model you use.

That agnosticism is also a strength in terms of partnership. dbt Labs has strong partnerships with AWS, Snowflake, Databricks, Microsoft, and others, because there is a high correlation between dbt usage and increased consumption of underlying data platform services (which the data platform companies like a lot, for hopefully obvious reasons). dbt is enabling technology, and helps customers more effectively and efficiently use these services. The result was that a significant amount of lead flow and pipeline generation for dbt’s business actually comes from partners themselves, both ISV (the platforms, like Snowflake and Databricks) and SI (services partners, like the Accentures and Deloittes of the world). Similarly, Braintrust is enabling technology for leveraging LLMs effectively and efficiently. There is a natural partnership that will emerge between Braintrust and companies like OpenAI, Anthropic, and Together.ai as it becomes clear that Braintrust helps drive faster, and stickier adoption of AI. The core problem is that many AI projects still die on the vine because companies don’t know how to get them to production, and once they’re in production, they don’t know how to safely improve them with new capabilities or fix bugs. That’s precisely what Braintrust enables, and it is likely to help drive higher degrees of success for AI projects and initiatives, which, in turn, helps expand Braintrust’s own market.

Ultimately, what excites me most about joining Braintrust is the opportunity to be at the forefront of shaping how AI applications are built, evaluated, and optimized. The AI engineering space is still in its early days, and while the landscape is evolving rapidly, one thing is clear: companies that take a structured, rigorous approach to AI development will have a significant competitive edge. Braintrust isn’t just riding the AI wave – it’s building the critical infrastructure that will help organizations turn AI from an experimental curiosity into a reliable, scalable, and cost-effective part of their business.

I’m thrilled to be joining a team that shares this vision and has the technical depth, customer focus, and execution capabilities to make it a reality. There’s an incredible journey ahead, and we’re hiring.

Semi-Structured

Discussion about this post