Episode Description
Jason Ganz, Senior Manager DX at dbt Labs, joins Simon Maple to unpack why, despite the rapid rise of AI systems, enterprises still rely on structured data for consistency and reliable decision making. They also discuss: • the invisible edge cases LLMs can’t see • difference between software engineering and data engineering in AI • the mismatch between AI output and business logic • what the data engineer of the future actually does
Overview
LLMs Are Powerful, But Not Omniscient
Large language models (LLMs) excel at generating boilerplate code, writing descriptions, and scaffolding first drafts of data tests. But they consistently miss domain-specific logic, those “weird” edge cases that only live in the heads of seasoned engineers. LLMs can't intuit the decades of tribal knowledge embedded in enterprise data systems.
Data Engineering’s AI Edge
Jason Ganz explains that AI’s impact on data engineering mirrors that on software engineering, but with crucial twists. Data workflows depend on large volumes of interconnected, structured information. Unlike a self-contained web app, data tools must reason across dependencies, history, and business-specific definitions. Copilot-like tools can help, but need deep metadata and semantic awareness to do it right.
Structured Data: Still the Source of Truth
Despite AI’s generative power, enterprises still rely heavily on structured data—because it's deterministic, consistent, and serves as a shared reality for decisions. LLMs struggle without clearly defined semantics. That's why dbt’s semantic layer is critical—it encodes business logic like “revenue” in precise, reusable definitions that LLMs can reason about.
The Limits of LLM Reasoning
LLMs often hallucinate or generate plausible-but-wrong SQL. Worse, they lack determinism: ask the same question twice, get different answers. For business-critical metrics, this unreliability is unacceptable. The solution? Human validation, strict role-based access controls, and semantic abstraction layers that enforce consistency.
AI Needs Context—Enter DAGs and Metadata
Context is everything. dbt models use DAGs (Directed Acyclic Graphs) and metadata to provide LLMs with structure. This prevents query failures and hallucinations, letting AI reason more effectively. But there’s a ceiling—models still need to be fed scoped, permissioned, relevant data to function safely and well.
MCP Servers: The Future of Data Access
Jason champions MCP (Model-Context-Prompt) servers as the game-changer. They let users query live data using natural language, massively simplifying access. Tools like dbt's MCP server connect users with trusted, semantic data models instantly. This lowers the barrier to entry and increases organizational demand for quality data.
The Data Engineer’s Evolving Role
As MCPs become mainstream, data engineers are shifting left—owning more of the data modeling, governance, and orchestration. Their job now includes building systems ready for AI interaction, and enabling future workflows like agentic automation or real-time dashboards.
AI is a Copilot, Not a Pilot (Yet)
For now, human validation is non-negotiable. AI can accelerate development, but responsibility still falls on engineers to ensure correctness. The future may bring tighter automation loops, but today’s best practice is trust, but verify.
Resources
1. Jason Ganz (LinkedIn)- https://www.linkedin.com/in/jasnonaz/
2. Jason Ganz (X)- https://x.com/jasnonaz
3. dbt Labs- https://www.getdbt.com/
4. dbt Fusion engine- https://www.getdbt.com/product/fusion
5. dbt Community- https://www.getdbt.com/community
6. Simon Maple- https://www.linkedin.com/in/simonmaple/
7. Tessl- https://www.linkedin.com/company/tesslio/
8. AI Native Dev- https://www.linkedin.com/showcase/ai-native-dev/
Chapters
00:00 Trailer
01:01 Introduction
01:41 dbt Labs
04:39 Data engineers
07:39 LLMs understanding
13:15 AI isn’t as lazy as humans
15:29 Problem: the scaffolding to get data
17:38 Best contextual results
19:40 Dealing with security
25:00 Structured data
27:37 Problems with LLMs and data
29:47 Exact numbers
32:10 Hallucinations
34:28 Human validation
36:20 MCP servers
39:09 UX bottlenecks
42:27 Quality of data
44:00 The future of data engineers
47:02 getdbt.com
48:09 Outro
Full Script
Simon: Hi there and welcome to another episode of the AI Native Dev.
Simon: My name's Simon Maple, your host for today, and today we're going to be talking about all things data engineering with Mr. Jason Ganz who works in the developer experience space for a company called dbt. Jason, welcome. How are you doing?
Jason: Hi Simon, I'm doing great.
Jason: Super excited to be here with you today. Big fan of your show and excited to nerd out with you about data, AI, MCP, who knows where I could go.
Simon: So, Jason, you're in the developer experience space. What does that mean?
Jason: Yeah, so dbt Labs is a company that makes dbt, which is a very popular way for data engineers, analytics engineers, and analysts to work with their structured data. Particularly in the past, you would do ETL. We now live in an ELT world where you extract, load, and transform.
Jason: And so, organizations have just a tremendous amount of data these days. You have all of the records that are generated by your payment vendors, by your marketing campaigns, whatever you can imagine. Every business is just constantly spitting off just tremendous amounts of data.
Jason: And you want to do things with that data. But also going from kind of the raw data that you have there to something that is useful and interesting is actually a very non-trivial challenge. You have to take that data and then you have to go and say, okay, all of this data, but what is useful? What are the useful bits and context that I have here? And then you build it up, you apply transformations, you join it together.
Jason: And you have to do that in a way that is flexible. So, you make sure that you can connect together the different things. It's trustworthy so that you know that you're generating accurate data and that it's not just accurate, but it's accurate and timely.
Jason: That you're not blowing up the spend if you accidentally join two tables together and then you end up with a six-figure bill for a query that then your boss comes in and yells at you about, so, there's all kinds of complexity around this. So, what we do at dbt is we have created dbt, which is a popular framework for doing the data transformation, as well as then additionally things like orchestrating your data, being able to discover your data assets, all those things.
Jason: And so, my job as our head of developer experience, and I was joking with Simon before this, that I half joking, that I'm considering changing my title to head of agent experience, is making sure the tools that we are building. So, we are building both kind of like a programming language interface, and then also tooling on top of that. We're making sure that it serves the needs of the people that are building our product.
Jason: So that's everything from actually going and helping sketch out, like okay, here's the direction that we think that data engineers, analytics engineers need to be moving in the future to be productive, to then going, working on some of our open source and source available repositories, going and hearing from our users what they need, and just making sure that we are building the things that allow people to actually take all their data and then make it kind of useful, accessible, all that good things. And that it's fun and interesting to do the work while you're doing that.
Simon: Now, you mentioned data engineer there, and I think most of our listeners have probably worked with a data engineer or a number of data engineers in the past, but I feel sometimes like it's one of those kinds of roles that maybe can change a little bit between org to org.
Simon: What's your definition of a data engineer?
Jason: Yeah, it's such a good question. I think a lot of roles in the data space, particularly at different organizations or in different industries, you can see kind of like wildly different definitions. But we think of the data engineers as kind of the people that are responsible for building and maintaining your data pipeline.
Jason: So, you're moving around large amounts of data, you're making sure that all of your databases are set up correctly, that you have the right kind of permissions and infrastructure in place, that you could be dealing with business logic and kind of like making sure, working with stakeholders to be like, okay, we want this data set defined. Or you could be like working kind of in the closer to the middle, focusing on your orchestration, making sure your jobs are run, making sure your tests are passed, and kind of building out your data platform. So, it's really, we think of it like it's much more of a spectrum than it is kind of like hard definitions, but there is kind of like closer to the hard technology of the databases and like working on underlying database infrastructure.
Jason: And then there's closer to something like a data analyst or a business stakeholder. And then there's all this room in between. So, a data engineer is someone that fits in the edge of this, where you're primarily working on building and maintaining data pipelines, making sure that you are building the data sets that are required by your business.
Simon: And how do people, I was going to say, how do people kind of get into that role? Do you see people moving from an engineering role, software engineering role into that? Is it people who come from a data background from education and these types of things into that role? What's the typical route?
Jason: Yeah, it's a great question. I'd say there's two primary ends. And so, the first is, yeah, someone will move over from software engineering.
Jason: You're a software engineer and you find that you are working with a lot of data. You think it's fun and interesting. And so, you start to develop a specialty there.
Jason: And then at some point, you kind of like formally make the jump over to software engineer. And then the other one would be you're someone that works in a more kind of data analysis, analytics engineering role. And then you keep shifting left, we call it, where shifting left is kind of like moving more into data engineering.
Jason: So, there's like a good, and increasingly we see people kind of directly targeting data engineering as a career path. There's data engineering bootcamps for people that they're trying to get into it. But the two main ones are, to not put too far of a point of it, either you come in from the data route or you come in from the engineering route and then you wind up kind of in the middle there.
Simon: Yeah. Yeah. Cool.
Simon: Well, let's, let's dip into AI now a little bit. We've, we've, we've talked a little bit about some of the, some of the typical things, you know, a data engineer will, will, will need to do. How much, I guess, in, in terms of the core aspects that you feel are best for AI to actually come in, help with from an automation point of view, or, you know, dealing with large amounts of data or queries.
Simon: What, what do you feel like the biggest, like the lowest hanging fruit, the biggest wins that we can make using AI in this, this space?
Jason: Yeah. I mean, it's, it's such a great question and really the, the interesting thing about data engineering is often like everything from software engineering applies, and then there's kind of like slight tweaks that make it different. So, everything that, you know, your previous guests have said about how AI is going to affect software engineering, that is all relevant to data engineers because, you know, at the end of the day, data engineers are working in IDE, they're writing code, they're, they're, they're doing things like that.
Jason: So, all the, all the things around, you know, AI-assisted software development, Copilot, I've been just spending a tremendous amount of time with, with Claude Code recently and just like really, really enjoying it. But, but the, the difference is when I'm going and I'm writing a web app in, in using kind of like AI-assisted code gen, it all lives in, in the web app, right? Like we have developed a lot of infrastructure so that I can like create an app and it kind of like lives here. These organizations, when an organization is working with terabytes, petabytes of data, you're actually, you're, there's a lot more considerations that have to go into it because you, you want to create a script to update an existing query, but then you're going to need to know about all of the metadata for that, all of the history, all the state.
Jason: So, some of the considerations that I'm sure have to go into development and deployment of complicated software engineering, you kind of like have to get into that from the beginning, if you're going to be doing AI-assisted data engineering work. So, the, the, the first, you know, the first input is the, the low-hanging fruit is exactly the same thing it is for everything else. So, you know, we launched a copilot and the, the copilot is pretty similar to, to other copilots with the distinction that it brings in, kind of metadata from your dbt project that it knows about dbt.
Jason: So like, I, one of the things that people have to do all the time in dbt is write descriptions and tests for their data.
Jason: It's not anyone's necessarily favorite thing to do, but very important for having a, a good, well-maintained dbt project.
Jason: And so that's the type of thing whereas, of, you know, coming into this year, we were shaving off the common, cognitively demanding, but not necessarily incredibly complex use cases, just shaving those off and then like putting those into the AI assisted development flywheel. And now, you know with the current generation of models, it's looking a lot more like, okay, we can kind of create a dbt model from scratch and it's, it's pretty good, but it still needs to know kind of what the columns are in your dbt project.
Simon: The LLMs are really probably quite good at building descriptions, but I'd love to know how, when we're thinking about tests for data, what does it, how does it know what the most important parts are to test? And there's a bunch of columns, what data is the most important data? What are the, what are the kind of rules that, that, that the, that the LLM needs to understand and recognize to actually be able to test this data well?
Jason: Yeah, great question.Great question. Okay. So, this is where you get into the domain expertise of blending with the engineering aspect of, of this.
Jason: And so, like the, the first line of testing that you want to add to a dbt project that, you know, you can kind of get that for free. You're doing a lot of testing to make sure you don't have duplicates in your primary keys or that an entry that's a required field isn't missing.
Simon: Hygiene, hygiene style testing.
Jason: Exactly. And like, so it's great. And honestly, like if doing that is a huge game for, for most organizations, but then things, you know, like adding unit tests and, and making sure that you're, you're unit testing the right thing.
Jason: That, that's, that is one where LLMs are very smart and they're very good at being like, okay, this is like the type of things that you might want, want to unit test. But like everyone's data is super weird. And like most of the things that are super weird about that live in the heads of the people that have been burned by it a billion, a billion times.
Jason: So, you know, doing a workflow where you're like, where you sketch it out. And I think that's where we are today where it's like, okay, it's not good. Like it will be able to get you a very good first pass of the data hygiene things.
Jason: I'm like, you know, you want to take it over, but it's going to be really good at that. And then the things that are like, okay, we're like the, the wordy parts that are like specific about my business. It's like not going to be able to do those just straight out of the box.
Jason: But what it will be able to do is work with you too, if you describe a unit test, they will scaffold it out. And then you know, particularly if it has good context about kind of like what the underlying columns are, what the data types are, what's expected, like it really is helpful to have all of that. And then it drains all of that.
Jason: And then you go kind of validate it and make sure it works. So, it's like Hygiene, you know, it's very good at doing that out of the box. And then the more specific bespoke ones, that's one where you're kind of going to want to work with it to make sure that that gets in there.
Simon: Yeah. And it almost feels like as well, sometimes these are, AI is so good at doing things that we are often as humans, a little bit lazy to do or lazy at sometimes, like, I don't know whether it is tests or docs or those types of things. AI just isn't as lazy there and it will go and off and it will start documenting everything you want it to document.
Simon: So, when we talk about things like data, do you find it sometimes being more complete than if a human were to do it? Are there areas that you feel AI is actually more competent almost at completing than if a human were to do it? There's a question.
Jason: Oh, Simon, it's so great. Okay. So, first of all, all of our users are perfect and they always document and tell exactly the same. But I'm lazy.
Simon: I think we have the same users.
Jason: And so, I will be writing and I'll be like, I'll fill this in later.
Jason: I'll have the testing later. And so, for me personally, I have found that particularly if you're the type of person that you're, kind of sketching out the 15 things you want to do and you want to twirl through all them and then you leave yourself a bunch of like, oh, I should do this. Like that's just the kind of workflow that's just like, I mean, like I know we've all become like a little inured to this where like two or three years past, but it's just, it feels like magic.
Jason: It feels like something I never thought.
Simon: Yeah. Yeah. Well, first of all, excellent cover, Jason, that was brilliant cover for your users. Secondly, in terms of, yeah, the various things, Copilot, that type of action that AI helps with, descriptions, tests for data. What else will AI help with for data engineers these days?
Jason: Yeah. I mean, so I, we think it's going to be helpful ultimately across the entire stack. So going and writing your data models, refactoring your data models, debugging a data pipeline breaks. Okay. Why this break? What do I need to do? What needs to change? The thing that's missing from that right now is not really model intelligence. Models are pretty much good at doing all of this today. The problem is kind of like building the scaffolding to get that data to them and productizing it.
Jason: Cause, you know, like if, if your model breaks today, there's like a relatively, like you can take a first set of things like, okay, like why, like, like, look at the logs. And, you know, of course, like when anyone's pipeline breaks today, what's the first thing that they're doing? They're taking it and they're pasting it into one of these things. Anyway,
Jason: so, it's building the system such that the first layer of problems across all of this, and the first pass of any of this should be able to be addressable by the LLM. And so, you know, I'm really hopeful we will get to the place where it's like, okay, everyone's data systems work better. You are able to build more data products cheaper and faster.
Jason: And with kind of like less just toil in terms of making it and maintaining it. And that, that, that to me is kind of like, that, that's like baseline. That's like the minimum thing that we're going to get.
Jason: The thing that then gets, I think more interesting is, okay, but then like, how is all that stuff actually like much more useful in a world where it can then be collected, connected into, into other AI applications?
Simon: Yeah, no, very interesting. Let's, let's jump into one thing that I think is really important there, which is when you have huge amounts of data and you want the LLM to kind of like, you know, if based on a query or prompt that I give it, how do you get that context window right to get the, to get the best results back?
Jason: Yeah, I, it, it's a, it's a great question. So, if everyone in the, in the, in the dbt universe and the data engineering universe, we think of DAGs, right.
Jason: Directed acyclic graph. And so, we actually have a lot of information about, about your DAG, about the models and about the, about the columns and all of that. You know, obviously you can't pass all the data in a system, but even for these big organizations, you certainly, you can't pass all the tables on the screen. They have thousands, thousands of columns and descriptions and all of that. So, we, we think of modular, modularity has been one of the key concepts of how dbt has transformed data work to begin with. It used to be, you know, a decade ago before dbt, you'd be working and you'd have these like, you know, 5,000 line SQL scripts that would be maintained by a couple of people.
Jason: And those people, their entire life would be like debugging this one SQL script. And we're like, all right, well, what if we broke that up into models and you ran them bit by bit? And then like, when one breaks, you can like see downstream. So, there's a lot of concepts and abstractions that not, you know, not just dbt, this has been happening across the data engineering world, but, you know, data engineers have been working on how do we, how do we create the kind of modular concepts that then, you know, a human can, can work with.
Jason: And actually, I, I don't know about you, Simon, but my, my personal context window is actually a lot smaller than, than 200,000 tokens. So, we're actually able to get a lot of additional context into, into these for, for generations. It's based off of DAG information.
Simon: And, and, and as part of that, when like, presumably you're using AI here to generate a lot of the queries, a lot of the SQL and things like that. If you're using AI in that way, how, how do you deal with things like security things like, you know, obviously when you, when you're gathering that kind of data that you're not in any, at any time injecting bad things or those types of, those types of problems.
Jason: Yeah. It's a, it's a really, really good question. So typically, in the data engineering world, like we, we, we are less dealing with kind of like, you know, SQL injections or anything like that, because typically the people running these are people that work for your, for your company. But the thing that you still need to think about is, okay, how do we make sure no data is getting exposed either internally or externally to people that should, shouldn't have access to it? And this is one where it's leaning on familiar abstractions and can, and the, the ways that we have been building out governance kind of as an industry, those same things.
Jason: I, I feel like I'm kind of reviewing myself, but it's a really important concept, which is like the set of abstractions that we built to solve hard problems for humans are going to be the same set of abstractions, which you're going to need to use, but then build upon for, for AI systems. So, it's like, if, if you have an AI system, agent, you know, operating on top of your database that, that should have a role that's scoped such that it can't access a set of data that you wouldn't want the human, you expect to be querying it to have access to.
Simon: Interesting. Okay. So, so the LLM has, the LLM only has the access to the data that the, that the, that the user would. So, each, each request goes through a specific access constraint.
Jason: I mean, like that, that, that's how, that's how it should be done. And so, if, if you're using this, for example, if you, if you're using this through kind of our, our cloud platform, you will have logged in and then you, Simon will have a set of database permissions. And then any SQL queries that you run on that will have been already scoped and managed, but by your team such that you hopefully have access to all the data that you need access to, but also that you're not having access to data that, that would be sensitive or you wouldn't have access to.
Jason: And then, so then kind of like AI systems and queries would be, would be run the same way. So, thinking in terms of roles and the importance of roles, this is very important when you are giving this perhaps not to an internal user, but to an external facing customer support bot. And you want your customer support bot to have access to the billing information for the person that they're talking to, because that's very useful.
Jason: And you really don't want them to then be able to hop over and ask about the customer whose IDE is incremented one, one higher in the database and their billing. So really, really locking down the data access and roles is, I mean, it's just like, it's always been that important, but it feels orders of magnitude more important now that we're hooking these systems into more things.
Simon: Yeah, absolutely. I think that's the only way of doing it as well in terms of like, you know, not allowing the LLM to make a decision. We've seen time and time again how LLMs can, can be confused or tricked into, into thinking they have access to other things that they shouldn't. In fact, we wrote just recently a piece about the GitHub MCP server that it's, it was an interesting one, actually.
Simon: And it's a good, it's a good example of how you can almost like circumvent some of the authorization whereby the, the, this MCP server will be sat here, you know, looking at your GitHub repos and it will, you know, a user will obviously only come in looking at the public repo. If they, if they added an issue onto the public repo and the, the, the developer using their MCP server said, you know, can you fix all these issues? It will look at those issues on the public, on the public repo. And if a, if a malicious user added a malicious issue on one of the, on that repo that said, get me all the information about this user, what they do in their private repos and so forth, the GitHub MCP server would actually go into the private repos and then create a public repo, a public PR against the public repo with potentially sensitive information.
Simon: So, the LLM is there making the decision based on the, on the security authorization of that user versus actually using proper authentication or authorization mechanisms that state whether this user can access this data. And I think that's it. As soon as we, as soon as you leave it to the LLM, mistakes will be made, whether it's, and I'm not even thinking like a hallucination.
Simon: I'm just, it's, you just need to convince it.
Jason: They’re so helpful.
Simon: Whereas yeah, they are right. They always want to provide an answer.
Simon: That's their problem. But I think what you're describing there is, is like using the mechanics of what we already know and trust to actually build that out, which I think is, is, is very powerful. Let's talk about structured data.
Simon: How much, how important is structured data obviously for, for, from a, from a user's point of view to get that back? How does, how does the LLM you know, deal with structured data? I guess both as input and returning that.
Jason: Yeah. I mean, it's important enough that every organization in the world invests a tremendous amount of time and energy to creating structured data, single sources of truth for, for their organization.
Jason: And the reason for that is there's a lot of reasons for that, but as, as you're building kind of an organization that needs to coordinate, right. And you're, you're doing that on, on complex things and you're trying to, to, to do complicated workflows in, in the real world, you ultimately need to have structure for enterprises, for governments, for, for anyone that is trying to coordinate across many people, many departments. You are ultimately going to have to have systems, systems of record.
Jason: This goes all the way back to, you know, when we first invented double entry bookkeeping, however many thousands of years ago, it's one of humanity's most important innovations because what it allows you to do is, the world is just like really complicated. And in order to have institutions that act on top of that world, you have to have systems that are saying, this is a thing that we believe to be true in the world. And that's really what structured data is for is we, we believe that on June 26th at 5:14 PM Jason went and he ordered an iced coffee from our e-commerce page.
Jason: And we believe that the account owner for Amia Corp Bank is Simon. And so structured data is just like all of these things that your organizations believe. And then when you need to make a decision such as, okay, how much, how much coffee do I need to order for my, for, for, for my coffee store next month, you actually, you need to go back to your structured data and then look at, look at the previous things that were there.
Jason: So to the extent that we expect LLMs will need to be doing things that engage with the level of complexity where you might want to check if something is true or not, then they're just going to have to have access to structured data. Unless, you know, sure, like maybe we get a billion, a 10 trillion context window and they can just do it all perfectly in context, but in a world where that's not happening, then they're just going to need to act on top of data and top on top of structured data. Yeah.
Simon: Yeah. And in terms of what LLMs are bad at in this, in this respect, what would you say that, what would you say the biggest problems that you've seen using AI dealing with data, dealing with you know, building these kinds of, there's various queries requests for, for data. What would you say it's, it's not particularly good at that, that, that in years to come, it could perhaps have scope to, to grow in.
Jason: Yeah. So one thing it's not particularly good at right now is coming back with the same answer every time, even if that answer is technically reasonable or right. And this kind of goes back to your GitHub example.
Jason: So, it's like, you know, one of the things that we say kind of ad nauseum is like every organization defines revenue differently. And in fact, every organization probably has five or nine different definitions of revenue that ultimately some parcel has to go and reconcile them and be like, no, this is like the one true definition of revenue. LLMs are like a very, like smart, eager, but they don't know anything about your business.
Jason: So it's like, you tell it to go calculate revenue. It's gonna like come up with one of nine or it's gonna make a 10th and 11th to 12th. So, it's, you know, it's really good at writing a SQL query that looks reasonable that, that like produces you an output.
Jason: But if you are trying to like bake these things into your critical business systems, you were trying to do things even where like, like one of the things data is most helpful for is having people like, I look at something and I develop like a mental model of the world. And then you look at it and you actually do off the same mental model of the world. And then we can kind of collaborate and coordinate across that.
Jason: And LLMs are just not good at that right now because they just don't have enough context. They don't have the system of record to create the single source of truth right now. So that, that thing is actually like really interesting and really important for deploying and scaling AI systems and making them actually kind of useful organizationally.
Simon: Yeah. And I guess like using that example that you just provided, let's take revenue as a great example, right? If you have a COO or a CEO or a CTO, someone who's caring a lot about these very, these very specific numbers and it's really important they get accurate numbers, and they ask for something, they ask for it a day later and it gives them a different answer through either non-determinism or perhaps they've asked for it in a slightly different way, in a very subtly different way, but it gives them a different answer.
Simon: How important is it that these types of things are deterministic versus actually requiring some level of validation before pretty much everything it provides you with at that level?
Jason: Yeah, it's a good question. And for things like what you were describing for CEO level reporting or things where you're going to be making hard business decisions based off of them or board reporting or you have a duty to report something accurately to the board or something like that, you got to give the exact right answer.
Jason: And so, this is one of the things that we have, which is called the semantic layer that sits on top of your dbt project, is that defines the semantics of your business. So, you would define something called revenue and then you would say, for us revenue means this column aggregated by these dimensions and these times. And one of the other things that LLMs aren't as great on is thinking in the rectangles that are structured data tables. They're actually much better in thinking in terms of the semantic concepts as defined in the semantic layer.
Jason: So, if you give them semantic concepts to map onto, it's a way that LLMS can actually reason about this that fits in with how they reason. And so, we ran some benchmarking on this. It's just much, much more accurate if you do the same sort of natural language questions on top of your semantics as on top of your structured data. So that's been really interesting and useful finding and one that I think focused across the ecosystem we're seeing.
Simon: Yeah, and that makes a ton of sense as well to kind of like give it that advice or that path that it can follow. I guess hallucinations are a big problem here. I guess it probably you need to give it a request where you don't have those semantics, where you don't you don't provide it with a good path to go down to understand where to get data or how to form that data. Is that something that can happen that much?
Jason: Yeah, so I mean, hallucinations happen. They happen across the board in any LLM workflow. We find that most of the time what's going to happen if you're getting hallucinations is you are hallucinating a table name or a column name or a database function that doesn't exist. So actually, like your query breaks.
Simon: It can forget.
Jason: And so honestly, you probably rather have your query break than have it do something that's not a hallucination, but it's like not quite right. Like not a not quite right query is actually much worse than like a query that just breaks. And so, like the hallucination not solved, but largely addressed when you present it, the right set of context.
Jason: So if you present enough DAG information and right now we can only do this like for certain projects, certain skills in certain ways, but ultimately, you know, like this is going to like we are going to solve the problem where it's like when you want to write a SQL query, like you're going to have a good information of like here is the like here's the dialect that you have. Here's the functions that you have available to you here. Like here's the columns that you've available to you and their data types and all that stuff. And then, so you'll be able to do something.
Jason: And then it's okay, but is it exactly right? And that's where, then you need kind of the semantic mapping and things on top of it.
Simon: Yeah. And it's nice in that, in that sense because you're not there's a level of abstraction between the LLM and the actual result like you get back right. It's, it's almost like I guess if there's going to be a hallucination on code that gets created, the chances are it's going to break. It's going to break in compilation or it's going to break when it tries to link to something that doesn't exist or try and hit an API that just plain won't work rather than the worst case scenario is, it hallucinates quietly and it provides you with something that's incorrect, but won't provide you any exception at all.
Simon: So, there's this nice layer of abstraction there where chances are something will go wrong if it, if it, if it hallucinates. In terms of human validation, I guess in some of the workflows that you have seen across dbt and others, how much human validation is still needed? Maybe even across the queries that, that are created. How much can be automated in terms of validation versus that human and stop the world and check before you actually run these things?
Jason: Yeah. So, I'm of the belief that where we are right now is if I'm sending you something that was generated by an LLM, I'm sending my name to it and I need to have validated it. Right. And so, it's ultimately, it's on the person who owns the workflow to know, okay, this is the set of things where it's it's going to be like it's going to be fine versus this is the set of things which at this point is most of the things where it's like it generates code. Okay. And then I go through, I read, I read through it. I make sure it's right. I might tweak something. I might check perhaps, my understanding on something was wrong.
Jason: But like I right now we're at this interesting thing where it's like, you know, we have this higher level programming ++ right? And like you can operate on that level. But it's like it's, it's there enough for when I'm when I'm like, I'm building a side project which is like an MCP server that lets you like world build like a fantasy or sci-fi world from scratch and like I like that's all Claude. I look at the code. I'm like, yeah, yeah, okay, looks good.
Jason: But that's because that's just for me. If I'm signing my name to something I'm reading through the code. I'm vetting it. I'm validating it. I'm checking, making sure it works and that that's what I would recommend for people today. But look, you know, things move fast. Four months ago, you couldn't you couldn't rely on these systems as much as you do today.
Simon: Yeah. And let's talk about MCP service because this is, this is a really nice way now of actually being able to get to your data now, right? Because I, presumably with an MCP with a MCP client attached can ask a question. All of a sudden, the LLM will magically hit this MCP server perhaps that you provide. And then and all of a sudden, it's, it's, it's grabbing data for me is pulling dashboards and presenting things to me. Do you see this as like one of the core ways people can access data going forward.
Simon: And I guess how mainstream is this in what you're seeing today versus, you know, things that we've talked about already?
Jason: Yeah. So I'm just so excited about MCP. It feels like it feels like it blew through the lid wide off. So, I had a, I had a habit. Basically, from the week after Chat GPT launched onwards, where I would go to LLM provider of choice, I would type in what is my revenue, it would say, “I'm sorry, I can't answer that because I don't have information.”
Jason: And I would post it on Slack. And I feel like the only thing that matters is, if we help people answer this question. And so, for however many weeks, too many weeks in a row. But that was the answer. But then one as MCP started to pick up steam, it's like, oh, we now have the way that we can provide this to people because like these and it's so useful to have access to this data.
Jason: So, we built the dbt MCP server. And now today I go and I log into any MCP enabled clients, I type in what is my revenue, it goes, it looks at my semantic layer and it says, I see the revenue, the revenue metric, I'm gonna, I'm gonna calculate that as your revenue was at X amount of dollars.
Jason: And honestly, like this pretty quickly became my primary way of interacting with data. But like, I'm, I'm like, perhaps not your normal data consumer. How mainstream is this for looking data teams are adopting it as prototypes and pilots and rolling it out kind of on that basis. Everyone's thinking about it. Everyone's experimenting with it. And there's just no real question that this is going to be how people interact with data moving forward.
Jason: It's just so painful to go and like scroll through every like, like thing that you might have available to you and you just don't have to do it anymore. You just go in and say, what was my revenue? And there's going to be a lot of work to make this enterprise ready and make it all of the things that we need. But the writing is on the wall, like it's, it's so hard to imagine this not happening just because it's such an incredible increase in ability to access data.
Simon: And we talk about kind of like developer experience and let's talk about user experience because I feel like one of the things here is if this kind of thing was made available with an organization and all of a sudden you get huge amounts of data that are fronted by these MCP servers that I can that anyone can now access, you're going to get a ton of people way more interested in accessing data that they before didn't know who to talk to or how to get this access to the data.
Simon: With this, I'm presuming there’s going to be massively increased usage of or massively increased demand for access to, you know, for just generating requests to this data. Where does the bot, what's the bottleneck now? Because does the, is there going to be now a big bottleneck in terms of, you know, are we getting all the right data in? Are we formatting the data correctly now to actually get all these new use cases that we perhaps didn't have before because it wasn't so readily available? What do you see as the new, the new bottleneck?
Jason: Yeah, I mean, I'm so glad you asked that question. And this almost gets us back to the, like the very first thing we discussed when we're like, what's the data engineer of the future doing? They are building out the set of use cases now. So, it's like, okay, like data is both going to be much easier to access. It's going to be much more important that it's correct because more people are going to have access to it. And, you know, maybe we'll get to this, maybe we'll, you'll be driving agent workflows on top of it.
Jason: So, you, the fundamentals of this are still going to be important, right? And the fundamentals are still, you know, very challenging to take in all of your data to make sure it's accurate, to make sure that the system is kind of well, well honed. And so, that, that's half of it. And so that's like what the data engineer of the future is going to do is they are going to be creating the set of data that then MCP or things like MCP will sit on top of, and then people will create it and they'll do that.
Jason: And then what's a little more fuzzy is what you would call them, I don't know, data analyst of the future does and how they fit in. Because, you know, traditionally, we're always going to have dashboards. We're still going to do dashboards. But like, I've always thought it's kind of crazy that we put so much effort into like learning about the world and learning about ground truth and then like building up these things. And then it's like, okay, we have like this set of bar charts that we look at.
Jason: And the bar charts are great. Don't get me wrong. I love the bar charts. And they're additional things. But now, like, we can actually use this kind of as like the central mind for then LLM processes to go and take actions off of. And that, like, that's a whole new industry. That's a whole new thing.
Simon: Yeah. Yeah. And actually, evolving dashboards based on the data could be super, super interesting. Like, as things start going south, well, let's deep dive into those and actually pull more data in that. So that actually we have almost like these, you know, monthly or quarterly based on the painful areas versus just this standard traffic light style dashboard. Sounds amazing.
Simon: And so, you see the data engineer then much, much earlier in the cycle versus later than presumably. So, it's more about that quality of data, the structure of the data, the quality that kind of goes in. Is that fair?
Jason: I think that's right. So, your MCP or the system like MCP, is going to sit on top of your data platform of choice or your iceberg tables or whatever. And then it's going to have access, it's going to need to have access to just, you know, a tremendous amount of well-modeled, accurate, well-governed data. And so, that's what the data engineering will be doing.
Jason: They will be orchestrating that, testing that, making sure that it's already for then whatever kind of set. This is like a bit too simple of a heuristic, but like, you know, they're doing everything pre-MCP, right? And then there's like another set of things which is like post-MCP or well, data engineers and analytics engineers and analytics engineers being the ones that still also focus on building out the data object.
Jason: But they're much more in the weeds, kind of with the domain experts with the business context and making sure that the, because like the MCP, it doesn't just need tables that are like technically built well. It also needs tables that have like deep context and understanding of the business, right? And they're built in such a way that they are the correct layer of abstractions and the correct descriptions and all of that, that then they can go and like take actual useful, actually, and be a staff of.
Simon: Yeah. And what one other thing actually for, just before we kind of let that wrap, I'm thinking about this data engineer, this future data engineer role. And if we compare it to a data engineer pre-AI, what data, a data engineer would need to add into their kind of like their data, like their data, their databases, the style of the, you know, the types of data that I need to add in, how would they know about that?
Simon: Well, it would typically be because they would get certain amounts of requests in and a certain amount of probably the business knowledge and the local knowledge that they had. But now with the explosion of something like an MCP server, you can almost like, see a lot of the requests that are kind of like, you know, I guess being asked for on.
Jason: Yeah.
Simon: Even is it more of a demand thing then? I'm trying to think, who does that data engineer need to talk to in order to do the best job? Is it, is it, is there a possible future here whereby based on the, the number of commands that are kind of like coming in whereby people are asking for these types of things that, you know, the data engineer thinks, okay, I need this, these types of data in order to be able to satisfy these requests.
Simon: Feels, feels like I don't want to say the data engineer role is getting lonelier and lonelier in terms of who they're going to be interacting with. But I guess, you know, how does that data engineer interact to get, to get that level of input?
Jason: Oh, what man, that's, that's so fascinating. I'm imagining, you know, we're, we're going to have, I, well, the metadata is one of our people, we're going to have the metadata of, you know, all of the requests that come in the MCP server. And it'll be, one of the things we'll be able to do is be able to say, like, you know, were we able to answer this question or just rely on data that we don't currently have access to?
Jason: And then, you know, it's like, it'll be not, not, not gone, not gone with you. Everyone's going to, everyone's going to yell at me for saying this, relatively trivial then to like build a, reporting on top of that. That's like, oh, actually, like a ton of people are asking about, like, this data source, like, we're seeing all these requests come in. It's, you know, of our requests per day, 17% aren't able to be answered. And, you know, for 41% of the 17% that can't be answered right now could actually be addressed if we bring in, like, this additional data source. And also, like, here's where you can find that data.
Jason: So, yeah, I mean, that's a very great question, really interesting concept you sketch out.
Simon: Because I think, I think it's interesting, because a lot of those questions wouldn't have been asked previously, because a lot of those people wouldn't have necessarily gone to someone to ask for that. So, it's quite interesting to see how this is almost the enablement of this, of this user request mechanism through, through just lateral language and an MCP server is potentially even actually seeding what data actually then gets added in as well.
Simon: So, it's a kind of like a nice, a nice full circle, really.
Jason: Got to get the data flywheel moving.
Simon: There we go. There we go. We've named it. Wonderful Jason, this has been a blast. Look, geez, we're almost 50 minutes already as well. Jason, this has been a blast. I hope everyone online has enjoyed listening as well.
Simon: Thank you very much, Jason. Where should people go to check out dbt and to learn more?
Jason: Yeah, so if you head on over to getdbt.com you will see all about dbt. We recently launched the dbt fusion engine, which is a ground up rewrite of the dbt engine from scratch. It's 30 times faster parsing speed. It has very good type awareness and SQL understanding.
Jason: So, all the things that are going to actually allow you to build in the deep DAG awareness in the deep context that's built into the new engine, suggest you try it out. Please find me on LinkedIn, Jason Ganz. We have a community Slack for dbt. So, you can shoot me a message there. You can find me on the website, formerly known as Twitter. So just wherever would love to stay in touch with the folks. This was a blast. Great talking to you. You were a great audience. So yeah, just really fun to be here.
Simon: Amazing. And Jason, it'll always be known as Twitter for me. I don't know about you. That's right. Yeah, thank you very much. And to all our listeners, really appreciate you listening. Hope you enjoyed it. Tune into the next episode. And all of those links that Jason mentioned will add them into the show notes as well. So do check them out. Thanks, all. See you on the next episode.