
AI VS. ENGINEERS:
WHO DEBUGS
FASTER?
Transforming Debugging and Root Cause Detection with Sentry
Also available on
Transcript
[00:00:00] Guy: Hello everyone. Welcome back to the AI Native Dev. Thanks for tuning back in. Today we'll have a fun conversation that talks about developing with AI from two angles. One is from a slightly contrarian, maybe pioneering, perspective of how to build with AI. The other is from building tools into AI, which is also a pioneering frontier in its own right.
To cover these, we have with us David Kramer, best known as the co-founder of Sentry, a tool I’ve always appreciated for its developer experience sensibilities and usefulness, along with David’s deep developer background prior to that.
David, thanks for coming onto the show.
David: Yeah, thanks for having me. Looking forward to this. To get us started, can you give us a little background about yourself and Sentry, mostly from the context of the world in which you’re experimenting with AI? What are you learning around you?
[00:01:06] David: At Sentry we have a pretty wide breadth of concern. Obviously, Sentry’s a dev tool, and our customers are developers. So we think a lot about what’s changing in how they build software. Are there new developers, and do those new developers have different needs? Right now, I’d say not really.
But there’s also the angle of: can we use new technology? Not just AI, but any new tech, to build different software. And then there’s the greenfield question of whether we should build something completely new, not just incrementally add to Sentry’s software.
Finally, the last piece, which has been more topical in recent months, is how AI actually changes your job. How can you leverage it to ship faster or improve velocity? I’ve personally spent a lot of time on that front, exploring the tools.
I’m not an academic. I’m a self-taught software engineer. But I think of myself as a really good applied engineer. I don’t know much about the theory—I roughly know what a Markov chain is—but I’m very good at debugging software and figuring out how things work. So I’ve spent a lot of time in the weeds of: how can you push this and use it to maximum effectiveness? Which, to be fair, is the same way you’d learn how to apply it to your product or a new product.
[00:02:36] Guy: Yeah, that’s the practitioner’s view. You’re learning by doing, bringing in your competency.
In software, there’s often this blend: I’m building dev tools and I’m using dev tools. That’s actually a privilege we often forget is a privilege. If you’re building tools for lawyers, the people building them are rarely lawyers themselves. There’s a split there.
Here, you’re coming in from the practitioner perspective. Maybe you have less familiarity with the underpinnings of AI math, but you see it in practice.
So in that context, we have two topics to cover. One is to understand, through the Sentry lens, what you’re building as you experiment with bringing value to your users using LLMs. The other is how you build software yourself and minimize the manual typing of code.
Maybe let’s start with Sentry, so we can then relate your practices to things you’ve built. Can you share a couple of interesting projects, even if they’re early stage, that seem useful in applying LLMs to Sentry to make it better for users?
[00:04:09] David: Yeah. The first thing we built—time is fuzzy, first COVID, then this AI wave—was embeddings, about a year ago.
Sentry’s core technology, if you simplify it, is: we capture errors, put them in a dashboard, and process them in between. One critical step is deduplication, which historically comes from fingerprinting—lots of regex or something equivalent.
We found early on that embeddings could drastically improve deduplication. For example, in Ruby, anonymous functions have random integers inserted into function names. You’d end up writing lots of rules to parse those out so they match. Embeddings got us much closer with less work. Not enough on their own, but they helped.
So we applied that and saw massive improvements early. That was a great first application.
More recently, we asked: how do we apply LLMs? In dev tools, the most exciting thing about LLMs is code generation. If you can generate and understand code, you can do interesting things. That wasn’t approachable before. Fingerprinting we could solve forever, it was just tedious. But code generation? No chance before.
So we asked: what if we could automatically fix bugs or generate tests? You can’t reliably do that today, but it’s worth exploring.
Sentry has a lot of data. Historically, that’s our moat—highly structured, highly semantic data about applications. Other vendors often operate at the system level, with junkier data. So we thought: let’s interlink this data (this was pre-LLMs), build a trace-connected dataset, and then feed it into LLMs. Since they’re effectively pattern-matching at scale, maybe they could help us diagnose issues.
We’ve been building this for about a year. It’s called SEER. For a while, I thought it wasn’t ready—because LLMs hallucinate when they don’t have enough info.
I tested it with a JavaScript bug on a side project. SEER gave me a root cause that seemed totally wrong. I showed it to the team as an example of why it’s not ready. The next day, someone at the company hit that same bug. They showed me the actual error—and SEER had been right.
I was genuinely impressed. The value of senior engineers at big companies is their knowledge of the system and ability to theory-craft and debug quickly. SEER did that pretty well, which shocked me.
It’s still early and doesn’t work that well all the time, but when it works—it’s impressive.
[00:23:22] David: And I think that happens because of how common it is in the training dataset. Again, I’m not an academic, so I don’t know. But either way, it goes off the rails and then you sit there fighting with it. This is the problem. This is where you waste time, and you shouldn’t rely on pure agent development because you’ll spend way more time refining prompts than you would just opening the editor.
In the end, it’s kind of obvious that would be true, but you don’t really know how frequently it’s going to break. I’d bet less than 10% of the time I was able to get a one-shot fix without serious refinement. Even sub-10% where one or two prompts got it correct. Usually, it took about 20 prompts to get something even remotely good.
And again, this is real-world software. I need it to be maintainable and testable. If you don’t review it, it’s even worse. There was one case where I thought I reviewed it—I usually skim this out when it’s generated—but I hadn’t.
[00:24:22] Guy: And you were working on side projects, so there isn’t a mandate or a specific project where the team reviews it if you don’t necessarily.
[00:24:34] David: Kind of, yeah. We don’t have the same controls, so we don’t require the team to review it. But humans make mistakes too. I actually do review the code, but I’m not going line by line.
There was one case where I had to build an integration test. I thought I had reviewed it and made sure it was doing what I wanted. Later, I found it was inserting something random that wasn’t testing what I needed at all. It was fine because there were other things that checked it, but that’s one of the fundamental problems.
What I learned through this was: don’t be a pessimist. Don’t just assume truth in things, but also live the experiences. Part of this was me trying out Claude Code. It’s a terminal-based UI. I didn’t know how well it would work. To be clear, it’s phenomenal software.
Guy: That’s what I’m saying. Yep.
David: But my biggest learning—and something the internet would probably say I’m wrong about—is that Claude Code is actually the wrong user interface for this stage of technology.
What I mean is: if I use Cursor—ignoring agent differences, because at the end of the day the model has the biggest impact—they’re close enough to the same. If I use Cursor, run through their agent, and generate a change set, I get diffs in my editor in real time. I can approve blocks of code seamlessly and jump between files easily.
If I’m using Claude Code, I’m stuck in an endless terminal slog. Yes, I can manually approve every change and review it right there, but it’s not easy. You’re reviewing code additions, not the actual source code.
So my biggest point of view is that it’s not the right customer experience. It’s not a good product for how developers need to use it in real-world software. Cursor is a much better experience for this.
[00:26:28] Guy: I love the perspective, and I don’t necessarily disagree. I find this is a learning exercise right now. But if you cast your eyes further, there’s a claim that Claude Code needs to get better at explaining changes because of this constraint of working through the terminal.
Is it in a place where it could improve its ability to engage with you without needing to review the changes? Or does it commoditize the IDE—you don’t really care if it’s Cursor or something else, as long as there are lightweight integrations? Do those resonate, or do you think we’re at least a few years away, where you’ll still need to review all the changes?
[00:27:18] David: I think you’ll look at all the changes forever. I don’t think this technology is enabling AGI or anything people are speculating about online. It’s not removing engineers. I spend as much, if not more, time doing engineering work as before. I’m still designing the system. These things aren’t intelligent, they’re just spitting out patterns.
That’s why I find the Cursor flow so much more superior. I have to review the code, and with Cursor I can incrementally review it as it goes, and I can interrupt when I need to. That’s important.
When I was hiring our CTO, I didn’t want the usual executive pitch exercise—they’re always useless. I wanted to see if they could story-tell and engage, so I asked them to pitch something they were interested in. He had worked at Apple on the autonomous car division and wanted to talk about self-driving cars.
Guy: Yep. Definitely cool though.
David: One point he made, which I strongly agree with, is that everyone wants to go straight to full self-driving automation. Apple did this wrong. You see the same in LLMs—people want to believe agents can run fully autonomously. The right answer is that humans need to be in the loop. Cursor’s user experience is a much better version of this, because I incrementally review things.
The context for me is smaller, I can see the source code, and I understand it. I don’t build software through a blog in the terminal—it doesn’t make sense unless you have full autonomy, which I don’t see with today’s tech. Maybe academics think it could get there, but I don’t think they know either.
The key point: you can’t pretend these things can do what they can’t. You need a human in the loop. And if that’s the case, what’s the best experience? For me, it’s not generating a bunch of stuff, then going to GitHub to review it, then refining it over and over. It’s not running in a terminal while also clicking around an editor. Those are high-friction experiences.
[00:29:52] Guy: Super interesting. Again, maybe back to the fact we’re all experimenting here—nobody really knows. I do wonder about defining the guardrails.
At Tessl, for instance, one of our core premises is that you write a spec at the beginning. That’s your primary artifact. We’ve modeled this after the five levels of autonomy in self-driving cars.
Within that, one premise is: if you have good tests, you don’t need to review every change. If your definitions are covered by tests and you trust them, then you don’t need to go back.
I’m curious whether that resonates. How did your relationship with tests play a role while avoiding manual coding? Was it like: “I need you to understand what I’m doing, and the way to convey that is this test must pass”?
[00:31:23] David: Yeah. My rule was: I cannot edit the code myself. That made it complicated. I’m a big test guy. I don’t do TDD—I don’t write the test before the code—but I don’t manually test code if I can avoid it. That’s been my entire career.
But the reality is that UI is ever-present, more code than ever, and it’s a real problem. If you’ve ever worked in JavaScript, you know you can’t reliably test things end-to-end. It’s very difficult to set up. Test harnesses are hard. Most people don’t test effectively.
I wrote an Evals adapter that implements an eval spec but does it through the test, because I need CI to do CI. Technology doesn’t change—we still need the validation layer. But I needed it to generate the tests.
What impressed me was that if you build a really simple system, the codegen works. I was using ORPC, a JavaScript API binding layer. The server side was so easy to reason about as a human, and therefore as a machine. Because if it’s easy for a human, it’s easy for a machine to pattern-match.
I was able to code-generate an entire migration from one technology to another because the API was simple. I generated tests for it, and they validated everything because the system was simple and clear.
I think that’s really important. What I hope happens over the next few years—fingers crossed, because we need it—is that a lot of this complexity goes away. Then we can go back to spending time solving real problems.
[00:33:09] David: It’s just infinite complexity, and I don’t see the value we’ve gained as an industry from it. I don’t see better software, more reliable software, or faster-to-ship software. I see more complicated software. Things that should not be commonplace are front and center. And I see people not writing tests.
[00:33:26] David: I see people not understanding core systems design. I have to believe this is going to reconcile that, because there’s no magic in these systems. The simpler the technology stack, the better it is for humans. The easier it is for us to write code, the easier it’s going to be for them to debug and build.
[00:33:43] David: So I’m hopeful this implicitly creates more value, just like it’s implicitly creating more value in specs and documentation. Despite what people think, docs aren’t that valuable to humans, because we learn and retain context really well.
[00:33:58] David: When you go into a new codebase, you might read some onboarding docs, but you’re not going back to them again. You just look at the code and know what’s going on. Docs are valuable to LLMs, because they don’t retain context.
[00:34:11] David: It’s interesting how the same things that have always been important to software design are even more important now.
[00:34:21] Guy: Right. I love that. I love both of those points. One is the notion of retaining context versus not—agents retain concepts. They’re also more disciplined. There are running jokes like, “We wouldn’t write docs for our teammates, but we do for the agents.” The obvious reply is, “Yeah, because the agents actually read them.”
[00:34:47] Guy: Beyond the snarky version, you’re making a good point: humans remember what’s salient, while LLMs don’t. The other is adaptability. I think a lot about adaptable software. Your point about JavaScript and the mess is much of that comes from interweaving implementation details, browsers changing over time, and the quirks of supporting an infinite stream of historical versions.
[00:35:11] Guy: Browsers evolve by seeing quirky patterns and creating cleaner ways to handle them, but then all the legacy websites remain. So it’s nice to think that abstraction into natural language will reduce that, because LLMs can provide adaptability.
[00:35:33] Guy: I’d love to shift into the broader team and ecosystem for our last 10 minutes. You’ve taken the plunge, subjected yourself to eight weeks of non-coding and building with this. You’ve gathered a lot of learnings on how to interact with these tools. If you think about today and tomorrow, which practices would you recommend? And not just recommend—what are you driving the broader Sentry development team to do?
[00:36:26] David: The first problem—this is unrelated to Sentry because we’ve solved it—you need your workplace to enable consumption of these tools. A month ago, I took away compliance on IP at Sentry. We no longer consider source code protected. That doesn’t mean we won’t enforce our license—I’ll come after you if you steal our code—but I don’t care if, say, Cursor accidentally leaves our source code exposed.
[00:36:49] David: One, it’s already public. Two, I’m not worried about model providers training on it. I’m not saying they can, but if they do the “ask forgiveness later” thing, we don’t consider it a business risk anymore. That was a 24-hour decision I made, and it was really important because it instantly allowed us to adopt new tools.
[00:37:27] David: Not every company can do that, but you need a workplace that enables these tools. Otherwise, you’re paying out of pocket, which I also do, for side projects. That’s the first and most important thing. At Sentry, I asked the team what tools they wanted. I didn’t give them everything—most are garbage—but we gave everyone access to Claude Code, Cursor, and ChatGPT.
[00:37:59] David: The key is to use the tools. The biggest challenge is the senior engineer archetype—the cynical one. They try a prompt, it produces garbage, and they say, “See? It doesn’t work.” And yes, it doesn’t work that way. You have to get people past that.
[00:38:38] David: It has to become a self-feeding cycle where everyone teaches each other. The name of the game is giving the model all the information it needs. If you don’t, it can’t answer. You still have to feed it context.
[00:39:22] David: We’ve been experimenting with exposing all our docs in a useful way. There’s an MCP provider called Context7 that takes open-source projects and exposes their docs over MCP. These systems don’t know your APIs, they’re just guessing. You have to feed them the right docs.
[00:40:21] David: We generated markdown files for every Sentry docs page—cleaned up, no layout junk—and exposed them to our MCP with tools like “search docs” and “git docs.” That lets you fetch the right documentation at the right time with less human input.
[00:41:15] David: I think those investments will be very valuable. But results are anecdotal, not scientific—you kind of vibe check them. There’s no better analogy.
[00:41:32] Guy: Not a great term, but a fairly descriptive one. On that note, how much real-world impact do you expect today versus building a muscle for six or twelve months down the road?
[00:42:04] David: Both. Let me give you an anecdote. Early on, when Cursor was mostly typeahead, I thought it was just autocomplete, nothing special. But I tried it with Composer, their agent. I had a normal workflow—building another API like the existing ones. I showed it an example, and it worked. I asked it to generate a test, showed it an example, and it did it—even better than me, because it added extra branches I wouldn’t have bothered with.
[00:43:22] David: It was successful and basically what I would’ve done. It was significantly faster—two to three times more output. That was already super effective months ago.
[00:44:10] David: But there’s another implicit value. People like me, or founders we’ve acquired, who weren’t coding much anymore, are now back in the weeds. They’re building again. That’s unbelievably important. It excites builders who drive this industry. For a long time, things felt stagnant. This reignites passion, which creates real ROI because it leads to better thought and better decisions.
[00:45:00] David: Personally, I’ll happily spend evenings working on random projects because I’m excited about technology again. That excitement translates into business value, even if it’s immeasurable.
[00:46:01] David: The more people in your company who are in the weeds, the better their opinions and decisions. That’s always been my strength at Sentry—living in the weeds, having strong signal. That’s why getting people involved in the technology is the most important thing. You also get real speedups if you use it rationally.
[00:46:45] Guy: We’re almost out of time. For the broader audience: if you’re a developer today, seeing this transformation in your profession, what’s your number one tip—do’s or don’ts for adapting?
[00:47:31] David: Embrace it. It’s fair to be skeptical. Don’t listen to internet noise or big-company CEOs spouting nonsense. But embrace it. This isn’t removing engineering. I do as much, if not more, engineering now.
[00:47:51] David: You can’t ignore it. If you do, you’re dead in the water, both from a career and company point of view. Every company is moving this way. Smart people are running those companies. Embrace it. Find time to use the technology. Ideally for work. If not, maybe find a job where you can.
[00:48:15] Guy: Agreed. You can’t deny it. You need to build the muscle, even if you don’t see immediate returns.
[00:48:27] Guy: David, thanks again for the insights and learnings. I love the soundbites you put out on social media and the conversations you’re driving. It’s a pioneering space, so keep doing it. Looking forward to hearing more.
David: Awesome.
Guy: Thanks, everyone, for tuning in. Hope you join us for the next one.
Chapters
In this episode
In this episode of AI Native Dev, Sentry co-founder David Cramer is joined by Guy Podjarny to discuss how AI is transforming developer tools and workflows. They cover Sentry’s use of embeddings, LLM-powered bug detection, and their SEER root-cause analysis system, along with the challenges of hallucinations, context limits, and non-determinism. David also explains why “vibe coding” is a myth, how he’s shipped production systems without writing code by hand for over six weeks, and why core engineering practices like specs, tests, and code reviews remain essential in an AI-assisted world.
Introduction
This conversation with David Cramer, co-founder of Sentry, explores the evolving role of large language models (LLMs) in software development, both within Sentry’s product ecosystem and in personal development workflows. It covers practical use cases, limitations, and the cultural and technical shifts necessary to integrate AI into engineering teams. Cramer brings a practitioner’s perspective, focusing on applied engineering rather than academic theory, and emphasizes that AI tools must augment rather than replace the fundamentals of software engineering.
Applying LLMs at Sentry
Sentry, a developer-focused tool, captures and processes application errors with highly structured data. This structured approach has allowed them to experiment meaningfully with LLMs. Early wins included using embeddings to improve error deduplication, replacing tedious rule-based systems. More recently, Sentry has been developing "Seir," a root cause analysis tool that interlinks rich, application-specific data with LLM reasoning. While the technology remains prone to hallucinations and non-deterministic results, Seir has demonstrated moments of unexpectedly accurate debugging similar to the skill of a senior engineer who deeply understands a system.
Sentry’s AI applications focus on real-world utility over novelty, exploring areas like automated bug fixes and test generation. These initiatives hinge on the company’s moat of interconnected, high-quality data in contrast to vendors that operate at more superficial levels of system monitoring.
Eight Weeks Without Writing Code by Hand
Cramer conducted an experiment to avoid manually writing code for eight weeks, instead working entirely through AI agents. The goal was to understand the limitations, strengths, and workflow implications of agent-based development. He rejects the idea of "vibe coding" which is blindly accepting whatever an agent outputs, insisting that proper system design, specifications, tests, and code review remain central.
The experience revealed that AI output quality varies less by task size and more by pattern familiarity. Simple, common patterns are handled well, while novel or nested agent scenarios quickly degrade into confusion. Reviewing every change is essential, making the integration experience critical. Cramer found IDE-integrated tools like Cursor superior to terminal-based workflows like Cloud Code because they support incremental review and maintain human-in-the-loop control.
Testing, Documentation, and Simplicity
Tests played a central role in Cramer’s agent-driven workflow, particularly when paired with simple, predictable system architectures. He emphasizes that LLMs excel when APIs and code are easy to reason about. This principle suggests a future where complexity is reduced to make both human and machine reasoning more effective.
While humans rarely revisit documentation after onboarding, LLMs rely heavily on it. This shift makes well-structured, machine-readable documentation more valuable, encouraging practices like exposing internal and external docs via machine-consumable APIs.
Cultural and Organizational Shifts
For teams, the first barrier to AI adoption is access, both in terms of licensing and compliance. Sentry removed internal restrictions on source code exposure to accelerate experimentation with AI tools, offering company-wide access to platforms like Cursor and Cloud Code. Overcoming skepticism, particularly from experienced engineers, requires education and demonstration that effective AI use depends on supplying relevant context and documentation.
Cramer notes an indirect but significant benefit. AI tools have re-engaged senior technical talent, drawing them back into hands-on building after years of primarily managing. This renewed enthusiasm drives innovation and better decision-making at the leadership level.
Present and Future Value
LLMs already deliver tangible productivity gains in certain workflows such as boilerplate generation, testing, and repetitive API pattern creation. More importantly, using these tools builds the organizational and individual “muscle” to exploit future advances. While full autonomy remains unlikely in the near term, keeping humans in the loop ensures quality and enables incremental improvements.
Cramer’s advice to individual developers is to embrace the technology, remain skeptical of overblown claims, and focus on hands-on experimentation. Ignoring AI’s trajectory risks obsolescence both for engineers and for companies.