
Also available on
[00:00:00] Ian Thomas: I think one of the things about meta culture is that it's very engineering empowered. If you can get engineers to support something from the ground up, you can go a long way. And there's a popular phrase, code wins argument. And I think in this case it's case of like proof wins argument.
[00:00:13] Wesley Reisz: One of the questions that I often ask when I talk to people about spectrum and development is at what level are we talking about?
[00:00:19] Wesley Reisz: So what we took was defining a process using Ripper five, how we're gonna work with the LLM starting with the spec, and then we paired developers with that process.
[00:00:28] Sepehr Khosravi: There's a Stanford research on over a hundred thousand employees. What they found out is AI is helping generate 30 to 40% more code than before.
[00:00:35] Sepehr Khosravi: However, 15 to 25% of that code ends up being junk that gets reworked. So they estimated the actual product that we gained is like 15 to 20%. I would think it's higher if you're using AI in the right way.
[00:00:43] David Stein: All large tech companies have this issue if your software that's been in existence for more than a few years using Legacy that no longer resemble the way that you would rebuild those things today.
[00:00:54] David Stein: So we're talking about shifting to architecture with a semantic layer and a kind of a query engine on top to serve the same kinds of metrics queries, but on a off production architecture.
[00:01:13] Simon Maple: Before we jump into this episode, I wanted to let you know that this podcast is for developers building with AI at the core. So whether that's exploring the latest tools, the workflows, or the best practices, this podcast's for you. A really quick ask: 90% of people who are listening to this haven't yet subscribed.
[00:01:30] Simon Maple: So if this content has helped you build smarter, hit that subscribe button and maybe a like. Alright, back to the episode. Hello, and welcome to another episode of the AI Native Dev. We are here live from New York at QCon AI. And on my journey here, I'm gonna be talking with a number of different folks throughout the conference based on whether I think it's a really nailed on presentation for our listeners. And today, I absolutely found one.
[00:01:57] Simon Maple: And so this is Sepehr Khosravi, who gave the session "Choosing Your AI Copilot: Maximising Developer Productivity." And that is such a cool topic. And you actually mentioned a huge ton of different productivity hacks and things like that in and around using various agents and LLMs. First of all, talk to us a little bit about who you are. You work for Coinbase; what do you do there? And tell us a little bit about it.
[00:02:21] Sepehr Khosravi: Yeah, I'm happy to. So, I'm Sepehr, like the fruit "pear," makes it easier for people to remember. I'm a machine learning platform engineer at Coinbase. I've actually only been in the industry about two years. I used to sell Teddy bears in a different career.
[00:02:35] Sepehr Khosravi: And I also teach part-time at UC Berkeley, where I teach people these AI tools and how to use them to maximise productivity or launching startups. And I also run an academy for kids to teach them the same thing for free, which is called AI Scouts. It's kind of my passion project.
[00:02:48] Simon Maple: That's super cool. Super cool. And so what age groups are we talking here?
[00:02:51] Sepehr Khosravi: We actually start anywhere from 11 and up to 70. Some grandpas wanna learn as well, and we put them all together, but we teach them the basics of AI, how to understand software, and how to use some of these tools that we're gonna talk about today.
[00:03:03] Simon Maple: Amazing. Amazing. So we talked about, or rather, you talked about a bunch of different tools, a bunch of different agents, and you went through a number of different tips and tricks to get people to be more proficient, more effective with those agents and those tools. Tell us a little bit about your environment.
[00:03:20] Simon Maple: What is the environment that you find you, and I know this is very personal in terms of everyone is different, but for you personally, what's your ideal environment that makes you the most productive?
[00:03:29] Sepehr Khosravi: Yeah, I think my ideal environment is Cursor IDE set up with Claude Code on my terminal. And that's where I feel I'm the most productive. I'll use Cursor for most things, actually, like 80 to 90% of the things I'm doing on Cursor. But for deep tasks, I will start off my task with asking Claude Code.
[00:03:46] Simon Maple: Oh wow. Okay.
[00:03:47] Sepehr Khosravi: Yeah, yeah.
[00:03:48] Simon Maple: And how do you find, like, was that the first time you moved out of the IDE into the Terminal?
[00:03:52] Sepehr Khosravi: Yeah.
[00:03:54] Simon Maple: How did you bridge that? Not being able to see everything in a typical IDE?
[00:03:57] Sepehr Khosravi: Yeah. I think initially why it started is because our company was actually tracking developers using AI and how much they're using it. And actually, in a good way; I know some companies don't wanna use as much AI because it costs a lot.
[00:04:11] Sepehr Khosravi: Our company was pushing a lot of AI adoption. And then Claude Code uses a lot more tokens than Cursor does. So it seemed like I wasn't using a lot of AI, even though I was pretty sure I'm using AI more than most people. So I'm like, okay, let me try out Claude Code because of the token usage.
[00:04:26] Sepehr Khosravi: But it turns out it is really useful for those deeper tasks. I've had so many projects where Cursor couldn't get it done, but Claude Code could get it done.
[00:04:35] Simon Maple: And why don't we go through a number of tips then? So we'll talk about Cursor first, then we'll jump into Claude Code.
[00:04:39] Simon Maple: Okay. So Cursor. You went through 14 different tips. They're on screen. We'll flick through them and we'll pull them up as you find one that you like. Tell us about a few of the tips that you find are really a game changer, maybe for Cursor specifically, or just generally?
[00:04:55] Sepehr Khosravi: Yeah. I would assume most of your audience hopefully is pretty AI forward. But for the people who don't like AI, Tab AI is where I would start. Cursor has this Tab function. For the people who don't like it, just turn on Cursor and see what it suggests.
[00:05:09] Sepehr Khosravi: Because a lot of times it'll write 10 to 20 lines of code for you just from you hitting tab and not having to lift your finger.
[00:05:14] Simon Maple: Yeah.
[00:05:14] Sepehr Khosravi: So that's where I would start for people who are very against AI.
[00:05:17] Simon Maple: And this is the very AI assisted, tab completion levels of AI usage.
[00:05:22] Simon Maple: Correct.
[00:05:23] Sepehr Khosravi: Cursor has their own model built for this AI tab complete, which is why I recommended it over a lot of the other IDEs.
[00:05:29] Simon Maple: Cool. Okay. So Tab, what's next?
[00:05:29] Sepehr Khosravi: What's next? Let's see. Most people know about Cursor Agent, but the multi-agent mode I think is really underrated.
[00:05:38] Sepehr Khosravi: And how do you keep up with which one's the best and which one's not? You kind of wanna experiment at the same time, but experimenting takes up a lot of time. So what I like to do is every time some new AI model comes out, use this multi-agent mode where you can set it up so you type in one prompt and have two or more AIs answer that same prompt.
[00:05:54] Sepehr Khosravi: So recently with Chat 5.2 coming out, I want to see if I wanted to switch my daily driver from Opus 4.5 to this new one. I would have it almost shadow whatever Opus 4.5 is doing, see how I like the results, and then after a few tries decide if I want to continue using this or not.
[00:06:15] Sepehr Khosravi: So multi-agent is a great one for people to use.
[00:06:17] Simon Maple: That's awesome. And so with this one, how would it, once you parallelise, how do you then choose which one?
[00:06:22] Sepehr Khosravi: So there's a benchmark for these models where you can see which one tends to be the best. I don't think that's the best metric.
[00:06:29] Sepehr Khosravi: You can tell, like Google Gemini for example, recently links really high on that, but people most of the time are saying Claude is better.
[00:06:36] Simon Maple: Yeah.
[00:06:36] Sepehr Khosravi: So it comes up to your personal experience. Most of the time I just take a look at it, see if it did a good job of completing it, how I like the code it generated, and how I like the response that it gave.
[00:06:43] Sepehr Khosravi: For example, I really like Claude because the responses it gives a lot of times are more educational than the other AIs and you can really understand what it did better.
[00:06:53] Simon Maple: Yeah.
[00:06:54] Sepehr Khosravi: So, yeah, I just leave it up to personal experience. You look at it, whatever works best for you, that's the one you end up using. Another great feature from Cursor specifically is they built their own model, Cursor Composer.
[00:06:59] Sepehr Khosravi: The other IDEs don't have this model, so it's native to Cursor and it's specifically good at generating code quickly. The biggest problem we have a lot of times with these AI coding tools is you type something and then you're waiting two to three minutes and sometimes the task isn't even that complex.
[00:07:18] Sepehr Khosravi: For that I'm using Composer. For example, this page was generated with Composer in 24 seconds, where something relatively similar for Claude was two minutes and 30 seconds.
[00:07:30] Simon Maple: And that's significant actually. Because I remember when we were originally in the very early days of using AI in a more assisted way, particularly when using the tab completion, for example, we required a really, really fast turnaround. When you hit tab, you want an answer within a split second.
[00:07:41] Simon Maple: Now with more agent-based tools, because it goes deeper, you will be more inclined to say, "Actually I'm happy to wait for it 30 seconds or a minute for a better answer."
[00:07:57] Simon Maple: But now when you look at this, the difference between 30 seconds and two and a half minutes is significant and actually interrupts your flow and thought process. So that actually matters quite a bit.
[00:08:10] Sepehr Khosravi: Yeah, a hundred percent. It's so big of a problem that, it's a funny example, but YC actually invested in this company which is a "brain rot" IDE. It's an IDE that pops up TikTok videos and games for people to do while they're waiting. So if YC is putting money into it,
[00:08:24] Simon Maple: Its like there's a problem there.
[00:08:26] Simon Maple: Yeah. It's like that classic meme: "My code's compiling." Why are you playing games? "Oh, my agent's doing something." Exactly. Okay. So, what else from the Cursor point of view?
[00:08:39] Sepehr Khosravi: Yeah. Next big tip is just setting up your Cursor rules. A lot of people are just using the agent but not putting down the groundwork you need to do to really excel with these tools. One of those is gonna be rules; the other one is gonna be MCPs. Mm-hmm. So for rules, there are actually four different types of rules where you can choose rules that always apply on every prompt, or you can have it be specific where Cursor is figuring out if it should apply it based on the context or based on specific files.
[00:09:03] Sepehr Khosravi: Or you can just have it set up so Cursor will never by its own look at the rule, but only apply it when you manually say, "Hey, I want you to set up this rule."
[00:09:12] Simon Maple: Opt-in information, opt-in documentation. Very actually a little bit similar to how Tesla does that with a Tesla entity and things like that.
[00:09:17] Simon Maple: Cursor rules, really, really important. Why is context so important?
[00:09:22] Sepehr Khosravi: Yeah. I think you gotta treat your AI like a basic junior engineer. If you don't give him the full requirements of the task, he's not gonna be able to figure it out.
[00:09:32] Sepehr Khosravi: So we need to make sure all the details that the AI needs to know we're providing it. One really good way to do that as well is with MCPs and setting up some sort of documentation MCP, because there are oftentimes a lot of gaps in our code where the AI will read through your code but still not understand what's going on.
[00:09:49] Sepehr Khosravi: But when we give it access to our documentation, it can read that and fill in those gaps, which really boosts the productivity of what you can produce.
[00:09:56] Simon Maple: Yeah. Amazing. And actually in the session I asked one of the questions which was about how to know when to give enough context without giving so much context that it actually degrades the performance.
[00:10:06] Simon Maple: And I guess that's why many of the things, like when we talk about Cursor rules and the "always apply" versus "apply manually", it's for exactly that reason. I suspect when we look at if you had a huge amount of context that you wanted to provide, you were probably more likely to say, "Actually there's too much context here, let's add this either manually or add it more intelligently."
[00:10:21] Simon Maple: So that way you're not actually bloating context for no reason. So super interesting and actually a really, really crucial part. Okay. One more Cursor rule and we'll jump into Claude.
[00:10:34] Sepehr Khosravi: Yeah. Yeah, let's do it.
[00:10:35] Sepehr Khosravi: Yeah, let's do it. You can have a bunch of different rules. I think one that's particularly interesting, Claude shared this themselves, is relating to this context we just talked about. Oftentimes if you get close to the end of your context window, so you've used up like 90%, then you ask the AI for something, it will give you a short answer because it's just trying to get something out before it runs out of context.
[00:10:53] Sepehr Khosravi: But if you type in a prompt like this or something similar, you can tell it, "Hey, your context is gonna end, but don't worry about that. You can compact it. Actually give me the best answer." And that's one useful tip.
[00:11:04] Simon Maple: Yeah. One of the things I love, and actually maybe this segues us into Claude, is the slash command, "the compact." It's the fact that you can add some text after that, which is super cool.
[00:11:16] Simon Maple: Sometimes it'll auto compact or you just put /compact, which will make it compact. I love the /compact where you put text after it and that text is, "Hey, this is what I'm gonna do next." And what that'll do is it'll also base it on that; it'll release a certain amount of context that it doesn't need because it thinks, "Oh, actually I don't need that for my next task."
[00:11:35] Simon Maple: Super, super cool way of actually saying, "Actually, I want you to release just the right context and leave me with what I need to do my next task." So that's a Claude tip.
[00:11:44] Simon Maple: And that lends us into Claude Code. Now, Claude Code, I would probably say is probably one of the most functional agents in terms of the extra value add that it provides you with. Let's jump into the Claude Code section. What would you say is your number one tip for Claude Code?
[00:12:02] Sepehr Khosravi: Yeah. I guess Claude Code is very similar to the way I use it on Cursor, except I think the use cases are different. Cursor is good because it gives you the visual IDE; you can see what you're doing.
[00:12:13] Sepehr Khosravi: It's faster than Claude Code is, uses less tokens, and you can switch between different AI models. But if you're trying to go deep on a task that's difficult, Claude Code does a lot more thinking, takes up a lot more tokens, but usually gives you a better engineered answer. At work, I had a project like this that I put into Cursor and into Claude, and Claude Code saved me a ton of time.
[00:12:33] Sepehr Khosravi: It searched the web, found open source repos, analyzed them, and gave me a proper solution, whereas Cursor just picked something, implemented it, and it was done. But Claude Code really saved me multiple hours.
[00:12:43] Simon Maple: And I think based on your language that you use in the prompt, it will do the correct amount of thinking based on that.
[00:12:48] Simon Maple: So if you type "think deeply," it actually adds the amount. And if you, another top tip, if you type in "ultra think" (I don't know if you have typed that, let's go to the terminal quick)... If I was to go run Claude and you type in "ultra think," it does this crazy thing.
[00:13:07] Simon Maple: If I was to go run Claude and you type in "ultra think," it does this crazy thing. I didn't even know about this. This is one of the deepest levels of deep thinking it will do.And so if you type something after that, it'll then go in and do really deep thought on that. So super cool little tip there as well.
[00:13:22] Simon Maple: Thinking, I totally agree, and I think that's why sometimes it takes that longer time because it does do that deeper thinking.
[00:13:30] Sepehr Khosravi: A hundred percent. Another one is subagents. This is where Claude Code kind of beats Cursor as well, where you can set up different agents and give them all their specific MCP tools that they can use.
[00:13:40] Sepehr Khosravi: And these subagents also have their own context windows. So for more complicated tasks, this might be a lot better. For example, you might set up a subagent that's a PagerDuty investigation subagent. Every time a page comes in and you call Claude Code, it will use this one.
[00:13:55] Sepehr Khosravi: It'll specifically look at Slack, find the alert, and then go into Datadog, research it, and come back with a solution for you. But for specific workflows that need specific sets of tools, these subagents can be super helpful.
[00:14:05] Simon Maple: And all that extra work it did adding stuff into the context unnecessarily just to get a small amount of information, it passes that small amount of information back to the main agent and then all that other context just gets lost. Right? It's so useful. We use that at Tesla a lot to do our research on our context and then once we learn this is the piece of context we need to pass back to the main agent, it keeps it nice.
[00:14:28] Sepehr Khosravi: Yeah, a hundred percent. Context management is kind of everything.
[00:14:32] Simon Maple: Absolutely. Now, subagents, that's not a thing which is duplicated across many different agents. It's quite particular to Claude in terms of, I would say, the mainstream agents. Do you feel like agents are gonna learn from each other and start building these additional things?
[00:14:46] Simon Maple: Do you feel like we'll hit a feature parity across these agents or how do you see that?
[00:14:51] Sepehr Khosravi: I think it really depends on how companies handle this context management problem. But right now, even with Claude Code, there's a main agent and subagents, and none of the subagents actually talk to each other. The subagents all talk to the main agent.
[00:15:03] Simon Maple: Yeah.
[00:15:04] Sepehr Khosravi: So even Claude Code itself isn't that advanced compared to what we could be doing in the future. And I'm sure these companies will start to dabble in that.
[00:15:09] Simon Maple: Really interesting. There's also another tool, I think it's called Claude Flow, whereby they do kick off a whole ton of agents.
[00:15:16] Simon Maple: I think it's mostly command line. And they have this thing called Hive mind, which is super cool, where the agents have almost like this shared memory where they can start writing to that memory and learning from each other. So yeah, audience, if you're interested in that, I think it's Claude Flow, which is super cool.
[00:15:29] Simon Maple: Nice subagent. Super cool. Really love that about Claude.
[00:15:32] Sepehr Khosravi: Yeah. Another tip, while you're introduced to other tools, Claudish is another one. You really love the terminal of Claude Code, but you want to use different AIs with it. There's different open source libraries that do this, but Claudish is one of them.
[00:15:44] Sepehr Khosravi: It looks exactly like Claude Code, but you can call Gemini, GPT, or whatever AI model you want from it.
[00:15:49] Simon Maple: And people would do that because they love the Claude Code DX, but they presumably have a different model that the company has agreed on or something like that.
[00:15:59] Simon Maple: Do you think you'll get much of a benefit in terms of the underlying LLM being certainly better in certain cases or not? Or would you say if you don't care, Claude Code is probably better to use with Claude?
[00:16:11] Sepehr Khosravi: I think if you don't care, Claude Code is probably the best to use at this point. Maybe that changes. Yeah, I wouldn't really recommend Claudish; it's just exactly for that specific use case where your hands are tied.
[00:16:23] Sepehr Khosravi: Another thing: when you're using these AI tools, it's important to evaluate if these are actually working or if this is slowing my team down.
[00:16:29] Sepehr Khosravi: A lot of times, initially, teams will adopt AI and see a little bit of slowdown before they ramp up, and that's the learning curve that it takes to start using these tools. But overall, you wanna be tracking these metrics. There is no perfect metric is what I've found.
[00:16:42] Sepehr Khosravi: But what you wanna do is just track a bunch of different things like PRs merged, how long from a PR created to merge, et cetera. And then, what I find most times is you in the company build a qualitative story that you want to tell your management, but when you have these metrics, you can pick and choose which ones you need to use at the right time to tell that story and back it up with evidence.
[00:17:03] Sepehr Khosravi: So that's where I think it's really important to gather as many different metrics as you can and use them when you think it's right.
[00:17:09] Simon Maple: I couldn't agree more. I think in the past, AI adoption has more been like people just wanting to adopt AI tooling and they have that almost that push from up top:
[00:17:21] Simon Maple: "You have to be using AI. Let's all do it." And I think there are certain people who are now looking at this from a productivity point of view: how much value is this providing me? I was talking with Tracy Bannon yesterday, actually. Tracy's giving the closing keynote at QCon and hopefully we'll chat to her as well.
[00:17:34] Simon Maple: Um, and hopefully we'll chat to her as well. But Tracy was talking about value versus things like velocity. 'cause velocity's just a metric, right? It's a metric, and it can be, it can be game, there's a whole different way, but value is super important. It's like, what is this, what is, what is this agent actually giving to us from a business value point of view?
[00:17:55] Sepehr Khosravi: Yeah, I think the impact is a hundred percent. Kind of alluding to what you're talking about, they have some complicated research methodology that they did over a hundred thousand employees. But what they found out is AI is helping generate 30 to 40% more code than before.
[00:18:09] Sepehr Khosravi: However, 15 to 25% of that code ends up being junk that gets reworked. So they estimated the actual productivity we gained is like 15 to 20%. I would think it's higher if you're using AI in the right way, but it's good to know there's a difference between value and just straight output.
[00:18:23] Simon Maple: Absolutely. And that's the biggest trip light hazard: output versus outcome and measuring the wrong thing. 15 to 20% though is still impressive. And we expect that to probably grow as well every time with LLMs, et cetera. Amazing session today. Thank you so much.
[00:18:38] Simon Maple: A pleasure to be here and meeting you at QCon today. Thank you for having me. Absolutely. Enjoy the rest of the conference and speak to you soon. Hello. And this time I'm with Ian Thomas, who's a software engineer at Meta. So Ian, welcome to our podcast.
[00:18:54] Simon Maple: Pleasure. And as soon as I saw your title on the schedule, I thought I have to have Ian on the podcast because your title today was "AI Native Engineering," and of course this is the AI Native Dev podcast, and I'm like, "That's a perfect match." That's a match made in heaven.
[00:19:11] Simon Maple: So tell us a little bit about the session.
[00:19:13] Ian Thomas: So. This is a kind of a case study of what we've been doing in my team and my wider org in the reality labs part of Facebook and Meta, where we've had a sort of ground up adoption program for people who are looking to experiment and develop with AI.
[00:19:29] Ian Thomas: As part of their workflows, building on all the various tools that we've been getting access to over the last few months, and generally seeing how we can accelerate ourselves in terms of productivity and making more product wins and outcomes for our teams.
[00:19:43] Simon Maple: And I guess in terms of, every organization's different, but generally, how would you say adoption is of these types of tools in a large organization like Meta?
[00:19:52] Simon Maple: Do you have people who are diving in, people who are stepping back, and then the majority maybe leaning into it slightly? Or where does that sit?
[00:20:01] Ian Thomas: Well, it's changed a lot over the last six months. I can only really speak to my org and the patterns that I've seen there.
[00:20:07] Ian Thomas: We had a few people that were really keen and experimenting, and they were finding value of these tools outside of work. And they were really keen to see how they could apply it internally. Meta has some interesting engineering things because it's very in-house.
[00:20:22] Ian Thomas: It builds its own tools and platforms and things. So there's some quite bespoke ways of working there that maybe you can't just go and pull some of these things off the shelf and apply them to our code base. But yeah, so we had these few people that were really passionate about it, and then we had a bunch of really senior engineers who were perhaps a little bit more skeptical.
[00:20:39] Ian Thomas: And then gradually over time we've seen adoption grow and there's been more of a push from the company overall saying, "Well, this is gonna be something that we have to take seriously. You've got access to all these tools, let's go for it." So the adoption now is pretty good. I think last time I checked we were over 80% weekly active users.
[00:20:55] Ian Thomas: And the way that we measured that sort of changed subtly. So now it's considering four days out of seven usage of any AI assisted tool.
[00:21:05] Simon Maple: And can you say what tools you're using?
[00:21:07] Ian Thomas: Well we've got a whole spectrum of internal things. We've got Meta Mate and Dev Mate, which are the kind of two main ones.
[00:21:13] Ian Thomas: One's more of a chat interface and the other one's more for coding and agency workflows. There's a whole bunch of stuff like Gemini Codex, Core Code, et cetera. We've got access to third party tools as well, like Cursor.
[00:21:30] Simon Maple: So devs are trying to find what works for them and teams are trying to work out...
[00:21:34] Ian Thomas: Exactly.
[00:21:35] Ian Thomas: And a lot of the tools like Meta Mate and Dev Mate have access to the models used by things like Core Code as well. So you get to compare things that have got a bit more internal knowledge than external. So it's quite good. The versions that we use generally are kind of tailored for Meta use cases, so they're a bit more restricted than the generally available stuff, but they're still pretty powerful.
[00:21:56] Simon Maple: One part of your session that really resonated with me was when you started talking about community and growing that community. Why is building that community so important when we talk about sharing that knowledge and that growth in adoption?
[00:22:11] Ian Thomas: I think one of the things about Meta culture is that it's very engineering empowered. And so what that means is if you can get engineers to support something from the ground up, you can go a long way. And there's a popular phrase, "code wins argument." And I think in this case, it's a case of like "proof wins argument."
[00:22:29] Ian Thomas: So if you've got people that are gonna be forming around this idea and these concepts, they can bring them up to a level where people are saying, "Hey, I'm getting value from this and this is how I'm using it and this is what's working." It attracts other people who are going, "Okay, well if they're doing it, then maybe this will work in my case too."
[00:22:45] Ian Thomas: And I think that community aspect helps to get a bit of authenticity and people feeling like they're part of a community. It helps to push everything along with more momentum. Whereas sometimes if you get a top down mandate, engineers, we can be quite skeptical people at times.
[00:23:00] Ian Thomas: You know, it's a bit cynical, and someone's like, "Oh, well, if I have to do it..." But rather it's bottoms up. It's a bit more like, "Okay, we can make this work and I think it's gonna be a great thing."
[00:23:09] Simon Maple: So what was being shared then? Problems when people were having problems, or people had questions, wins? Maybe people saying, "Hey, this works really well for me," maybe context or rules, things like that? How did the community work?
[00:23:23] Ian Thomas: All of the above. So we have, one of the things that we rely on a lot is Workplace, which is sort of like the professional layer on top of Facebook.
[00:23:32] Ian Thomas: It was an external product until fairly recently. So we use this heavily and that's the basis of where this community is. And we share things in there. You can post, people can comment, it's like a social experience but for work. And so what we were seeing was people were coming in, they were asking questions, they were saying, "Well, I'm trying this thing," or "Have you seen this tool's available?"
[00:23:51] Ian Thomas: Generally keeping everyone informed about what's going on and over time it's just gradually grown and grown.We never really forced anyone to join. It's not got auto enrollment or anything. And I think last week we hit 400 members. Which is pretty good for something that's kind of a grassroots initiative.
[00:24:07] Simon Maple: Self-sustaining, like some of the best communities. And I guess one of the things that are really nice about communities is that you get so many different people at different levels of their experience and adoption. And this leans into one thing that you mentioned as well, which is these maturity models.
[00:24:22] Simon Maple: I guess as maturity in terms of how an individual is using it as well as a team. Talk us through the maturity models and why they're important.
[00:24:30] Ian Thomas: So the thing about the maturity model, I intended it to be used for a team, but there is a dimension on there which is about individual productivity as well.
[00:24:38] Ian Thomas: So you can reflect on your own performance in ways that you are getting value from it. The benefit of it being a team-based thing is that it opens up the conversation within your team. And so you can generate ideas and have action plans that are specific to you because every team's gonna have slightly different context or different levels of ability and different interests, so that's great.
[00:24:54] Ian Thomas: We tried to model it in a way that was gonna be fairly agnostic of the tooling and be durable because the value, I think, is that you can have these models and you can repeat the assessments time after time and see how you're progressing.
[00:25:10] Ian Thomas: And we do subtly tweak it every now and again, but generally it's kept fairly consistent. And then the teams run these assessment workshops and they could have the discussion and that's where the real value lies.
[00:25:20] Simon Maple: et's talk about value and wins. What were the big wins that you saw and how did you share that across the community?
[00:25:27] Ian Thomas: So initially the wins were people saying, "Oh, I'm using Dev Mate," which is part of our tool set in VS Code that we work with day to day. "So I'm finding ways to use this to understand the code base better" or what have you. And there are some early examples of people just going for a big problem and putting a prompt in.
[00:25:47] Ian Thomas: And they were getting a bit lucky. And then there was equally some people that were suffering because that wasn't working at all. But then we found there was kind of repeatable patterns emerging around things like test improvements, or how to make code quality improvements, or reducing complexity of code.
[00:26:03] Ian Thomas: And that was when we started to experiment with things like unsupervised agents. And you could say, "Okay, with this category of problem, say like we've got test coverage gaps, we want to go and find all the files that are related to this part of the code base, find the ones that have got the biggest coverage gaps and then, using this runbook that we've put together, go and cover them. Produce diffs that help us to bridge the gap."
[00:26:18] Ian Thomas: And that was the sort of thing that would take lots of hours of manual work. And as this evolved, we found actually we can go and use the tools to go and query the data, find and do the analysis, and then generate the tasks for itself to go in and fix the tests and add the coverage.
[00:26:43] Ian Thomas: And I think the end result was something like 93.5% coverage was achieved. Wow. Which, when we were way less than 60% to start with, many diffs landed.
[00:26:54] Simon Maple: And it sounds like huge productivity gains.
[00:26:57] Ian Thomas: Yeah. So that one, I think we ended up doing it in about three hours to achieve this.
[00:27:01] Ian Thomas: Which is incredible, and this is again one engineer just saying, "I've got a hunch I think this is gonna work. I'm gonna go and play with it." They found a pattern that works and then it became repeatable. They shared it to the group. And I think I said in the talk, one of the things that came about from that was another engineer was like, "Hey, this kind of works.
[00:27:19] Ian Thomas: I've got all these tests that are running fine, but they're a bit slower than I need them to be to be eligible to run on diffs. I wonder if I can do the same sort of thing. Can I go and put this tool to use and go and find all the tests and see if I can reduce the runtime down?" And I think they achieved something ridiculous, like 1900 tests were improved..
[00:27:38] Ian Thomas: And so that's a huge amount of work that you have. It's hard to estimate.
[00:27:42] Simon Maple: It's hard to estimate how much time that is. And you're probably right; it just wouldn't get done. So it's a beneficial increase in quality versus just time saver.
[00:27:50] Ian Thomas: Exactly. And then in terms of the value of that one, I was chatting with that individual engineer as well, and I said, "Look, what would be amazing to know is not just how many tests you've fixed or improved, can you tell me how many issues that's actually caught?"
[00:28:02] Ian Thomas: Because they're now eligible to run on these diffs. And he went away and he found the data: "Yeah, look, in the time since I launched this, at least 200 changes have been stopped because these tests now run and previously they didn't." So yeah. That's the kind of, it is harder to get those kind of value metrics for everything.
[00:28:18] Ian Thomas: But that's the sort of thing that I think makes the difference and it helps to persuade people that this is gonna be a differentiator. And we're just starting.
[00:28:27] Simon Maple: Amazing. Well, thank you so much for the session and joining our discussion here. I very much encourage folks, go to the QCon website, check out the talk in full. But for now, Ian, thank you very much. Appreciate it.
[00:28:37] Simon Maple: Hi there. Now I am with David Stein, who is a principal AI engineer at ServiceTitan. And David gave a really great session yesterday, in fact, which was called "Moving Mountains: Migrating Legacy Code in Weeks instead of Years."
[00:28:57] Simon Maple: Very nice title. Really engages the thought processes of what is possible with migrations in and around AI. David, first of all, tell us a little bit about what you do. Tell us a bit about ServiceTitan.
[00:29:08] David Stein: Right. So ServiceTitan, we are the operating system of the trades. And so what that means is this is for trades and residential commercial building services industries like plumbing, electrician, HVAC roofing, garage door, and so on.
[00:29:25] David Stein: Much of these contractor businesses for regular service work and for construction. ServiceTitan builds a technology platform that helps these businesses run their business. So an end-to-end platform supports everything from customer relationship management, taking payments, to answering the phones, and we have a whole suite of capabilities in our platform around everything you need to run a business like that.
[00:29:54] David Stein: So that's what ServiceTitan is when I work as an AI engineer, principal AI engineer at ServiceTitan. And you might wonder at first: what are the AI use cases in the trades businesses? And there's actually a lot of really interesting AI applications that are really important for companies that want to run a sophisticated operation today.
[00:30:21] David Stein: And just to name a few of them, if that's sure, can keep going there. So there's one thing that we are doing around what's called job value prediction. A lot of trades businesses, when they have many dozens of technicians and lots and lots of customers and many jobs that they're gonna serve in a particular day, they have a big task to do at their main office around matching which technicians are gonna serve which customers on a particular day.
[00:30:49] David Stein: And we have a bunch of intelligence in our product that allows for this scheduling and dispatch to be matched efficiently to basically help our customers run their business better, more efficiently.
[00:31:11] David Stein: That's one example, but there's a bunch of other areas too. Using, for example, we have a voice agent product, a voice AI answering service, so that our customers can answer the phones for our customers so that it can help their customers schedule service appointments 24/7. Those are a few things that I work on.
[00:31:35] Simon Maple: Yeah. And so tell us, one of the big things about your session was about how you took some legacy application and you wanted to migrate it. So first of all, why don't we talk a little bit about the legacy application.
[00:31:49] Simon Maple: What were the pieces of that application that you needed to migrate? There was a bit of data; I think there was some code. Anything else?
[00:31:57] David Stein: Yeah, so in the talk it actually focused on a different area than the things that I just mentioned, and it related to a migration that we've been working on as part of, this is, like all large tech companies have this issue.
[00:32:12] David Stein: If you've been around for more than a few years, your software that's been in existence for more than a few years is using legacy components that have been around for some time, that no longer resemble the way that you would rebuild those things today using more state-of-the-art foundations.
[00:32:31] David Stein: And so anyone who's been in software engineering for long enough has had to work on a migration at some point where you have to unpack code that was written a long time ago. Maybe the people who wrote it aren't on the team anymore, where not all of the context is easy to find, but you have to do the work of picking those things up and moving them into a new implementation on an improved platform.
[00:32:57] David Stein: And those kinds of projects, all companies have these projects, ServiceTitan included, they're notorious for taking a long time. There's a lot of toil in understanding the old legacy code and moving it over. So the example I talked about in the talk was around our reporting application.
[00:33:15] David Stein: ServiceTitan has a reporting product that basically allows our customers to see all sorts of business metrics and KPIs relating to their operations and their financials and their business. And we have a bunch of very complex infrastructure that supports that. And something that we do periodically is move the underlying machinery to use better and more state-of-the-art infrastructure.
[00:33:41] David Stein: And so that's what this is about. So the specifics are we've been working on, like working with DBT Metric Flow, it's like a metric store technology based on and layered on top of Snowflake, some of the details of the stack underneath. But there's this big task of how do you take this code that was written a long time ago touching legacy.
[00:34:06] David Stein: Well, I don't know how many of these dirty details I should go into here, but I'll let you actually ask the question. Do you wanna know the interesting bit?
[00:34:13] Simon Maple: Why don't we talk about the AI piece first of all? So we know there's a reporting part of the application.
[00:34:19] Simon Maple: There was some code that I think you said you changed language on? Yes, for that. There's an amount of data as well, of course. Did the data migrate or was that pretty much staying where it was?
[00:34:29] David Stein: So... yeah.
[00:34:32] David Stein: So we are talking about shifting from an architecture that is based on a legacy stack which is written in C with an ORM programming model that is operating against SQL DBs in production.
[00:34:40] David Stein: And shifting that to a data lake like architecture with a semantic layer and a query engine on top that allows us to serve the same kinds of metrics queries but on an off-production architecture. And there's lots of reasons for that.
[00:35:11] Simon Maple: No, that's fine. And so in terms of, you used agents to help you with this, right? From the, what would happen if you just said to an agent, "Here's my environment. I need you to migrate this. Create a plan, do it yourself." What would be the problems?
[00:35:26] David Stein: So for a product like this, when you have hundreds of metrics and each of them are underpinned with a bunch of code written in C with all the issues that I mentioned before about not all the context necessarily being exactly where you need it, there's a lot of complexity in that.
[00:35:39] David Stein: You can't just open up even the state-of-the-art coding tools like Cursor and say, "Hey, please migrate all of our metrics into this new abstraction on this new framework. And by the way, convert from C into writing SQL with a YAML for Metric Flow on top." It doesn't work to do that is what we found.
[00:36:00] David Stein: In order to get traction there, you have to really break down the mountain of that problem into small pieces. It sounds kind of obvious if you say it in this way: you break it down into standardised, similar tasks that can be verified in a standardised way.
[00:36:21] David Stein: And then you assign a long task list with phases for individual sections of the task like, "Move these first five metrics" is the first phase and then the next.
[00:36:34] Simon Maple: And who's doing this? Is this humans doing this? Is this humans with agents as an assistant?
[00:36:41] David Stein: Human are choosing the, the task list, right? They are enumerating which metrics we're going to migrate in phase one and which metrics we're going to migrate in phase two. Mm-hmm. So humans are making those choices as well as what the target architecture is that we're gonna be putting these things into.
[00:36:58] David Stein: And humans are also, with some help from AI tools, constructing all of the context that's gonna go to the coding agents to actually enable them to do the migration work for those pieces.
[00:37:10] Simon Maple: And you went into a bunch more detail in the talk. I very much recommend, it's a recorded talk, I believe, so I very much recommend folks go to the QCon site and check that out.
[00:37:15] Simon Maple: So it's very, very, very much recommend folks go to the Q con site and, and, and check that out. But in terms of the agents, now we have those tasks. The agents now need to take those bite-sized tasks, they need to perform that migration, and then essentially you can run the validation to say: did this work, did this not work, what needs to be changed after that?
[00:37:34] Simon Maple: In terms of, I'm really interested in what needed to be done to the agent to best prepare it for that actual migration.
[00:37:41] David Stein: It's kind of hard to say it all without the visuals that we had in the talk, so people can watch the talk if they wanna see more of these details.
[00:37:48] David Stein: But we talked about standardising the tools that need to be used to do each of these tasks. And so that means standardised tools for context acquisition. So the agent, just like an engineer, is gonna need certain context in order to be able to migrate one of these pieces of metrics codes.
[00:38:05] Simon Maple: What type of context are we talking about here?
[00:38:08] David Stein: Context in terms of what the underlying data looks like in the database. So the agent will be looking at the reference code and it will have references to some table. And you need to know what some example data actually looks like that lives in those tables and what the schemas of those tables are.
[00:38:27] David Stein: And in terms of what's available in the destination platform, like confirming, "Okay, we have this table in Snowflake, we know that the data is there." Understanding anything that a human engineer would need to see in order to be able to correctly write the code that's gonna calculate those metrics in the new platform, we want the agent to be equipped to see that too.
[00:38:54] David Stein: So we would have a standardised tool for context acquisition or a set of tools. What that would mean is, not tools in the sense of MCP, we didn't get that complicated with it, we just made sure that the CLI tools an engineer would be able to use to get these things are set up and ready for the agent to use to be able to get access to what it needed in order to do the task.
[00:39:20] David Stein: Then the other really important piece is around giving the agent a kind of environment in which to run the code and a kind of simulation engine, I call it a "physics engine" in the talk, which is not really about physics, but it's about a script or a program that the agent can run to try out the code that it has written.
[00:39:40] David Stein: To actually run the metrics code in a context that resembles a legacy application so that it can produce an output that can be directly compared to the output that would be produced by the legacy code.
[00:39:53] Simon Maple: And I think what you said in the session was you're not using production data at this point.
[00:39:56] Simon Maple: It's almost dummy data a little bit to say: if you mess up, I want you to mess up in a safe environment. But it gives a good enough way of you being able to validate. I see what you're doing; you're within an environment where you can test yourself and make sure you're able to migrate and we can understand if that's successful.
[00:40:13] Simon Maple: And so from a validation point of view then, trust is a massive piece of this, right? In terms of when you push big pieces of code and things like that, what levels of validation did you have there? You obviously mentioned you had this environment that you created.
[00:40:31] Simon Maple: What else did you need to do before you said, "Right, we're gonna put this live"?
[00:40:34] David Stein: Right. So the validation that enables the tasks to be able to be done automatically by the agent, we have machinery there that can compare the output produced by the new system to the output produced by the old system.
[00:40:52] David Stein: And so there's a separate set of validations that we also run before actually doing production cutover because it's obviously extremely important that the data be right when we're showing it to our customers. In terms of the details of those, I'm not sure what level of specifics to go into there, but hopefully that...
[00:41:16] Simon Maple: I think what's important is you were effectively replaying things that you saw in your previous production environment, and you were ensuring that the output that you got from your previous production environment was the same and equivalent of what you were getting with the updated code and the updated flows.
[00:41:34] David Stein: Yeah, that's right. So "replay" is a good keyword there. We of course have the logs of what queries are given to the old system. We're able to play those queries against this engine that runs using the new platform as well.
[00:41:48] Simon Maple: In terms of the LLM itself and the agent itself, how smart does that actually need to be? Did you hit any limitations in terms of needing to change models or anything like that? Or was it pretty basic in terms of what it actually needed to do, given the context of validated intelligence?
[00:42:06] David Stein: I had a couple slides about this.
[00:42:09] David Stein: At one level, the agent doesn't need to be that smart, doesn't really need to be as smart as a human engineer. Part of the point that I try to make in the talk is that as long as you have that kind of self-healing loop where you're able to empower the agent to check its work and then try to make corrections if it didn't pass validation, and the metric computed by the new system doesn't match the metric produced by the old system.
[00:42:39] David Stein: It's kind of interesting. This project really kicked off for us when Claude 3 Opus was released in March of this year. We had done some experiments before that at seeing how well the different coding LLMs could understand some of our legacy metrics code. And it turned out that they could understand it to an extent.
[00:43:04] David Stein: But the first time that we encountered the ability of an LLM to really be able to understand that stuff well enough to do a pretty good job of restructuring the code for the new platform was with Claude 3 Opus, which is one of the bigger models out there.
[00:43:25] David Stein: So it needed in our case to be pretty smart. But it doesn't need to be perfect, which is an important point. I think that sometimes people get caught up in "Well, bots will hallucinate sometimes and are they really as good as a human engineer?" But part of what I'm trying to say is that's not the point. As long as you can have this self-healing loop.
[00:43:44] Simon Maple: So it's interesting that you needed that. Did they get stuck at any point? Were there any common ways in which the agent or the LLM failed, maybe through lack of context or lack of intelligence or anything like that?
[00:44:00] David Stein: Yeah, so we have a slide about this as well. In the talk, we talked about a few pieces that we put up to kind of govern this process where you're going to have the agent go through and do these tests one after the other. And it's not the case that we came up with this idea and we tried it one time and it was able to just do all of the metrics and wire all of them in just one take.
[00:44:25] David Stein: That is not how it worked. Instead, the first 10, 15, 20 of them, we did them many times in order to get to a point where the standard context that we present in the system prompt and the migration_goals.txt that I mentioned in the talk, as well as the way we even break down the tasks under the right granularity and also the behavior of the validator and simulator tool itself...
[00:44:55] David Stein: We had to kind of tune those to get them to work well enough that the agent was able to reliably get through. And so we would encounter situations where the agent thinks that it successfully migrated or that it successfully rewritten the code for a particular metric so that it would work in the new platform.
[00:45:16] David Stein: But for a set of them where it would say, "Okay, finished number one. Great. Okay, try number two. Great." And then we just let it go until it had done several of them. And then we would find upon inspection, this is obviously before shipping any of that code.
[00:45:32] David Stein: We would find that there are some problems here in what the agent did.And so we would have to go back and rewrite or add certain things to the context and improve the behavior of the validator and simulator so that from that point on the bot would be able to do a better job of producing the code.
[00:45:49] David Stein: And another thing you can find is cases where the agent would get stuck because it knows that it's not able to make progress on a particular task. There could be many reasons why that would happen. An example that I like to talk about is if the agent just doesn't have sufficient test data for that particular case, it will struggle to be able to confirm that the code that it tried to rewrite is actually going to work.
[00:46:13] David Stein: Which is a similar problem that a human engineer could have when they're trying to write a new version of some code that fits a new paradigm but where they don't have sufficient test data to be able to confirm that it's gonna behave in the same way. So making sure the agent has that context.
[00:46:25] Simon Maple: Absolutely. And I guess in your session, the title was "Migrating Legacy Code in Weeks instead of Years." Do you have any thoughts in terms of how long was that process? And I guess secondly, how long do you think it would've been had you not have used AI as a part of that?
[00:46:43] David Stein: So we talked about, I think the number of metrics I mentioned in the talk is 247 metrics that were moved into the new platform. It really does go very, very quickly once you get that flywheel running using the approach that we're talking about. So this is part of, once you get that assembly line working where you have sufficient context in there and the right tools for context acquisition and the right tools for validation, you can get pretty good agents to run your whole migration really fast.
[00:47:16] David Stein: So in terms of how long it took, I would say that the first 20 or 30, as well as getting those tools and those fixtures right, probably took one to two months, trying to remember my exact estimates there. Once those things were in place, going from that 20 or 30 point all the way to almost the end of that list of metrics was really fast.
[00:47:45] David Stein: So it was just a few weeks. And the reason that's so compelling is anytime you would be needing to have an engineer read the code in one language, look at the underlying data, make sure that the logic can be captured and put into a new abstraction, that would be work you have to assign to an engineer.
[00:48:11] David Stein: That task would stretch out over a long period of time. The way we estimate it, it would take quarters to do that kind of work where you're gonna have engineers migrate all that code into a new platform.
[00:48:23] Simon Maple: And to distract them potentially from future product work or anything else.
[00:48:27] David Stein: Right.
[00:48:27] David Stein: So that's time that engineer could be spending working on something else. But there's also a sense of agility that I want people to take away from this. A lot of companies where I've worked in the past, there are often some big migrations that are deemed high enough priority to actually kick off and do despite the huge costs that these things used to take.
[00:48:51] David Stein: But there's also a bunch of other kinds of tech debt cleanups where you might almost say there is a desire to migrate something to use a better foundation, but where engineering teams struggle to prioritise those things because of how long they take.And so trying to say here that if you can break down that problem in the right way, you might be able to do those migrations that you've been wanting to do much faster than you might've been able to do them in the past. So it hopefully helps people think about their strategy a little differently given what we have now.
[00:49:28] Simon Maple: Amazing. David, thank you so much. And by the way, ServiceTitan, they're hiring right now at the time of recording?
[00:49:34] David Stein: That's right. At the time of recording, we're hiring. So folks can find me on LinkedIn. You can reach out if you're interested. There's a lot of great stuff that we're doing here.
[00:49:43] Simon Maple: Amazing. David, thank you so much. Appreciate the amazing session you gave yesterday. Thank you for giving us some time and enjoy the rest of QCon.
[00:49:50] Simon Maple: Thank you so much. Appreciate it.
[00:49:52] Simon Maple: Hi there. Joining me this time is Wesley Reisz. And Wes is a Technical Principal at ThoughtWorks. Wes, welcome to the podcast.
[00:50:00] Wesley Reisz: Thank you. It's great to be here. I'm excited to be on it.
[00:50:02] Simon Maple: Tell us a little bit about what you do at ThoughtWorks, first of all.
[00:50:05] Wesley Reisz: Great. So I'm a Technical Principal. We call them technical partners as well. And so what I basically do is I lead an account, all the technical aspects of an account.
[00:50:06] Wesley Reisz: WeSo on this particular account that I'm leading, it's a large state organization in the United States. I've got about 10 developers, about 18 people total on the project with design, project, product, those type of folks.
[00:50:25] Wesley Reisz: So we're basically building a knowledge graph for a state agency that's using a deep research agent to be able to populate this knowledge graph. And we're building applications to access that knowledge graph to be able to answer questions for the state agency. So I've been doing about three months for this particular project and yeah, that's what we've been doing.
[00:50:42] Simon Maple: Awesome. And I joined your session yesterday. Really, really interesting session. The title for those obviously not at QCon AI: "AI-First Software Delivery: Balancing Innovation with Proven Practices." And you talked a ton about the various different levels of using specifications as part of your software development.
[00:51:01] Simon Maple: Super interesting. But going from using specifications to assist your development all the way through to specs being that kind of spec-centric piece. An amazing blog from Beita who we had on the podcast before as well. Absolutely. And you introduced a really interesting framework called Ripper... or RIPPER?
[00:51:23] Simon Maple: RIPPER. RIPPER. It goes back and forth. RIPPER-5. Why don't you introduce that to the audience and then we can delve deeper into it.
[00:51:32] Wesley Reisz: First, to be fair, I didn't come up with it. We discovered it on a blog post or a Cursor forum. That was in, I think, March of this year or so when we first ran across it.
[00:51:43] Wesley Reisz: But what it stands for, it really stands for Research, Innovate, Plan, Execute, and then Review. So it's that plan-execute model, but it goes a little bit deeper. And the reason why I think it's so important is: you've been interacting with an LLM, you've been in a chat console and you're trying to do something and it has to have the context of what you're doing. You don't always are in the same mindset.
[00:52:05] Wesley Reisz: You may be trying to research a particular thing but it jumps into coding, or you may be planning and it jumps into coding. Or you're coding and it's not really doing any planning. So what the RIPPER-5 model does is it provides a set of instructions that you can pass as a command.
[00:52:21] Wesley Reisz: We're using Cursor, so we pass it as a kind of property for our IDE to be able to have this context. So we give it a command. We say we're in "research mode," and what that does is it tells the LLM: "Ask me questions. Analyse the code base of where you're at."
[00:52:45] Wesley Reisz: But "don't do" is as important as what it can do is what it can't do: don't do coding, don't do planning. Right now, I just want you to understand what the code is, what my spec is actually trying to do. So that way I can provide more details to refine the spec. So this RIPPER-5 is kind of like an execution model of how you work with the LLM. And we do that in pairing with the developer.
[00:53:12] Simon Maple: It's really interesting because when you sometimes give an LLM a task, it will think it knows enough and jump to the next thing. It's almost like: if I have enough information that I can give a response, I'll try and do that.
[00:53:25] Simon Maple: Rather than actually thinking: "Well, I have enough information to give a response, but I don't have enough information to actually give the response that you as a user want." So in terms of the context that you give the LLM at that stage, I presume it's similar to rules or it's essentially like a step-by-step process to say: "Look, this is what you need to do and then this is what you shouldn't do.
[00:53:46] Simon Maple: Or if you find yourself doing this, stop and at least complete that before you say I'm done and then pass over to the next." Is that how it works?
[00:53:55] Wesley Reisz: Yeah. So I kind of jumped straight into the RIPPER-5. What we start with is specification-driven development. So what we define is a well-defined spec. So it has things like acceptance criteria.
[00:54:01] Wesley Reisz: We're moving to actually put behavior-driven development type tests in there so that way we're getting the test upfront before the code is written. A lot of times when you have it generate tests, it will fit the test to the code and not create the test then the code validates it.
[00:54:22] Wesley Reisz: So by getting BDD style tests upfront and in the specification, it kind of creates a test-driven development type approach that the spec follows. So the first step is getting a well-defined spec. One of the questions that I often ask when I talk to people about specification-driven development though is: at what level are we talking about?
[00:54:44] Wesley Reisz: Are we talking epics? Are we talking stories? Where are we when we talk about specs? What we're focused on right now, remember if you go back, this is a team that we just stood up with a new account, brought these people together, working with kind of a traditional company, a state organization. So we hadn't been here long; we didn't have a lot of domain knowledge yet about all the processes that we were doing or how we were working.
[00:55:05] Wesley Reisz: So what we took was defining a process using RIPPER-5, how we're gonna work with the LLM starting with the spec, and then we paired developers with that process. So with that well-defined spec, with that definition of done, with the acceptance criteria, with the information that's there, given that spec, we then go into that first "R" stage.
[00:55:28] Wesley Reisz: And again, this is just a command that we share with all of our development teams, a submodule what we pull into our repo. That means "do research." So do research: give it the spec, give it the code base that you're working on, make sure you understand what it is that this is going to do. Ask me questions.
[00:55:46] Wesley Reisz: And then we refine that spec. As it asks questions, we put it back in at the spec so that way we keep refining it. Then we go to the next stage.
[00:55:54] Simon Maple: Yeah, that's a really interesting point actually. Because when you say "ask me questions," I like that because when we talk about research, I guess there are different ways in which the LLM can get answers.
[00:56:02] Simon Maple: It can determine it itself based on its training. There are a ton of tools that are available to LLMs these days and agents, maybe it does a web search or something like that. How autonomous is this research? Is it very interactive with the user and kind of almost prescribed by the user, or can it just go off and come back?
[00:56:20] Wesley Reisz: Well, so that's the beauty of it, right? You do have the tools. So you can use a tool for a web search. In my particular case, we don't have like an MCP client that has broader context. But if you did have context, like some of the products out there that might provide some context back to it, you could certainly use those.
[00:56:38] Wesley Reisz: In the case that we're doing, it's primarily "ask me." You can do a web search, but generally it's "ask me" and then the developer's providing it back. But there's nothing that precludes adding the additional tools into it so that it can ask, right? That you can use it. But in the way that we're doing it right now, we're using the developer upfront in the process.
[00:56:58] Wesley Reisz: As we move along, as we gain more domain knowledge, as we gain more of the context of how this business operates, we can start to put evals in place where we can do things more autonomously. Right now, we're taking a very supervised approach, being that a developer, a pair actually, is engaged with this research and it keeps that developer in the loop as well.
[00:57:16] Simon Maple: And it keeps that developer in the loop as well.
[00:57:18] Simon Maple: So who determines when research is done? Is that a developer decision? Yeah, it's a developer decision.
[00:57:23] Wesley Reisz: So we're firmly keeping that developer in the process. Once that research is done, then the next stage of RIPPER is Innovate. And that's that part where you can do things many different ways in software.
[00:57:35] Wesley Reisz: Like implement this: what are the ways that you would implement this? And it might give you one, two, three, four different options. And you're like, "You know what? I like two. Or actually I like two, but let's make it more event-based and make it event pass." You pass that to it. That then goes back into the spec; that refines it.
[00:57:51] Wesley Reisz: And that's the Innovation.
[00:57:52] Simon Maple: And so the Innovate stage is still updating the spec, but it's updating the spec more in terms of the how the developer would want that to be implemented.
[00:57:59] Wesley Reisz: Yes. So it's refining the spec; it's adding more detail to it. A lot of the tools do some of this stuff as you go through.
[00:58:04] Wesley Reisz: But this, we're taking a very simplified approach. This is just straight Markdown, or interacting with the LLM and the developer and enriching the Markdown file.
[00:58:12] Simon Maple: And is it typically, I mean, of course it's just Markdown, so both an LLM or a user can go in and play with that. Do you find people leaning into the LLM only to make changes?
[00:58:22] Simon Maple: Or do you find people more handcrafting as they go as well?
[00:58:25] Wesley Reisz: So there's a common question that goes back and forth. What I push for is to make sure you go back to the spec at this stage. You're firmly in the spec. Now, I do take a spec-first approach. I believe once the code is developed, that the code is the source of truth. At this stage, though, we're refining the actual spec.
[00:58:45] Simon Maple: Yep. So we've done the Innovation. That refines the spec.
[00:58:50] Wesley Reisz: Now P is Planning. So in Planning, what we do, if you think back before AI, if I had a story, a well-defined story, and we're on a Scrum team together... say we're on this team.
[00:59:01] Wesley Reisz: We got three or four other developers; we don't know who's gonna pick up this story. So we've got the story, we go into planning and we look at this thing. It's like, "How are we gonna develop this?" And you're like, "Oh, there'll be a React component that we need to build. There's maybe a service. It needs these methods; we need these kind of things passed into it.
[00:59:17] Wesley Reisz: Maybe this model's there. We might need to create this repository. We need to do something to be able to migrate for the database to do updates." We literally break down what's gonna be done. We write the tasks. So if you and I aren't here and somebody else picks it up, another pair, they have the same task, they have the same mental model of what's being required to do it.
[00:59:38] Wesley Reisz: We use the same thing with Planning. So what we do in Planning is once we have this stage, there's a command that we'll run that's do plan. And we give it the spec. What do plan will do is it will first break it down into individual tasks. Do that same tasking stage.
[00:59:56] Wesley Reisz: Then as a pair, we review those tasks and we look at it and go, "Okay, do these things match what we want? Or do we wanna rearrange them or do things differently?" One common example I think I used in the talk yesterday was in the example that I did when I set up the project. When it tasked it out, I was doing a Python project and it didn't set up a virtual environment first.
[01:00:19] Wesley Reisz: And everything I do, I want it set up in a virtual environment so I'm sandboxed off and nothing's manipulating my machine. So the first thing I did was go back to it and say, "Hey, add a virtual environment to this plan," going back to the spec. I'll keep trying to update that. Sometimes I have to go to the plan if I can't do the spec.
[01:00:38] Wesley Reisz: But what I do is I task these things out. So now I have this task: set up the project, do this component that I mentioned, or do this service that was there, this repository, the migrations. I lay all those things out. Once we've done that, then what we can do is go to the next step where we actually create the plan for those tasks.
[01:00:58] Wesley Reisz: In the part where we actually do the task for that, we try to keep these things as atomic as possible so that way they're like a discrete commit if at all possible. So they're not leaking across different tasks. There's a couple reasons for that. One is they do have that atomicity, but also we might split tasks up.
[01:01:17] Wesley Reisz: You might do one; I might do another. We might parallelise some of the work. Some things might be dependent. We'll build that into a little task plan that we do, but we'll do that so that way we can implement each of these things individually. So that's the Planning stage.
[01:01:33] Simon Maple: Awesome. And with Planning, it seems like particularly with things like Python for example, there are a number of common best practices. Like, "Yeah, I want it to be contained in my own environment.” How much of this is almost reusable across different projects? Because there's probably an amount, whether it's styling, whether it's just ways of working, whether it's test-driven development, or whether it's a stack that the teams are most familiar with or want to deploy with.
[01:01:52] Simon Maple: Is there an amount that you kind of say, "You know what, for this customer I wanna reuse this for every single project, therefore I'll make this some kind of context that every project can reuse"?
[01:02:13] Wesley Reisz: For us in particular, we're using Cursor, so I'm gonna answer through a Cursor lens, but how I answer this really is irrelevant to the specific tool that you use. What we really start with, even before you jump into the RIPPER-5, is: what are some of the foundational rules that we're gonna establish?
[01:02:23] Wesley Reisz: So the first rules we start with are around documentation: anything that's created, use Mermaid to be able to diagram things. We'll create some things like that.
[01:02:40] Wesley Reisz: There're some general rules that we have on programming abstractions that we want to have. I know some people will actually put rules in there about DDD for example, if you wanna leverage domain-driven design or maybe a clean architecture that you'll add into the rule set. So you start there.
[01:02:56] Wesley Reisz: As you go through that RIPPER-5 and you go into Planning and you discover those things that are cross-cutting that you want, we create rules. Current project I'm working on, I'm using Spanner. So there're some rules on how Spanner operates. But I'm still somewhat new to Spanner.
[01:03:13] Wesley Reisz: So as I'm going through that and I'm working with Spanner and I'm doing DML or DDL, doing updates, and Spanner has a certain way of dealing with it, I'm like, "Oh, I ran into an issue here." So I add that into my rule file. I share that out with the submodule; other developers pull it in. So when they're running another specification that touches on Spanner, they're picking up that DML rule that I added into it. So we use rules for those particular things.
[01:03:37] Simon Maple: Gotcha. And how much of that is checked into the repositories?
[01:03:42] Wesley Reisz: All of it is. What we do is just create a submodule. So we have a Cursor submodule; it's our AI rules. We put rules and commands into that. And then for all of our repos, we've probably got five repos, and this one's not a mono-repo, what we do is we just have them clone the submodule. That pulls down the latest rules, and then we apply those rules as we go throughout them.
[01:04:07] Simon Maple: E is Execution.
[01:04:10] Wesley Reisz: Wesley Reisz: So, I should say we talked about before that as important as what you can do is what you cannot do in each of the stages. So when you were in Research, Innovate, and Plan, one of the rules is that you cannot execute. You cannot create code. And that's important to make sure that the LLM and the developer are in sync before they get to that stage of execution and creating the code. So now we're in execution.
[01:04:33] Wesley Reisz: And now you don't plan anymore; you follow the plan. That's the rules that have to actually happen.
[01:04:39] Simon Maple: What if you realise during execution, "Actually, we need to go back to planning"? Is it a process that you can bounce around or is it kind of best done where you agree completion before you go, and it actually makes it a bit messy to bounce around?
[01:04:52] Wesley Reisz: It's a fine line. So the answer is: it depends. The general rule is to go back to the specification and replan. That's the general rule. But, you know, sometimes alls can take a minute. You can think of it. So maybe you just wanna make a small change to the plan. That's frowned upon, but that's okay. If you do it, you do it.
[01:05:10] Wesley Reisz: Once you get to the execution stage, it really begins to depend. I am a firm believer, like there're different models that you can take when we talk about spec-first development, are we specification-driven development? Are we talking spec is a source of truth where every change must go to the spec?
[01:05:28] Wesley Reisz: A lot of folks and a lot of tools out there that drive specifications maybe take that approach. I haven't found that to be really successful with me because it forces you to move up too high in the abstraction. Because if you're too low, then you have specifications that start to collide.
[01:05:45] Wesley Reisz: So I haven't found that perfect balance for me. I tend to use spec-first, or spec is used to define what we wanna do in plain English. We review that, we generate that, and then we review what's there. If there're small changes that have to happen, one of the techniques that I picked up with a colleague of mine when I worked for Equal Experts, Mark Vermillion (the core contributor and creator of SDKMAN!)
[01:06:13] Wesley Reisz: what he would do is when the code was generated, as you review the code with a pair, he would drop a TODO in the code that says, "Hey, I don't wanna do nested for-loops here. I want to use streams to be able to do this." Drop a TODO in there, put that definition in there, and then you would have the LLM roll through there, pick up all the TODOs, and update the specification from that. Interesting technique.
[01:06:37] Wesley Reisz: Try that a few times. That is a useful way of doing it; I've done that. Other times I will just make the change. Because in my approach with spec-first and the approach that we try to follow, the code is our source of truth, not the spec.
[01:06:54] Wesley Reisz: The spec is used to help us get there, help us align on it, help us to do some of the thinking, and then just automate some of the code generation. But the code is actually the source of truth.
[01:07:02] Simon Maple: So as that code changes, then do you go back to update the spec, or the spec is kind of an old document at that stage?
[01:07:10] Wesley Reisz: depends. I Oh, you mean as we go on, as we got past, yeah.
[01:07:13] Wesley Reisz: It depends. You mean as we go forward? Right now, we get rid of the spec. We drop the spec. I haven't completely excluded it from the code base, but I exclude it from what we ingest. Because you don't want a mixed source of truth between the spec and then the code. The code is the source of truth.
[01:07:29] Wesley Reisz: So we exclude the spec in what we analyse. But I think it's gonna be completely removed very soon.
[01:07:36] Simon Maple: So the execution then just leaves you with pure code. Does it also generate any tests from there at all, or is that kind of more in the last stage?
[01:07:44] Wesley Reisz: So this is what we were talking about before with specification-driven development or with BDD. In many cases, one of our tasks would be to generate a spec.
[01:07:53] Wesley Reisz: And what you've seen, I'm sure, is when you generate code and then you have a task afterward that says, "Create 95% code coverage across this," what the LLM will do is take the code and create tests from that. Which makes incredibly brittle code. It makes incredibly brittle tests.
[01:08:14] Wesley Reisz: So what I then wound up doing as we go through this is I'm spending so much time trying to fix these very, very brittle tests that I wind up just getting rid of the tests and regenerating them. That feels disingenuous. That doesn't feel right. So what we're doing right now is we're pushing tests into the specification.
[01:08:31] Wesley Reisz: During the planning process, part of the spec defines BDD. So when you're generating code, you're generating the tests that go with it. So it's together; it's not done as an after-effect.
[01:08:44] Simon Maple: Which I think at Tesla as well, in our very early learnings, we originally had specs for capabilities and features and tests separate.
[01:08:52] Simon Maple: And we realised pretty quick: capability is just a different way of describing a test, and a test actually describes the capability. So we merged it all together.
[01:09:07] Simon Maple: And yeah, I've seen similar examples of tests being written for code and those being for the absolute implementation of the code as-is. And it makes it hard. It puts LLMs into that loop when it wants to make changes and things start breaking and it chases its tail a little bit in terms of what it should change, the test or the code and so forth.
[01:09:26] Wesley Reisz: What's interesting is the more you have these conversations, like you and I haven't specifically talked about that, but we're finding we are all converging on similar behaviors with specs and what goes into the specs and how they're created. I find that really interesting. People come from similar backgrounds and create a similar approach to things.
[01:09:42] Simon Maple: Yeah, I think it's good.
[01:09:45] Wesley Reisz: Final R is Review. So this is actually something that I think we missed too much, and that's the QA, right? This is QC. We've all established this pattern of plan and execute. You're seeing them brought into tools like Cursor; now all that's there. RIPPER expands that a little bit.
[01:10:01] Wesley Reisz: And what it does is now verify: did the thing that was created match the specification? So in that Review stage, what it does is it looks at the plan, looks at the code that was created, and shows you drift. Did it do what you asked it to do?
[01:10:20] Wesley Reisz: And it pushes this back to the developer and the developer can say, "Yes, actually I want that, that drift is okay. It does what I want because I'm reviewing and accepting it." In that case, we have to remediate that. But what it does is it verifies that the plan that was given is what was actually created and then puts it in the hands of the developer to decide what to do.
[01:10:42] Simon Maple: And is that like "LLM as a judge" style? It identifies the drift based on what it's seeing versus actually running some tests to validate?
[01:10:51] Wesley Reisz: I haven't called it "LLM as a judge," but yes, it is a stage that says, given this plan and the code that was executed, what does it match?
[01:11:02] Wesley Reisz: And it has things like "create a checklist." It'll give us our little green check boxes that say it was all there, and red identifies the things that were mismatched. An interesting feature of this, though, is sometimes you want that drift; it actually is something that's beneficial. I say something in my talk that the reason why LLMs are so powerful is because they're non-deterministic.
[01:11:23] Wesley Reisz: If it was a pure function, we've had them for a while. If we give it input, it gives us an exact output; it's called a template. We don't necessarily need a non-deterministic behavior from an LLM to achieve that. So sometimes it will give you something that you're like, "Oh, I like that."
[01:11:39] Wesley Reisz: So what we'll do is user review, get that drift, identify that drift, and then tell it to go back and update the specification so that way we can get it back and we remediate it that way.
[01:11:50] Simon Maple: For people who want to read more about this, where's the best place? The ThoughtWorks blog?
[01:11:57] Wesley Reisz: To be honest, we haven't blogged about it. This talk will be published in a few weeks; it'll be available. There'll be some things I'll put out on LinkedIn. I'll get some posts out there, but I don't think there's anything I can point to other than the Cursor forum where I actually found it originally.
[01:12:14] Simon Maple: Where's the best place to follow you?
[01:12:16] Wesley Reisz: BlueSky or LinkedIn primarily. Don't really use X or Twitter as much these days, but I'm out there.
[01:12:24] Simon Maple: Amazing. Wes, it's been an absolute pleasure.
[01:12:26] Wesley Reisz: Thank you, Simon. I really appreciate the opportunity to chat with you.
[01:12:29] Simon Maple: Thank you.
In this episode recorded live at QCon AI in New York, host Simon Maple and Coinbase machine learning platform engineer Sepehr Khosravi explore the dynamics of maximising developer productivity with AI. Delving into Sepehr's insights on choosing the right AI copilot, they discuss the cultural shifts, process improvements, and architectural changes necessary for effective AI-native development. Key takeaways include adopting a proof-first culture, clarifying AI task levels, and making context a priority to convert speed into meaningful outcomes.
Live from QCon AI in New York, host Simon Maple sits down with Coinbase machine learning platform engineer Sepehr Khosravi—plus contributions from David Stein, Ian Thomas, and Wesley Reisz—to unpack what actually moves the needle on developer productivity with AI. The episode centers on Sepehr’s talk “Choosing Your AI Copilot: Maximising Developer Productivity,” but widens to culture, process, and the architecture shifts needed to sustain AI-native workflows at scale.
Ian Thomas opens with a cultural truth that underpins successful AI adoption: proof wins arguments. In engineering-led organizations, debates about tools and approaches are best settled by working software and measurable outcomes. That mindset showed up throughout the episode—instrument what you try, show the delta, then scale what works. It’s a useful antidote to hype and a way to move beyond opinion toward reproducible value.
Wesley Reisz adds a crucial framing question when teams say they want “AI in development”: at what level are we talking? Is the goal code completion, agentic task execution, or changes to upstream process and architecture? In his work, they defined a clear, repeatable process (referenced as “Ripper 5”) that starts with a written spec and then pairs developers with LLMs through each step. The emphasis is on clarity of intent, bounded tasks, and fast feedback—so the AI’s output is both checkable and usable.
To keep conversations grounded, the team points to data like Stanford’s study of 100,000 employees: AI helped generate 30–40% more code, but 15–25% of that needed rework, netting 15–20% productivity. The implication is not “AI underdelivers” but “process quality determines the yield.” Spec-first work, clear acceptance criteria, and tight review loops convert raw code volume into shipped, maintainable features.
Sepehr’s daily environment is Cursor IDE paired with Claude Code in the terminal. He performs 80–90% of tasks in Cursor, then kicks off deep or ambiguous work in Claude Code, where the larger context handling and agent-style depth often succeed when general autocomplete doesn’t. Interestingly, his team tracks AI usage (in a supportive way), and token consumption differences across tools highlighted real utilization patterns while nudging him to try Claude Code more deeply—an experiment that stuck because it worked.
For developers skeptical of AI or new to it, Sepehr recommends starting with Cursor’s Tab AI. It’s low-friction autocomplete that can output 10–20 lines at a time, shaping muscle memory without changing your entire workflow. From there, activate Cursor Agent for bigger changes, then lean on multi-agent mode when you want to evaluate models or approaches side-by-side without derailing your day.
Multi-agent mode is especially useful when new models appear (e.g., comparing “Chat 5.2” against an existing daily driver like “Opus 4.5”). Benchmarks can be noisy or not match your codebase, so shadowing new models in real tasks is key: issue the same prompt to multiple models, compare the code and explanations, and decide based on clarity, correctness, and follow-through. Sepehr often prefers Claude because it explains the “why” behind changes, improving your understanding and future autonomy.
Speed isn’t just convenience; it’s cognition. Cursor’s Composer model exists for this reason—it generates code quickly. Sepehr cited a page generated by Composer in 24 seconds versus 2 minutes and 30 seconds with another model. That delta is large enough to pull you out of flow, increasing context-switching costs and, ironically, error rates later. The joke that YC backed a “brain rot IDE” with TikTok while you wait for your agent to finish is a tongue-in-cheek signal: latency is now a developer-experience priority.
A practical pattern emerges: use Composer for scaffolding, boilerplate, and shorter single-file edits where speed dominates. When you hit ill-defined problems, cross-file refactors, or tasks with tricky domain invariants, escalate to Claude Code. This bifurcation helps you retain flow—fast where you can be, deep where you must be—rather than forcing every task through the same slow agentic path.
To reinforce flow, timebox agent runs. If an agent doesn’t meaningfully advance the task within 30–45 seconds for small jobs (or a few minutes for complex, multi-file changes), pause, refine the spec, and retry. Latency is feedback: if the model can’t move fast, your prompt may be under-specified, you may be overloading the context window, or you need to decompose the task into smaller steps.
Sepehr emphasises that you should treat the AI like a junior engineer: it can be brilliant, but it needs the right context and constraints. Cursor’s Rules are the foundation here. He outlines four useful modes you can mix per project: always-apply rules for global preferences (style, security posture, diff-only edits), context-aware rules that Cursor applies when relevant, file- or directory-scoped rules for module-specific conventions, and manual-only rules for sensitive operations you explicitly opt into. Done well, rules serve as the “house style” and guardrails an onboarded teammate would receive.
Beyond code context, the Model Context Protocol (MCP) lets you wire documentation, APIs, and tools directly into the agent. This solves a big gap: code alone rarely explains domain invariants, data contracts, and “why it’s this way.” A documentation MCP allows the AI to answer questions and fill in missing intent, reducing hallucinations and preventing invasive refactors that violate non-obvious constraints. For many teams, connecting design docs, runbooks, and ADRs is the single highest-leverage improvement after enabling autocomplete.
Finally, manage the context window actively. As you near the limit, LLMs may default to terse, low-quality outputs. In Claude, you can use the /compact command or instruct explicitly: “You may compact prior context; produce the best possible answer.” Even better, tell it what you’ll do next so it can jettison irrelevant context. Paired with spec-first prompts (task intent, constraints, acceptance criteria), this keeps responses high quality without bloating tokens or slowing the loop.
The Stanford numbers (30–40% more code, but 15–25% rework) quantify something teams feel: AI is an accelerant, not a substitute for engineering rigor. To convert speed into outcomes, keep the bar high. Start with a spec and acceptance tests; use multi-agent comparison for risky changes; insist on readable diffs with rationale; and execute code under test harnesses or ephemeral environments before committing. You’ll ship faster than before, but just as importantly, you’ll ship with confidence.
David Stein zooms out to architecture, reminding us that productivity gains hit ceilings when they run into legacy systems. Most large companies have stacks that no longer match how they’d build today. His team is shifting to an off-production analytics architecture: a semantic layer with a query engine that serves metrics and insights without hammering production systems. That’s not just a data win; it’s an AI win. Agents can safely query, aggregate, and reason over business metrics when you give them a consistent semantic contract and a performant, isolated execution path.
This architecture pattern—semantic layer + query engine, off production—untangles operational concerns, improves performance, and creates a safe substrate for agentic analytics, test data generation, and observability. Combined with a culture of proof (prototype, measure, iterate) and a process that respects context (rules, MCPs, spec-first prompts), you get the compounding benefits promised by AI-native development rather than a collection of flashy demos.
Adopt a proof-first culture. Small, instrumented pilots beat debates—measure cycle time, review burden, defect rates, and deploy frequency to decide which tools and patterns stick. Clarify the “level” of AI you want on a task, then choose appropriately: Tab AI for low-friction speed, Composer for fast scaffolding, multi-agent for evaluation, and Claude Code for deep, cross-file work. Use multi-agent shadowing to assess new models in your codebase rather than relying solely on general benchmarks.
Make context a first-class citizen. Codify rules (global, context-aware, file-scoped, manual) so the AI behaves like a well-briefed teammate. Connect documentation via MCP to eliminate domain blind spots. Manage the context window deliberately, using /compact and spec-first prompts to keep outputs crisp and high quality. Finally, remember that sustained gains require modern foundations: consider a semantic layer with an off-production query engine to safely power agentic analytics and developer tooling at scale.

12 Nov 2025
with Baruch Sadogursky, Liran Tal, Alex Gavrilescu, Josh Long

23 Dec 2025
with Cian Clarke

"Agent Therapy"
for Codebases
3 Dec 2025
with Sean Roberts