
Faster Reviews,
Faster Shipping
Instant PR Feedback Without Leaving GitHub
Also available on
Transcript
[00:00:00] Guy: Hello everyone. Welcome back to the AI Native Dev. Thanks for tuning back in. Today we're going to review everything about review and code review. And to dig into that, we have the CEO and co-founder of Graphite, Merrill Lutsky.
[00:00:20] Guy: Merrill, thanks for coming on and joining us here on the show.
[00:00:23] Merrill: Thanks for having me guys.
[00:00:25] Guy: So, let me say a couple of words about Graphite and do kind of keep me honest here if I misrepresent. Yeah. So your, and you've been around a bunch of years. Your kind of core innovation has been on code review, but like human code review before, right?
[00:00:37] Guy: You had this smart insight and great tooling that allowed these sort of stacked PRs where you can create a snippet of your work and start sharing it and get reviews as you build those out. And that's been great. You know, my team here at Tessl has been using it, loving it. And that was kind of a great innovation.
[00:00:54] Guy: Then AI came along, and you've been leaning much more heavily into AI code review now and your Reviewer there, Diamond. And so, let's maybe start from just understanding a little bit more about what it is that you do in that sort of AI code review.
[00:01:12] Merrill: From the beginning, we’ve been focused on the problem of code review and how we unblock the outer loop. The second half of the story is what happens once a PR is ready, is finished and complete or ready to be reviewed.
[00:01:41] Merrill: There's still so much work that needs to happen to get that from that first draft of the PR out to production. We started by taking inspiration from companies like Google and Meta that have built amazing internal tooling and great workflows to help tens of thousands of engineers coordinate across different time zones and be able to ship as quickly as those companies do.
[00:02:05] Merrill: So from the beginning, we were focused on how do we unblock the outer loop, like accelerate code review, taking more of a workflow and tooling approach to that. Stack PRs was the main innovation that we were borrowing there. Basically, the idea is how do we break down big changes and let you review and put changes up for review, and then keep developing on top of them while you do that.
[00:02:31] Merrill: So you've kind of decoupled the review, the review and development processes and let those go on in parallel. The first chapter of Graphite was, bringing that to every company on top of GitHub. Then I think, you know, a year and a half ago, you had Sonnet 3.5 released.
[00:02:48] Merrill: All of a sudden everything changes as far as how software is built, and every team is adopting Cursor and Claude Code and these amazing tools. I think it both made the problem that we were working on 10 times more important, and it gave us an incredible tool to go and solve it with AI. So, the problem's 10 times more important because there are now
[00:03:07] Merrill: 10 times more PRs in some cases, like every engineer has a bunch of Claude Code windows open, working simultaneously. You're able to build features so much faster than ever before. But the result of that is you just have a mountain of PRs piling up, waiting for review. We hear from a lot of our customers that many of their most senior engineers are now just buried in mountains of PRs from more junior folks that are just churning these out with AI tooling now. So we knew we wanted to solve that. Our existing tooling helps teams solve that, but we also knew we needed to do more.
[00:03:43] Merrill: So we created Graphite Agent, formerly called Diamond, which is our AI reviewer. The idea there was, can we take a lot of the work of the first pass that a human reviewer would do and hand that to AI, have it catch many of the bugs, style inconsistencies, follow custom rules, find security vulnerabilities.
[00:04:09] Merrill: Like can it do that first pass of a code review and kind of shortcut that so that as an author of a PR, you get feedback in a few seconds. And then by the time you loop in a human colleague, like there's still a human in the loop checking that. But, you know, fundamentally, you know, they're able to then look at higher-order problems and, not have to spend as much time like nitpicking and reading every single line of code necessarily.
[00:04:32] Merrill: And that's exactly what we've done. Like Graphite Agent. Now, you know, most, many of our, our top customers now have, you know, have expanded to use Graphite Agent. I think the big change that we've made there is now bringing the chat functionality into the [00:05:00] Graphite PR page. So you can actually now collaborate with Graphite Agent, live on the pull request from Graphite. You can have it. You can have it, like update things for you.
[00:05:09] Merrill: You can ask it questions, it can go into your code base and search for relevant context. Like it's much more of now making it, you know, making the PR a collaborative experience rather than, you know, rather than just a point of validation. And I think that's super interesting for, you know, for working with AI.
[00:05:25] Guy: So, as you review and as you kind of build in and you introduce this code reviewer, I'm curious first just to understand a little bit more about the substance of what it does and doesn't review.
[00:06:09] Guy: Yeah. Well, so I guess, what are the types of problems you've seen it most successful at dealing with and maybe the other way around, like what are the gaps where you say, well, at the moment I'm sure it gets better over time, but at the moment, it is, its boundaries, its capabilities, boundaries.
[00:06:29] Merrill: I think the things that it's really good at, it's great at finding bugs and any logic errors. It's very good. I think there's also, interestingly we've seen, there's a class of bug where it can be, it's hard for a human to spot or where the type checker wouldn't find it.
[00:06:49] Merrill: Where just looking at it, just with your eyes, you wouldn't necessarily spot it immediately. And I think AI, on the other hand, is very good at finding those types of bugs and calling those out. So, we've seen that as an interesting class of issues that I was particularly good at finding.
[00:07:10] Merrill: It's very good at style, like style guide adherence. It's good at finding security vulnerabilities. It could pretty much do most of what a human reviewer can do, at least at that first level of review of looking at the diff itself. I think the things that.
[00:07:29] Merrill: A couple things that it's not great at yet. One is higher level architecture decisions. I think that's still very much in the realm of, you need a senior engineer who's thinking about the entire system. I don't think we yet have, it's getting better, certainly, but it's still, I don't think we yet have the context to be able to do that.
[00:07:49] Merrill: The other piece I'd say is, well, I think there's a really interesting opportunity and something we're actively exploring. Can we validate that this change is actually doing [00:08:00] what it was supposed to, like end to end, and is that the agent is spinning this up on a box and clicking through it and recording a video and adding that to the change.
[00:08:11] Merrill: You know, something like that I think is an interesting direction for this to go. And then I'd say the other piece that's a little more difficult right now is, there's some languages that just aren't really well represented in AI training sets. Those end up being things that are more esoteric, end up being a bit more difficult for the reviewer to get right.
[00:08:31] Merrill: For some reason it's bad at Kotlin right now. We don't know why exactly. But that's something we keep getting whenever we're in mobile repos, we see that a lot.
[00:08:43] Guy: So, so it sounds like the, the challenges are mostly on either sort of LLM limitations of, you know, the brain is still not good enough
[00:08:50] Guy: around it and on, I guess kind of breadth of context, like when you think about architecture Yeah. Or end to end testing for sure. But with the architecture, would you say that today's [00:09:00] limits are more. Around kind of the reasoning capabilities, like you can, you can give it all the information about all of that context and it's just kind of not smart enough yet to, to extract that out, or, or is it, or is it more of a product problem still of I just need to like, bring all the relevant information to the hands of the other limb, but the intelligence is already there.
[00:09:20] Merrill: I think it's more the latter where so much of this, like I think the reasoning is, is there if you give it perfect information. But the challenge is that, you know, a lot of the considerations that go into like good architecture are, you know, from so many like disparate sources of context that might be, you know, in.
[00:09:36] Merrill: One engineer's head who built the system two years ago, like it might be in a notion dock that's, you know, that's been lying dormant for a long time. Like there's so, there's so many places where this is, is kind of fanned out in an organization and, you know, tribal knowledge that it is, I think, I think this is one of the great, one of the things that, that we're, we're working on is like, how do we bring all of those pieces of context and [00:10:00] like provenance for a change into the pr, you know, more so than than ever before.
[00:10:05] Merrill: Provide the agent with enough, enough context to actually, actually like, think through the problem at, at that scale. And this is something that I, but I, I do think that we're, we're close if not there already as far as the, the capabilities of the model themselves, given, given the right information.
[00:10:21] Guy: Do the, yeah, that's interesting. Although it is interesting to think about whether the training data, connecting these two kind of gaps, the training data for what is good architecture is probably also kind of a bit more limited.
[00:10:33] Merrill: Fair, fair.
[00:10:33] Guy: So as you look at this, you know, you've sort of had a stretch.
[00:10:37] Guy: You still do sort of see a bunch of human code, and reviews. Yep. And now you're seeing a bunch of AI. So I guess kind of two related questions. One is, can you tell the two apart? I mean, does the system track those? I know if you have any sort of a stats or sort of data around it, but in general, perception around, you know, the amount of AI code coming in and importantly like, are there like types of.
[00:10:59] Guy: [00:11:00] Mistakes or sort of problems that you think are more kind of AI prevalent than human prevalent?
[00:11:04] Merrill: Yeah. I, we don't have, so. This is, this is a challenge that I, I think a lot of the industry has right now, is that there's, there isn't a, an agreed upon standard for, you know, for like, you know, storing in, in Git history.
[00:11:18] Merrill: Like, what was, was this written by, by AI or not? I think there's something, that'd be something that'd be very very interesting and, at a high level, what, what [00:12:00] we've seen is, engineers are, are generating or are, are changing about 70% more lines of code per week, than what we saw at the end of 2023. So, we think a lot of that is largely, you know, will, should largely be attributable to AI. You know, GitHub has published stats in the past about like, you know, over half of the code on, on GitHub now being AI generated, I'm sure it's, it's probably closer to like, you know, two thirds to, to three quarters at this point.
[00:12:27] Merrill: Given the trend, I don't think we, we hit Dario's like 90% by September projection, but I think it's, the path to that I think is, is fairly clear. So, we're seeing a lot more AI generated code. There's no stand, there's no way to really tell it apart at the moment. Especially for organizations with top tier [00:13:00] engineering talent, like the quality does tend to be like, lower than the, the median engineer there. Yeah. Yeah, exactly. I think that that's, there's some organizations, it's interesting because I think there's some organizations for which, for AI is now like better than the median engineer.
[00:13:15] Merrill: And, and I think for, for many of our customers though, like, that's not, that's like decidedly not the case yet. And, there's still like very much a need for, you know, a need for us to like refine, to like check and refine particularly AI generated code.
[00:13:28] Guy: You mentioned that it adheres to style guides and the likes.
[00:13:40] Guy: I'd love to talk a little bit about what is the source of truth that you kind of contrast to the reviews, against, you know, where, what has been sort of effective in terms of, of defining kind of correctness and quality. And what's the experience for someone to, to input this, you know, who Yeah. [00:14:00] That is, that is entering this, this gospel.
[00:14:03] Merrill: Yeah. So we've, we've put a lot into, into like letting customers customize, customize their reviewer, add their own rules. Like so much of code review is this, this kind of store of institutional knowledge and, and style. We have, so with, with graphite agent, you can define, you can define custom rules either, in English, or, you know, you can give it, you can give it like regex that you want it to look for around this like.
[00:14:30] Merrill: Helping it, helping it to, to like focus on particular types of, of issues. We, we have some
[00:14:36] Guy: Do people like the regex more because it's controlled?
[00:14:38] Merrill: Yes.
[00:14:39] Guy: Or do they like the sort of the,
[00:14:41] Merrill: they like the, yeah, they like, they like conventional, the conventional language. We also will ingest, we'll look for if there's like a, you know, if there's like a Claude MD or like a, you know, cursor rules file.
[00:14:54] Merrill: We'll ingest that and add that to, to context as well, just so that there's, there's consistency there. This is another [00:15:00] place where I think there's a lot of different, a lot of like, you know, different standards that are, are evolving and, you know, we, we basically just try to like, vacuum up as many of, as as possible from the repo, to, so.
[00:15:13] Merrill: We'll say actually in the comment, like if it was attributable to a particular rule, and, we show insights as well in the product of like, which, which rules are, are triggering the most. And, like what we're, what we're catching and, and what we're not.
[00:15:30] Merrill: So, we try to have like a good feedback loop there of, of, things that we, yeah. Of like customization and then seeing the impact of, of that in, in terms of the, the AI reviews.
[00:15:40] Guy: Whose job is it to enter this data? Like things like, the Claude MD that you read from a repo, for instance, you know, that is the repo's information and presumably the app dev team that is working on that building.
[00:15:54] Guy: But maybe the definition of what high quality code is or what secure code is and all of that, that is [00:16:00] more some platform team's responsibility of it, I guess. Yeah. Who configures the, the rules or, or the, the languages in, in graphite.
[00:16:08] Merrill: At scale, it's usually, it's usually like a developer productivity or developer in developer infra team.
[00:16:15] Merrill: They're usually the ones who are admins on the GitHub repo. They're configuring graphite more generally for, for the organization. So, at, at a large company like Shopify, you know, we will work really closely with, with that dev infra team and, you know, make sure that we're. That we're pulling in the right sources of context and, you know, setting them, you know, setting up the, the right knowledge there.
[00:16:36] Merrill: For smaller teams though, you know, it's, it much like on on GitHub, like it's kind of a free for all. Like anybody can, you know, many of them will, will set it so that like anybody can add a rule and anyone can update it. Which I think there's like something. There's something nice about that where, you know, you just get, you get like as many, as many inputs as possible and like you can kind of like converge on, on like the right set of rules much [00:17:00] quicker.
[00:17:00] Merrill: But obviously for large organizations, it makes more sense to centralize it and we wanna support that as well.
[00:17:05] Guy: So you're not alone in introducing a, a, a reviewer and specifically a pull request reviewer, but clearly you've been around before. Yeah, I, I would. Kind of pause it to that Graphite's original success was very much around user experience.
[00:17:21] Guy: You know, it was a methodology and kind of a ease, and you kind of nailed some of that sort of developer zeitgeist, of, this is an elegant way of working in a way that I guess we've come to appreciate that it's, it's hard to get simple things right. And those out. Yeah. Now in the AI code review, there's a certain sense that, you know, everything that you've described, if I was to kind of bring along, you know, many, many other sort of code review.
[00:17:47] Guy: Builders, there will probably be a lot of repetition. What, what do you think makes a company stand out in this space? You know, what makes it, you know, not just a, whatever to use the, sort of the cliche lingo of [00:18:00] on top of the LLM.
[00:18:02] Merrill: Yeah, I think that there's, I mean, I think that there, there's obviously a lot of interest in, in the space right now, like, you know, we're certainly not the only ones who've recognized that.
[00:18:11] Merrill: The volume of AI generated code is, means that there will be a massive opportunity, to like help teams like review it faster. I think the thing that, the thing that differentiates us and, and what we, what we believe is graphite's, the graphite's like core advantage in this space is that, we span the entire lifecycle of the pull request.
[00:18:33] Merrill: It's not just a bot that, that plugs into GitHub and comments in this one, this one part of the lifecycle. We're able to go, you know, all the way from the moment the PR is created of like. Enhancing that with like describing the change of, generating a description for it, pulling in the right reviewers, doing the AI review certainly, but then helping the author iterate on that, helping reviewers like, you know, understand the change better.
[00:18:59] Merrill: And, [00:19:00] walking reviewers through that on, you know, on as part of their review, helping to solve CI failures as they come up. And prompting just like. Prompting the author to, you know, to accept the change and rerun it. Solving merge conflicts. Like I think the, the view, you know, and we're not quite there yet, but I think the view of of Graphite agent is that it really should be this, you know, almost this like companion that's, that's helping to shepherd your changes through.
[00:19:25] Merrill: All the different steps of the process. And so that piece is, I think, that holistic view of the PR lifecycle is what differentiates us from many of the point solutions here. I'd say the other piece is that even on AI review, I don't think that we've really seen what the best modality here is.
[00:19:47] Merrill: Like, you know, for AI code generation, you've seen several very quick iterations on what that modality is like. You went from Co-pilot, which just had auto-complete, to Cursor, tab to Cursor agent mode, to now moving entirely away from interfaces that focus on the code and moving towards just the prompt with Claude Code.
[00:20:08] Merrill: Now we're seeing background agents. So, you know, the way that you interact with these tools is evolving so rapidly. And I think the bot that comments on GitHub to me is just the Co-pilot auto-complete V1. And there will still be several iterations of what AI code review looks like.
[00:20:28] Merrill: And you know, I think there's something nice about it in that it meets users where they are. I think it is the right logical starting point here, but I think it's still so early in terms of pushing the boundaries of what this review experience should look like in an AI-native world.
[00:20:45] Merrill: And that's part of why, with bringing the Graphite agent chat into the pull request page, I think that is our belief of what the next step here looks like. You know, can it be not just a static reviewer that's asynchronously leaving comments, but rather a real companion that you're building with, that's helping you get the PR ready and iterate on it live from the PR page.
[00:21:09] Merrill: And I think that's why a lot of customers are really excited about it. That's why we're really thrilled to get that out.
[00:21:17] Guy: So, in this context now, you can interact with the LLMs through Graphite.
[00:21:29] Guy: You can interact with the Graphite agent to edit the code within the PR. And you've seen a bunch of Claude and others introduce maybe review elements, and you're kind of encroaching into that space. I guess, how do you see this playing out, with all the caveats and as you described, you know, this is all theory, 'cause we don't really know.
[00:21:51] Guy: Yeah. But still, how do you see the boundaries or the right scope of a product? Does it need to be a single product that does kind of the whole agent perspective? Like, is it eventually that someone using Graphite will be using that for their end-to-end development, and Claude Code or a DevIn or such will not be part of their flow?
[00:22:10] Guy: Or do you think there's a specific slice, a specific realm of responsibility that will be distinct? Because it definitely feels like a massive blurring of the lines.
[00:22:21] Merrill: Yeah, I think it is. It's rapidly. One thing I'll say is like, it's very hard to know where, where this will go. And also like, part of this is what
Guy: In the next week.
[00:22:31] Merrill: Yeah, in the next week. It, it does, it feels like it changes, it does change on like a weekly basis. But, there's also an interesting, I think we've seen. We've seen like different, different levels of interest from the base model providers of, I think at one point it looked like, you know, anthropic was gonna focus entirely on, on just code generation and OpenAI was gonna go full consumer.
[00:22:51] Merrill: And now, you know, OpenAI has kind of gone back the other direction and said like, no, we actually wanna own everything, which is, which is like, you know, kind of cool and, and like ambitious [00:23:00] of of them to, to go that direction. But I, I think there is this, this question of like, to what extent do the base model providers like remain.
[00:23:07] Merrill: Base model providers, or do they, you know, do they continue to go, you know, to go like up stack and, and own more of the application layer? My, my belief there is that they, at least, at least so far, you know, they, they mostly want to, they will only own applications and so far as that, it feeds back training data to 'em.
[00:23:26] Merrill: So I think that's the, that's kind of the logical, you know, boundary for, for some of those company for base models is like. What, what is like the, the minimum scope needed to, to get good training data back so they can rl? So, I think for, for us, our view is that we'll likely have a lot of competition among, among who is generating the code.
[00:23:47] Merrill: Even who is providing, like the review agents, like I could see. I could see a world where graphite, you know, isn't necessarily like building a first party review agent if, if others get good enough. Like, our, our goal is very much to be, you know, to be [00:24:00] like the interface layer and, the place where like the best place to interact with and control, you know, many different agents, on, you know, working on your, your code changes.
[00:24:09] Merrill: So, you know, we think the winning platform, I think there will be opportunities to build. The best in class, you know, agents for each task in, in the software development lifecycle. I also think then there's another opportunity to be the control plane for all of those agents and that is really where, where I think we, you know, we see the, the most opportunity right now.
[00:24:29] Merrill: I think there's a you, I think you kind of have to. You have to build in some of these cases, like with AI review, I think you have to build the agent because there just aren't, there aren't good opportunity, like aren't good alternatives out there. But like as you, you know, for instance, we, we don't really, there's a ton of companies doing like background agents for code generation.
[00:24:47] Merrill: We're not super interested in getting into, you know, building like a graphite, like code gen agent. Like we would much rather. Be, you know, be like the place where, you know, you prompt, you prompt your agents, they go off, create prs for [00:25:00] you, hand you back, you review them in graphite, merge them in graphite, like that entire create review, merge process, all happens from, you know, from our control plane.
[00:25:09] Guy: Yeah. Yeah. Interesting. So, so I guess, no, that I'm interpreting what you're saying as, as more of. The user experience and, you know, maybe that sort of core element of
[00:25:18] Merrill: Yes.
[00:25:19] Guy: Of what you started graphite with is where you feel there's an opportunity to kind of orchestrate this activity. Yeah. But, and I guess to, to, to challenge, like, stepping out of that a little bit.
[00:25:29] Guy: Yeah. It, it, it feels to me, and feel free to sort of, you know, call me an idiot on this, but yeah, that, that, like the stacked prs, for instance, that you started with, in, in, in sort of the human review or reality were very. Still are very necessary because you're working on something big and you publish this sort of piece of work so that humans can review it.
[00:25:51] Guy: If AI is doing the review, I think I would want to consume that review as part of my ongoing development as opposed to waiting until I publish a PR. Maybe it even reduces the value of that stacked PR because the AI's already reviewing it as they build. But even for that first bit, when you have review, and the review is done by AI, why do I need to wait until I open the pull request? Why is that point in time needed for the AI review? I get why it's needed for the human review, but why for the AI review?
[00:26:27] Merrill: I'd say a couple of things. One, I would challenge the assumption that stacked PRs are less valuable in a world where AI's generating the code. Because I think if anything, what we've seen is that, especially for background agents, if you just let them run wild, the result is you get multi-thousand-line, unintelligible PRs. Certainly, there are things that we can do on the PR interface itself to guide you through the change and help you understand it.
[00:26:58] Merrill: But I think fundamentally, especially as you're building with multiple agents or having these agents do longer autonomous tracks of work, it will be increasingly important to have those changes be understandable and digestible by the humans that need to check them.
[00:27:19] Merrill: So. We've built the Graphite MCP, we have a lot of our customers now actually having their agents create stacked PRs precisely because it makes them easier to review on the other side.
[00:27:31] Guy: Yeah, that's actually super interesting and I can totally relate, which is the agents, when they do long work, if you think of stacked PR as a way for you to continue working while you are submitting stuff to review.
[00:27:43] Guy: Kind of the core of it. Actually, these sorts of agents are a great example of that because they don't want to wait. And you want to be able to review their work incrementally and I guess probably there will be interfaces of once they did submit something, if there was substantial correction in the review, they might kind of revert back to that point and continue the work from there.
[00:28:02] Guy: Which is something that humans will probably be quite annoyed at doing.
[00:28:07] Merrill: Exactly. And I think the revertibility piece is another good point there. Having smaller pieces and a clear history enables you to more easily figure out what was the change that introduced this problem and lets you roll back more precisely.
[00:28:25] Merrill: Like there are a lot of benefits to having your agents work in smaller pieces, and stacking is a fantastic workflow to enable them to do that. The second point that I wanted to make though was that I think a lot of folks really get stuck on the idea of code review being the only value of code review being this validation step.
[00:28:52] Merrill: And I think that that's certainly one of the main benefits of it. But I think that the other piece, the other two pieces I, I'd say that are often missed of this is that I think code review is also a great, is also super important in like sharing, in like sharing knowledge across the team of helping other engineers understand like what is, what is actually going out to production, what is changing. It's also a great moment of teaching where you get a lot of feedback and you're able to, you're able to like improve the way that you're approaching, approaching your projects.
[00:29:27] Merrill: So those two, certainly the, there's like the validation quality piece I think AI can take on, but I think the latter two are actually, more important than ever in many ways because, you know, you have now if even the engineer who's like authoring the PR doesn't fully understand the change, like, I think there's, it's all the more important that you have a moment to where, you know, where you walk through everything and you can clearly communicate like what is changing in the code base. Similarly I think having, you know, sharing that with others, teaching others like what is, what is good about this change and what is not, like what, what learnings they wanna apply.
[00:30:06] Merrill: And also having a moment then to like feed any systemic things back into the reviewer and give it more rules for the future of what it should or shouldn't look at. Like that piece is really important. So I think that those, certainly the validation, code, the purpose of code review, I think will shift over of human review at least, will shift from validation more to the pedagogy and knowledge sharing pieces.
[00:30:35] Merrill: But I do think that those two are fundamentally, I think arguably more important in terms of like why we do code review today than the validation pieces.
[00:30:48] Guy: Interesting. So I guess you're describing a reality in which maybe the code review that is fully automated, you might eventually want to consume that as a natural part
[00:30:58] Guy: of the agent development. It's a
[00:31:00] Guy: point in time in which it is done in a different system, and it's to come back to, that's a, that's maybe a hiccup, maybe they call inter graphite through that MCP, not just to open the poll request, but to actually run the review and kind of get the results.
[00:31:11] Guy: But then the I guess you mentioned two things that are maybe long term relevant. One is the notion of a stacked pr, even if it is, whether it is or isn't reviewed incrementally just as a means of demarcation of different points in time. And second is as a means of review of work that's been done to be able to get guidance, and to provide guidance on what's there, sort of learn and guide the AI.
[00:31:41] Guy: And I guess in that context, you probably would. At some point we'd have to veer away from reviewing the code and reviewing some analysis of the code, right? Yes. It's just a, it's an unpleasant future to imagine in which all we do is read code and review and kind of comment and nitpick on it.
[00:32:00] Guy: Does that sound about right? Like you're gonna need some summaries or sort of reviews of the changes. Is that a likely feature?
[00:32:09] Merrill: I mean, I think the framework that we talk about internally here is that we see three steps in the process and the evolution of engineering and code review is a product of that.
[00:32:20] Merrill: Like the first step, which I think we're somewhat moving past already, is one where the IC engineer is primarily the one writing the code or generating it with tab complete. And then review is very much like a process of looking at every single line of code, like really carefully inspecting it.
[00:32:39] Merrill: I think phase, the second phase that we're now entering or have entered is one where the job of engineering feels more like an engineering manager where you're working with and orchestrating a team of agents that are going and building, they're handing you back changes.
[00:32:56] Merrill: You are reviewing, you're certainly reviewing the code. But a lot of it is reviewing the design docs and the architecture and kind of the higher level points. Still being able to introspect into the code. But a lot of that is happening of the review and like what you're looking for is happening at a higher level of abstraction.
[00:33:15] Merrill: The details there are starting to be handled more by AI review in that world. And then I'd say the third level, which we have not gotten to yet but we'll start to see if everything gets good enough, is one word.
[00:33:34] Merrill: I'd say that it almost looks more like work. Well, working with an external dev agency would be where you're giving just high level specs, and then you're getting back the artifact of the finished product and you're reviewing that versus, you may not even ever necessarily look at the underlying code or need to understand it, but you're primarily reviewing whatever that finished product is.
[00:33:57] Merrill: And I think we're somewhere between phases one and two right now, and in the coming years we have the opportunity to jump from two to three.
[00:34:06] Guy: Yeah. Yeah. And clearly we're sort of quite aligned in terms of the sort of spect driven development. Yeah.
[00:34:13] Merrill: You guys think about this a lot.
Guy: Yeah. It's about that. And I guess the question here would be, review is an after the fact kind of reality, right? You're looking at the output and you're saying, what is the output? Proper. And to an extent, the more you've defined upfront, the less you need to review, or at least the review becomes a bit more straightforward, which is yes.
[00:34:38] Guy: Did you adhere to the specification? And I think that's a future, but I agree that it's a journey, you know? And at the moment it's there. And to an extent, my question about what is the source of truth is a bit about that, right?
Merrill: It is about, yeah.
[00:34:55] Guy: You know, where are the, what's the spec assisted review that you have there to tell you what's correct. Whether it's a Claude MD or a style guide. Yep. And I think it's interesting to think about how that gets managed as well within an organization.
[00:35:06] Merrill: Yeah. Yeah. I think that there's, there's a lot there in terms of, connecting back to what I was saying earlier around, around like how do we start to show the, like the chat history or like the prompts, the agent logs, like in, in the pull request.
[00:35:21] Merrill: How do we pull in the right, the right provenance and context around like where was the spec? Like where was the Figma design file, what was the, what was like the intention of this change? And capturing that more completely than I think just a pr description today. You know, today just isn't sufficient anymore.
[00:35:39] Merrill: If we want agents to be able to do like a really high quality review, even, even for humans, I think that there is. There is like that level of subjectivity about review today that would really benefit from having, you know, having more context and. I think I, I, I'm curious, you know, I'd be curious to hear a little bit more about it, how you think about it.
[00:36:00] Merrill: Like how do you kind of more, more, most completely capture the spec of like, what was the intention of a change and, and then turn that into something that that can then be, you know, be digestible by, by an a, either an a reviewer or a human. On the other side of that.
[00:36:16] Guy: I mean, I think, is so, you know, clearly kind of flipping roles here a little bit on it, but clearly, clearly there's an evolution there to sort of think about, you know, how do you create a spec?
[00:36:26] Guy: How do you read it? What does a spec even contain? Yeah, you, you referred now to sort of the spec of a change of going from state A to state B. We think a lot is, well about long lived specs, not just the task at hand, but rather, some documented captures. Some long lived definition of either what you're building or how you want to build and then how do you adhere to that.
[00:36:45] Guy: One thing that's sort of quite interesting, and I was, I don't know if you've encountered it, well, I guess in your case you're running your own agents, so you're fine. But because we connect to different agents, we find that the same guidance that you want to provide Yes. Would get very different adherence.
[00:36:59] Guy: You know, you might have one [00:37:00] state-of-the-art agent, you know. Take some instruction and follow it 80% of the time and another follow it 30% of the time. And it might just be a wording difference on it. Four, five requires very different prompting than four. Yeah, it was five and when it came out versus the predecessors.
[00:37:14] Guy: And so, so I think the notion of how do you manage this knowledge, and the source of truth is, is interesting and we need visibility. We need. Observability, you know, which, which I guess again, it comes back, you know, for us, I'm a bit of a broken record there. We talk about, talk about spec driven development as a new development paradigm over time.
[00:37:31] Guy: Yep. But transition period is an interesting one, as it's still still quite impactful. And I think, I think review is at a clearly very important place there. And we need to see, what is being reviewed. Like there would always be a need to review, but what is being reviewed? I want to shift to talk a little bit about the business and the kind of the startup side maybe of the company.
[00:37:57] Guy: Sure. Then kind close off on some futures. So, a common debate today in sort of startup land on, specifically in developer land, is just the pricing model, right? You are, yeah. Today, I believe you're sort of still a, sort of a seat based, primarily how many developers are using this, which makes sense.
[00:38:15] Guy: You're collaborating on it. You're gonna have a lot of AI workers, I don't know if they're gonna purchase a seat. I guess how do you think about what's the right long-term kind of value attribution, right? Or pricing model? For, for products like, like yours.
[00:38:31] Merrill: Yeah, I think that there's, there's certainly a big shift happening right now in, across, across industries on collaborative tooling where historically it made sense to price per seat.
[00:38:42] Merrill: I think we're, we're still, I think that is still like what largely what buyers are anchored to, and. We've been, we've certainly been exploring like what would a usage based model look like, you know, if we, if we charged by, you know, by PR volume or, by, you know, PRS review, you know, PRS reviewed, like lines of code, something.
[00:39:01] Merrill: They're different. Different like parameters we could, we could use there. But, I think the challenge. The challenge there is like, what is the right time to, what's the right time to innovate? And like, is pricing the thing that we wanna spend our, our innovation points on? And if so, like when, you know, do we want to lead, do we want to like lead the market in, in doing that?
[00:39:22] Merrill: And, I think that the. I think that for now, certainly, you know, proceed is still very much the model in the paradigm that we're operating in. I see over the next year, especially as, as teams start to, yeah, as, as companies start to, to like, you know, change the way that they think about scaling their teams and, you know, maybe not hire as, as many engineers as they scale and shift more towards, towards, like relying on, on background agents for those work, those workloads.
[00:39:50] Merrill: It's the type of thing that would make us like more seriously consider moving over to usage base. I think that will be the end state here. It's more a matter of like, when is the right time to pull that lever.
[00:40:01] Guy: Yeah. I do think that it's super interesting to see how, how things play out. I, at a, on with a sneak hat on, thinking about sort of security, it's similar.
[00:40:11] Guy: Yeah. We secure the work of developers and so the more developers you have contributing, the more. The more it makes sense to charge. And when you think about outcome-based pricing and a lot of that sort of aspiration in this world, yeah, well fine, maybe we do, you know, for every vulnerability fixed, but then you get into the conversation of like, what's a vulnerability and was it vulnerable?
[00:40:31] Guy: Right? What's an extreme? So it gets very hairy, I think. I think there's a world of learning there. If you switch to utility based pricing and you charge by scan, you are potentially incentivizing the wrong behavior, because people would wanna scan less. Right. There will be, there will be some sort of similarity of like, the number of reviews is a, is a tricky proposition for you.
[00:40:51] Merrill: Right? I think there's, there are kind of two competing, competing factors there. Like on the one hand, as, as you say, we don't want to disincentivize the thing that we believe. Our users should be doing more of, like, we want users to create more prs. We want them to stack. Like, we don't want that to be, you know, we don't want you to feel like, oh, you're, you're now paying more because you're doing the thing that graphite told you to do.
[00:41:13] Merrill: At the same time though, I, I think that there's, you know. I mean, Goku, Roger, I'm one of our investors, and a great product mentor of mine from the Square days. He has like a great blog post on this of like, how Google thought about like, charging, you know, why they charge for impressions and not for, you know, conversions or clickthroughs even.
[00:41:35] Merrill: And that being you. Because the, like, you want to charge for things that are under your control, as you know, as the business. Like, you know, if somebody makes shitty ad copy, you know, it's not on Google. It's not on Google. Whether or not you actually like click through to the, you know, to the site or not.
[00:41:50] Merrill: Like Google's job is just to show you, is to like serve you the ad in the first place. And I think about a similar thing here where, you know, even like completion, it's like, what is the definition of completion? If you expose yourself to too much, you know, subjectivity on, and variability on the user input side, like, then all of a sudden your, you know, your pricing just becomes like really hard to model, hard to, you know, reason about, and like hard to really rely on, versus like, keeping it as something that's fundamentally under your control.
[00:42:21] Guy: Yeah. Yeah. I find it a little bit funny how so many things get emphasized with AI, and one of them is the measure of developer productivity. 'Cause really what you want is you want to say, hey, I'm going to charge by the amount of productivity I've provided to you. Yeah. If only we can measure what that actually means.
[00:42:41] Guy: Yeah. So, Merrill, thanks for all the great insights. I want to finish off with just a bit of futures of it. Yeah. So for starters, you touched on this a little bit, but, what is it that you think is the role of a developer in the future? I'm talking about someone who does it for a living, if you think that exists.
[00:43:00] Guy: Yep. But if you try to categorize far like five years or maybe even 10, what do you think is the role of a developer? What does it look like?
[00:43:12] Merrill: Yeah. I think that it really becomes one of defining at a high level what, you know, the experience should be.
[00:43:21] Merrill: You know, to some extent, like what are the technologies that should be used? Like what are the hard problems that need to be solved, and then working collaboratively with agents to find the best solution to them, to iterate on that, and then to get that out to your end customers.
[00:43:43] Merrill: And, that it looks more, I think it does, it does look more like review or more like, you know, you're operating at a higher level of abstraction than likely, than, than code code today. I still think that you need to be able to, you need to be able to like dive deep into the code [00:44:00] in some cases, and know kind of when to, you know, what, what approaches to use at, at the right time.
[00:44:06] Merrill: Some of that will be like, the agents will be informing you of this. Sometimes you'll be informing them, like a collaborative team works today. Not everyone on your team knows everything or has the right answer to every problem. But together with your human and agent colleagues, you'll figure out the right solution.
[00:44:26] Guy: I guess just sort of poke on that a little bit. If you spend most of your time at a higher level, would you not lose, or maybe in some cases never acquire, the skills to be able to review the code? Don't we need to kind of find a way to not have that uncanny valley, like it can nearly code, but not quite, and solve that with other tools?
[00:44:49] Merrill: I think this is a great debate, right? Because a lot of us don't really understand, like a lot of us don't have to understand compilers today, or you don't need to get down to machine code and understand it.
[00:45:05] Merrill: Like there are, yeah, there are levels of abstraction we've jumped up before successfully. The difference there is that those are far more deterministic than what we're dealing with now. But I do think that we have been able to navigate those changes in the past.
[00:45:23] Merrill: I still think that there is a reason why we've learned those things. Even if we don't use them in our day-to-day as engineers, there is value in learning that, and that's certainly something that I'm curious to see the result of, every CS graduate having access to Cursor now, versus a few years ago. I think there's a bull case there of like, it makes everyone far more productive and able to learn faster, build things faster, and get more excited about the field.
[00:45:48] Merrill: There's a bull case there of like, it makes everyone far more productive and able to learn faster and build things faster and get more excited about the field. I also think that there is potentially a challenge coming of not a whole class of new grad engineers.
[00:46:06] Merrill: Like, not necessarily having, you know, having the skills and knowledge that they need to like, be able to, to jump down and to, you know, jump down a layer of abstraction anymore.
[00:46:16] Guy: Yeah. And that, that's actually precisely sort of the last question I had for you, which is if you had a, an 18-year-old child right now and they were considering whether or not to go into a CS degree in university, would you recommend they do or they don't?
[00:46:31] Merrill: I still think I would just say that there are basically two angles that I'd say you should go. It's either you go super deep on it and say, you want to become an expert in a deep technology, something that will take a long time for AI to catch up to and requires a lot of specialized knowledge.
[00:46:52] Merrill: It'll take a long time for AI to catch up too. Or the other side of it is like, go wide and develop, you know, [00:47:00] develop, you know, design skills, develop, like, they develop like business skills, product and you know, have, have like a, a breadth of, you know, of influences and, you know, of, of abilities there such that you can better.
[00:47:14] Merrill: You know, better inform the agent of what to build. And that piece, I think, you know, we're seeing, we, we talk about like, does like product and design and eng kind of all, all merge together in, in some not so distant future? Like, I think that is, there is some world where the role of engineers is either, you know, it's either going to get like much broader in, in some cases and like much deeper in others.
[00:47:35] Merrill: But the in-between of like, you know, just kind of existing in the, the scope that, that we have before I think is not going to last very long.
Guy: I love the perspective, sort of like pick your battle, but sort of the might be a little bit tricky.
Merrill: Right.
[00:47:47] Guy: Merrill, thanks a lot for all the great insights and good luck reviewing and kind of evolving review with Graphite.
Merrill: Thank you, I really enjoyed this. Thanks for having me, Guy.
Guy: Thanks everybody for tuning in and I hope you join us for the next one.
Chapters
In this episode
In this episode of AI Native Dev, host Guy Podjarny sits down with Merrill Lutsky, CEO and co-founder of Graphite, to explore how AI is revolutionizing code review in software development. They discuss the transformation of the "outer loop," focusing on Graphite's AI reviewer, Graphite Agent, which enhances productivity by catching logic errors and enforcing standards while allowing human reviewers to focus on architectural decisions and systemic risks. Discover how to effectively integrate AI-assisted reviews into your workflow, codify standards, and maintain a collaborative PR process that leverages both AI and human expertise.
Code review is being reimagined in the age of AI. In this episode of AI Native Dev, host Guy Podjarny talks with Merrill Lutsky, CEO and co-founder of Graphite, about how AI is transforming the “outer loop” of software development—particularly code review—and what it takes to make AI a productive collaborator rather than noise. From Graphite’s origins with stacked PRs to its AI reviewer, Graphite Agent (formerly “Diamond”), Merrill shares what’s working, what isn’t, and how teams can operationalize AI-assisted review at scale.
From Stacked PRs to AI-Assisted Review: Unblocking the Outer Loop
Graphite started by focusing squarely on the outer loop: everything that happens after code is “ready” and needs to get through review, CI, and merge. Drawing inspiration from internal tooling at Google and Meta, Graphite popularized stacked PRs—breaking a large change into small, reviewable slices and decoupling development from review so both can proceed in parallel. That early bet paid off by reducing merge conflicts, getting faster feedback on smaller diffs, and making it easier for teams to keep shipping even as complexity grows.
The AI inflection point—spurred by tools like Cursor and Claude Code and large model upgrades like Claude 3.5 Sonnet—multiplied the volume of changes in flight. Engineers now spin up multiple AI-assisted workstreams concurrently, pushing out far more PRs than before. This made Graphite’s original problem 10x more pressing: senior reviewers were getting buried in mountains of diffs. In response, Graphite introduced Graphite Agent, an AI reviewer designed to take the first pass on PRs, catch issues quickly, and reduce the human burden.
Crucially, Graphite doesn’t treat the PR page as a static checkpoint. The Agent’s chat is embedded directly in the PR, allowing authors and reviewers to ask questions, request changes, and have the AI propose or apply updates in context. It can search the codebase for relevant examples and supporting files, turning the PR into a live collaboration space rather than a one-way gate.
What Graphite Agent Does Well (and Where Humans Still Shine)
On strengths, Merrill says the Agent is particularly good at finding logic errors—including subtle bugs that slip past type checkers and careful human reading. Think boundary conditions, incorrect default values, asynchronous missteps (like missing awaits), misuse of APIs, or error paths that look fine at a glance but fail in edge cases. It also reliably enforces style guides and catches security footguns, offering feedback in seconds so authors can iterate before looping in a human.
That said, system-level architecture decisions remain a human domain. The Agent can read a diff and retrieve local context, but architecture involves trade-offs, performance characteristics, and organizational constraints that often live in disparate places—or in someone’s head. Today, senior engineers still need to set direction, evaluate cross-cutting concerns, and arbitrate design choices that span services and teams.
There are also uneven spots in model coverage. Languages underrepresented in training data, such as Kotlin in some mobile repos, yield lower accuracy today. And while end-to-end validation (spin up environment, click through, record a video, and attach it to the PR) is a promising near-term direction, it’s not yet a turnkey capability. The right pattern is clear: use the Agent for the high-signal first pass and keep humans in the loop for architecture, product risk, and final accountability.
Context Over IQ: Feeding the Reviewer the Right Signals
Is the main limiter AI reasoning or missing context? Merrill’s view: given perfect information, models can already reason well enough to be useful at higher levels. The harder problem is product and workflow: corralling the right context from scattered sources—design docs in Notion, old decision records, code comments, tickets, and tribal knowledge—into the PR surface so the AI (and humans) can make sound judgments.
Graphite tackles this by retrieving repository context and ingesting codified standards wherever they live. If a repo has a Claude MD or Cursor rules file, Graphite pulls that into the Agent’s context to ensure consistent advice across tools. In the PR chat, developers can ask the Agent to search the codebase for prior art, relevant helpers, or historical diffs, making it easier to validate a change against established patterns.
For developers, the takeaway is to make context machine-consumable and PR-attached. Link relevant tickets and design docs in the PR description, summarize intent and constraints, and cite performance or security requirements up front. Treat “change provenance” as a first-class object—what problem this solves, why this approach, and what trade-offs were considered—so the Agent and human reviewers evaluate the same facts.
The AI Code Wave: Volume, Quality, and the Human-in-the-Loop
Across Graphite customers, engineers are changing about 70% more lines of code per week than at the end of 2023—a surge likely attributable to AI copilots. That translates into more PRs, more context-switching, and more pressure on senior reviewers. AI-assisted review relieves the immediate bottleneck by triaging routine issues, but organizational process has to evolve alongside it.
One challenge: there’s no agreed-upon way to mark AI involvement in Git history. Without a standard trailer or metadata convention, it’s hard to measure downstream impacts of AI-authored code or tailor review heuristics accordingly. While public stats suggest more than half of code is now AI-generated—and likely trending toward two-thirds or more—Graphite (and the industry) can’t reliably distinguish authorship post hoc.
Quality varies by baseline. In top-tier engineering orgs, AI-generated code often isn’t yet at the median reviewer’s bar, reinforcing the need for human refinement. In other environments, AI can exceed the average contributor on routine tasks. The operational model that works across contexts is consistent: let the Agent run a fast, thorough first pass, then have humans focus on architectural coherence, systemic risk, and product correctness.
Codifying Quality: Custom Rules, Shared Standards, and Feedback Loops
Graphite Agent supports organization-specific standards via custom rules written in plain English or regex. Teams can encode naming conventions, security policies, dependency hygiene, error handling expectations, and other house rules. The Agent also ingests existing config where possible (e.g., Cursor or Claude rule files) to reduce duplication and keep guidance consistent across local editors and the PR surface.
Transparency matters for trust. Graphite attributes its comments to the rule that triggered them and exposes insights on which rules fire most often and where the Agent is catching issues versus being ignored. That makes it easier to tune thresholds, de-noise low-value checks, and spot policy gaps that deserve promotion from “advice” to “must.”
Implementation best practices: start with a short list of high-signal rules (security, breaking changes, API contracts, error handling), use regex for precise lint-like checks, and use natural-language rules for style and policy that benefit from flexibility. Wire rules into PR templates and contributor docs, and revisit your rule set monthly to eliminate noisy checks, add new patterns surfaced by incidents, and keep your signal-to-noise ratio high.
Key Takeaways
- Treat the PR as a collaborative workspace. Embed an AI reviewer in the PR, enable chat, and let it propose changes, search the codebase, and enforce rules before humans step in.
- Use AI for the first pass, not the final say. Let the Agent catch logic bugs, style drift, and security footguns quickly; keep humans focused on architecture, product risk, and system-level trade-offs.
- Feed the model context, not just code. Link tickets and design docs, summarize intent and constraints in PR descriptions, and centralize “change provenance” so both AI and humans judge the same facts.
- Codify standards once and reuse them everywhere. Write custom rules in natural language or regex, ingest existing Cursor/Claude rule files, and keep guidance consistent across IDEs and PRs.
- Measure and tune the feedback loop. Track which rules fire and which are ignored, reduce noise, and iterate monthly. Watch PR throughput, time-to-merge, and “AI catch rate” as leading indicators.
- Plan for known gaps. Expect weaker performance in underrepresented languages (e.g., Kotlin in some repos) and on architecture. Add targeted human review and, where possible, higher-level tests or E2E validations.
- Push toward provenance for AI-generated code. Until there’s a standard, consider internal conventions (e.g., commit trailers) to tag AI involvement and inform review policies and post-merge analysis.
Resources
Related episodes

UNLOCKING
SPEC-DRIVEN DEV
with Tessl
Revolutionising Spec-Driven Development with Tessl’s Framework & Registry
16 Sept 2025
with Guy Podjarny, Simon Maple

SPEC-DRIVEN
DEVELOPMENT
WITH KIRO
Transforming Dev Practices with Kiro’s Spec-Driven Development Tools
19 Aug 2025
with Nikhil Swaminathan, Richard Threlkeld

TEST TO APP
IN MINUTES
Can AI Really Build Enterprise-Grade Software?
26 Aug 2025
with Maor Shlomo