
35% Higher
Abstraction
Adherence
Also available on
[00:00:00] Simon Maple: Hello, and welcome to another episode of the AI Native Dev. My name's Simon Maple, and joining me today is Maria Gorinova, who is a member of technical staff at, in the AI engineering team at Tessl.
[00:00:14] Simon Maple: Maria, welcome, and we're in the London office, of course, in Tessl HQ, so we can have a nice face-to-face chat. Maria, how are you?
[00:00:20] Maria Gorinova: I'm good. Thank you. I'm very excited to be here, Simon. Thank you for having me.
[00:00:24] Simon Maple: Long time listener. First time caller, right? Yeah. Well today, well, Maria, you were part of a team that built up, that wrote a report and some findings in and around.
[00:00:36] Simon Maple: Well, basically a bunch of evals that is testing how good coding agents can be with and without various contexts at performing certain coding tasks. Dig into that and show some really interesting information about when coding agents can provide solid results versus kind of a little bit scrappier, not sure about the quality of those results.
[00:01:03] Simon Maple: So really interesting data coming up. But first, who is Maria? Tell us a little bit about yourself, Maria.
[00:01:10] Maria Gorinova: Yeah, sure. Well, as you said, I'm a member of technical staff here at Teso in the AI engineering team. Before that, I have a PhD from University of Edinburgh in data science, in machine learning.
[00:01:25] Maria Gorinova: And then I moved to industry research. Firstly Twitter, back when it was still Twitter.
[00:01:34] Simon Maple: It still is Twitter. Come on. It still is Twitter. In our hearts. It's still Twitter.
[00:01:37] Maria Gorinova: eah. Well, no comments. Yeah. And after that, I was in a few other startups before joining Tessl..
[00:01:47] Simon Maple: Awesome.
[00:01:47] Simon Maple: Awesome. From the AI engineering point of view, what drives you into the wonderful space of machine learning and LLMs? What's your passion around that?
[00:01:57] Maria Gorinova: Good question. I think I've always been passionate about this intersection between programming and logic and machine learning and data.
[00:02:05] Maria Gorinova: So my PhD was kind of in in that space. Even before that, some projects that I did were in that space. So I've always been really keen on this. And then I stumbled on Tessl. I was like wow, like these guys are really doing something and I wanna be part of it. I think it's really matches like what my journey so far.
[00:02:25] Maria Gorinova: Yeah.
[00:02:26] Simon Maple: Oh, wonderful.
[00:02:27] Maria Gorinova: So that's what excites me.
[00:02:28] Simon Maple: Wonderful. So let's talk a little bit about the report. So yeah, there were a number of people. So there was yourself, who else was on the report?
[00:02:35] Maria Gorinova: Max.
[00:02:36] Simon Maple: Who we had on the podcast just a few weeks ago as well, which is cool.
[00:02:40] Maria Gorinova: Yeah. Max. He never sleeps.
[00:02:44] Maria Gorinova: There's so much work. So there is max, there is Rob, and there is Dru. And myself. So that the four of us did this report. I did a lot of work on different evaluations. We actually have, you know, more things cooking. So cool. We are very excited for follow-ups.
[00:03:03] Simon Maple: Yeah. And I know certainly within the Tessl organization, whenever, when, you know, we always love the AI engineering team doing a whole bunch of this research cause we find such amazing kinda like little gems and learnings from this.
[00:03:15] Simon Maple: And and it's, you know, one of the things about LLMs just being so non-deterministic is we have these experiences, we have these anecdotal experiences and. People will recognize them when we, when we say, oh yeah, the LLM did this. Someone will typically, a user of an agent will typically say, oh yeah, yeah, I saw, I saw that.
[00:03:33] Simon Maple: Yeah, I've, I've, I've experienced that before. But it's hard to actually understand, well, what is the pain though? What, you know, how bad is it? And I think one of the things that I love that, you know, your team and you, you and the rest of the team do is with these, with this, with this research, you kinda like go a little bit deeper and you kinda like show the real pain.
[00:03:50] Simon Maple: You know, what, what, what are the problems? What are the capabilities? And and let's, and let's do that. With this report and with this, and it's out on the Tessl blog, we can, we can share some links in the in the show notes and things like that. What are the what are the problems then that this report is trying to discover the I guess the scope of the depth of what what are the what are the key challenges to agent coding?
[00:04:13] Simon Maple: Wow,
[00:04:14] Maria Gorinova: That is a big question. Very big question. Thank you for asking. I think I find this report and work with it on this very very interesting because I think it it really hits on something that to me is fundamental when it comes to programming and and software engineering and that is abstraction.
[00:04:35] Maria Gorinova: It is I feel software is really built on abstractions that kind of get more and more high level. Yeah. Like a trash car right? Everything is is built from, you know, the like machine code at the very bottom and then you have a higher and higher level APIs. And I think this is this is really fundamental when it comes to software and I don't believe that it will change or that it should change when it comes to AI agents.
[00:05:02] Maria Gorinova: And the reason for that is that. I mean yeah sure an AI agent can generate everything from scratch, can generate machine machine code from scratch. Yeah. But does it make sense for it to regenerate it every time and and never use any existing abstractions that are already out there? This sounds very expensive.
[00:05:21] Maria Gorinova: Right. So thinking of this abstraction one thing that I personally observed with AI agents that really annoys me is that when I use them and ask them to write some sort of code, they very often will just implement everything from scratch. Not use any libraries, or even if I tell them to use a library they're not gonna use it very well.
[00:05:44] Maria Gorinova: Have to like really micromanage how they use it. And it's kind of. Sometimes sometimes that's fine. Sometimes you just don't care. Sometimes you just want to generate code from scratch. But sometimes you really care because libraries are they have been optimized. They're very performant.
[00:06:01] Maria Gorinova: Right. They they help with saving costs, saving time, things like this.
[00:06:08] Simon Maple: They're trusted as well I suppose cause they're out in the wild. Exactly. Everyone's using them. People can, you know, come in and and have expectations of how things work.
[00:06:16] Maria Gorinova: Exactly. Yeah. Exactly. And it's it's also, you know, we are still at the stage that maybe that's not going to be the case one day but right now we are still at the stage that we are collaborating with the agent.
[00:06:24] Maria Gorinova: So we still want to understand that code or at least to some extent be able to work with it. Right. So there are many many factors then and I think it's it's a very important problem.
[00:06:37] Maria Gorinova: One thing we were thinking about we can do with with libraries and just helping the agent use libraries more effectively is if we if we can firstly evaluate how well it's using them right?
[00:06:51] Maria Gorinova: Because like if we want to improve this the first step is we have to be able to measure it. We have to be able to evaluate it. And unfortunately too. Best of our knowledge there is not that many if any benchmarks out there existing that touch on this.
[00:07:11] Maria Gorinova: A lot of the benchmarks that that are for coding they have to do with functional correctness. They evaluate may maybe based on some tests or or something like this. So they evaluate the behavior of the program.
[00:07:17] Simon Maple: Like SWE-bench or a T-bench or something like Exactly. Exactly.
[00:07:19] Maria Gorinova: And and you know that is that is very useful. And shows us that the code works. But it. It misses exactly this problem of reusability.
[00:07:30] Maria Gorinova: And reusability of libraries. So that's why we created this evaluation framework that generates data that really touches on the abstractions and how well the agent is adhering to those abstractions of a particular library. And I will go more in more in more details.
[00:07:50] Simon Maple: Yeah and and it's very interesting cause I think a lot of the I would say the majority of the cases probably people aren't building from scratch necessarily.
[00:07:56] Simon Maple: They're trying to use agents to to adapt or change their existing applications. Oh. I'm sure there are plenty of Greenfield projects out there as well. But in a in a for an organization you're trying to you're trying to work out how you can do your day job quicker and. A developer is 95% of the time in their own code base working on on you know small fixes, new features, et cetera et cetera.
[00:08:18] Simon Maple: So in that case you want you know the agent is gonna not just decide to use certain libraries it's gonna have to use certain libraries cause that's what's already in the in the existing in the in the existing application. So exactly. So getting an agent to really understand and code well against that open source library is super super important.
[00:08:41] Simon Maple: 'cause it needs to do it. There's you, you don't want it to all of a sudden start. Building in its own way because we as humans then have to start reading this and going, well, why is this building, why is this building everything for itself versus just using these libraries that we already import? Now you mentioned evals and I, I, you know, I can't go any longer than five to 10 minutes talking someone from the AI engineering team without them saying the word eval.
[00:09:04] Simon Maple: I'm sure. I'm sure you, you must say that like a thousand times a day or something like that.
[00:09:09] Maria Gorinova: Oh, so we think about this all the time.
[00:09:10] Simon Maple: It's, yeah. You dream about evals.
[00:09:14] Maria Gorinova: It's. Yeah, absolutely we do. And, and, and I think it is, it, it, we think about this a lot because
[00:09:21] Simon Maple: Why, why, why are evals so important?
[00:09:23] Maria Gorinova: Excellent question. I think when it comes to AI. Everything, everything's probabilistic, right? We are before AI and AI agents, we are really used to deterministic code, right? We are writing, we are working with programs that are deterministic. We know what the output is when we give it certain input and so on.
[00:09:43] Maria Gorinova: We can test it. We, we kind of, we can predict and expect how it'll behave.
[00:09:49] Maria Gorinova: But that's not the case anymore. It's just not the case when it comes to machine learning. Yeah. So. It is very dangerous if we look at a isolated example that works or doesn't work and then make judgements on this.
[00:10:06] Maria Gorinova: Um, so it is dangerous because then we're going to make decisions about a product or a feature or even our just day-to-day workflow, that are based on this one isolated example. That is just an anecdote.
[00:10:19] Simon Maple: Yes.
[00:10:20] Maria Gorinova: And it does. It's not reflective of the average. It's not reflective of the population.
[00:10:25] Maria Gorinova: So by doing evals, which involves really doing it on many examples on a population, then we are able to start making those statistical judgements. About how the, the model is behaving, the agent is behaving, or whatever tool we are using is behaving.
[00:10:43] Simon Maple: So it's not just about the correctness of an agent, it's about how often that agent will do the right thing.
[00:10:50] Maria Gorinova: Yes. Because it's never going to be. Well never say never, but I, you know, it's never, statistically, it's never going to be a hundred percent correct. Yeah. On any, anything. Yeah. If you make it a hundred percent correct on a dataset, you likely are running into overfeeding issues. Yeah. Yeah. You are likely not generalizing beyond that dataset that you're evaluating it on.
[00:11:09] Maria Gorinova: So it is, it's very important for us to work with. Evals that actually statistically measure how well, we are doing on a population.
[00:11:21] Simon Maple: And there are of course many different things that affect that number. Or, or that eval score. One of which is the model itself, the underlying model, how good the, the LLM is at being able to perform that task.
[00:11:34] Simon Maple: But of course, there are a bunch of other things, such as the prompt you use, the context you provide, and things like that, that, you know, that give the LLM a better chance of success. What, we'll jump into that in a little bit more depth in terms of the eval. What is the, what are, what are your eval suites look like for this, for, for this report that we, that we released.
[00:11:55] Maria Gorinova: Yeah, good question. So what we did was we generated a dataset, an evaluation dataset.
[00:12:02] Simon Maple: Yeah.
[00:12:02] Maria Gorinova: That consists of pairs of a question, a coding question.
[00:12:06] Maria Gorinova: And a criteria for evaluation.
[00:12:10] Maria Gorinova: But this. Questions. They're not just, they don't come from thin air. Hmm. We actually take, an existing library, open source library, and we use an agent to analyze this library and then generate the question based on the real API of this library. Of the question and the evaluation criteria. And we focus on questions, that have to do with the, so the coding task, the nature of the coding task is that. It tries to evaluate someone's, a person's or, or an agent's ability to use a library efficiently. Incorrectly.
[00:12:47] Maria Gorinova: So it's less.
[00:12:50] Maria Gorinova: The typical coding tasks that agents are used to are sort of like lead code style, you know, general like, implement from scratch, something like bread for search, that sort of task. This is not it, this is more of a Oh, you have Pydantic, yeah. Pydantic Python library. Now use pedantic to generate that sort of class with that sort of validation, you know?
[00:13:15] Maria Gorinova: So it's more about, it's more focused on using the library. Yeah. So yeah. So that's the data set. Yeah. And then we, in order to evaluate a method, we ask it to solve the question. Mm-hmm. And then separately we use an agent as a judge Mm-hmm. To, evaluate the solution based on the criteria that we previously generated.
[00:13:38] Maria Gorinova: Right, right. Okay. So, and the criteria is all about how the API was used.
[00:13:42] Simon Maple: Gotcha. So, so your tests are twofold and your tests are essentially, this is the challenge, and then this is what I expect. This is how your criteria is. This is how I expect a correct solution to kinda like, you know. It needs to use this library, it needs to use this method on the library.
[00:14:01] Simon Maple: Perhaps it needs to pass this information and it, we need to be able to see this kind of outcome. So it still allows the, the LLM to be creative in whatever ways it wants to Absolutely code it, but it needs to hit those several things, which, yeah. Pardon me. It needs to hit those several things which, which, which require that correct use of the library.
[00:14:20] Simon Maple: Okay. What, what are the, so when things go good. They go Good. You get a pass. When things go bad, what are the kind of things that we are looking for? I can, I can already guess things like, you know, that hallucination of the API that doesn't exist maybe very often because it's actually picking the wrong version of the API.
[00:14:39] Simon Maple: Is that a common thing? Yeah. What, what, what are the things that we look for when in failures?
[00:14:44] Maria Gorinova: Yeah. So. What you've mentioned with the versions happens. For sure. Pedantic actually, yeah. That I mentioned already is a good example of this because Pedantic version two is quite different to Pedantic version one and LLM often confuse it.
[00:14:57] Maria Gorinova: But also we have, I think there is more fundamental problem, which is. New or, or private libraries, right? Yeah. Things that might not be in the training, in the pre-training data of the LLM. Mm-hmm. So there is no preexisting knowledge about this.
[00:15:18] Maria Gorinova: And now suddenly. You are asked to use this library, but you don't know anything about it. So the, if it's an agent, so if it's, if it's an LLM, it's just going to hallucinate.
[00:15:28] Maria Gorinova: If it's an agent, it might do web search, it might read documentation and so on. But this is quite expensive, quite a lengthy process.
[00:15:35] Maria Gorinova: And it might not find what it needs because some libraries are not that well documented.
[00:15:39] Simon Maple: Yeah.
[00:15:39] Maria Gorinova: So niche libraries, not well known libraries. Libraries that have a smaller community, maybe an academic community behind the, you know, thing. Things like this. They're going to be overlooked.
[00:15:50] Simon Maple: Yeah.
[00:15:50] Maria Gorinova: And not to mention private libraries where there is no preexisting knowledge.
[00:15:55] Maria Gorinova: Yeah. And you might not be able to. Access that much knowledge about them. Yeah.
[00:15:59] Simon Maple: So what are the, so the problems, I think, you know, I'm sure when you, when you say those types of, you know, issues that agents and LLMs have, I'm sure all of our listeners have, you know, this isn't news to them. They'll have recognized, they'll have sinned this firsthand.
[00:16:14] Simon Maple: Not every single time, but you know, every now and then they go, oh yeah, I'm coding and this happens. Mm. They people recognize it. How can we make. What, what's the solution? How do we make LLMs more predictable to be able to say, yes, I know where to get that information from. How do we get the, the, the answers the evals highest?
[00:16:34] Simon Maple: What did you learn?
[00:16:35] Maria Gorinova: Well, we saw about context, Simon.
[00:16:37] Simon Maple: Oh, of course it is. Of course, it is.
[00:16:38] Maria Gorinova: So what Tessl did, we launched a, something that we call the Tessl Registry.
[00:16:48] Maria Gorinova: That is a collection of more than 10,000 tiles, we call them tiles.
[00:16:53] Simon Maple: I could see between you and I, we know what the previous name was and I know our listeners maybe don't, but I could see, I could see your thought process.
[00:17:02] Simon Maple: There was saying, don't say this, don't say this, but we, we we're good. We got through it. And I think, I think despite my, despite my commentary there, I think we got away with it.
[00:17:12] Maria Gorinova: Yes. Alright. Yeah. So the tiles. Mm-hmm. I should tell you what tiles are. So tiles are their collection of context.
[00:17:20] Maria Gorinova: And in this case, the 10,000, the more than 10,000 tiles, they have context about libraries.
[00:17:27] Maria Gorinova: About different libraries and different versions of those libraries. So everything is versions. And what we found with our evaluation that we previously talked about is that if we give the agent. The styles for libraries that it's using in a, in a specific project, then it has better abstraction adherence.
[00:17:50] Maria Gorinova: So remember the abstraction, the adherence to those specific APIs for when using a particular library. We observed that the agent behaves better and is more likely to use those obstructions correctly when it uses the Tessl tiles.
[00:18:06] Simon Maple: Yeah. So there's a couple terms there. Let's jump into tiles and then, and then, adherence abstraction, no.
[00:18:12] Simon Maple: Abstraction adherence. Abstraction adherence. That's right. I suppose adherence abstraction is kind of a thing as well. No. No, it's not. It's not. But let's start with tiles. So I always, I always love thinking of tiles as almost like if you think about a package like an NPM package or a Java package or something like that.
[00:18:28] Simon Maple: A collection, a grouping of, of some source code and some whatever it is, some part, part of your application. And it, and it's typically, you know, a package for that ecosystem. A tile for me, I always think of it in a similar way. A tile is like a package for your context. And, and of course there are three types of context, and I think the one, what we're gonna be talking about today is more about the documentation.
[00:18:48] Simon Maple: So there's documentation, that's right. Rules. And third is commands, I think. Right? But that's kind of a, a little bit more future thinking. But the first two is, is around documentation, which is describing, it's context that describes the best usage of things like open source packages or, or beyond.
[00:19:07] Simon Maple: We can talk about private packages and things like that as well. Rules are more descriptions of how you want an agent to behave. So it's actually a, it's kinda like steering an agent in a particular way or doing a process in a particular way. How do those two between documentation and, and kind of rules, how does an agent, use those two?
[00:19:30] Simon Maple: Is, is one more forced upon it? One more, kinda like optional for it? What's the, what's the way that works?
[00:19:35] Maria Gorinova: Oh, that's an interesting question. Yes. The way I think about it is. Documentation is more about, as you say, more about knowledge, more about something that it's not really behavioral. Yeah. It's just there, it's just facts.
[00:19:52] Maria Gorinova: And that is something that the agent can reach for. Right. So when, when it doesn't know something, for example, how to use the library.
[00:19:58] Simon Maple: Yeah.
[00:19:59] Maria Gorinova: It can, it can reach for it, it can ask a question. And go look for it. Rules are more behavioral. And they're more about steering the agent.
[00:20:09] Maria Gorinova: And this is not something that it makes sense for the agent. It can, it doesn't really know how to reach for behavioral instructions. Do this what I mean? Yeah. It's more of a this is encoded in me or Yeah. You know, this, this is asked me. It's, it's the way Yeah, yeah, yeah. This, this has been asked for me to do.
[00:20:27] Maria Gorinova: So this is the slight subtlety there. Yeah. And I don't think it's. You know, it's, it's a, it's a bit subtle. I, I don't think it's like there is a mathematical formula to, to distinguish between those two. But, but that's how I think about it is reaching like, can it actually reach, does it make sense for it to ask a question?
[00:20:44] Simon Maple: Yeah.
[00:20:45] Maria Gorinova: About that piece of influence.
[00:20:46] Simon Maple: And it's interesting 'cause like, I think from the volume of context that you would expect in each of those, a documentation. It's, it's a large knowledge base, right? You, you wouldn't expect to push all of that into, into a context. Exactly, yes. Whereas the, a set of rules you'd expect, yeah, okay, this, this is important.
[00:21:02] Simon Maple: I need you to follow this, and you'd expect that to be a much shorter kind of artifact, so it could probably keep that in context. No problem at all.
[00:21:09] Maria Gorinova: No, just expect not just expect it actually. You know, this is getting ahead of ourselves, but actually one thing we've observed in other experiments is that the longer the, the steering context is the, the, the, the more you try to ask the agent to behave certain ways Yeah.
[00:21:26] Maria Gorinova: The less it's going to follow any of those suggestions.
[00:21:28] Simon Maple: Okay. Yeah, that's, that's super interesting actually. And it's like, I always think of it like kids, like I always think of it like kids where it's like, LLMs are kids basically. Or maybe my kids are just acting like LLMs, but that would be worrying.
[00:21:40] Simon Maple: Yeah, it's like, it's like you wanna, you wanna give your kids knowledge to be able to use at the right time, at the appropriate time when they decide to use that knowledge. But there are certain rules. You wanna give them, and you always expect them to abide by those rules. And so like, that's a nice little parallel in my mind that kind of, that kind of goes off.
[00:21:56] Simon Maple: The second term that we kinda like mentioned was abstraction adherence. So you kinda like mentioned the, the levels of abstraction on, of course, open source and using open source being a, a, a key level of abstraction that we have over decades. Decades learn well over over development. It's the adherence for agents to use those, that's, that's, abstraction, adherence in terms of what we're talking about here.
[00:22:23] Maria Gorinova: Yes, it is about. The agent adhering. To the abstractions that are available. By, by given, given by some library or, or some private project or, or, or something like that. So is the agent actually using that abstraction? We have a very, a very good example in the blog post. Which has to do with, attention for the machine learning community in PyTorch.
[00:22:52] Maria Gorinova: The PyTorch library. So PyTorch provides a one-liner implementation of attention, but only after version 2.0 or something. I don't, I don't quite remember the version. So if you ask an LLM that is trained before this, it would generate attention from scratch while if you. You know, if you, if you provide context for using Torch
[00:23:21] Maria Gorinova: It'll actually use the one-liner implementation of attention. Right. And that is good for many reasons. Right. That is good because that implementation of attention is optimized. It's much faster. There are actually papers that show this. Yeah. So we, we want that. And this is what obstruction here is, is about.
[00:23:39] Maria Gorinova: So whether the agent actually used. The intended abstraction or did it implemented from scratch?
[00:23:46] Simon Maple: Right. Right. Okay. Let's jump into the results, but first, why don't we talk about the scenarios. What was the baseline, I guess, that we kinda like worked out? This is from scratch, what we wanna improve from?
[00:24:00] Simon Maple: What did that look like?
[00:24:01] Maria Gorinova: The baseline is just. Vanilla agent. Either clot coat or cursor. These are the two agents that we evaluated against. And just letting it do its thing, right? So with no, no restrictions, it can search the web, it can read the code if it wants to, but it's not explicitly.
[00:24:23] Maria Gorinova: Asked to do any of those things.
[00:24:25] Simon Maple: So, do I, do I dare ask, which was better between Claude code and Cursor in the results? Or is that, is that set in fire underneath something? I don't need to set fire under.
[00:24:34] Maria Gorinova: I can't tell you.
[00:24:35] Simon Maple: Okay. Okay. That's fine.
[00:24:37] Maria Gorinova: Check the blog post.
[00:24:38] Simon Maple: Check the blog post. Set the, this is really good check.
[00:24:39] Simon Maple: Okay. Cool. Cool. Um. So in terms of the key result, so sorry, in terms of the next scenario, what would, what would the next scenario beyond the baseline be that we tried?
[00:24:50] Maria Gorinova: Yes. So we tried another, before getting to our approach, we tried another scenario, which was explicitly telling the agent to use, to read the source code of the package.
[00:25:03] Simon Maple: Okay.
[00:25:03] Maria Gorinova: So including a folder within, within the project that has the source code and. Telling the agent that it can read the source code. To check to double check implementation. Yeah. Right. We wanted to include this scenario because we thought this is, this is kind of the not distilled, not specially curated version of tiles.
[00:25:30] Maria Gorinova: Yeah. Right. Yeah. Because like you can imagine the source code itself is a, like a really inflated style.
[00:25:37] Simon Maple: Yeah, yeah.
[00:25:37] Maria Gorinova: Because it has, it has context. Yeah.
[00:25:39] Simon Maple: It's just huge. It's just like massive in terms of the volume.
[00:25:42] Maria Gorinova: It's not, it's not optimized for reading by an agent. Right? Yeah. Yeah.
[00:25:47] Simon Maple: And of course not always available, right?
[00:25:48] Simon Maple: Like I come from the Java land where we have, you know, compiled code, the wonderfuls, the wonderful world of compiled code where. Once you've compiled that code, it's much, much harder, I guess, for the agent to be able to, to read that and understand that because it doesn't really understand bike code as much as it does the, the, the typical GitHub, source.
[00:26:05] Maria Gorinova: Exactly. And everything will be also dependent on the language and Right. The frameworks you're using. Yeah. Yeah, absolutely.
[00:26:12] Simon Maple: So. We have the baseline, just Claude Code and, and, and Cursor. We don't have access to the source code. And then beyond that?
[00:26:19] Maria Gorinova: And then the third one is using the Tessl registry.
[00:26:24] Simon Maple: Of course we couldn't do this without Tessl, could we? So, so we have Tessl, so we have the Tessl Registry and the Tessl Registry contains the context that describes the open source packages that we have the, we have the, the access having, agents have access to the full source code and they have to determine it themselves, but they have everything in front of them.
[00:26:44] Simon Maple: And then literally just training data essentially with, with the baseline. What were the results?
[00:26:50] Maria Gorinova: The results were that using Tessl tiles. On average showed 35% higher obstruction adherence.
[00:26:59] Maria Gorinova: Which was even higher for newer libraries. So libraries that were released in the last three years,
[00:27:06] Maria Gorinova: there we saw 50% improvement of
[00:27:09] Simon Maple: 50% improvement of, so 50% more likely to use, the open source in the correct way. The criteria that was set of what we would expect, a developer really, or, or an agent as well, to use an open source library to solve a problem that was given. Yes. That's incredible.
[00:27:29] Maria Gorinova: Yeah. Yeah. Yeah, I think so as well. I think so as well. And, we saw a lot more interesting things really there. There's so much learnings. Yeah. From this experiment. Really check the blog post. It is very, very interesting. But for example, we saw that we, the source code, setting that also performs, well,
[00:27:50] Simon Maple: Yeah.
[00:27:51] Maria Gorinova: But it was much slower and yes, it took more, more turns. And I think it makes sense. It it's because the information is there, the context is there. But it's not optimized for usage. It is, it is messy. The agent needs to go and search for it and wrap all the time and do a lot of turns, and this slows things down.
[00:28:12] Simon Maple: Yeah.
[00:28:13] Simon Maple: So and a turn here, is that the agent itself making like, making more and more decisions about what it needs to do, or is that back and forth with the user?
[00:28:23] Maria Gorinova: It's agent turns, it's agent turns. It's it's steps, the steps that it takes in order to Gotcha. To get there. And, and it's because it has to.
[00:28:32] Maria Gorinova: do a lot of more manual work. Yeah. In order to
[00:28:35] Simon Maple: So lengthier and I guess therefore as a result, more expensive just in sheer tokens that it's gonna use. Uh, and presumably is, is there, was there any context bloat as well because it does all that? Or was that more done in I guess a subagent?
[00:28:48] Maria Gorinova: That's a very good question.
[00:28:50] Maria Gorinova: I think Claude Code in particular uses subagents extensively. Yeah. So we didn't observe necessarily, anything like this, but I can. I expect that for complex projects.
[00:29:05] Maria Gorinova: Where I feel, you know, our, our tasks, they were quite small. They, it wasn't something super complicated that needed to be done, especially if you're using the abstractions correctly.
[00:29:16] Maria Gorinova: Yeah. But for larger projects that. It's to implement the whole feature or something of the sort. I expect that there will be, there'll be more problems Yeah. With exactly this. Yeah. Because there will be so much more that the agent will need to manually discover.
[00:29:33] Simon Maple: Yes. Yeah, absolutely. In terms of, you mentioned newer, libraries, or Yeah.
[00:29:41] Simon Maple: Newer versions of libraries would, would cause a greater number of problems. And that's where you saw also kinda like Tesla. Succeed. Well, which kinda makes sense because, you know, less chance of it being in the training data. Fewer examples out in the wild for the agents to train on, I guess.
[00:30:00] Simon Maple: What about. Super old legacy style, libraries where, you know, I guess again, similarly, you're gonna have fewer people Potentially using the older versions, but I'm sure there's a ton of listeners that we have whereby they have these dungeons of Legacy apps where they don't wanna go near them because they're.
[00:30:20] Simon Maple: It's petrified of making changes and you know, we, it's a wonderful place for agents to actually come in and start understanding that code base. But can we trust the agent to equally make changes in that code base? How did, how did legacy applications perform?
[00:30:35] Maria Gorinova: Very good question. So what we observed is that the baseline
[00:30:41] Simon Maple: Yep.
[00:30:41] Maria Gorinova: Underperformed for very old and very relatively old and very new packages.
[00:30:48] Maria Gorinova: So the graph looks like this. Yeah.
[00:30:51] Simon Maple: Kinda.
[00:30:52] Maria Gorinova: And I think that kind of makes sense. Yeah. Because probably the training data was really focused towards relatively new libraries cause that's what's more used and, and so on.
[00:31:05] Maria Gorinova: But then not the newest most because there is less information about them. So it checks out in my mind. What we observed with when using Tessl tiles, is that this was relatively stable. So it, it did not depend so much how old the or knew the library is. It was a relatively stable performance and higher than that of the baseline in all cases.
[00:31:29] Simon Maple: Yeah. How about popularity of libraries as well? Because obviously the most popular libraries Yeah, they're probably gonna have a ton of. Of usage examples in open source repositories today, so much that of that will get into the training data and the agents will have a good chance, maybe, probably, getting versions and things like that slightly wrong, but it'll have a good chance of actually getting close
[00:31:51] Simon Maple: What about libraries that are lesser known? Lesser used, and fewer examples? In the wild. How, how did they perform?
[00:32:07] Maria Gorinova: Yeah. So. Again, kind of intuitively, what, what we thought of intuitively we saw in the data, which is that the agents would had a negative bias towards more niche libraries, niche here, judged by the number of forks on GitHub.
[00:32:25] Maria Gorinova: So we use that as a metric, and then we could see that the agents, the baseline, so just, just the agent, performed worse on. Libraries with less forks. Well that wasn't the case for tiles. So for tiles, we saw a big cliff there.
[00:32:47] Simon Maple: Yeah, which is for, for using tiles, would you say you can get those niche or maybe even legacy
[00:32:54] Simon Maple: Versions up to a similar kind of level as more popular ones, or is there still a gap between more popular and more niche or or or lesser used legacy maybe, even using tiles across both.
[00:33:09] Maria Gorinova: Yes, absolutely. So we see that niche libraries and popular libraries when using Tessll tiles are kind of at the, the same level of performance
[00:33:19] Simon Maple: Wow.
[00:33:19] Maria Gorinova: When it comes to abstraction adherence.
[00:33:21] Simon Maple: Yeah. Yeah. Gotcha. There's also a case study in the report, a very impressive, it's a very impressive case study. And it's with, is it LangGraph, right?
[00:33:31] Maria Gorinova: That's right. Yeah. LangGraph.
[00:33:32] Simon Maple: Tell us a little bit about, well, a little bit maybe about LangGraph, but again, what.
[00:33:37] Simon Maple: What were the results with this case study?
[00:33:38] Maria Gorinova: Yeah, so, that is, it, it was a case study that one of the other authors of the blog post did, Rob. So the idea there was to really deep dive on this particular library and in particular deep dive on new features. Features that we knew that were introduced after the training cutoff.
[00:34:03] Maria Gorinova: I think it was January, I think it was sonnet 4.5. But yeah, check the blog post.
[00:34:09] Simon Maple: I think. Yeah, I think it's January. January this year, I think.
[00:34:11] Maria Gorinova: Yes, yes. So, really looking at those features that were, that, that were introduced afterwards so that we know that they were not part of the pre-training data.
[00:34:22] Maria Gorinova: Yeah. For the model.
[00:34:23] Simon Maple: And presumably LangGraph is a good example to choose because it's had a number of releases fairly frequent, and as a result you see that change. So it's a good example. Not all, this is, not all libraries will be like this, but it's a good example of when you see changes fairly regularly.
[00:34:38] Simon Maple: It's something like LangGraph. Yes. Yeah. Yes,
[00:34:40] Maria Gorinova: Exactly. And what we saw there was that when using tiles, when using the graph tiles, yeah, the agents performed 90% better, wow, to 90% better on those new features. Yeah, which is, I mean, yeah, it's quite, it's quite impressive, I think. It makes sense because now we are giving them the context and we are making it so much easier for them.
[00:35:09] Simon Maple: Yeah.
[00:35:09] Maria Gorinova: But also, isn't it cool? Yeah. Right. Like how much easier we can make it for them. Yeah. Just by
[00:35:17] Simon Maple: And very simple to, to implement. Actually, we'll talk about that in just a sec. One burning question that I've got is, we mentioned a number of the problems. Yeah. And at the start of this pod podcast episode, with agents and how they get certain things wrong.
[00:35:31] Simon Maple: Have we just fixed everything then, or are there still like issues? What, what's the remaining issues I guess? Remaining problems that even with these, even with using something like Teslas tiles as context, what are the remaining problems that we would, should still expect to see?
[00:35:49] Maria Gorinova: I think to me the biggest, the biggest problem I see is with making the agent use this context correctly.
[00:36:01] Maria Gorinova: So the steering of the agent, how does, how does the agent actually know when to reach for that context versus not to reach for that context? This is this. Big question. And I've seen it over and over again. Steering is quite difficult to, to, to do.
[00:36:18] Simon Maple: Yeah.
[00:36:18] Maria Gorinova: And it's also something that's quite difficult to evaluate, which makes things hard because we need to be able, as I mentioned at the beginning, we need to be able to measure something. Yeah. In order to improve it.
[00:36:23] Maria Gorinova: So, I think there, it's quite tricky because a lot of, a lot of it is that we are at the mercy of. Whoever traces the models. Yeah. Right. We are the me mercy of the big labs.
[00:36:44] Maria Gorinova: Yeah. Because that is the biggest factor to how the agent behaves. How the, how does it know what the next step is? Right. Right. What, what training data they, they put in post training in order to make it, be, behave this way.
[00:37:00] Simon Maple: Yeah.
[00:37:00] Maria Gorinova: Of course there is. A lot we can do to control this. And, and we have been very busy with that at Tessl.
[00:37:10] Maria Gorinova: But it's a hard problem. Yeah. It's a hard and a very interesting problem.
[00:37:13] Simon Maple: Interesting. How would you recommend someone get started if they wanted to try the Tessl tiles themselves?
[00:37:18] Maria Gorinova: Go on the website. Yeah, go on the website, check out the Tessl registry, and try it out. Anyone can try it. For public libraries.
[00:37:28] Maria Gorinova: Yeah. Anyone can, can go and install Tessl and get going.
[00:37:33] Simon Maple: So, yeah, it's a, it's a, it's an MCP server that you can kinda like, you know, add to your agents and then your agents can effectively, through MCP allow Tessl to kind. You know, install that context, search for that context, and then, you know, provide that steering so that the agent uses that context for its requests.
[00:37:51] Simon Maple: And it's also CLI if you wanted to use it like that from a con, from a producer point of view, you can also create your own context as well, right? Yes. And, and if you have, I think we mentioned private Regi private libraries, or if your open source library isn't, you, isn't. Captured by our 10,000, tiles already.
[00:38:12] Simon Maple: You can actually use that. You can actually create, a, a markdown like spec as, as your context. You can publish that into your private registry and even, make a request to make that public in, in the, in the global registry. So there's a bunch you can kind of like do there as, as both a consumer, as a user, as well as a,
[00:38:30] Simon Maple: someone who wants to publish that context for others to use for their, using their libraries or private,
[00:38:36] Simon Maple: libraries.
[00:38:36] Maria Gorinova: Yeah, absolutely. Absolutely.
[00:38:38] Simon Maple: Amazing. Maria, thank you so much, first of all for, for sharing the work on the blog with, with your colleagues Rob and Max and, Dru.
[00:38:47] Simon Maple: And thank you for coming on the podcast. I really, really had a great time, having a chat about this and doing a, a deep dive into this. Always love to eval, chatting evals, and it won't be the last time, I'm sure. But Maria, thank you so, so much.
[00:39:01] Maria Gorinova: Thank you so much. Thank you for having me.
[00:39:03] Simon Maple: Appreciate it. And for those who wanna read that blog, check it out on the on the, on the Tessl blog, but also we'll mention it in the show notes as well. Thanks very much for listening, and we'll see you next time. Bye for now.
In this episode of AI Native Dev, host Simon Maple and guest Maria Gorinova from Tessl explore a groundbreaking research report on coding agents' ability to effectively use existing libraries and abstractions. They discuss the importance of evaluating agents not just on functional correctness but on their adeptness at leveraging real-world APIs, highlighting how developers can improve agent performance by providing curated context and integrating library usage into workflows. Tune in to learn how these insights can help teams create more efficient and maintainable code with AI assistance.
In this episode of AI Native Dev, host Simon Maple sits down with Maria Gorinova, Member of Technical Staff on the AI engineering team at Tessl, to unpack a new research report on how coding agents perform when they must use existing libraries and abstractions—both with and without extra context. Instead of asking agents to implement algorithms from scratch, Maria’s team evaluates whether agents can correctly and efficiently use real-world APIs. The discussion dives into why evals matter, how the benchmark was built, and what developers can do today to make agents more library-savvy in everyday workflows.
Maria frames the core challenge: software is built on layers of abstractions, and productive engineers don’t reinvent low-level logic when solid libraries exist. Yet, many coding agents default to reimplementing functionality from scratch or use libraries clumsily, requiring human micromanagement. That’s inefficient, costly, and leaves teams with code that’s harder to maintain and reason about.
In modern development, agents must integrate into existing codebases and follow the patterns, dependencies, and APIs teams already trust. Using known libraries yields performance gains, cost savings, predictable behavior, and shared understanding across teammates. Since humans still collaborate closely with agents, the code produced needs to be idiomatic and familiar—not bespoke stovepipes that diverge from the project’s conventions or dependencies.
The report zeroes in on this gap by benchmarking an agent’s ability to use libraries properly. Rather than “solve the algorithm,” the tasks say, “use the library to solve the problem.” Think: “use Pydantic to define a validated data model,” not “implement your own validators.” That shift brings the benchmark closer to everyday engineering and exposes how agents handle real abstractions.
Simon and Maria underline a key reality of AI systems: they’re probabilistic. One great or terrible interaction with a model is just an anecdote. Without evaluating agents across many examples, teams risk making product decisions and workflow changes on misleading impressions. Good evals quantify both what an agent can do and how often it does it reliably.
Traditional coding benchmarks—like those focused on functional correctness—are useful but incomplete. They typically test whether the final behavior matches expected outputs (often via unit tests). Maria’s team wanted to capture a different dimension: whether the agent uses the right abstractions to achieve that behavior. A solution that passes tests but recreates a JSON parser by hand instead of using the project’s standard utility is still problematic in real projects.
Evals also reveal how other levers—prompt design, context quality, and model choice—affect outcomes. A stronger model may help, but poorly structured or missing context can still derail library usage. By measuring performance statistically across many tasks, the team can isolate what truly moves the needle.
Tessl’s evaluation dataset is built from pairs of “question + evaluation criteria,” grounded in real open-source libraries. An agent first analyzes a library’s API surface and documentation, then generates coding questions that demand the correct use of that API. For example, in Python: “Use Pydantic to define a model with specific validation rules,” rather than “write validation logic from scratch.”
Each question comes with explicit evaluation criteria keyed to the library’s abstractions. The scoring rubric emphasises “API adherence”: Did the solution call the correct functions and options? Did it rely on the library’s intended patterns and types? Is the solution idiomatic to the library and aligned with its guarantees? This pushes beyond pass/fail outputs to assess whether the approach leverages the right abstraction.
To scale evaluation, the team uses an agent-as-judge to score the generated code against the criteria. This judge is separate from the solving agent and is guided by the rubric to focus on library use, not just end results. While agent-as-judge introduces its own considerations, it enables rapid iteration on large datasets. The result is a benchmark tailored to the real challenge developers face: getting agents to use libraries effectively.
A major part of the study probes how context impacts agent performance. Agents were tested with and without different forms of support—library docs, API signatures, examples, or repository-specific guidance. If you ask an agent to use Pydantic but don’t provide its docs or code references, you’re forcing the model to recall details from pretraining or guess. With well-targeted context, you dramatically narrow the search space and nudge the agent toward the library’s “happy path.”
For developers, this suggests a practical recipe. Package and retrieve the most relevant context: minimal API docs, type signatures, short examples, and project-local usage samples. Compact context beats a massive dump—curate snippets that clearly demonstrate canonical usage and constraints. When possible, add an explicit rubric to the prompt: “Prefer library calls over custom code; do not reimplement X; follow these patterns.” This aligns the agent with the evaluation criteria.
Beyond raw context, prompt structure matters. A system message that states “You must use the project’s existing libraries” and enumerates the preferred import paths sets expectations. You can also include “banned patterns” (e.g., “Do not write your own JSON schema validator”) and ask the agent to self-check for violations before finalizing an answer. These guardrails often pay bigger dividends than swapping models.
If you’re integrating coding agents into your team’s workflow, start by making library usage an explicit goal. Bake it into your prompts, your tooling, and your acceptance criteria. For day-to-day tasks, supply curated context: the library’s README snippet, key function signatures, one or two canonical examples, and any project-specific wrappers or utility functions the team expects. Keep the context fresh and easy to maintain.
Adopt a lightweight eval loop for your own codebase. Create a small suite of tasks that represent your common patterns—“create a DB migration using our ORM,” “add a Pydantic model for this payload,” “call our HTTP client wrapper instead of requests directly.” Pair each with a short rubric: which APIs must be used, which anti-patterns to avoid. Run your agent against this suite periodically and whenever you change models, prompts, or context retrieval strategies.
Add verification layers. Simple AST or regex checks can catch telltale reinvention (e.g., hand-rolled parsing). Unit tests should verify behavior, and a rubric-based judge (human or agent) should verify library adherence. Consider sampling multiple candidates at a low temperature and selecting the one that passes tests and rubric checks. In prompts, explicitly instruct the model to self-critique: “Verify you used the specified API,” then revise before finalizing. These practices make agents more predictable and their outputs more maintainable.
Maria hints that more research is coming—this is an evolving space. As benchmarks mature, we can expect more nuanced scoring (e.g., degrees of idiomatic use, performance-aware choices, and maintainability metrics) and wider coverage of languages, frameworks, and domains. Future iterations may assess multi-step planning: reading docs, proposing an approach, implementing with the chosen APIs, and self-checking against the rubric.
On the product side, better retrieval, richer tool use, and tighter library adapters will reduce the need for micromanagement. However, even as models improve, the underlying engineering truth remains: good software depends on good abstractions. Teams that articulate, expose, and test those abstractions—through docs, examples, and evals—will get the most from coding agents.
Ultimately, this report reframes the question from “Can an agent code?” to “Can an agent code like our team codes?” That’s the standard that matters in production.

12 Nov 2025
with Baruch Sadogursky, Liran Tal, Alex Gavrilescu, Josh Long

AI VS. ENGINEERS:
WHO DEBUGS
FASTER?
12 Aug 2025
with David Cramer

CUT TECH DEBT
WITH REWRITE
AUTOMATION
22 Jul 2025
with Jonathan Schneider