
Also available on
Baruch
[00:00:00] Simon: You said, we want something to automate the test.
[00:00:02] Simon: We want something to create a test because we don't want to look at code. Right. How true is that though? Right. Because I did a keynote at Devoxx a couple of weeks ago, a few weeks ago, and I was talking about how the world is moving into this new space where developers will continue to be creators.
[00:00:21] Simon: We will. We'll do things in a different way. And one of those ways is we will look more into specifications, less about code.
[00:00:29] Baruch: Exactly.
[00:00:31] Simon: A developer came up to me though. At the end. Okay. My last slide was because it was. Devoxx. Keep calm and carry on because we as developers need to change.
[00:00:41] Baruch: Evolve. Evolve.
[00:00:42] Simon: He was super worried though. He was like, how can you put that slide up when it means we're not looking at code? He loves coding.
[00:00:49] Baruch: Yeah, yeah, yeah.
[00:00:50] Simon: Absolutely. His safe space was clearly in an IDE looking through code. Yeah. His mind was extremely technical. Of course. Loves the challenges. Loves the depth. Yeah. So how do those people deal with that?
[00:01:01] Baruch: So I think when I say like we don't want to look at code, we mostly don't want to look at code which is not ours. Like we're, I mean we're humans. I agree. Well, selfish. We're self-centric.
[00:01:13] Baruch: We most of the time listen to someone, and when we listen, what we really do is come up with a clever reply to what they're about to say without even listening. Yeah. We know all those flaws. Like humans are flawed and very self-centric. And this is true about code as well. Yeah. And the example that I gave you earlier, I obviously love, in love with my code. Yeah. Yeah. But I'm much less inclined to look at your code than you because well, okay, that's just some other code. Yeah.
[00:01:40] Simon: Or my code that's one month old. Which might as well be someone else’s code.
[00:01:45] Baruch: Oh yeah. Absolutely. Right. So and this is, and this is the problem with the AI generated code, that it’s someone else's code. Yeah. Right. So you might look at it and especially now when it's a novelty, you look at it out of curiosity to see what LLM does.
[00:02:00] Baruch: Yeah. And you're like, oh my God, this is brilliant. This is crap. But you look at it because you're curious. Like in six months, when more and more code will be generated by AI, you will just, okay, someone else's code. Right. So first of all, you don't want to look at this code. And obviously, the non-technical people definitely don't want to look at this code.
[00:02:12] Simon: They want to look at intent. They want to look at this test does this, and I can read that.
[00:02:24] Baruch: Exactly. Exactly.
[00:02:29] Baruch: Right. And generally, the idea that code is, that the tests are technical and written in code, I think, and that's just an assumption, is one of the reasons why TDD really didn't take over the world. Because in the end of the day,
[00:02:41] Baruch: Because in the end of the day,
[00:02:43] Simon: TDD, did you say that?
[00:02:44] Baruch: Yeah. Like test-driven development didn't take over the world because, in the end of the day, if developers are the only ones who can write the tests and read the tests, the real question becomes, at least for me in my experiences, why am I doing it this way? Yeah. I have a problem. I'm biased for action.
[00:03:03] Simon: Yeah.
[00:03:03] Baruch: I know the solution in my, I see algorithms running, code written in my eyes. I want to go and write code. I want to write tests. I don't mind writing tests afterwards to check that my code does what it's supposed to do.
[00:03:11] Baruch: But starting with code doesn't make sense if developers are the only one involved in the picture. Yeah. And I guess I'm not the only one who thought about this problem. And this is one of the reasons why behavioral driven development came to be 20 years ago. Like the BDD thing.
[00:03:33] Simon: Yeah.
[00:03:33] Baruch: The way BDD works is like, hey, what if the codes, the tests weren't defined in code, but were defined in spec? And what BDD refers to spec is some kind of pseudo natural language. It's called Gherkin.
[00:03:42] Baruch: And it's basically a set of rules that goes like given context when something happened, then expect those results.
[00:03:59] Simon: Yeah. [00:04:00]
[00:04:00] Baruch: So basically this is the steps. And the beauty of Gherkin spec is that it's human readable. And the idea was that it's also human writable. Yeah. Right. So the product managers, the business people, maybe even the customer can participate in writing those specs.
[00:04:18] Simon: And it's interesting because when Agile came in, very often when we started writing use cases and things like that and starting with use cases, it was similar, the language is slightly different, but it was actually kind of a similar thing as whoever I want to use it.
[00:04:31] Baruch: Exactly. Right. And then the beauty of the idea of Gherkin spec was that you take an algorithm and compile those specs into tests. Now, look at us moving one step forward toward building the integrity that we spoke about. Because now suddenly we can have something that technical people and not technical people can see and write. And then, if you remember the rest of the chain that we started to build, the spec is compiled to the test.
[00:05:09] Simon: Yeah.
[00:05:10] Baruch: We can trust the test. That's exactly what the spec was.
[00:05:12] Simon: Yeah.
[00:05:13] Baruch: The code implements the test. So we can trust the code.
[00:05:18] Simon: Yeah.
[00:05:18] Baruch: Because it has to, we have to, because if it passes the test, it does exactly what the test wants. And the tests are the result of compilation of something that we agreed our application should do.
[00:05:31] Simon: So the areas that you care about, you should have as many of those tests as you can. The areas that you don't care about. You just allow the LLM to infer whatever code it wants and whatever it decides. Yeah. That's the right way. That's a way of doing it. That's an okay way.
[00:05:45] Baruch: If it's wrong, it means that you need some tests that will enforce the way that you want it. And it means that it makes some spec. It misses some spec. It misses one of those use cases of given when then, then you just need to go ahead and write.
[00:05:58] Simon: So it's, it's an uncaptured intent [00:06:00] essentially.
[00:06:00] Simon: Exactly. Yeah.
[00:06:00] Baruch: Now, BDD also didn't really take over the board. No. It's either, you see it even less than the actual. And again, my speculation why is that writing those structured, this structured intent, the spec, was still a little bit too much for the non-technical people. Right? Yes. Of course it's readable, but writing it, it's kind of annoying. Those are, you know, when you look at the, when you speak with business people or product managers, they are kind of free souls. They want to create. They want to write Shakespeare. They don't want to write this rigid like structure.
[00:06:42] Simon: Yeah.
[00:06:43] Baruch: You read PDD/ BBD, it's like, it's like poetry. Yeah. So, so again, those specs didn't really catch up because product managers like, ah, that's too rigid. And then if they don't do it, developers definitely are not going to do it. Right? Because for some reasons.
[00:06:58] Simon: I also think, I also think one of the issues with BDD is there was no hard connection between the BDD intents, the spec part, and anything else. It was left up to someone to then implement that.
[00:07:05] Simon: And as change happened in the business and things like that, it gets stale. It gets there in the same way a PRD would. And then all of a sudden, you know, no one goes back to the PRD because it's out of date. The code is the source of truth.
[00:07:25] Baruch: But now, if we automate all the way from the spec to the end product, the spec is very important, but no one wants to write this Gherkin stupid thing. Yeah. Suddenly, AI can help.
[00:07:33] Baruch: Yeah. Because we don't need to write. Once we have a, like a specification document. Yeah.
[00:07:44] Simon: Yeah.
[00:07:45] Baruch: Right. Whatever we call it, so-called definition document, whatever, there are a bunch of, we can ask AI to formulate those specs, those when given when then from it. Now, we know that it will hallucinate, and we know that the spec won't be exactly as what we wanted it to do for mainly two reasons. The first is I mentioned that it hallucinates. But the second is that it's very hard for humans to express intent in the words.
[00:08:17] Simon: Yeah.
[00:08:17] Baruch: Because of context that we have in our heads that obviously LLM lacks. So there is this chasm between the prompt and the intent, between what we ask it to do and what we really wanted it to do.
[00:08:30] Simon: Yes.
[00:08:30] Baruch: So, for those two reasons, the spec won't be perfect.
[00:08:35] Baruch: But coming back to the biggest benefit of the spec, because it is readable, we can actually review it. It's different from tests that we theoretically also could declare a source of truth and review it instead of specs. But as we spoke, we don't want to read anyone else's code and especially the non-technical people that they even cannot.
[00:09:01] Simon: Yeah.
[00:09:01] Baruch: So the spec is different because it's readable. In your language, there are hundreds of languages, translations of Gherkin's spec. You can read it in Urdu if you feel like.
[00:09:12] Baruch: And this is suddenly something that everybody around the table can sync on and say, okay, now we can read the spec.
[00:09:22] Simon: Yeah.
[00:09:22] Baruch: It's not painful in English. And we can verify that what's in the spec, it's actually what we wanted to be generated from on-prem. Yeah. Now, if it's not, we go back and forth, we iterate until it is, or we just fix the spec. It's almost a plain language.
[00:09:42] Simon: And presumably you could almost use LLM as a judge as well to some extent almost as a final backup to say, look, because people are going to be lazy. Right? Yeah. And I think this is one of those interesting things where even if you look at something like code review, okay, people don't like looking at each other's code, for example. But it's something that the more text, the more change there is, we'll kind of like look at that and go, yeah, that kind of looks good, looks good to me. If there's two lines, we're going to be questioning the variable choices. So if there is going to be a big set of tests or a big specification, are we actually going to go through it?
[00:10:27] Baruch: And LLM as a judge could actually be, there are going to be levels of automation that, okay, we need that backup almost. There's potentially two levels there. One for the human. And we can bring different models. We can generate spec with one model and then ask another model if each thing that the intent is being captured. Yes. Yeah. Right? And then it can, we can iterate and make this process. And then, but in the end of the day, human readable specs are much easier to review than tests or code. Right? And this is like the main idea.
[00:10:53] Baruch: So now we have the entire chain. We start with the prompt, like, hey, I want to write an application that does ABC, come up with a set of specs that will satisfy those requirements. Then we have the specs. We don't trust the LLM.So the problem with LLM, that it is non-deterministic, not better than random code generator. Right?
[00:11:17] Simon: Yeah.
[00:11:19] Simon: I like that.
[00:11:20] Baruch: So my analogy is, you remember this thought experiment that if you get to endless number of monkeys and endless number of typewriters, eventually they will come with the works of Shakespeare. This is a lens. They're slightly better than random monkeys typing on typewriters. But eventually, that's exactly that. They try to get it right. And it almost, sometimes almost works. So, and the idea is that you cannot really trust a monkey to write Shakespeare, especially not on the first run. Yeah. So the idea here is that...
[00:11:54] Simon: I think that's the quote that we should take out from this entire podcast. You can't trust a monkey to write Shakespeare.
[00:12:03] Baruch: This is exactly,
[00:12:04] Simon: That's one thing. I don't care whether you're spec-centric, code-centric.
[00:12:09] Baruch: That's one thing we can all agree on. Never trust a monkey to write Shakespeare.
[00:12:09] Simon: Or any kind of famous literature.
[00:12:10] Baruch: Or code.
[00:12:12] Simon: Yeah, or code. Or a code. Well, I've seen some code.
[00:12:18] Baruch: No, but that's, you’ve seen some code, but it wasn't true.
[00:12:22] Simon: No, that's true. Absolutely not. Yeah.
[00:12:23] Baruch: Yeah. So let's make sure that we iterate until it is. Yeah. And that's exactly the idea, right? So the monkeys generate spec. Will they do it right for the first time?
[00:12:27] Baruch: Probably not. We want to bring other monkeys. And eventually the most important is we look at the spec.
[00:12:28] Baruch: And we say, well, it rhymes, but it's not Shakespeare. Yeah. Try it again With five different prompts. Yeah. Or we can just add the spec directly. Now, once we have the spec, we say, okay, we cannot trust the monkey to generate correct tests from the spec, because we cannot trust the monkey.
[00:12:58] Baruch: Now, the problem is that we decided we're not going to review the tests. It's too much for us. And someone else's code, we don't want to do it. The beauty of namely. cucumber, but generally parsable spec is that we don't need a monkey. Because we take an algorithm and compile the specs into tests, deterministically every time. We don't need the monkey.
[00:13:15] Simon: Yeah.
[00:13:24] Simon: Yeah. If you run it 10 times, you'll get the same thing 10 times. Whereas if you do it with LLMs, there'll be subtle differences and maybe sometimes bigger.
[00:13:32] Baruch: Of course. And since we don't review the test, we won't even know that something is wrong. Yeah. Right? So monkeys, out of the picture for this part. Right? So we generate the test. Now, once we have the tests, we can let monkeys lose, type on those typewriters forever until the test’s pass.
[00:13:50] Simon: Yeah.
[00:13:51] Baruch: The only thing is we need to protect the tests because the monkeys will be inclined to adjust the test to make them pass. So how do you do it?
[00:13:58] Baruch: Whatever. Make the files read only. Put them in a Docker container and make them unaccessible, like also read only. Do whatever to protect the tests and then let the monkeys lose.
[00:14:06] Baruch: Let them type as long as they want to until the tests pass. Yeah. And now we have the chain. Right? We have a prompt that is guaranteed to capture our intent because we reviewed the specs.
[00:14:08] Baruch: And then we have the code that is guaranteed to match the specs because we generated code, we generated tests without the monkeys and then the code actually passes the test. So we can have the entire picture from our ideation to the end product that we can absolutely 100% trust. This is an intent integrity chain.
[00:14:55] Simon: Intent integrity.
[00:14:56] Baruch: Because we guarantee the integrity of our intent in the code.
[00: 15:01] Liran
[00:15:01] Simon: The LLM is just as capable of, you know, suggesting fixes the first time around versus identifying issues and then creating fixes around them. What are the risks in relying upon those fixes, or suggested fixes, to my code?
Simon: And I guess we're talking about security here today, but we just had a performance session last time around. It's going to apply to a ton of different areas, but specifically, Liran, I think you're probably well placed here from the first-party code point of view. Yes, AI can suggest fixes. What are the risks?
[00:15:35] Liran: So I would say it again suggests fixes, but the fixes will be suggested based on some statistical model of whatever it was trained on and whatever context it thinks it has, but it doesn’t really have the right context. And I think that is key. So, like, there’s so much going on when you go beyond the benchmark sample repository that you compare various tools against and the code and to-do apps you build with AI. Once you go deep enough and things get complicated, at that point there’s a ton of context that’s needed, and understanding of code flow.
[00:16:04] Liran: So I’ll give you one example that makes total sense where LLMs would actually try to fix something but would actually get you hacked.
[00:16:22] Liran: So imagine that you are on a piece of code in a, in your file and you are working with, URLs and you are, you know, you, you, you practice something that's maybe like, you know, URL encoding or like you, you like ask the, like the result of the AI is giving you some code that relates to that and you ask it to like secure it, right?
[00:16:42] Liran: And it says, okay, let's do URL encoding or the other way around? Let's do URL decoding, whatever, like the, it depends on like the current process. Let's say, let's say that this is what it suggests, and by the way, that's, that's probably one of the things it should do. So like that's a proper process. But what happens when it is not aware that somewhere up the chain, right?
[00:17:02] Liran: Some mean, some, you know, let's say this is your, you know, service and in some other service you interact with or you know, the route, the controller, whatever, you know, the, the architecture is, you actually already did URL encoding. So now you have double URL encoding and that is potentially like a security vulnerability like waiting to happen due to the double encoding.
[00:17:21] Liran: And so if the LLM doesn't know to like trace your code pads and say, well, well you actually did encoding before, you should not really reapply the encoding because you could suffer at the double encoding injection. Then that's gonna get you busted. And so that's why it needs more than being just a statistical model and sometimes like vulnerabilities in most of the cases.
[00:17:45] Liran: Honestly are very nuanced and very complicated. While it's sometimes easy to just, you know, demonstrate some code injections, some cross-site scripting that we show in a vulnerability and, you know, there it is. It's, it's usually, most of the time vulnerabilities are chained together. They are more complicated people, you know, chain one to another, like a prototype pollution to do code injection to do command injection to open a shell.
[00:18:05] Liran: There's like a, a layered approach when things happen in the real world rather than what we can like, you know, demo. So having that nuance and understanding is, is, I don't think that can like fits the statistical model and maybe those things, maybe those LLMs and AI agents and the whole, you know, gen AI, you know, general way of coding gets smarter where it'll be able to then kinda like infuse several learnings and several paradigms together, like both the model, but also some applied, you know, logic and something else that would be great.
[00:18:25] Liran: But like at this point in time, it's just super basic and it's as if it's kinda like saved you the time to like Google how to properly do, you know, path validation versus just asking the LLM. It's not at that point where you would ask it to fix it and you could count on it.
[00:18:55] Simon: Yeah, yeah, yeah. Very interesting. We've got a minute or so till Q and A. Let's, let's invite the audience to also ask some questions, while the audience is thinking of questions and pinging them into the chat, I'd be interested in just asking more of a cultural question, and I think.
[00:19:10] Simon: When we think about, you know, DevOps changes over, over the last 10, 20 years or so, we've seen platform teams. We've seen this, this amazing DevOps engineer role been created. I, I personally see there's a place on that platform team for maybe certain security engineers and, and people who do work closely with, with the developers the paved road, style approach when we, so first of all, may, maybe it'd be interesting to hear your thought on that, but beyond that, do, do, do we see it from an organizational change point of view, an AI security paved road or anything like that, also sitting in the, in the, in, in the platform team? I'd love, I'd love to know how you think the, the teams are gonna change.
Simon: Ashish, why don't you lead us with this?
[00:19:53] Ashish: I think the, the, so there is already. I, I guess many companies have already started working on AI projects, and the way they're dealing with this at this point in time is the platform engineers themselves are the ones who are, because from a workload perspective, if you look at the components that get used for building an AI application, those really haven't changed.
[00:20:10] Ashish: I guess you still use Kubernetes as your foundation layer. You still have databases. So all that layer hasn't really changed. So though those people would still continue to work on it, I think the only, the only point you would see evolution like that when the number of AI applications outweigh the number of regular applications you have.
[00:20:33] Ashish: So it kind of happened to your point, example where we started with DevOps. We had cloud became cloud security engineers, cloud security this, cloud security that then Kubernetes became popular. Then like, oh, let's just do Kubernetes security engineer. I think it just comes down to, if we ever get to a point where the number of applications which are AI driven in your organization are more than the regular non-AI driven, or by then they are quote unquote legacy application, then yeah, maybe, we'll see that, but maybe not right now.
[00:21:03] Ashish: I guess kind of where I'm coming from.
[00:21:06] Simon: Yeah. Any thoughts, Liran?
[00:21:09] Liran: Yeah, I, I agree. I think it's a bit early to call like an AI security engineer. I think it's definitely not early to have like AI engineering teams, just because that's like very, like, that's like, like AI. AI is here, right? Like LLMs are here.
[00:21:23] Liran: They're like, you know, among us. Sometimes they, they present themselves as Java LLMs, which you shouldn't trust. But you know, they are among us and I think AI engineering is like, is, is something that, that should probably exist. I think AI security engineering is, is a bit deep down. It's, it's indeed like, I don't know if there's like Kubernetes security engineers.
[00:21:44] Liran: It's like very, very, very specific. You kind of like, I think like depend on the tooling, the ecosystem, you know, the foundation, to like run a bunch of like, you know, the security responsibility to an extent kinda like outsource the risk because it's a third party and things like that. And I, I, I somewhat feel like Ashish, that it's kind of early to understand what it is since we don't exactly know, like even how to secure it properly.
[00:22:07] Liran: So, we'll see.
[00:22:10] Simon: Yeah. Question coming in from the audience, Min asks, it's interesting that open source code used to be seen as more secure than proprietary code because open source projects are transparent about potential vulnerabilities. I gather from what you are saying now is that no, is that no longer the case when it comes to models and agents?
[00:22:30] Simon: Did I understand this correctly?
[00:22:32] Ashish: I think it's still the case, though. I think it's still the case that both open source and proprietary data cannot be trusted. That's where the, I mean, I get the, the whole foundation of the nervousness that we all have is that no matter which way you face, if you look at internally, there was the Ashish example that I went for who left, left a bar bag of Rotten Tomatoes, ages ago.
[00:22:52] Ashish: That is still hanging around. There are parts of the organization that we still don't touch or talk about. That is interest. So that is never going away. It's gonna be there if you look at open source, I think I'll go back to the episode example that you gave Simon earlier, where Arman kind of spoke about the levels of.
[00:23:09] Ashish: Like I did my first GitHub repository back when I was trying to understand what, what is it like, what is it to code something and my, that my, that repository is still there to the point that today when I'm teaching people Kubernetes and cloud and all of that. I have the mature repository as well. I mean the same, Ashish has two repositories, which are complete different experiences.
[00:23:32] Ashish: I've never touched the, the first one again 'cause I'm just too embarrassed to talk about it now that I did it publicly. But still, I think that's where I find that I would say there's untrust or nervousness on both sides. I, I wouldn't say one is better than the other.
[00:23:46] Simon: Yeah.
[00:23:47] Ashish: Or more secure.
[00:23:48] Liran: I would take a, I think it's more nuanced and it's.
[00:23:52] Liran: It's really hard to compare open source code, like, like comparing third party vulner, third party dependencies with like, third party models. If you, if you, if that's what you kind like, wanna compare, it's super hard because the, I think the foundation is very different. So for example, to, to review and scan, like get an audit of an open source, third party package.
[00:24:14] Liran: I could, I could see the maintainers, I could see who contributed, I could see a git log of all the code added, how it was added when fixes were made, right? Like in the code, in the releases, I can see what party it uses to do things. I can have like, you know, provenance and attestation at like the signature level of how, how versions were published.
[00:24:37] Liran: I don't really have that level of data at the model, right? Like I can't see, I cannot see like how exactly was it trained. I, I, I am maybe being told how, but I don't know how it was actually. I don't think like the same, the same standards of how you do code to how you train models exist yet. And I think that's like the, the nuance that kind of like, kind of like escapes and it's, it's kinda like easy to maybe overlook, but I think that's, that's an important aspect when you need to audit, you know, code versus, versus a model.
[00:25:10] Simon: Yeah, we're pretty much out of time, but I want to add this last question 'cause I feel like it's a, a really great question from Macy here. What about social engineering risks, when teams leveraging, generative AI? What steps can be taken, what steps can teams take against this? Or is it too early?
[00:25:28] Simon: Do we not know what those are just yet?
[00:25:30] Ashish: I would say there's definitely some, and I guess there are different ways to take the social engineering aspect. The one that comes to mind with Gen AI is the fact that I work as a developer and I want to know what's the salary of Isha CSO, 'cause I'm just curious.
[00:25:45] Ashish: He seems to be quite fancy on the internet. I wonder where he gets all the money from. So, so I, the, the way people are talking about this at the moment is that there needs to be almost like a quote unquote data access manager, for lack of a better word. The, the way people are doing this is how do you manage access for different identities based on their roles?
[00:26:03] Ashish: That is one aspect for social engineering. I, I would argue this could be done better in the AI world because you have a lot more sense, you have a lot more understanding of where your data is, what kind of data it is, and who can access it and what they should be doing. It's not just a policy document, it's actually implemented.
[00:26:22] Ashish: I think that's where I feel that'll be really interesting, but in some aspects really early because we haven't really seen AI based attacks at scale, so we don't really know what social engineering could be like as well. So maybe in that aspect, maybe too early.
[00:26:36] Simon: Cool. I'm being shoved off shave, shoved off stage here because we have another session starting in just a few minutes.
[00:26:44] Simon: So first of all, a massive thank you to both Ashish and Liran. It's an absolute pleasure to have you on the AI Native Dev.
[00:26:50] Ashish: Likewise. Thank you. Yeah, thanks for having us.
[00: 26:52] Alex
[00:26:52] Simon: We're talking a lot about spec-driven development, here at Devoxx and, and, Alex, first of all, tell us a little bit about yourself. 'Cause I think some of the things that we're gonna talk about with backlog and things like that is very much, is very much kinda in, you know, built in a, in a spec driven mindset.
[00:27:10] Alex: Yeah.
[00:27:10] Simon: Yeah. In terms of going forward. So tell us a little bit about yourself, a little bit about Backlog.md.
[00:27:15] Alex: Right, so I'm a lead engineer in Vienna for a gaming company. And I'm coming from an agile environment to using Scrum, having all the requirements with PRDs. Thinking ahead a few months ahead with all of the features, like really thinking them through.
[00:27:34] Alex: And then there's the coding part. And right now everything happens like, between humans. So it's, no AI, no nothing, just classic development. We have a backend team. We have a mobile development team. So they all collaborate together. This is what I'm doing during the day, but obviously in my free time I'm trying to learn as much as possible about AI and trying to keep up to date with the latest trends.
[00:28:05] Alex: And Backlog.md was basically my trial of AI and like a challenge for myself to really try to be as autonomous as possible and try to let AI code 100% of the time in my tasks. So, what was the reason why I started building Backlog.md is, I started with side projects. I started working with AI.
[00:28:32] Alex: I was just prompting, prompting all the time, and I was just getting really bad results. I was basically doing web coding, and then I realized how successful processes I use at work make the collaboration between humans effective enough to build successful gaming features.
[00:28:54] Alex: And then I tried to learn from the human process and try to adapt it to AI. By doing this, I realized the most important part of the specs. So everything starts with specifications, requirements, not just about the feature you want to build, but also everything around it, like security specifications, CICD specifications, like just the language that you should use, like C#, TypeScript, Java, whatever language.
[00:29:24] Alex: And you should have all of this context right before you start. So in order to be really successful with your tasks, really…
[00:29:32] Simon: Really interesting. Let's just break that down a little bit. Yeah. So, you were doing a lot of vibe coding. What typical tools were you using for vibe coding?
[00:29:40] Alex: Claude MD and Claude.
[00:29:45] Simon: Yeah. So, what are the examples of those problems? Did you kind of hit in Claude code?
[00:29:57] Alex: So the problems were like single-task problems that would repeat themselves on every single task. So what I want to mean by this is, I tell the agent to build a certain feature and he reached the goal and he managed to build this feature.
[00:30:20] Alex: But by going back and forth on a lot of things. And with the next stuff, I would repeat again the same instructions, and I would stumble upon the same issues again and again and again. Gotcha. This is the problem with vibe coding. Basically, every time you have a new session with your agent and you try to build as much as possible within your context window.
[00:30:39] Alex: And this doesn't scale, obviously.
[00:30:41] Simon: Yeah. And as soon as you close that window, context is lost. You have to build that up again in your next window that you created.
[00:30:47] Alex: Exactly. And maybe you forgot the instructions that you told him to prevent certain issues. And you have again the same problems.
[00:30:54] Simon: Yeah.
[00:30:56] Simon: And a lot of the things that you mentioned that you would put into specs, like security considerations and things like that. How would you describe those needs to your vibe coding tool?
[00:31:05] Alex: Yeah, so obviously with Vibe coding, you can do a lot of damage. You can deploy some changes, some patches to production that actually break things. And you need guide rails, but you need these guide rails also with humans, like we all use specifications about security and measures that check points, for example, like staging environment to test all of these measures to prevent issues. And with vibe coding, suddenly everyone forgot about them.
[00:31:41] Simon: Yeah. So then you mentioned you moved all into the spec environment. Tell us a little bit about that.
[00:31:48] Alex: Yes. So I started creating markdown tasks manually. Actually, before that I started creating a huge markdown with all of the specifications and all of the features that I wanted to build.
[00:31:57] Alex: The problem with that, it was a huge context and I would immediately inject this context into the model. The agent with this model will maybe be effective, maybe not, and I will not have a good success rate and sometimes I will have to roll back, but it would be very hard to roll back a whole product or a whole feature.
[00:32:21] Simon: And what's interesting there is you're nowhere near the context window. You're nowhere near the maximum context size. You're just hitting that mark whereby the results based on the amount of context you provide degrade massively from that.
[00:32:37] Alex: Yes. These agents have a feature called compaction.
[00:32:43] Alex: This basically tells the agent to make a summary of all of the previous conversation and start with the minimum context from that summary. But the problem is that the instructions that are contained in this summary are half of the ones that you instructed him initially, and you have to most likely start from scratch again because you don't know what he misses.
[00:33:07] Simon: Okay, so what's next in the spec journey?
[00:33:11] Alex: What was the immediate next step? The immediate next step was to split the Big Mac down file into smaller tasks. So I would have, similar to how we have in Jira or in Linear or other project management tools, single tasks that just define what has to be built.
[00:33:28] Alex: And I basically translated this into markdown files because markdown is a universal language that is plain text with formatting, that both humans and agents can understand very well. And agents were actually more effective and I could easily roll back single tasks at the time.
[00:33:47] Simon: And you did this manually, or did you…?
[00:33:50] Alex: Yes. At the very beginning I was creating manually. When I reached around 50 manually created tasks, I was like any power software engineer would try to automate this. How can I automate this manual creation of tasks? So I built Backlog.md for enabling this and have an easy way to create tasks by using your terminal.
[00:34:13] Simon: The best way to learn about it is to see it, right? Yeah. Should we take a look at a quick demo of Backlog?
[00:34:19] Alex: Sure.
[00:34:20] Simon: Talk us, talk us through.
[00:34:21] Alex: So the first thing that you will do to even run Backlog, you would have to install it. So you can use ban, NPM or brew, and you should install it globally on your computer.
[00:34:26] Simon: Why globally?
[00:34:26] Alex: Because then you don't have to install it in every single project that you're using, and you can immediately use it in every folder. This is going to take a few seconds, and even on conference Wi-Fi, it works. This is because one is very fast in installing dependencies.
[00:34:55] Alex: And afterwards, you can run Backlog as, as this no other instructions. This is going to give you some initial hints, what you can do with Backlog. So you can see here, you can create tasks, you can list the tasks, you can see the board. I will show, show it very quickly. You can also run a browser. And again, I will go in detail about this and then overview that shows what the statistics about your project, how many tasks are finished and how many are remaining. And the link to the docs and. The next thing to do. Normally you would want to create the tasks, but because this is, I'm running Backlog inside the Backlog project and I'm using Backlog to keep track of the tasks for Backlog itself.
[00:35:27] Alex: I already have hundreds of tasks. So let me show you how it looks. The bot, so Backlog gives you a board that is configurable with how many statuses you want. So the default ones are To Do, In Progress, and Done. And here you can see what is going on in this project. And all the tasks that you see here are in sync between multiple Git branches.
[00:36:06] Alex: So Backlog uses natively Git to fetch information from other branches. So if you work with other people, as long as they push the changes on a task. So, for example, I take over task 200 and I assign it to myself and I put it in progress. As I make a push on my feature branch, you will see it also in your Backlog instance on the main branch, for example.
[00:36:28] Alex: So this is also nice to, to use it as a collaboration tool.
[00:36:31] Simon: And can you filter based on branches and things like that here?
[00:36:35] Alex: So, the logic behind it is it's a bit hidden from the developers. Yep. Backlog should be smart enough to find what is the task with the latest updated date.
[00:36:46] Simon: Nice.
[00:36:47] Alex: And that means is the, the mo the newest state.
[00:36:49] Alex: And that means it's the newest state. Obviously, if you change the task by mistake, this will trigger that the task has the latest update version. But in normal use cases, this should not happen.
[00:37:01] Simon: Gotcha. Okay.
[00:37:02] Alex: And what you can do from here, you can find the details of the tasks. So for example, let's see, one task that I would like to build next. This one, I press enter and I immediately see the details of the tasks. So what I would like to have in the near future is drag and drop in the terminals.
[00:37:24] Simon: Yeah.
[00:37:25] Alex: SSo like you do in Jira, you drag your task from To Do to In Progress. I would like to achieve the same in this board. So this is the task for that feature. It's not implemented yet. It's in To Do. And what we can see here are the task id, the title, some metadata about the task, some dependencies, because Backlog also supports dependencies. That basically prevents the agent to work on tasks which are not ready to be started.
[00:37:36] Alex: So this is also important in a spec driven development. We have a description that basically tells us why we are even doing this feature. So adding the drag and drop functionality using shift plus arrow will allow the users to not leave this terminal interface and to change the progress very easily.
[00:38:04] Alex: So this is very nice from a user perspective. And we have some acceptance criteria, which are the core of Backlog basically.
[00:38:24] Alex: Yeah, the acceptance criteria is something that can be testable, something that can be easily verified and measurable, and basically represents the increments of implement like the smaller increments within this single task.And whenever an acceptance criteria is done, it'll be ticked. So I will show you very soon how that looks. Let's open a task that has been completed. This one, for example, so you can see the status is Done. Yep. It has been taken over by OpenAI Codex.
[00:38:57] Alex: Also another feature about Backlog is labels. So you can label your tasks and filter later by labels. And here we can see the acceptance criteria are all obviously implemented. Yep. We have here an implementation plan. I will talk about the implementation plan a bit later, but it's part of the development.
[00:39:18] Alex: You always want to ask the agent how he would like to develop this task, and then you review. And then at the end we have the implementation notes, which are sort of permanent context of what has been done in this task. So if for any reason a human or an agent would like to see what happened in this task.
[00:39:41] Alex: They would read implementation notes and have this permanent context about it.
[00:39:50] Simon: Yep.
[00:39:51] Alex: So yeah, this is the primary interface. We also have a task list view, which is this one here. We have just a long backlog of tasks. We can also check the details and if it's longer scroll, we can also search for tasks or maybe let's find tasks about Tailwind.
[00:40:21] Alex: Yeah, this one, there is no task actually about Tailwind, but anyways, we can also filter by status. So To Do, Done, In Progress, priority again. So this is also like a second command panel where you manage your tasks. Yep. And that's basically it for the UI. We also have a web interface.
[00:40:48] Alex: For people that like to have more visual GUI for interacting with tasks, we also have a plain mode for AI agents. Let me show you that one because I think it's more important. So for example, we were looking at task 200. So Backlog task 200 --plain. This will show the same information but in plain text.
[00:41:16] Alex: And this is what agents should use.
[00:41:18] Simon: Yeah. Yeah. Yeah. I see. Yeah. Okay. And so, let's talk a little bit about, I want to kind of talk a little bit about the way in which, so you mentioned you built this a lot through specifications and things like that. How did you go about building those specifications?
[00:41:34] Simon: Did you use the tool, did you handcraft the specifications yourself? How did you deliver that to them?
[00:41:41] Alex: That's quite interesting because the first tasks I created manually. So this kind of format, I came up manually by creating them myself. And after four or five tasks, I already had Backlog CLI command to create tasks.
[00:41:56] Alex: So afterwards I just recursively started using Backlog to manage tasks about Backlog.
[00:42:02] Simon: And so Backlog creates the specification about the task, which is essentially the thing, or not necessarily the specification of the task, but the information in that task that it can be used as a spec for an agent to implement that change. Would you say…
[00:42:15] Alex: So. Yes and no. So Backlog itself doesn't create specifications. Yeah. So Backlog is a tool that AI agents use. But it's not calling agents, it's not connected to AI agents by default. So you'd have to start the flow this way. So I start Claude or Codex. Or I mean ICLI. And they tell these agents, "Hey, I want to build this feature, use Backlog and DCLI to keep track of the tasks and to split it into multiple sub-tasks." There are some agent instructions that come with Backlog that tell the agent how to use Backlog itself. So afterwards the agents will know how to split a bigger task that you have in mind into smaller tasks that feed Backlog.
[00:43:04] Alex: But Backlog itself, it should be as minimal as possible. Right. And should not be in the way. It should be sort of a tool that both the human and the agent use on the side.
[00:43:11] Simon: Gotcha, gotcha. So it's Claude that kind of determines the split almost and then adds that into Backlog.
Alex: Exactly. Yeah
[00: 43:17] Josh
[00:43:17] Simon: We're at DevOps UK today, and joining me is the wonderful Josh Long.
[00:43:22] Josh: Hi buddy,
[00:43:23] Simon: Josh. How are you doing?
[00:43:24] Josh: Oh, so good. I'm at DevOps UK.
[00:43:27] Simon: Josh, we go back like how long? 15 years?
[00:43:34] Josh: Too long.
Simon: 2010 maybe, or?
[00:43:34] Josh: Yeah, yeah, I think so. Yeah. And, I've been a fan since 2011.
[00:43:40] Simon: Was 2010 a bad year?
[00:43:41] Josh: No, I'm just kidding. Thank you. I mean, you've brought so much joy to my heart. But then you brought joy to the world. I remember the virtual JUG.
[00:43:50] Simon: Yes.
[00:43:50] Josh: That was, I just think about that all the time.
[00:43:52] Josh: That was such a genius idea. A decade ahead of its time.
[00:43:55] Josh: It took a pandemic for the rest of the world to realize. Oh yeah. Genius.
[00:43:57] Simon: And that's still going there. There are almost 20,000 people in the virtual JUG these days.
[00:44:01] Josh: Wow.
[00:44:02] Simon: Apparently.
[00:44:02] Josh: Whoa.
[00:44:02] Simon: Apparently
[00:44:03] Josh: It's crazy.
[00:44:03] Simon: So, people know Josh Long as the Spring advocate, the advocate in JUG space, but, how long have you been with SpringSource, Pivotal, VMware, Broadcom?
[00:44:17] Josh: Since 2010. And, you know, still going strong.
[00:44:22] Simon: Yeah.
[00:44:22] Josh: But, uh, yeah, it's about, as about as long as I've known you actually,
[00:44:25] Simon: It's, yeah.
[00:44:26] Josh: Yeah. Yeah. It's been coincidental. Yeah.
[00:44:28] Simon: Wonderful. Yeah. Awesome. And today we're gonna be talking about Spring AI.
[00:44:33] Josh: What else?
[00:44:33] Simon: We'll be talking, uh, a little bit about, uh, perhaps why people will use AI when they're typically a Java developer, Spring developer. So we'll talk a little bit about the reasoning behind why people use it. Yeah. And then we'll go into some demo as well to talk about, so actually show Spring AI in action.
[00:44:50] Simon: But I guess, first of all, what are the capabilities of Spring AI? What does it, what does it provide?
[00:44:54] Josh: So spring AI is, is your one stop shop for AI engineering. And, uh, I think we, in the Java and Spring communities in particular, are in a uniquely amazing position right now, just an amazing position, because where, uh, most people are gonna use AI as an integration with their existing business logic and applications and, and services.
[00:45:13] Josh: That's all written in Spring, that's all written on the JVM and Kotlin and Java and whatever, right? That, that you are already there. They just wanna hang AI integrations off of that code and make it work. Your business logic, the things that drive your business and the data that, that, that feeds your business, that's all governed and controlled by, and orchestrated by Spring-based microservices.
[00:45:31] Josh: And so this is a really natural place, uh, to start your AI journey, I think, to enable access to that data, to that business logic from your AI models and via your AI models. So, uh, sure, some people are gonna use Python to train new models.
[00:45:46] Josh: But that's not most of us, most of us in the same way that most of us aren't building our own SQL database, you know, and see or whatever.
[00:45:53] Josh: Most of us don't need to do that either.
[00:45:54] Simon: Yeah.
[00:45:54] Josh: So, I think we're in a uniquely great position. And when it comes to production, production worthy, scalable, fast, secure, uh, observable production, that's, you know, the JVM is, there's nothing like it, nothing like that, you know?
[00:46:06] Simon: Yeah. Okay, so how does AI fit into all of this?
[00:46:11] Josh: Well we have this private framework point of view, we've got, so, you know, Spring, uh, uh, is a set of frameworks, Spring Framework, and then Spring Boot on top of that. And a bunch of verticals on top of that, serving different use cases, including microservices, batch processing, integration, data, uh, security, whatever, right?
[00:46:24] Josh: And, uh, we have one called Spring AI, which goes GA by the way, 1.0 GA 20th of March. No, May. So we, we had ambitions, I think. I think, uh, I don't know if I'm speaking outta school here or not. I think at one point we hoped it would go at GA. Uh, but, um, and you're not gonna believe this. The AI space has changed."
[00:46:42] Simon: Really?
[00:46:43] Josh: Yeah.
[00:46:43] Simon: I don't believe it.
[00:46:44] Josh: No. I, you, you,
[00:46:46] Simon: if there's one, if there's one constant,
[00:46:48] Josh: right?
[00:46:49] Simon: It's a, it's the AI space.
[00:46:50] Josh: AI's change
[00:46:51] Simon: Yeah.
[00:46:51] Josh: Is too much.
[00:46:52] Simon: Yeah.
[00:46:52] Josh: It's too quick, too fast. Too much too, or whatever. We have a whole team of people working on this.
[00:46:57] Josh: And even then, uh, and we've got, you know, it's one of our most busy
[00:47:01] Josh: Open source projects, right? Star History is like a, a hockey, hockey puck, you know? Yeah. Just hockey stick rather.
[00:47:08] Simon: Yeah.
[00:47:08] Josh: Just, just through the roof. Meteoric rise in, in popularity and GitHub issues. Mm-hmm. And, and contributions and everything. And it's just, it's, yeah. So every time we think we're about to settle down and dig, reach a GA mouse or GA release.
[00:47:21] Josh: Whole new paradigm gets dropped in our laps. You know.
[00:47:23] Simon: I just, I just assumed it was GA just by the amount I hear about it.
[00:47:27] Josh: It's, it's mature. People are using it. It's growing all the time. It's, it's very, very popular. But obviously we've just wanted to get to a point where we had the things that mattered and, um, I think we're there.
[00:47:37] Simon: So as soon as there's a week of no change or May 20th,
[00:47:41] Josh: whichever should happen.
[00:47:42] Simon: Whichever should happen first.
[00:47:43] Josh: Yeah. Well, I think it's gonna be May 20th. Either way, at that point. We're, we're on the 7th of May, so yeah. Got a couple weeks. Yeah. We don't even have two weeks of that.
[00:47:51] Simon: In fact, depending on when this gets released, yeah, we could be around the May 20th.
[00:47:55] Simon: It may be May 27th. Actually.
[00:47:57] Josh: Actually that might, that may. Be seven days too late. Yeah.
[00:48:01] Simon: Really? So let's just say, let's just say spring AI is out today.
[00:48:04] Josh: Yeah. It's out.
[00:48:05] Simon: Yeah.
[00:48:05] Josh: Okay. Go. Go get the bits. Yeah. Yeah. Fresh off the press. We might even have the first catch release. Who knows?
[00:48:10] Simon: Yeah. Yeah.
[00:48:10] Josh: By the time you watch this.
[00:48:11] Josh: By the time you watch this. Uh, but all that to say things are moving quickly and, uh, that's okay. That's okay. But remember, we wanna pair, uh, the innovation in the AI space with the idiomatic, sort of approach to building apps that Spring is always so embodied. And, um, we want that to, to build upon some of the pillars that Spring has always talked about, right?
[00:48:28] Josh: Portable service abstractions too, I isolate you from the, uh, differences between different models and image models, chat models, uh, transcription models, et cetera. Um, uh, dependency injection aspects oriented programming and Spring Boot style auto configuration, you know?
[00:48:42] Simon: Yeah. Yep.
[00:48:42] Josh: So you take those four pillars and
[00:48:45] Josh: Did I say three earlier? I was talking about four.
[00:48:47] Simon: Yeah.
[00:48:48] Josh: And, uh, you get an approach that gives you purchase in this new strange land, right? Yeah. It, it gives you, uh, uh, the ability to get hit the ground running. You already know all that stuff. You already know the component model. It's just a matter of applying that, those facets of your understanding of Spring.
[00:49:03] Josh: To this new domain.
[00:49:05] Simon: And you're gonna demo Spring AI?
[00:49:06] Josh: I sure am. I'm gonna try,
[00:49:07] Simon: I'm gonna talk a little bit about, as we go through.
[00:49:09] Josh: We're gonna build a very simple application here 'cause we're kind of pressed on time. But I wanted to demonstrate a simple application that helps, we're gonna build an assistant to help people adopt dogs, right?
[00:49:18] Josh: And I talk about dogs all the time 'cause I think it's really cute and I've got good, I've got a dog and I, I talk about this in particular, I talk about my dog, who's, look, he's not the best dog, but he is mine. And we'll, you know, we'll, he's
[00:49:30] Simon: he's a good dog. He's still a good dog.
[00:49:31] Josh: He's, he's, eh, look at that dog. That's a, oh, look at it. That's a cute dog right there. So, uh, all that to say, um, not good, but he is ours and his name is Peanut. Okay? And Peanut is the worst dog, except then I met this other dog in the, I learned about this other dog in the pandemic. Whose name is Prancer. Prancer, as it turns out, is even more of a spicy, uh, dog, right?
[00:49:54] Josh: And, uh, this owner, this lady, was trying to find a new home for this dog, and she put out this hysterical ad saying, okay, I've tried, I've tried for the last several months to post this dog for adoption and make him sound palatable. The problem is he's just not—there's not a very big market for neurotic, man-hating, animal-hating, children-hating dogs that look like gremlins. And she continues, if you own a chihuahua, you probably know what I'm talking about. He is literally the Chihuahua meme that describes him as being 50% hate and 50% tremble. She continues. I kind of liked him better that way. He was quiet and just laid on the couch, didn't bother anyone. I was excited to see him come out of the shell and become a real dog.
[00:50:27] Josh: I'm convinced at this point that he's not a real dog, but more like a vessel for a traumatized Victorian child that now haunts our home. And she continues, and this goes on for a long time, and she signs off, oh, he is only two years old and will probably live to be 21 through pure spite. So take that into account if you're interested. That said, super cute, right? Like that's a cute dog. Is that a cute dog? That is a cute dog. I'd pet that dog.
[00:50:47] Simon: I've got two Labradors. I've got two Labradors. I dunno if that's cute.
[00:50:49] Josh: The big dogs are great. Yeah, the big dog. But my dog is just like this dog — small. They have the Napoleon complex.
[00:50:55] Simon: Yeah, yeah.
[00:50:55] Josh: Something about it.
[00:50:56] Simon: Yeah. Angry. Yeah, angry by default.
[00:50:58] Josh: By default. And I don't know why. 'cause I just [00:51:00] wanna, I just wanna pet this cute little guy.
[00:51:01] Simon: Yeah.
[00:51:01] Josh: So I think about this dog a lot too. Rent free all the time, right? I mean, just, just how did such a dog exist? And by the way, this ad went viral, right? This ad went viral. So for example, here’s a. People magazine talking about Prancer.
Simon: Wow.
Josh: The demonic chihuahua. Here’s, USA Today talking about Prancer, the demonic chihuahua. Here’s Buzzfeed. Um, talking about the nightmare Chihuahua, the viral nightmare Chihuahua. And of course, here’s the New York Times talking about Prancer, the demonic Chihuahua. Right?
[00:51:31] Josh: So very, very famous dog. And I thought, well, that's, that's nice. That’s good that people learned about this dog. But that’s not how most people roll. Right? Most people don’t find dogs by finding them on the, uh, on the internet, right? Like, you go to a shelter and you have a conversation with somebody. Yeah. And you, you interview to, to, to discover the dog of your dreams, or in this case, uh, nightmares. So what I wanted to do is to build such an assistant to help people go through that process, right. Okay. To, to find the right dog. So we’re going to go to the start. Spring.io. I’ve already got this dog database here, and you can see there’s our dog old Prancer. Mm-hmm. His ID is 45 and his name is Prancer right.
[00:52:02] Josh: He’s in a Post-base database. So we’re gonna build an application here. We’re gonna call it assistant.
[00:52:09] Simon: Strong enough to contain Prancer.
[00:52:13] Josh: Yeah. It is a very, very tough ask. Yeah. Uh, GraaIVM, we use the web stuff. We’ll use, um, OpenAI, now I’m gonna use OpenAI. It’s just a very good model and a lot of people probably have access to it.
[00:52:21] Josh: But it’s not the only model. Not even close. Yeah. Here in, uh, uh, data, data privacy centric, uh, sensitive Europe. You might prefer something like a Llama, which is a fine choice. Or, uh, you know, alternatively we got things like Bedrock and, uh, Gemini and, um, everything. Everything. I mean, just. There, there’s dozens and dozens of different models that we officially support, and the ones that we don’t officially support, most of them speak the, uh, OpenAI API.
[00:52:50] Josh: And so you can talk to them via our OpenAI integration. Right. So I’m gonna bring in OpenAI. I’ve got the web support, don’t I?
[00:52:57] Josh: Oh, I took that away. I’ll bring in the Spring Boot Actuator support. Um, and I’m gonna bring in, I need a, a, a Vector store. Now you can use Look, just type Vector Store, and you can see we’ve got Milvis, you’ve got Neo4J, Pine Cone, MariaDB, Weaviate, Oracle, Redis, uh, Qdrant to Azure, Apache Cassandra, Chroma, Elasticsearch, MongoDB, PG Vector, Type Sense, Azure Cosmos DB, et cetera.
[00:53:16] Josh: I’m gonna use a PG Vector store because I’ve got a SQL database. This is a vector plugin. So we’re gonna go ahead and open that up.
[00:53:25] Simon: That's what we did there is you added a bunch of down a bunch of dependencies into your Spring project, which then allows you to effectively build that into, presume, a Maven pump file.
[00:53:34] Josh: Yep.
[00:53:34] Simon: When you build that, it'll pull all the job dependencies straight into your, uh, into your Spring project.
[00:53:39] Josh: You know it, and actually, you know what I did, you know what I did wrong there? I, uh, I forgot to select. Uh, PG? No, I forgot. Select dev tools. Okay, so I’m gonna actually go down to M7 here because I don’t know what the, they changed something in M8 and I don’t remember the idiomatic way to do it already, but I’ll use that.
[00:54:01] Josh: It’s downloading the internet, which is not a good thing. It’s, we’re actually on conference wifi. We are, no, we’re not. Stop that. Stop it. No.
[00:54:04] Josh: I’m used, I’m, I’m live streaming here. Okay. So IntelliJ is amazing, but if you start the project in IntelliJ before adding the dev tools, yeah.
[00:54:23] Simon: yeah.
[00:54:23] Josh: It'll, it'll not use, it won't enable the dev tools integration.
[00:54:27] Simon: Gotcha.
[00:54:27] Josh: So I, I retroactively added the dev tools there didn't I?
[00:54:36] Josh: I did not. Oh, that is so awkward. Okay, we’ll go back over here. Copy and paste. Dev, have tools. Is there now, right? There you go. There it’s, so now we’ll go back again. Do this whole thing again. Normally you don’t have to do any of this stuff, but I’ve screwed it up twice now. Okay. pom.xml and uh, M7.
[00:55:02] Josh: There we are. Fantastic. We load. So here's our application and we know we're gonna build a controller That'll act as a, the thing that we can ask questions to, right? So system controller and, uh, just to have an endpoint here with the user context and then the inquiry endpoint, right? Inquire. So also string inquire and we're gonna use it.
[00:55:25] Josh: To do our work, we’re gonna talk to a chat model. And that chat model, by the way, is gonna be connected. We’re gonna connect to it via OpenAI. Uh, and we have a key there. Now, friends, I’ve already, I’ve already, uh, exported an environment variable here like so. Right. So that’s already done in my shell.
[00:55:39] Josh: And Springboot will normalize that into the property that you just saw there a second ago. These two are the same, but you need to specify that yourself when you connect. Okay. We’re also gonna connect to a data source, not that one, Spring data source. Uh, url=jdbc:postgresql, right.
[00:56:01] Josh: localhost/my database, and then we’ll create the username, my user, and then the password is secret. Okay. Go back to here, and we’re gonna use the chat client. That was just. Uh, we can, it’s got, there’s a chat model and then you can use a chat client. You can create as many of these chat clients as you want. That will talk to the chat model behind the scenes.
[00:56:18] Josh: I’m gonna inject the chat client builder and build a new one. And here I’m gonna put my defaults, and then I’ll use that model here to answer questions from the user to this endpoint, right? So, call content, et cetera. And then the prompt is a user prompt coming from, uh, the user. And that'll be a request parameter, right?
[00:56:36] Josh: RequestParam Spring question. Question. So confusing. Ignore this user path variable for now. Okay? Let’s just try this. So we’re gonna start that up.
[00:56:48] Simon: Okay? So a user can hit that endpoint and that inquirer endpoint, right? Asking in a question as that, as that, as that variable, that Param variable, right?
[00:56:55] Simon: That RequestParam, that then goes to the chat line, which talks to the model, does some stuff, talks to the model, which is open AI in this case does some stuff. Provides an answer. Passes back to you here.
[00:57:06] Josh: Right? Says, nice to see you. Nice to meet you, Josh. How can I assist you today? Nice. Great. So it’s, we made it wiggle, right?
[00:57:09] Simon: Nice.
[00:57:09] Josh: Great. So it's, we made it wiggle, right? Yeah, there's there's a dial tone there.
[00:57:13] Simon: Go by the dog. By the dog,
[00:57:14] Josh: Yeah. Right. Well, what's my name? I don't have access to. Okay. So it does, it's already forgotten me. Yep. Right?
[00:57:19] Simon: Yep.
[00:57:19] Josh: Right?
[00:57:19] Simon: Yep.
[00:57:20] Josh: Quite like the first time we met, I said hi, and you're like, ah. And then moved on and we didn't talk for a year.
[00:57:25] Simon: I can't believe you're lying, Josh.
[00:57:28] Josh: So. So, uh. Anyway, it doesn't know. Yeah. So we need to help it because remember, you use chat, GPT, you use uh, cloud desktop. Mm-hmm. Whatever. They have memory, they have conversational memory, but that's not the case for the models, right?
[00:57:39] Simon: Yeah, yeah,
[00:57:39] Josh: The AI have APIs.
[00:57:41] Simon: So we need to continue that context. We send that context back to it every time,right?
[00:57:43] Josh: Yeah. And the way you do that is by creating, configuring an advisor. Okay? So what I'm gonna do is I'm gonna have a per-user map, you know, and I'm gonna pass this chat memory advisor. Okay. So there we, oh, there we go. There we go. And I'll go down here. And then the advisor, the, the map will say compute if absent user. Right. And I'm just gonna create a new one if it doesn't exist.
[00:58:16] Josh: Right. And I'm just gonna create a new one if it doesn't exist. And I'll start in memory. Now there's other implementations of this chat memory interface. Yep. That you can use, uh, that will write to different, uh, abstractions. Right. You can do like Neo4J and JDBC, and mm-hmm. But I think there's one in Redis coming along. I dunno. But it's all sorts of j dbc of course.
[00:58:31] Josh: I dunno, but it's all sorts of JDBC, of course, you know, all that stuff. So, okay. This is an advisor, this is like a filter, right? Uh, it's a pre-processor on the requests intended for the model. And basically as we have a conversation with the model, this will get, um, stored per user in that map. And that'll be retransmitted to the model on every subsequent request so that the model remembers, oh, we talked about A, B, and C. Mm-hmm. When that person asks about A, B, and C, uh, remember it, right?
[00:58:57] Josh: So here we go. So now we go back and we say, uh, my name is Josh. Great. What's my name? Your name is Josh.
[00:59:21] Josh: Yeah. Okay, great. But it's not supposed to be helping people with their homework. It's supposed to be a model to help people adopt a dog. Clearly we've got, we've kind of wandered off in the deep end here.
[00:59:29] Simon: And this is interesting because you see a ton of companies. What was the, what was the, the, there were a couple of big ones. I think one was, was it Amazon, Chrysler or someone where Yeah, it, they were basically getting it to, to like write a bunch of like malicious code and, and things like that from, from the sites, which is kind of
[00:59:44] Josh: Amazon for a moment. I think their app has an assistant there. Yeah. And you can actually, like, somebody prompt poisoned it. Yep. And got it to, like, generate code for them instead of helping them with shopping because of Amazon then. Anyway, that's not what we want. Yeah. So we want, we don't want this thing getting too off, too far off in the weeds. We have a mission, we want people to adopt dogs.
[01:00:03] Simon: Unless you want to know how many, two dogs plus another two dogs, right? Right. Could be right. Yeah.
[01:00:08] Josh: So what we wanna do is, uh, is to, um, give it a system prompt. Yeah. The system prompt is the overall tone and tenor. So our system, okay, here we are. And uh, cat Desktop talk system. I happen to have a system pro prompt. Okay, I'll paste that there. Ah, you know what I just did? It's the wrong tool. Yeah, it's the wrong one.
Josh: There that’s better.
[01:00:31] Simon: Do you know when you switched from Cat to Dog there? I was about to make that joke and I thought, oh no. Any reason I know that joke is 'cause you said it last time. I saw it a couple of months ago.
[01:00:41] Josh: I love it. So, okay, that's better. Right. So we're gonna say you are an AI power assistant to help people adopt a dog from the adoption agency called Pooch Palace. With locations in Antwerp, Seoul, Tokyo, Singapore, Paris, Mumbai, New Delhi, Barcelona, San Francisco and London, that's where we are in DevOps Hub and information about the dogs available, we will be presented below. There's no information, return to polite response suggesting we don't have any dogs available.
[01:01:03] Simon: I bet if you, I bet if you put two plus two in that problem, it'll still be before.
[01:01:07] Josh: Sure. But we don't want it to.
[01:01:09] Simon: Nice.
[01:01:09] Josh: Okay, so that's a system problem that's going to dictate the manner in which it responds to us. Right. It'll try and frame all responses in terms of that mission, that overarching mission. Like, let's see what it says actually.
[01:01:22] Josh: Yeah, it helps, but it gets us back on track.
[01:01:29] Simon: It's like, it's always trying to help. Right? That's the thing with ai, it's like, it, it, it has all that background information. It has a specific bit of context. Right. That doesn't mean it's gonna forget all the other information. It still knows how to answer your question.
In this episode of AI Native Dev, host Simon Maple and guests Alex Gavrilescu, Baruch Sadogursky, Josh Long, and Liran Tal discuss shifting from code-centric to spec-centric development in the AI era. They explore how human-readable specifications can align developers and stakeholders on intent, allowing deterministic tooling and AI to handle implementation. The panel emphasizes creating a chain of trust from specs to tests to code, ensuring AI assists without compromising quality or control.
In this episode of AI Native Dev, host Simon Maple is joined by Alex Gavrilescu, Baruch Sadogursky, Josh Long, and Liran Tal to explore a provocative shift: moving from code-centric to spec-centric development in the age of AI. The throughline is simple but powerful—developers and stakeholders can align on intent through human-readable specifications, then let deterministic tooling and bounded AI do the heavy lifting. As Baruch quips, you shouldn’t trust a monkey to write Shakespeare—or tests. Instead, build a chain of trust that starts at the spec and ends in code that demonstrably does what it should.
The conversation opens with a common anxiety: if AI writes more code, will developers stop looking at code altogether? Baruch reframes the concern. It’s not that developers hate reading code; they hate reading code that isn’t theirs. AI-generated code is, by definition, “someone else’s code,” and the novelty of reviewing LLM output will wear off. That means we need a representation of intent that is easier to agree on than raw code.
Human-readable specs fill this gap. Rather than arguing over diffs in an IDE, teams can encode product intent in a shared language that both technical and non-technical stakeholders understand. The more you formalize intent up front—and make it reviewable—the less you have to rely on brittle, after-the-fact interpretation of what the code “should” do. The hosts argue that this is how developers keep doing deep, meaningful work, even as the mechanics of code production become more automated.
This approach also creates a practical division of labor. Areas you deeply care about get explicit specs and tests. For less-critical paths, you can let the LLM infer implementations, confident that if it goes off course it highlights missing intent rather than a failed engineering process.
The panel gets candid about why TDD didn’t “win.” If developers are the only ones who can write and read tests, the tooling excludes the product and business voices who own intent. Developers, biased toward action, often rush to implementation and backfill tests later. The result: tests validate code, but they don’t source the truth of the product.
BDD tried to fix this. Frameworks like Cucumber introduced Gherkin’s Given-When-Then structure to describe behavior in semi-structured natural language. The goal was inclusivity—let anyone propose and review behavior. In practice, however, the syntax felt rigid for many product folks, and the coupling between BDD specs and implementation was often loose. Specs drifted like stale PRDs, and teams returned to “the code is the source of truth.”
AI offers a way out without compromising rigor. Instead of asking product managers to handcraft Gherkin, you can feed a requirement doc to an LLM to draft initial scenarios. Yes, the model may hallucinate or miss context. But that’s acceptable if you commit to a human-in-the-loop review cycle and keep the spec readable. The order of operations changes: draft specs with AI, review and correct as a team, and then turn that intent into executable tests deterministically.
Baruch lays out a repeatable chain of trust. Start with a prompt, requirements, or PRD-like doc. Use AI to generate Gherkin-style scenarios that capture intent in Given-When-Then form. Because these artifacts are plain language, anyone can review them. Crucially, treat this as iterative—expect to refine the spec until it reflects the product truth.
Next, convert specs into executable tests deterministically using tooling like Cucumber (JVM), SpecFlow (.NET), Behave or pytest-bdd (Python), Godog (Go), or Serenity for richer reporting. The key insight is to avoid nondeterminism at this stage. LLMs are “slightly better than random monkeys typing,” so don’t ask them to write tests. Parsing and wiring Gherkin into step definitions and fixtures should be ruled by algorithms, not generative models. Run the conversion 10 times; get the same result 10 times.
Finally, unleash the LLM to implement code that makes those tests pass—under guardrails. If the model tries to “cheat” by editing tests to green them, block it. Make test directories read-only, mount them as read-only volumes in containers, or enforce pre-commit hooks and CI policies that reject test modifications. The invariant is simple: tests encode intent; code must conform. If code passes, you can trust it because the tests were compiled from an agreed spec.
Specs are only useful if people actually review them. Long specs are a real risk—humans skim, miss details, and wave things through. The team proposes a two-tier safety net. First, human review for high-signal sections and critical flows. Second, LLM-as-judge to cross-check coverage and consistency. Use a different model than the one that generated the specs to ask questions like: Do these scenarios cover all acceptance criteria? Are edge cases addressed? Are there contradictions or ambiguous steps?
This “model ensemble” approach turns AI into a verification tool rather than a generative authority. Treat its output as suggestions, not truth. Incorporate lightweight prompts to expose gaps in coverage, such as comparing scenarios against user stories, acceptance criteria, and known non-functional requirements (performance, security, compliance). The result is a pragmatic review loop that scales as specs grow without sacrificing human judgment.
Over time, you can template this process. Maintain a cookbook of scenario archetypes—CRUD, search, auth, billing, error handling—and let the LLM propose instantiations. Bake in domain-specific step libraries so compiled tests always target stable step definitions. Your review then focuses on scenario correctness, not syntactic ceremony.
The episode closes with actionable guidance to operationalize this approach. Start with a thin slice: pick one critical feature or service boundary. Draft specs from the existing PRD with an LLM, then workshop them with engineering and product. Adopt Cucumber or an equivalent in your language stack, and standardize a minimal set of step definitions for your domain to ensure the spec-to-test path is deterministic.
Enforce guardrails. Protect the test tree with OS permissions, Git attributes, and CI policies. If you generate code in a container, mount the tests directory as read-only. In your CI, require that any changes to tests come with human approval and are not authored by automation accounts. Consider mutation testing or coverage gating on the compiled tests to ensure they actually fail when behavior regresses, not just when code changes.
Tune your LLM usage. Keep model temperature low for code generation to limit variance. Use a separate model for judge/critic tasks. Log prompts and outputs for traceability. Most importantly, be explicit about where you’ll accept inference. For low-risk utility functions, let the LLM implement without exhaustive specs. If something comes out wrong, treat it as an intent gap—write or refine the spec, recompile tests, and rerun. This creates a virtuous loop where missing behavior becomes visible and fixable.
This spec-first, AI-assisted workflow keeps developers in control, invites stakeholders into the conversation, and turns LLMs into powerful, bounded tools—so you can move fast without trusting monkeys to write Shakespeare.

Next-Gen
Dev Tools
4 Nov 2025
with Alan Pope

MCP: The USB-C
For AI
7 Oct 2025
with Steve Manuel

THE END OF
LEGACY APPS?
2 Sept 2025
with Birgitta Böckeler