Podcast

The Missing Gap In Workflows For AI Devs

With

Baruch Sadogursky

1 Jul 2025

Episode Description

Baruch Sadogursky, Head of Developer Relations at TuxCare, joins Simon Maple to explore why automated integrity needs to be built in before we rely on AI outputs. On the docket: • the difference between specs and tests • why PMs sidelined specs • the "intent-integrity gap" between human goals and LLM outputs • the non-determinism of LLMs as a feature, not a flaw • Baruch’s belief: devs are not going anywhere

Subscribe to our podcasts here

Overview

Introduction: Revisiting Trust in Code at AI-Fokus

Simon and Baruch reconnect at the AI-Fokus conference to explore how LLMs, specifications, and modern development practices are reshaping software engineering. They reflect on their shared history and shift the conversation toward the integrity of AI-generated code.

The Core Issue with AI-Generated Code

AI-generated code often lacks developer trust. Developers tend to avoid reviewing code they did not write, especially when produced by machines. This problem mirrors poor human code review practices and highlights a need for better accountability mechanisms.

Tests as Guardrails and Specs as the Foundation

Baruch proposes that software quality should be driven by well-defined tests. If code passes trustworthy tests, it can be accepted without manual inspection. However, for this to be effective, those tests must be generated from a clear and agreed-upon specification. The specification becomes the authoritative source of truth, accessible to both technical and non-technical stakeholders.

The Promise and Limitations of BDD and Gherkin

Behavior-Driven Development (BDD), supported by Gherkin syntax, attempted to make specifications readable and writable by all stakeholders. While human-readable, Gherkin proved too rigid for product managers and non-technical users, limiting adoption. Additionally, the disconnect between specifications and implementation caused them to become outdated and unmaintained.

The Intent Integrity Chain

Baruch introduces the concept of an intent integrity chain, a structured process for aligning software with human intent:

Begin with a prompt or product definition.
Generate specifications using an LLM and review them with stakeholders.
Compile specifications into deterministic tests (outside the LLM).
Use LLMs to generate code until it passes those tests.
Lock tests to prevent tampering and ensure reliability.

In this model, code is treated as a disposable output, with integrity preserved through specifications and tests.

Microservices and Iterative Regeneration

The conversation emphasizes the importance of modular architecture. With microservices, updates or new requirements can be addressed by regenerating individual components. The prompt and spec drive the process, allowing for scalable and maintainable evolution of the system.

Expanding the Role of Specification with Tessl

Tessl is positioned as a more capable successor to Gherkin. It allows for richer specifications that include behavioral expectations, API interfaces, and non-functional concerns such as performance, security, and language preferences. This avoids overloading specs with information they were never designed to carry.

Continuous Validation and Feedback Loops

The podcast highlights the value of feedback loops from production telemetry and quality metrics. These inputs can inform changes to specs, enabling continuous iteration while maintaining alignment with original intent.

The Evolving Role of the Developer

In a spec-centric future, developers will not disappear but evolve. Some will focus on architecture and composability, while others will act as domain experts ensuring feasibility and guiding prompt formulation. Technical knowledge remains critical, particularly for non-obvious constraints and system-wide considerations.

Conclusion

The intent integrity chain, when implemented with advanced spec tooling like Tessl, provides a reliable structure for developing software with AI. It allows teams to scale LLM-based development while maintaining trust, aligning stakeholders, and ensuring that code reflects shared intent.

Resources

1. Baruch Sadogursky- https://www.linkedin.com/in/jbaruch/

2. Simon Maple- https://www.linkedin.com/in/simonmaple/

3. Tessl- https://www.linkedin.com/company/tesslio/

4. AI Native Dev- https://www.linkedin.com/showcase/ai-native-dev/

Chapters

00:00 Trailer

00:59 Introduction

02:24 Code Accountability

05:06 Upfront Test Guards

09:32 Monkey Metaphor

11:43 BDD Origins

15:01 Spec-Driven Chain

20:04 Disposable Code

24:26 Spec Evolution

33:59 SDLC Refined

46:27 Closing Remarks

Full Script

Simon: Hello and welcome to another episode of the AI Native Dev. I am at the AI-Fokus conference, which, for those who maybe, kind of recognize the name slightly, is a conference by Mattias Karlsson, again from Jfokus.

Baruch: It's a spin-off.

Simon: It's a spin-off of Jfokus, a conference which we've both been to countless times. And joining me today, Baruch. Baruch Sadogursky.

Baruch: Sadogursky.

Simon: Close, close. It's been a while. So Baruch, we go back how many years? 10 or 15?

Baruch: Yeah, I guess almost 20, like ZeroTurnaround, early days. Something like that.

Simon: Yeah, yeah, yeah. And then you were in JFrog?

Baruch: Yeah, like early 2010’s.

Simon: Okay. I feel old, Baruch now.

Baruch: We are, we are.

Simon: And yeah, we were, so I saw you at the speaker dinner yesterday here in Stockholm. And I thought, oh, I need to sit down with Baruch, chat to him, see what he's talking about. And wow, we were talking the same language yesterday.

Baruch: Absolutely, absolutely. Yeah, and that's what I was excited about because I thought about when you, so your work at Tessl and the idea of spec driven development. And this is something that obviously resonated with me because I was thinking about this idea of how can we make code generation in AI accountable in an easy way.

Baruch: Right, because at the end of the day, the whole idea of vibe coding or whatever flavors, less vibe more coding, more vibe less coding, all of them boil down to the lack of trust in the generated code. And it's not because someone is evil and wants to like screw others. It's because it's only natural for us to not give the full attention to the code that something else wrote, compared to the code that we write.

Simon: And with vibe coding is an interesting one because with vibe coding, we tend to actually reduce the number of tests we end up writing and write code with a far greater lack of integrity that we actually look to validate ourselves. But we tend not to read. We trust in the AI doing it or we click through and go, oh yeah, this kind of does what it does.

Baruch: And that looks good to me. Now, the thing is, it's not because AI is in the mix. This is also true for everything else. If we are on the same team, you asked me to do a code review, I'll look at their code. Well, it looks okay to me. Looks good to me.

Baruch: Let's ship it. This is terrible. Yeah. But the, and this is one of the reasons why the software is terrible. But most of the software still works.

Simon: Yeah.

Baruch: And it still works because, as you mentioned, of integrity, and the integrity that we're talking about when we are two humans in the mix is a human integrity. Right. You are a professional. You have standards. I know that, and I trust you to an extent. I know that the code that you will give me for a code review won't be absolute garbage.

Simon: Well, it's been a while since I've written code, but yeah, yeah. Okay. Okay. I'll accept that. Right.

Baruch: Now, how much worse it is when code is generated by a machine that has no integrity, no standards, and is usually absolutely clueless about the code that it's generated. Yeah. Yeah. So it is much worse, but we still don't, we won't apply any rigorous code review because it's not code. This is not something we can do.

Simon: Yeah.

Baruch: Right. So the question is, how do we automate this integrity into the code that we see on the other side? And that was like my question to begin with. And the way I thought about it was, well, we have guard rails for code existing for many decades. And those guard rails are obviously tests, tests, of course.

Baruch: Right. So what if we put those guard rails in front, up front, and then we let, and then we don't care that much about the quality of the code. So we can write tests, or we can generate tests somehow, make sure that we can trust those tests. And once we trust those tests, we will have to trust the code that passes the tests.

Simon: So the tests are then, by, from what you're saying there, the tests become the most important thing, the most important piece of code effectively, because if you have bad tests, you have bad code, if you have good tests, you have good code.

Baruch: Yes. Those are the guard rails. Those are the guard rails. Now, obviously, the next question is, okay, but how is it different? We don't want to write tests either. So we don't. So if the tests are generated, we don't trust those tests as well. So in order to trust those tests, even if they're generated, we need to read them.

Baruch: And we don't want to read code. Yes. As we just spoke about.

Baruch: Not only that, if the tests define what our software does, it's probably a wider circle of stakeholders that need to have a say in those guard rails, not only developers. People like product managers, like business stakeholders, they are definitely not going to read tests.

Simon: Yeah. But they're actually the people who have probably the stronger intention behind how something works.

Baruch: They actually know what the software should do. We just code monkeys that put their intent into code are going to be replaced with other code monkeys that are going to put, right, but the really important people in the mix are the product managers, the business people and the customers.

Simon: Let me pull you up on something that you said just before. You said we want something to automate the test, we want something to create test because we don't want to look at code. Right. How true is that though? Right. Because I did a keynote at Devoxx a couple of weeks ago, a few weeks ago, and I was talking about how the world is moving into this new space where developers will continue to be creators. We will. We'll do things in a different way. And one of those ways is we will look more into specifications, less about code.

Baruch: Exactly.

Simon: A developer came up to me though. At the end. Okay. My last slide was because it was. Devoxx. Keep calm and carry on because we as developers need to change.

Baruch: Evolve. Evolve.

Simon: He was super worried though. He was like, how can you put that slide up when it means we're not looking at code?

Simon: He loves coding. Yeah. Absolutely. His safe space was clearly in an IDE looking through code. Yeah. His mind was extremely technical. Of course. Loves the challenges. Loves the depth. Yeah. So how do those people deal with that?

Baruch: So I think when I say like we don't want to look at code, we mostly don't want to look at code which is not ours. Like we're, I mean we're humans. I agree. Well, selfish. We're self-centric.

Baruch: We most of the time listen to someone, and when we listen, what we really do is come up with a clever reply to what they're about to say without even listening. Yeah. We know all those flaws. Like humans are flawed and very self-centric. And this is true about code as well.

Baruch: And the example that I gave you earlier, I obviously love, in love with my code. Yeah. Yeah. But I'm much less inclined to look at your code than you because well, okay, that's just some other code. Yeah.

Simon: Or my code that's one month old. Which might as well be someone else’s code.

Baruch: Oh yeah. Absolutely. Right. So and this is, and this is the problem with the AI generated code, that it’s someone else's code. Yeah. Right. So you might look at it and especially now when it's a novelty, you look at it out of curiosity to see what everyone does.

Baruch: Yeah. And you're like, oh my God, this is brilliant. This is crap.

Baruch: But you look at it because you're curious. Like in six months, when more and more code will be generated by AI, you will just, okay, someone else's code. Right. So first of all, you don't want to look at this code. And obviously, the non-technical people definitely don't want to look at this code.

Simon: They want to look at intent. They want to look at this test does this, and I can read that.

Baruch: Exactly. Exactly.

Baruch: Right. And generally, the idea that code is, that the tests are technical and written in code, I think, and that's just an assumption, is one of the reasons why TDD really didn't take over the world. Because in the end of the day,

Simon: TDD did you say?

Baruch: Yeah. Like test-driven development didn't take over the world because, in the end of the day, if developers are the only ones who can write the tests and read the tests, the real question becomes, at least for me in my experiences, why am I doing it this way? Yeah. I have a problem. I'm biased for action.

Simon: Yeah.

Baruch: I know the solution in my, I see algorithms running, code written in my eyes. I want to go and write code. I want to write tests. I don't mind writing tests afterwards to check that my code does what it's supposed to do.

Baruch: But starting with code doesn't make sense if developers are the only one involved in the picture. Yeah. And I guess I'm not the only one who thought about this problem. And this is one of the reasons why behavioral driven development came to be 20 years ago. Like the BDD thing.

Simon: Yeah.

Baruch: The way BDD works is like, hey, what if the codes, the tests weren't defined in code, but were defined in spec? And what BDD refers to spec is some kind of pseudo natural language. It's called Gherkin.

Baruch: And it's basically a set of rules that goes like given context when something happened, then expect those results. Right.

Baruch: So basically this is the steps. And the beauty of Gherkin spec is that it's human readable. And the idea was that it's also human writable. Yeah. Right. So the product managers, the business people, maybe even the customer can participate in writing those specs.

Simon: And it's interesting because when Agile came in, very often when we started writing use cases and things like that and starting with use cases, it was similar, the language is slightly different, but it was actually kind of a similar thing as whoever I want to use it.

Baruch: Exactly. Right. And then the beauty of the idea of Gherkin spec was that you take an algorithm and compile those specs into tests. Now, look at us moving one step forward toward building the integrity that we spoke about. Because now suddenly we can have something that technical people and not technical people can see and write. And then, if you remember the rest of the chain that we started to build, the spec is compiled to the test.

Baruch: We can trust the test. That's exactly what the spec was.

Simon: Yeah.

Baruch: The code implements the test. So we can trust the code.

Baruch: Because it has to, we have to, because if it passes the test, it does exactly what the test wants. And the tests are the result of compilation of something that we agreed our application should do.

Simon: So the areas that you care about, you should have as many of those tests as you can. The areas that you don't care about. You just allow the LLM to infer whatever code it wants and whatever it decides. Yeah. That's the right way. That's a way of doing it. That's an okay way.

Baruch: If it's wrong, it means that you need some tests that will enforce the way that you want it. And it means that it makes some spec. It misses some spec. It misses one of those use cases of given when then, then you just need to go ahead and write.

Simon: It's an uncaptured intent essentially.

Baruch: Now, BDD also didn't really take over the board. No. It's either, you see it even less than the actual. And again, my speculation why is that writing those structured, this structured intent, the spec, was still a little bit too much for the non-technical people. Right? Yes. Of course it's readable, but writing it, it's kind of annoying. Those are, you know, when you look at the, when you speak with business people or product managers, they are kind of free souls. They want to create. They want to write Shakespeare. They don't want to write this rigid like structure.

Baruch: You read PDD/ BBD, it's like, it's like poetry. Yeah. So, so again, those specs didn't really catch up because product managers like, ah, that's too rigid. And then if they don't do it, developers definitely are not going to do it. Right? Because for some reasons.

Simon: I also think, I also think one of the issues with BDD is there was no hard connection between the BDD intents, the spec part, and anything else. It was left up to someone to then implement that.

Simon: And as change happened in the business and things like that, it gets stale. It gets there in the same way a PRD would. And then all of a sudden, you know, no one goes back to the PRD because it's out of date. The code is the source of truth.

Baruch: But now, if we automate all the way from the spec to the end product, the spec is very important, but no one wants to write this Gherkin stupid thing. Yeah. Suddenly, AI can help.

Baruch: Yeah. Because we don't need to write. Once we have a, like a specification document. Yeah.

Baruch: Right. Whatever we call it, so-called definition document, whatever, there are a bunch of, we can ask AI to formulate those specs, those when given when then from it. Now, we know that it will hallucinate, and we know that the spec won't be exactly as what we wanted it to do for mainly two reasons. The first is I mentioned that it hallucinates. But the second is that it's very hard for humans to express intent in the words.

Simon: Yeah.

Baruch: Because of context that we have in our heads that obviously LLM lacks. So there is this chasm between the prompt and the intent, between what we ask it to do and what we really wanted it to do.

Baruch: So for those two reasons, the spec won't be perfect. But coming back to the biggest benefit of the spec, because it is readable, we can actually review it. It's different from tests that we theoretically also could declare a source of truth and review it instead of specs. But as we spoke, we don't want to read anyone else's code and especially the non-technical people that they even cannot.

Baruch: Yeah. So the spec is different because it's readable. In your language, there are hundreds of languages, translations of Gherkin's spec. You can read it in Urdu if you feel like. Baruch: And this is suddenly something that everybody around the table can sync on and say, okay, now we can read the spec.

Baruch: It's not painful in English. And we can verify that what's in the spec, it's actually what we wanted to be generated from on-prem. Yeah. Now, if it's not, we go back and forth, we iterate until it is, or we just fix the spec. It's almost a plain language.

Simon: And presumably you could almost use LLM as a judge as well to some extent almost as a final backup to say, look, because people are going to be lazy. Right? Yeah. And I think this is one of those interesting things where even if you look at something like code review, okay, people don't like looking at each other's code, for example. But it's something that the more text, the more change there is, we'll kind of like look at that and go, yeah, that kind of looks good.

Baruch: It looks good.

Simon: If there's two lines, we're going to be questioning the variable choices. So if there is going to be a big set of tests or a big specification, are we actually going to go through it?

Baruch: And LLM as a judge could actually be, there are going to be levels of automation that, okay, we need that backup almost. There's potentially two levels there. One for the human.

Baruch: And we can bring different models. We can generate spec with one model and then ask another model if each thing that the intent is being captured. Yes. Yeah. Right? And then it can, we can iterate and make this process. And then, but in the end of the day, human readable specs are much easier to review than tests or code. Right? And this is like the main idea.

Baruch: So now we have the entire chain. We start with the prompt, like, hey, I want to write an application that does ABC, come up with a set of specs that will satisfy those requirements. Then we have the specs. We don't trust the LLM.

Baruch: So the problem with LLM, that it is non-deterministic, not better than random code generator. Right?

Simon: I like that.

Baruch: So my analogy is, you remember this thought experiment that if you get to endless number of monkeys and endless number of typewriters, eventually they will come with the works of Shakespeare. This is a lens. They're slightly better than random monkeys typing on typewriters. But eventually, that's exactly that. They try to get it right.

Baruch: And it almost, sometimes almost works. So, and the idea is that you cannot really trust a monkey to write Shakespeare, especially not on the first run. Yeah. So the idea here is that...

Simon: I think that's the quote that we should take out from this entire podcast. You can't trust a monkey to write Shakespeare.

Baruch: This is exactly...

Simon: That's one thing. I don't care whether you're spec-centric, code-centric.

Baruch: That's one thing we can all agree on. Never trust a monkey to write Shakespeare.

Simon: Or any kind of famous literature.

Baruch: Or code.

Simon: Yeah, or code. Or a code. Well, I've seen some code.

Baruch: No, but that's, you’ve seen some code, but it wasn't true.

Simon: No, that's true. Absolutely not.

Baruch: Yeah. So let's make sure that we iterate until it is. Yeah. And that's exactly the idea, right? So the monkeys generate spec. Will they do it right for the first time?

Baruch: Probably not. We want to bring other monkeys. And eventually the most important is we look at the spec.

Baruch: And we say, well, it rhymes, but it's not Shakespeare. Yeah. Try it again.

Baruch: With five different problems. Yeah. Or we can just add the spec directly. Now, once we have the spec, we say, okay, we cannot trust the monkey to generate correct tests from the spec, because we cannot trust the monkey.

Baruch: Now, the problem is that we decided we're not going to review the tests. It's too much for us. And someone else's code, we don't want to do it. The beauty of namely. cucumber, but generally parsable spec is that we don't need a monkey. Because we take an algorithm and compile the specs into tests, deterministically every time. We don't need the monkey.

Simon: Yeah. If you run it 10 times, you'll get the same thing 10 times. Whereas if you do it with LLMs, there'll be subtle differences and maybe sometimes bigger.

Baruch: Of course. And since we don't review the test, we won't even know that something is wrong. Yeah. Right? So monkeys, out of the picture for this part. Right? So we generate the test. Now, once we have the tests, we can let monkeys lose, type on those typewriters forever until the test’s pass.

Baruch: The only thing is we need to protect the tests because the monkeys will be inclined to adjust the test to make them pass. So how do you do it?

Baruch: Whatever. Make the files read only. Put them in a Docker container and make them unaccessible, like also read only. Do whatever to protect the tests and then let the monkeys lose.

Baruch: Let them type as long as they want to until the tests pass. Yeah. And now we have the chain. Right? We have a prompt that is guaranteed to capture our intent because we reviewed the specs.

Baruch: And then we have the code that is guaranteed to match the specs because we generated code, we generated tests without the monkeys and then the code actually passes the test. So we can have the entire picture from our ideation to the end product that we can absolutely 100% trust. This is an intent integrity chain.

Simon: Intent integrity.

Baruch: Because we guarantee the integrity of our intent in the code.

Simon: So let's say now that we release that. Yep. The world is wonderful. Yep. We have farms of both cucumbers and monkeys. In separate cages and separate fields and forests. Now someone comes in and says I want to make a change.

Baruch: Beautiful. Right. Beautiful.

Simon: And then we have maybe there's some bug changes, maybe there's some feature changes, maybe there's new features that need to be added. What does that flow look like? How do we stop the monkeys from attacking the cucumbers?

Baruch: So I will mention now a word that you expect the least. Microservices.

Simon: Okay. Yeah.

Simon: I could have said several words before you said microservices then. But God.

Baruch: The intent integrity chain, microservices is its best friend. Because once your code is modular enough, and the services are micro enough, you know what you do?

Baruch: There is a new requirement, or there is a bug, you go back to the beginning. You go to the prompt, you fix this prompt to include this feature, or to express your intent better. You run the process all over again, and you have a new service that will replace the old one and you’re good to go.

Simon: Well, maybe it is Microsoft's. It's a huge composability to the extreme. Whereby when something changes, actually, there's a component or a number of components that just need to be, well, in fact, it comes to that stage whereby you don't throw it away in the sense that this code is never usable again, because other things might rely on it, I suppose, but you would find something else that does that job in a different way and use that or add that in.

Baruch: Yeah, and you define APIs, and they are very easily definable in prompt and spec and test. And then, once you have the API defines, every part is replaceable. So you can actually replace every component with a better version of this component because you improve the prompt.

Simon: So what do you…

Baruch: The code is absolutely disposable. It's garbage to begin with. It's monkeys who wrote it. It is garbage. And no one even looked at it ever because we don't care as long as the test pass.

Simon: So, okay, here's a new feature brief. What are you changing? What is the doc you're changing?

Baruch: So we start with the prompt. We always start with the prompt. So the prompt is some reference to a software definition document that product managers probably maintain. So we change this software definition document, and our prompt will be like, hey, we have additions and let's do the entire thing again.

Simon: So that changes a spec. In your vision, do you break the spec down?

Baruch: So from here, you can go tons of different directions, right? You can say, okay, our spec should be very small and limited scope, so we can replace it. You can say, okay, our spec should be modular, so we can direct prompt to change only part of it. You can say, you know what, the change is so small, I won't even bother with generating new specs. I'm just going to edit the spec. There is like one sentence that I need to change.

Baruch: IIt's easier for me to change it than to go through the entire prompt thing. It doesn't matter. In the end of the day, what you need to make sure is that the spec, this is your source of truth, right? Because again, it's still readable by all the humans. We can agree on it and iterate on it and need it. And if it's too hard for us to write it, we can use LLM to generate it, and it's completely fine. But in the end of the day, what we all agree upon will be the spec. And we evolve it as we see fit, replace it, edit it, regenerate it, or do whatever we like.

Simon: And how good is, I'm going to be honest, I haven't used cucumber in like, I was going to say years, I'm going to say decades.

Baruch: This is why we're old. The benefit of being old, I dug up something that the new generation has no idea that it actually exists.

Simon: Yeah, yeah, yeah. Or even if you say BDD, most people will go, what's a BDD? Is that a new TDD? So how good is cucumber at change on something that exists? So cucumber turns something into existing, would you just redo the whole thing?

Baruch: So the idea is that you redo the whole thing, right? And this is why having the code smaller in modular is so important. Not only that it's important for cucumber or not, but remember that we don't look at the code. So if we try to refactor something, we actually don't know if it's any good and if it does what we wanted it to do. We need a new updated set of tests that we don't look either. The only thing that we do look are the specs. So once we change the specs, we have to regenerate everything down the road because that's the only way to guarantee that it actually matches the spec and then the code matches the test and then everything else works with the chain. Right, now so cucumber is just an idea. This is what I had in my tool plan.

Simon: It exists. It exists.

Baruch: Exactly. And obviously, guaranteeing specs are not perfect because frankly, there are concepts that are impossible to describe in given when then. For example, security constraints. Right, you can do something with performance.

Baruch: You can write the specs that will kind of express like, given a certain load, it's like when you throw more users, then the application responds in under retirement, but it already starts to be awkward, and especially stuff like other non-functional requirements like security, it's impossible to express. Right, because the whole idea of behavioral driven development is to express behavior, and cross non-functional concerns are hard to express as behavior. So obviously, it was never intended to use for this kind of stuff, and the only way, the only reason why I demonstrated was Gherkin and Cucumber because it's just here. But what if we had a better way to express the spec? Something that is born for AI, born for this kind of problem in terms of the engine doesn't change.

Baruch: You only have a better tool in this part when you need to go from spec to code. Everything else remains the same. You still have the same concerns of never trust a monkey, and how do we make sure that our intent is properly captured?

Baruch: How do we close the chasm between prompt and intent? All those problems and the concept that solves them is the same. We will just have a better implementation for the part that goes from spec to code.

Simon: Really interesting. It's that hardening, really, isn't it? It's not going to happen, unfortunately, the way LLMs are trained. We're not going to get that hardening, but will it be most likely?

Baruch: Oh, no, no. We never, because the whole idea of LLM is that it's by definition non-deterministic. That's the whole idea. So you probably know, and probably most of your audience for AI-Native knows how it works, like the neural networks, that they have to have this level of freedom in order to generate something that is different than what it has been asked. It needs to guess what the reply should be that is related to what you asked, but not exactly what you asked, because then it's just useless. So it has to have this degree of freedom.

Baruch: It has to be stochastic, non-deterministic by definition. And when you play with the temperature of the network and how creative it is, this is the balance. If the temperature is too high, it will hallucinate more. If it is too low, you won't get the responses that you want from it because you limit how much it can go in search for the right answer. But regardless of the temperature that you set, unless it's zero, which makes the model absolutely unusable, you will have non-deterministic behavior built in. So the idea of not trusting the monkey will never go away, because it will generate different replies for your request, and only one of them will be the right one. So the rest of them by definition will be the wrong one.

Simon: And other than, I guess, resources being used, overly used, it's actually not a problem that we don't trust the monkey because of the framework and the guardrails that we have.

Baruch: Exactly.

Baruch: And we will let it type non-Shakespeare until Shakespeare appears.

Simon: Yeah, because the rest is just throw. So, okay, we've talked about a technology that is, oh my god, like 30 years old probably in Cucumber and Gherkin spec and things like that. Let's project forward now. Okay, in this vision, in this projected way that we will build apps. I guess two questions. We'll start with the SDLC and then we'll go on to what a developer is. SDLC, how would that need to change?

Baruch: So in the end of the day, we're speaking here about a pretty narrow area of proper code, code generation, right? Of how can we trust the generated code? Now, there is like the entire chain of writing, of delivering software that is after that, right? Or the build and everything else. Now, this all can obviously be improved by using AI, but this is kind of a little bit out of scope for the intent integrity chain in particular, mostly because most of it is already kind of algorithmic, right?

Baruch: So the build is a completely algorithmic problem. The only thing that can change is now, and this is the beauty of it. And that's the raising the level of abstraction. And this is the evolution that you spoke about at the keynote, is that you can look at everything after the spec as a part of the build.

Baruch: Right? In the end of the day, we took the spec, the spoken word, the written word, and compiled it into working Java code or whatever other code. And then the build compiles it even further to whatever machine.

Simon: To liquid software.

Baruch: Yeah, exactly. Right? And then from there it goes to and being deployed on or whatever. So we kind of made a compiler out of a non-deterministic system, very wasteful, but also very fun, right? Which feeds another compiler that compiles to bytecode, which feeds another system that puts it on like double centers and what so. So eventually, it's all compilers down there from the spec. So spec becomes our programming language and everything else is just SDLC.

Simon: And so that then leads us on to that next question of, I guess, so let's break this down into two now in terms of developers. I guess the scope of who can be a developer broadens, diversifies. Absolutely. What about the developer of today? What would you see? So first of all, let me ask you the question, that first question. What does a developer look like in five, 10 years?

Baruch: So for me, what really is not going away, and I don't think it will ever, is the expertise. And the expertise for us is what can be called like grasping the art of the possible in computer engineering or computer science. Because if you let non-technical people write the prompts and then read the specs, it can be absolutely wildly unconceivable in terms of machine implementation. The requirements might be absolutely unrealistic in terms of functional requirements, but also then non-functional requirements like performance requirements that can never be met, right? Or anything like that.

Simon: But are these things that can be learned by non-technical people? There’s no reason..

Baruch: Well, to an extent, right? They can learn by kind of observing that, hey, this is not possible. We don't really understand why. But the technical people ask, we do understand why.

Baruch: Which gives us much more abilities to be this. So we have like the imaginary table that we have all the stakeholders, which is the business, the customers, the product, security people, and everybody else, have a say when they read those specs. And let's say someone says, well, we need to make this application very much faster. And us as developers sitting on the table, this is where we say, no, we actually cannot. And this is why. And instead, we need to go the other route and we architecture something in order to make it happen.

Simon: So the architect, that's interesting. So the level of abstraction is left to the architect in question. And so you have certain developers there that care about the architecture and certain developers. We were talking about previously, was that people who care more about that intent, that I need this to do this. They're providing that input. And it's the person who's more on that architecture side that says, okay, I can help you get to this stage by making these changes.

Baruch: Yes. And then there is a lot of technical expertise that gets into it. Because people can say, okay, now you have this service that you want to generate, but it is actually too wide. So it will be very very wasteful on the resources to regenerate it every time that you have a change across.

Baruch: So we might want to make it small. So it's not only about understanding what the end application should do. It's also understanding about how the intent integrity check works and what needs to go to the code and what's not. And this is not different from what we do today. Right. We have opinions on both the quality of the code itself.

Baruch: Well, this variable is not properly named, or this should be refactored in this way or that. And we have opinions about the architecture of our application. Well, this is, should be designed this way, or those components should talk to each other and those do not. This all translates.

Baruch: It just translates a little bit differently. So whoever really cares about the greedy details of the implementation might be interested in the implementation of the intent integrity chain mechanism of this compiler of compilers. And whoever cares more about architecture will still care about architecture.

Baruch: They will just discuss it, not in terms of interface names and implementation class names, but in terms of the specs and how the microservices are talking to each other or not talking to each other. Right. So the expertise, the technical expertise that we have, is still absolutely critical for software engineering.

Baruch: This is not going away. And the knowledge, the domain knowledge is suddenly even more important. Yeah. Yeah.

Baruch: Before we wrap up, let me ask you back. Oh, okay. Right. So, Tesla.

Simon: Yes.

Baruch: It's about spec. So this is, it sounds to me that when I told, then I told you that, you know what wouldn't be wonderful if we had something better to capture the spec and then translate it to code. It sounds like you might have something that might fit into this piece that is not exactly missing, but I would say not perfect in the intent integrity chain.

Simon: Yeah. I think this is kind of like at the core of what we're looking at in terms of, you know, we believe that this kind of like spec centric world is going to be, is going to be the place that, that with the validation, with the checks and that, and that feedback loop, we are in a better place to actually be able to say, this is what we want.

Simon: And then allow LLMs to kind of like fill in the gaps. But then test is again, like very similar to the, there's so much that you said that resonates. Tests become the most important thing. Code is essentially a disposable artifact that can be generated.

Baruch: But so long as your tests are good, your code will be proven good enough. And I think when you talk about the componentization and things like that, this all resonates very, very much. All right. So let's do a fun experiment. Talk to me in intent integrity chain with Tessl.

Baruch: How it will work, the process when we replace the 20 years old and not really fitting technology with something that was born for the AI age. I know I'm a product manager. I don't read or understand code. I have a software definition document, which is a beautiful piece of literature. And I wanted in code with intent integrity chain with Tessl inside. How does it work?

Simon: So I think, I think there's, there's three key things that are important to be able to describe. One key thing is what are the, what are the things that we want this to be able to do? This, this whatever it is to be able to do capabilities are really core there to be able to describe this is the set of capabilities that I want this unit of software, this, you know, software component to be able to do.

Baruch: And I can ask LLM to take my beautiful literature of software definition document and describe those capabilities for testing. Okay. Got that.

Simon: The next most important thing is tests. Okay. And we see tests, the way we're looking at it, is test is actually part of that specification. So we actually say per capability.

Simon: Well, what are the, what are the, the, the groups, the areas that we, that we assert need to be true or, you know, negatively need to be true in order for that to, in order for that to be realized.

Baruch: And this still tracks with the spec generation that we spoke about. Still, LLM will review my software definition document and will come with a list of capabilities and the test criteria or test scenarios for those capabilities. And we as humans, everybody, product manager of business, technical people will review this and will say, well, that's not exactly what I meant.

Baruch: Let me adjust something in the document and regenerate all of that, or maybe just edit, edit it right there and tweak it. Okay. Now we have capabilities, we have the test descriptions, test scenarios.

Simon: The third thing though, because this composability is super interesting. So the third thing is being able to say, this is the API. This is how I would describe this component to others. Right. So that you allow for that, you know, microservices interchangeability.

Simon: Exactly. And I think the API is a great actual way of actually saying to the LLM, this is the intent at which I want this component to be used. You're actually describing the interactions as well as the capabilities, which is, which is, which is very, very important.

Baruch: Right. And this is the third part of this amorphic spec. Yeah. And, and most of this part, you kind of can describe with Gherkin, but not really, especially not the API. Yeah. And, and, and this is where we have a much more powerful tool to describe those specs.

Simon: Yeah. Yeah. Now I think the, but I think the interesting thing that we see now as a potential future is whereby that spec tests can be generated, code can be generated. But the, but the important thing is there is a level of context that can then be added, which is outside of the spec, because the spec could describe the, you know, the behavior, but it doesn't actually describe the pure implementation as much.

Simon: Whereas the context can say, these are the things that I care about. It could be language, it could be performance requirements, it could be stack, it could be, it could be the availability that was.

Baruch: Accessibility, whatever.

Simon: A ton of things like that.

Baruch: And all those pieces are exactly what we miss from the intent that we check based on cucumber, because that, when it's based on cucumber, it's behavior only. Yes.

Simon: Yeah. Yeah. And I think that then allows you to have one spec with multiple versions of this implementation that can kind of like come out. Exactly.

Baruch: That, and it's all compilers down there. Yeah. Beautiful.

Simon: Also, let's say this security issue. Yeah. Or a bug. Does that actually require, sometimes that's just based in the implementation that we won't find until later? Perhaps a test is missing or something like that.

Simon: Or a new vulnerability, like a third party of vulnerabilities have been found. Is that a change to your implementation? Or is that a change to your spec? Nine times out of 10, I feel like it's not a spec change. Very often. Okay, there's a bug in the way this was implemented. Yeah. We need to make sure we're using this version or this or that kind of thing.

Baruch: With the original intent integrity chain, this is still a problem. Yeah. Because we don't read the code. Yeah. We don't read the test. Yeah. We definitely don't read the whatever, like build scripts that dictate the versions. So yes, theoretically, we can write something like when this is used, use like Spring Boot 3.5 and not 3.4. But this is very awkward, because it's completely foreign to the idea of behavioral driven. It's not, it doesn't describe a behavior.

Simon: But on the other side, if I have multiple of these being created and stacks being created, I don't want everything to use completely different stacks, or just choose various languages and things like that. At some point, I'm going to want to say we remain consistent, because it reduces my attacks, of course, it reduces the effort I need to then stand all of that up. So I might say, actually, for consistency reasons, this is the stack I choose, these are the languages I choose, etc.

Simon: As a result, I want consistency between my components, between my applications. So, and yeah, that isn't necessarily something you want to add in at the spec. It's actually the implementation detail.

Simon: Because if I want, if I want behavioral changes, I go to my spec, if I then want the deployment changes or implementation changes, I need to change the context with which this is generated.

Baruch: And this is again, like a missing piece in what I've described. But if we take Tessl instead, this spec to code part, then we can add those non behavioral concerns in a native way that you envisioned. And then it makes it even more usable. And you don't need to abuse the spec for stuff that is not really intended.

Simon: Yeah, you almost pollute it a little bit, right? Yeah, exactly. You have that distinction. One of the things that I do love in this model is this loop. And this loop goes, we talk about that kind of like the SDLC and then all the way into production, there are so many places at which you can loop, whether that's quality tests, whether that's performance tests, all the way through to observability data in production and pulling that useful information back that we can actually pass into the generation. And the value is, with that validation, verification all the way through to say, does it still adhere to the spec? Does it still adhere to the things that are cared about by the stakeholders, by the people who are actually describing it?

Simon: So as long as that is well described and the things you care about are documented there, you can iterate through. And you can just let it go wild. Exactly. Like, you know, get those monkeys to go in, let's loop, loop, loop, loop, loop until the specification and its core that we always constantly check without specification is satisfied. But also, you know, these tweaks and changes are made to satisfy the non-functional, to satisfy the business needs, etc. I think it's a super powerful space.

Baruch: And here you have, it's a win-win, right? So intent integrity chain, but implemented with something much more powerful than just BDD, covers those areas that kind of the uncomfortable questions that I have to this model. Like what do you do with security? What do you do with specification details that you care about, but you still don't want to read code and this kind of stuff. And it's a beautiful win-win.

Simon: Yeah, well, the beautiful win-win, that's where we're leaving, man.

Baruch: That's what we do.

Simon: It's been a pleasure.

Baruch: Thank you so much.

Simon: And I'm looking forward to the rest of AI-Fokus. So yeah, thanks very much.

Simon: Thank you very much for listening and be sure to tune into the next episode.

Baruch: Yep. Thank you. Bye-bye.

Can LLMs replace structured systems to scale enterprises? ›

Subscribe to our podcasts here

Welcome to the AI Native Dev Podcast, hosted by Guy Podjarny and Simon Maple. If you're a developer or dev leader, join us as we explore and help shape the future of software development in the AI era.