Back to articlesTaming AI agents with specs: what the experts say

2 Oct 202518 minute read

Simon Maple

Java Champion developer advocate leading VirtualJUG and London Java Community, creates technical content

AI-Native Development

Spec Driven Development (SDD)

AI Coding Tools

Developer Experience

AI Engineering

Table of Contents

Intent vs. implementation: what gets lost

The validation gap

The compiler revolution revisited

What makes a good specification?

The 3-level (spec)trum: spec-assisted to spec-centric

What to specify, what to leave flexible

The determinism vs. adaptability trade-off

Team collaboration: the standardization question

Tests, trust, and the future of development

Levels of agent autonomy

The transition challenge

Back to articles

Taming AI agents with specs: what the experts say

2 Oct 202518 minute read

Just over a week ago, in a community space at Tessl's King's Cross office, a group of developers (what’s the collective noun – a “package”?) gathered to discuss a hot topic in AI’s future development: writing specifications. We’re not referring to the heavyweight requirements documents of the waterfall era, but a new breed of markdown-based specs designed to manage an increasingly frustrating reality: AI agents that code at lightning speed but often lack the memory and judgment needed to build high-quality, production-grade code.

It was against this backdrop that I moderated a panel consisting of Guy Podjarny (Guypo), founder and CEO of Tessl; Alan Pope (Popey), Director of DevRel at Anchore at the time; and Don Syme, Principal Researcher at GitHub.

Our conversation surfaced a number of themes, the first being the gap between intent and implementation in software development.

Intent vs. implementation: what gets lost

Guy highlighted an important distinction between intent and implementation: "Code is a representation of an intent – it's a choice,” he said. “You ask for, say, an e-commerce website, and you get an implementation of that. You get one of multiple viable decisions. The problem is that once code exists, the ‘why’ behind those decisions evaporates.”

This has always been true in software development, where code is both the embodiment of intent mixed with implementation decisions. The intent often turns into institutional knowledge, manifesting either in documentation (that can become stale) or in the heads of developers.

Popey shared a relatable scenario: "I come back to code months, or maybe years later, and think, ‘oh, I need to make some adjustments to that’. And, ‘I have completely forgotten about all the choices that I made at the point when I wrote that code’."

His solution? Feeding old code into Tessl to generate a specification that both documents what was built and suggests how to rebuild it better. This resonates strongly with me, although I’d replace ‘months and years’ with ‘hours and days’!

The validation gap

Perhaps the most telling pain point came from an audience member's observation about the gap between instruction and execution. You write something down, the LLM reads it, and produces something completely different. What's missing isn't just accuracy, it's a systematic way to validate that what was created matches what was intended.

Guy described this as the difference between "inspirational specs" and "canonical specs." You start with your intent, your request, your clarification. But that's still inspirational until you introduce validation layers. "You need to introduce levels of validation to be able to say, okay, how do I know that what was created on the other side matches?,” he said.

The panel also discussed what they called "stable regeneration" – an agent’s ability to produce consistent results. If it generates something once, will it generate it the same way again? One audience member noted the frustration of seeing agents apply the same fix multiple times, each in a different way within a single build.

The compiler revolution revisited

Don's historical perspective provides crucial context. He recalled sitting next to subroutine inventor Maurice Wilkes at Cambridge. “I think back to the day — what was programming like once upon a time, when they had the program, that is, the machine code? Then these things called compilers suddenly turned up. People were like, ‘We can’t trust that. This is the only thing that’s trustable. The machine code is the reality.’”

We're experiencing a similar transition right now with AI. The spectrum between intent and reality, between ‘I want to build something’ and ‘it's running on hardware,’ has always existed. What's changing is where on that spectrum it's viable to work. For decades, code was the only place. Now we're discovering that working at higher levels of abstraction, with more ambiguity and less determinism, can deliver value in ways that were previously impossible.

What makes a good specification?

We started the discussion around scope and lifecycle: “How long should a spec live?” This question split the panel, or rather, revealed that ‘it depends’ is the only honest answer in this exploratory phase.

Alan described his evolution: "I initially thought [in terms of] one enormous text document, which was just super unwieldy. And [now] I've embraced the atomic unit of work as a spec."

He breaks applications into micro-specs. One for the frontend, multiple for different backend components, each representing a success checkpoint. "If I could throw away all the source code, and then tell Tessl to regenerate the whole thing from scratch, and it does, and it works – which it can do – then that's double success."

Guy distinguished between different use cases: "When you look online at AWS Kiro's work, for example, they often refer to a spec for an implementation plan. That spec is really the spec of the change, so it’s about moving from state A to state B, and then that spec kind of goes away."

In contrast, specs for long-term guidance – the product intent, the architectural decisions – should be durable artifacts.

Don emphasized that while he's "totally okay with 'it depends' answers," one thing is clear: "There's not a single coding scenario today where high-level declarative guidance, guardrails, boundaries, and success conditions – checked into your repo to define what it’s for – don’t help. It always gives AI agents the guardrails they need.”

Even if you eventually return to a ‘code as king’ mode, those specs become valuable context files that dramatically improve agent performance.

The 3-level (spec)trum: spec-assisted to spec-centric

Guy proposed a useful framework for thinking about depth-of-spec commitment:

Spec-assisted development is the baseline. This is your Claude.md file, your cursor rules, your usage specs from a registry. "It's just help – information available for the agent to do its job," Guy explained.

This might be product intent, API definitions, or company policies. It's guidance, not gospel.

Spec-driven development represents deeper commitment. "You've defined a portion of the functionality, that is, you're committed to keeping it well captured, and when you make a change... the first thing you do is you modify the spec. You apply that change, and then you apply the change to the code."

Spec-centric development is a possible future state: "Code becomes disposable. The spec contains enough detail about everything that matters to you, all the things that you need. There are enough tests referenced from the spec and accompanying it, that you can have confidence and trust in the outcome. And then you can just throw that away and create another language."

The panel agreed we're not at spec-centric yet, but spec-assisted is table stakes, and spec-driven is increasingly viable for many use cases.

What to specify, what to leave flexible

One of the most nuanced discussions centered on what should go in a spec. Guy articulated a key principle: "There are these implementation details. We tag them with .impl to indicate that they’re lower-importance bits. You should try to preserve them for stability, but they don't trump the spec."

Example: the button can be red or green, the spec doesn't care initially. But once chosen, it should stay that way for stability. "I don't want to change the color every version,” Guy said.

However, if a later spec update explicitly says the button should be green, that trumps the previous implementation detail.

Guy also noted: "Sometimes using, whatever, a vector or an array, they're interchangeable. Doesn't matter that over here it uses a vector, and over there it uses an array. And sometimes it does matter, because maybe they're optimized for different data set sizes." Getting the right level of granularity is key. Too granular and you might as well write code. Too vague and you lose the guardrails.

Alan took a more minimalist approach: "I try to keep the spec as detail-lite as possible in terms of technicality. Just enough to try and coerce the agent to stay on track."

He puts in framework choices and API endpoints, but avoids over-specification. "I generally don't look at the spec. The spec is an intermediate step between how my brain creates text through conversations... and it's just raw text, it's just like a stream of consciousness."

The determinism vs. adaptability trade-off

Don framed the core tension well: "There's a straight-up trade-off between determinism and adaptability. The more deterministic you want the code to act, the more rigid it is, the less adaptable it is."

He used Docker's FROM something:latest as an example. Every build gets something different, which is frustrating but also means the system stays patched and current. "Generally, it annoys people, it's not deterministic, and certain people are not willing to tolerate that, and they switch to doing it. But then they get rigid systems that are oftentimes vulnerable, or they can't progress."

One audience member asked about idempotent specs – a question that revealed just how strong our attachment to determinism still is. But as Don noted: "If you want to be 100% idempotent, this isn't the game to play. But there are loads of cases in the industry where that doesn't matter, and you can deliver huge amounts of value without having that, and where the ambiguity is genuinely useful."

The industry will bifurcate. Some teams will embrace looser specs and gain adaptability. Others, working on kernels, safety-critical systems, and highly deterministic requirements, will stick with traditional approaches. "For half of you in your current job, it's not going to be the right thing," Don warned. "For the other half, it's going to be the best thing ever."

The key is knowing which half you're in.

Team collaboration: the standardization question

Don acknowledged the challenge: "It's a great question about how teams actually collaborate on that. Sometimes it requires multiple perspectives on how you're going to spec out." Guy also tackled this head-on: "Just imagine a team in which everybody's programming in a different language. Sure, they all compile the same, but I don't think that's a happy reality. I think you want to allow people to pick up and continue from where someone else has left off."

The solution isn't global standardization, but team-level conventions. Guy compared it to legal contracts: "A great lawyer can read any contract and understand it, but there are [still] norms in the ecosystem. [For example], in a commercial contract you expect to find a liability section — so you pick it up and jump straight to it.”

Similarly, software specs need conventions. You expect an API definition in a certain section, expressed in one of a small number of ways. You expect versioning elements. "Right now, it’s the Wild West, legitimately, because it's all being formed. But I think to be able to collaborate, you need some element of standards."

The unresolved questions remain: What happens when you modify a spec in a pull request, and a colleague suggests changes? Do you regenerate code immediately? How do you think about the code as a build artifact versus a source-of-truth? How does versioning work when software is adaptable and can be created in Rust just as easily as Python?

Tests, trust, and the future of development

Another audience member asked a crucial question: In iterative development, we rely on regression tests. In a spec-driven world, how do we ensure tests stay stable, are added to, and continue to express system intent?

Guy explained Tessl's approach: "You start in prototyping mode, so you don't have tests yet. You generate a spec, you capture your intent, build it, and check the product. If it looks good — then you generate tests.”

Those tests then become locked in as the source of truth. “When you make another change, the new code won’t have tests yet. But you build it, the regression tests run, and if they fail, the agent regenerates the code until all the tests pass.”

Critically: "It's only a human, the user, that should be able to say, ‘yeah, this regression test is a failure’."

Later, another audience member asked about generating tests from specs, versus generating code from tests. Guy confirmed: "Specs are definitely a source for test creation, and then the tests become part of the spec. They turn an inspirational spec into a canonical spec."

Levels of agent autonomy

Guy referenced a recent piece he wrote for AI Native Dev, titled: The 5 levels of AI agent autonomy: learning from self-driving cars, which adapts the familiar self-driving framework (levels 0–5) to describe stages of agent capability.

Today, guy argues, we're still at the lower levels of AI agent autonomy, where regressions should fail and be surfaced to humans.

"I imagine a future, as we get into level 4, level 5 autonomy levels, [where] there's enough information at agents’ disposal, for instance, about how your product is used, what is the business context, that they would be allowed to break regressions. It's like, they know how the app is used. I think this is worth it. I don't think we're there yet."

The transition challenge

Building on Don’s point about the industry bifurcating, the real challenge is knowing which problems welcome adaptability and which demand exactness. The industry is already starting to sort this out: some areas benefit from looser specifications and greater flexibility, while others still require absolute precision. "You do not want the Linux kernel done just in spec-driven mode," Don emphasized. "You want absolute precision, and all of this is built on a sea and ocean of exact programming."

But for integration work, for business applications, for the countless systems where button colors and implementation details don't matter as long as business value gets delivered? Spec-driven development offers a compelling path forward.

The pain points are real. Agent amnesia, implementation inconsistency, validation gaps. These are the problems driving adoption. The solutions are still emerging.

The answer, it turns out, might be as old as Maurice Wilkes' subroutine: create the right abstraction, trust the tools beneath it, and work at the level that delivers the most value.