Back to articlesBeyond tests: What to verify in AI-generated code

13 Nov 20258 minute read

Jennifer Sand

While most people idly ponder whether the glass is half full or half empty, Jennifer Sand goes right to the source, asking where is the waiter?, what is the problem?, and how can I get the waiter's attention, thereby solving this problem as efficiently as possible? That same enterprising attitude motivated Jennifer to leave West Virginia to study at Wellesley College in Massachusetts. Jennifer has spent decades in Series A startups to public companies, honing her tech skills. At Everbridge, she drove growth from $5M to $75M ARR. At CloudLock, she helped build a solution through its $293M acquisition by Cisco. As Co-Founder and CEO of Codential, Jennifer is solving a problem she's witnessed throughout her career: even the best teams spend significant time chasing preventable quality issues.

AI-Native Development

Testing & QA

AI Tools & Assistants

DevOps & Platform

Best Practices

Table of Contents

AI-native development demands a quality transformation

Verification practices contribute to code correctness

Back to articles

Beyond tests: What to verify in AI-generated code

13 Nov 20258 minute read

QA [quality assurance] has always been kind of broken.

Waterfall: QA was the gate at the end of the process. Developers threw code over the wall, QA pushed back, and this handoff often resulted in unanticipated delays.

Agile: QA moved into sprints, engineers took more ownership of quality, and automation expanded, but testing still lagged the pace of development and the appetite of the business.

Quality in depth: We added layers: unit, integration, regression, end-to-end, and load testing, alongside static analysis and (still) manual QA. Each layer adds coverage, but maintaining tests can consume as much effort as writing code.

Production: Some issues only appear under real-world conditions, and it’s impossible to test every scenario. Canary releases, feature flags, controlled rollouts, and observability tools are the norm. Certain problems still escape: race conditions, deadlocks, or failures that occur only when your code meets infrastructure under some unpredictable scenario.

Organizations today still approach quality as a threshold traded off against dev velocity. Some bugs will always reach production.

AI-native development demands a quality transformation

Now that most teams use AI Agents to create code, that generated code might pass every test, but still introduce subtle issues that are more difficult to identify and diagnose. A new approach to software development requires a new approach to software quality.

As code creation accelerates, we’re hearing from engineers about the bottlenecks that they’re encountering downstream: more code means more tests to create, and the old way of writing tests can’t keep up. So we throw AI at that problem.

With more code to ship, there’s more code to review, so we throw AI at that problem too, and introduce a new tradeoff: velocity versus code familiarity.

These AI dev tools are hugely valuable, but we continue to apply the same software quality approaches to a new mode of development. Perhaps we should be considering that the entire SDLC may need to be revisited and redesigned.

Subtle issues in AI-generated code

AI agents produce code that looks correct very quickly, but they work from what you tell them, not what you mean. They don’t take into account the same constraints (and battle scars) that human engineers learn (and earn) over time.

Additionally, they don’t take into consideration the system that the generated code is integrating into, so they might produce fantastic component-level code, but once it’s operating in the context of a larger system, obscure system-level issues (the kind that wake up on-call engineers at 3am) surface well after the code has been deployed to production, after passing every test.

A few key areas that are particularly challenging to test intersect with the kinds of issues that AI-generated code often creates:

Too hard to test: Concurrency bugs that appear only sometimes. The team calls the test flaky, reruns CI, and moves on until the failure hits production.

Too expensive to test: Code that performs perfectly in a small test environment but collapses when real data arrives, pulling four times the data it needs and consuming way too much memory. Yes, you can try to catch these with load testing, but that’s more focused on behavior under stress rather than logical oversights that are baked into the code itself and emerge over time due to complex, system-level code interactions.

Too complex to test: Systems with many moving parts and overlapping states. The number of possible combinations is so high that no team can test them all. We are working with a design partner now that builds software that allows their users to design highly complex, custom workflows. They cannot predict (and therefore test) every execution path that users might traverse through their software.

These are not ordinary testing failures. They are verification failures, and they happen because the system’s deeper properties are never defined clearly enough for an agent (or even a human) to verify.

Since not everything can be tested, some of these issues can be prevented or avoided by applying verification techniques that assert what must always be true. In other words, what are the invariants that the AI Agents need to be aware of in order to avoid producing these subtle issues in the first place?

Verification practices contribute to code correctness

Code can pass tests perfectly, but still be incorrect: by implementing a less-than-ideal architecture than what the requirements call for, not adhering to the non-functional requirements (some of which are often unknown at the point of code creation), or by introducing seemingly minor logic issues that only surface in the most inconvenient and unexpected scenarios.

To avoid some of these issues in AI-generated code, invariants can be defined at various levels of scope to drive adherence to core principles that matter for your business.

Universal – concurrency safety:
-Shared mutable data cannot be updated by multiple threads or workers without synchronization.
-Example: Two checkout operations cannot both decrement the same inventory item simultaneously; one must block or retry.
System-level - transaction boundaries
-Cross-service transactions must fully complete or fully revert.
-Example: When a payment is processed, both the ledger service and the order service must reflect the same state, either both mark “paid” or both remain “pending.”
Feature-level - workflow integrity:
-Every path through a workflow must respect defined preconditions and postconditions.
-Example: In a custom workflow builder, a “Submit for Approval” action must always follow “Draft Saved”; it cannot execute directly after “Create.”

I will be talking about this topic at AI Native DevCon in New York on November 19, in a session titled: Beyond Testing: What to Verify in AI-Generated Code.

I’ll be sharing an invariant-driven framework that helps AI agents understand what “production-ready” actually means. We’ll talk about what to verify in AI-generated code before testing even begins, provide a taxonomy for how to think about leveraging invariants in your AI-Native Development workflow, and demonstrate how to implement it with your existing tools.

Jennifer Sand is CEO & co-founder of Codential. Previously VP of Product at Everbridge, VP of Product at CloudLock (acquired by Cisco for $293M). Reach me at jennifer@codential.ai.