Code Centric Eval First Development: Accelerating AI Features for Devs

13 May 2025with Dru Knox

Also available on

Dru Knox

Head of AI, Product, Tessl

Webinar

Table of Contents

Understanding Evaluation Phases in AI Development

The Initial "Vibe-Based" Phase

The "Quality-Obsessed" Extremes

Striking a Balance: Numbers and Pragmatism

Product-Led Evaluations

Designing with Evaluation in Mind

In this talk

In the world of AI-native development, evals aren’t just quality gates, they’re powerful design tools. Especially for codegen products, targeted evals can drive clarity, speed, and alignment across teams. This session dives into how code-centric evals can be used to prototype developer-facing AI features, bring cross-functional (XFN) stakeholders into the loop early, and set measurable goals for launch quality. Learn how to evolve your product rapidly with user feedback loops grounded in meaningful code evaluations—treating evals like the “Figma mock” for AI coding tools.

Understanding Evaluation Phases in AI Development

Dru Knox begins by challenging the conventional wisdom that evaluations inherently slow down development. Instead, he positions evals as pivotal for enabling rapid, collaborative iteration. Knox elaborates on Tesla’s journey through different evaluation "stages" that reflect common experiences among teams, especially when dealing with code-centric AI features.

The Initial "Vibe-Based" Phase

In the early stages of AI development, teams often rely on a "vibe-based" approach, where outputs are manually inspected for correctness. Knox warns that while this stage is expedient for early prototyping, it lacks consistency and scalability. He states, “When you look at a specific example... it’s hard to know where to target your fix if you don’t have a broader, more representative set.” This phase is fraught with challenges as the number of team members and code implementations grows.

The "Quality-Obsessed" Extremes

Following the initial phase, teams may swing to the opposite extreme, becoming "quality-obsessed." Here, massive annotated datasets and rigorous metrics are used in an attempt to quantify "correctness." Knox cautions that this can lead to paralysis: “You just never ship anything, because you can never finish the eval that you’re trying to build.” In code contexts, the subjective nature of code style and comments often complicates achieving a singular, definitive metric.

Striking a Balance: Numbers and Pragmatism

The solution, according to Knox, lies in a middle ground. He emphasizes the importance of embracing imperfect, yet representative input sets and "directionally correct" metrics. Knox advises designing focused checks to highlight areas for improvement rather than attempting to control every possible output. He succinctly puts it, “You just need signal.” This approach facilitates faster iterations and better integration of synthetic data, with a focus on statistical outliers.

Product-Led Evaluations

Tesla's current methodology, which Knox terms “product-led evals,” involves teams collaboratively defining precise input distributions and creating an operation taxonomy. This sharpens focus on what matters most, aligning evals with product priorities, and distinguishing “P0” must-succeed cases from lower priorities. Knox stresses, “Everyone on the team should generate some outputs by hand,” underscoring the value of deep product understanding.

Designing with Evaluation in Mind

Knox advocates for designing products around existing model limitations. This includes transparently surfacing errors, displaying alternative outputs, and pausing workflows to manage known weak spots. Such strategies enable earlier launches and garner essential user feedback, even if the user experience is not entirely polished. He concludes by recommending a balanced approach: leverage synthetic data, adopt simple metrics, and engage in human-in-the-loop processes to ensure reliability.