Webinar

Code Centric Eval First Development: Accelerating AI Features for Devs

With

Dru Knox

13 May 2025

In the world of AI-native development, evals aren’t just quality gates, they’re powerful design tools. Especially for codegen products, targeted evals can drive clarity, speed, and alignment across teams. This session dives into how code-centric evals can be used to prototype developer-facing AI features, bring cross-functional (XFN) stakeholders into the loop early, and set measurable goals for launch quality. Learn how to evolve your product rapidly with user feedback loops grounded in meaningful code evaluations—treating evals like the “Figma mock” for AI coding tools.

Understanding Evaluation Phases in AI Development

Dru Knox begins by challenging the conventional wisdom that evaluations inherently slow down development. Instead, he positions evals as pivotal for enabling rapid, collaborative iteration. Knox elaborates on Tesla’s journey through different evaluation "stages" that reflect common experiences among teams, especially when dealing with code-centric AI features.

The Initial "Vibe-Based" Phase

In the early stages of AI development, teams often rely on a "vibe-based" approach, where outputs are manually inspected for correctness. Knox warns that while this stage is expedient for early prototyping, it lacks consistency and scalability. He states, “When you look at a specific example... it’s hard to know where to target your fix if you don’t have a broader, more representative set.” This phase is fraught with challenges as the number of team members and code implementations grows.

The "Quality-Obsessed" Extremes

Following the initial phase, teams may swing to the opposite extreme, becoming "quality-obsessed." Here, massive annotated datasets and rigorous metrics are used in an attempt to quantify "correctness." Knox cautions that this can lead to paralysis: “You just never ship anything, because you can never finish the eval that you’re trying to build.” In code contexts, the subjective nature of code style and comments often complicates achieving a singular, definitive metric.

Striking a Balance: Numbers and Pragmatism

The solution, according to Knox, lies in a middle ground. He emphasizes the importance of embracing imperfect, yet representative input sets and "directionally correct" metrics. Knox advises designing focused checks to highlight areas for improvement rather than attempting to control every possible output. He succinctly puts it, “You just need signal.” This approach facilitates faster iterations and better integration of synthetic data, with a focus on statistical outliers.

Product-Led Evaluations

Tesla's current methodology, which Knox terms “product-led evals,” involves teams collaboratively defining precise input distributions and creating an operation taxonomy. This sharpens focus on what matters most, aligning evals with product priorities, and distinguishing “P0” must-succeed cases from lower priorities. Knox stresses, “Everyone on the team should generate some outputs by hand,” underscoring the value of deep product understanding.

Designing with Evaluation in Mind

Knox advocates for designing products around existing model limitations. This includes transparently surfacing errors, displaying alternative outputs, and pausing workflows to manage known weak spots. Such strategies enable earlier launches and garner essential user feedback, even if the user experience is not entirely polished. He concludes by recommending a balanced approach: leverage synthetic data, adopt simple metrics, and engage in human-in-the-loop processes to ensure reliability.


About The Speaker

Dru Knox

Head of AI, Product, Tessl

Dru Knox is Head of AI, Product at Tessl, where he leads the development of AI-native tools purpose-built for developers. A seasoned product manager, Dru has spent his career working on deeply technical and developer-facing products at companies like Google and Airtable. Over the past several years, he’s focused on machine learning and generative AI at scale, driving product innovation at Grammarly, his own startup, and the AI-native social network Cantina.

Dru is passionate about making LLM-powered products that are fast, useful, and intuitive, especially in the codegen space. He brings a thoughtful, pragmatic approach to building tools that serve real developer needs, grounded in experience across both big tech and startups.

Outside of work, Dru’s interests include improv comedy, Dungeons & Dragons, and philosophy. His favorite podcasts are Cortex, The Adventure Zone, and Sharp Tech. Originally from Virginia, he’s currently based in London and still counts winter (and snow) as his favorite season.

Subscribe to our podcasts here

Welcome to the AI Native Dev Podcast, hosted by Guy Podjarny and Simon Maple. If you're a developer or dev leader, join us as we explore and help shape the future of software development in the AI era.

THE WEEKLY DIGEST

Subscribe

Sign up to be notified when we post.

Subscribe

JOIN US ON

Discord

Come and join the discussion.

Join