Join us for DevCon Fall conference, virtual and in New York. Nov 18 - 19Join us for DevCon. Nov 18 - 19
Logo
  • #

    +

    >

    DevCon
  • Articles122
  • Podcast83
  • Landscape539
  • Events26
  • Newsletter30
  • #

    +

    >

    DevCon
  • Articles122
  • Podcast83
  • Landscape539
  • Events26
  • Newsletter30

Stay ahead, join 7,000 devs reading AI Native Dev Digest

Logo
  • Discord
  • LinkedIn
  • X
  • YouTube
  • Spotify
  • Apple Podcasts
  • Home
  • Articles
  • Podcast
  • Landscape
  • About
  • Privacy Policy
  • Cookies
  • Contact
© AI Native Dev
Back to events

Code Centric Eval First Development: Accelerating AI Features for Devs

13 May 2025with Dru Knox

Also available on

YouTube

Dru Knox

Head of AI, Product, Tessl

LinkedIn
X
Medium
Webinar
Table of Contents
Understanding Evaluation Phases in AI Development
The Initial "Vibe-Based" Phase
The "Quality-Obsessed" Extremes
Striking a Balance: Numbers and Pragmatism
Product-Led Evaluations
Designing with Evaluation in Mind

In this talk

In the world of AI-native development, evals aren’t just quality gates, they’re powerful design tools. Especially for codegen products, targeted evals can drive clarity, speed, and alignment across teams. This session dives into how code-centric evals can be used to prototype developer-facing AI features, bring cross-functional (XFN) stakeholders into the loop early, and set measurable goals for launch quality. Learn how to evolve your product rapidly with user feedback loops grounded in meaningful code evaluations—treating evals like the “Figma mock” for AI coding tools.

Understanding Evaluation Phases in AI Development

Dru Knox begins by challenging the conventional wisdom that evaluations inherently slow down development. Instead, he positions evals as pivotal for enabling rapid, collaborative iteration. Knox elaborates on Tesla’s journey through different evaluation "stages" that reflect common experiences among teams, especially when dealing with code-centric AI features.

The Initial "Vibe-Based" Phase

In the early stages of AI development, teams often rely on a "vibe-based" approach, where outputs are manually inspected for correctness. Knox warns that while this stage is expedient for early prototyping, it lacks consistency and scalability. He states, “When you look at a specific example... it’s hard to know where to target your fix if you don’t have a broader, more representative set.” This phase is fraught with challenges as the number of team members and code implementations grows.

The "Quality-Obsessed" Extremes

Following the initial phase, teams may swing to the opposite extreme, becoming "quality-obsessed." Here, massive annotated datasets and rigorous metrics are used in an attempt to quantify "correctness." Knox cautions that this can lead to paralysis: “You just never ship anything, because you can never finish the eval that you’re trying to build.” In code contexts, the subjective nature of code style and comments often complicates achieving a singular, definitive metric.

Striking a Balance: Numbers and Pragmatism

The solution, according to Knox, lies in a middle ground. He emphasizes the importance of embracing imperfect, yet representative input sets and "directionally correct" metrics. Knox advises designing focused checks to highlight areas for improvement rather than attempting to control every possible output. He succinctly puts it, “You just need signal.” This approach facilitates faster iterations and better integration of synthetic data, with a focus on statistical outliers.

Product-Led Evaluations

Tesla's current methodology, which Knox terms “product-led evals,” involves teams collaboratively defining precise input distributions and creating an operation taxonomy. This sharpens focus on what matters most, aligning evals with product priorities, and distinguishing “P0” must-succeed cases from lower priorities. Knox stresses, “Everyone on the team should generate some outputs by hand,” underscoring the value of deep product understanding.

Designing with Evaluation in Mind

Knox advocates for designing products around existing model limitations. This includes transparently surfacing errors, displaying alternative outputs, and pausing workflows to manage known weak spots. Such strategies enable earlier launches and garner essential user feedback, even if the user experience is not entirely polished. He concludes by recommending a balanced approach: leverage synthetic data, adopt simple metrics, and engage in human-in-the-loop processes to ensure reliability.

About The Speaker

Dru Knox

Head of AI, Product, Tessl

Head of AI Product at Tessl, building LLM-powered developer tools with Google/Airtable experience

Related Events

From Vibe Coding to AI Native Dev as a Craft

13 May 2025

with Guy Podjarny

AI-assisted Programming: From Inline Completions to Agentic Workflow

13 May 2025

with Anton Arhipov

From Code Completion to Multi-Agent Coding Workflow

13 May 2025

with Itamar Friedman

Join the discussion.

Stay connected, share your thoughts, and be part of the community.

Join us on Discord

Dru Knox

Head of AI, Product, Tessl

LinkedIn
X
Medium
Webinar
Table of Contents
Understanding Evaluation Phases in AI Development
The Initial "Vibe-Based" Phase
The "Quality-Obsessed" Extremes
Striking a Balance: Numbers and Pragmatism
Product-Led Evaluations
Designing with Evaluation in Mind

Related Events

From Vibe Coding to AI Native Dev as a Craft

13 May 2025

with Guy Podjarny

AI-assisted Programming: From Inline Completions to Agentic Workflow

13 May 2025

with Anton Arhipov

From Code Completion to Multi-Agent Coding Workflow

13 May 2025

with Itamar Friedman