o3 vs GPT-4.5: Observations in AI-Native Development
17 Mar 2025
•
Baptiste Fernandez
The challenge of AI-native development
Building AI-native development is a challenge that demands precision. Success hinges on integrating code understanding, specification-to-code translation, intelligent code generation, and automated testing across multiple modules. Beyond OpenAi’s insights on o3-mini’s coding performance (see below), there’s a shared sentiment in the developer community about its effectiveness. For example, Simon Willison noted his surprise at o3-mini’s ability to generate programs that solve novel tasks, as well as its strength in producing documentation.

However, within AI-native development, each layer introduces new complexities, requiring models to not only generate functional code but also understand dependencies, adapt to evolving specifications, and self-correct through testing. Traditional approaches struggle with this level of integration, making it clear that AI-native development demands a fundamentally different paradigm.
Advancements in reasoning models, from the early “chain-of-thought” approach to “hybrid reasoning” in Claude 3.7, make this an exciting time for tackling these complex problems. Tessl’s AI engineering team built an evaluation framework that enables continuous testing of new models as they are released. When GPT-4.5 launched with their “last non-chain-of-thought model”, the team assessed which models best suited their use case.
Comparing o3-mini vs GPT 4.5—Tessl’s approach
Tessl, focused on AI Native development, initially had its generation process use GPT-4o for most tasks, but transitioned to o3-mini after it demonstrated stronger performance. With the release of GPT-4.5, and its claims to produce more accurate responses and less hallucination, the team conducted a comparative analysis to assess its performance against o3-mini.
The evaluation process involved testing the models across a range of packages. Each package represented a distinct coding challenge, such as implementing a calculator, performing color transformations, or creating a spreadsheet engine.
The task focused on key capabilities they were testing within AI Native Development:
Code understanding
Code generation
Translating specifications into code
Debugging and error resolution
Test case generation
The results provided meaningful insights.
Initially, the team left GPT-4.5 and o3-mini to generate their own test cases, and o3-mini demonstrated a significantly higher pass rate. However, to ensure a fair comparison, the team standardised the evaluation by using test specifications and cases generated by o3-mini for both models. With this apples-to-apples comparison, o3-mini still proved to be significantly stronger in their internal pass rate benchmarks.
Our findings match OpenAI’s statement that GPT-4.5 is “showing strong capabilities in […] multi-step coding workflows”. In this context, o3-mini-mini ultimately proved to be a stronger fit for Tessl’s use case in AI-native development. These findings resonate fairly well with SWE-bench (a known benchmark for evaluating models on issues collected from GitHub—see below).

Ultimately, the most interesting insights lie between the lines.
OpenAI’s claim that GPT-4.5 is a strong model holds true. While o3-mini outperformed it in Tessl’s benchmarks, GPT-4.5 still delivered decent results. One notable detail is that when given the autonomy to test itself, GPT-4.5 organically generated more tests. This raises compelling questions:
Were GPT-4.5 generated tests superior?
Could GPT-4.5 be better suited for test generation rather than code creation?
Would it be more effective to leverage GPT-4.5 for specific aspects of AI-native development rather than applying it universally?
Implication for the Future of AI Native Development
Model advancements are reshaping development workflows, making AI-driven coding a more practical reality. These early results could push more AI-powered dev tools to integrate models like o3-mini as its model improvements are dramatically changing development workflows.
“[o3-mini] was built it with a load of RL methods. I was a bit cynical all of this post-training […] but o3-mini really does feel like a real step change. It is making well-considered code that suddenly makes it feel like we're a lot closer to AI native future.” Amy Heineike, AI Engineer at Tessl
That said, should we explore further experiments in model pairing—where one model (potentially o3-mini at this stage) manages the overarching system architecture while another refines the finer details? We believe the future of AI-native development lies in leveraging multiple models, stacked on top of one another, each optimised for a specific stage of the development workflow. Just as a hammer and a screwdriver can both put a screw in place—but with different levels of effort and precision—different models excel in different roles within the development process.
For instance, GPT-4.5 is known for its human-like writing, while o3-mini excels in coding output. Could o3-mini generate the code while GPT-4.5 refines and explains it in a more natural way? What role does each model play in this complex puzzle? And which pairings create the most effective AI-native development stack?
We’re still in the early days of AI Native development, and the possibilities ahead are exciting. Let’s explore, build, and learn together. What models are you using? What evals are you running? What insights have you unearthed? We will be looking into this in more detail at the 2025 AI Native DevCon event. If you’re interested in AI Native development, emerging trends, and how to stay ahead in this fast-moving space, join us!
Plus, be part of the conversation—join our AI Native Development community on Discord.