
Ray Myers
Ray Myers is a legacy code expert with 18 years of Software Engineering experience across four industries. His recent work explores the delicate intersection of AI and maintainability. He co-hosts the Empathy In Tech podcast and publishes guidance on the Craft vs Cruft YouTube channel with influences from DevOps to Taoism.
AI coding agents are our pride and joy. They work miracles in Pitchdeck Paradise, that sunny land of prototypes and demos.
The problem is that enterprise users don't live there. They live on Legacy Mountain, a harsher place with treacherous cliffs and systems old enough to bite back. When they try to use these same miraculous tools, their results range from frustrating to catastrophic.
I've spent most of my 20 year career maintaining and modernizing old systems at companies large and small. Today I work on OpenHands, the leading open source coding agent.
AI hates legacy code, and we need to change that. But what do I mean by "AI hates legacy code”?
Some of you will be itching to point out that "AI" can mean other things. Don't worry, we'll get to that.
Today's coding agents are largely based on LLM calls run in loops with tools and environment feedback. That's gotten us far, as benchmarks like SWE-bench show. It's even gotten real results, especially on codebases that are new, small, or well-tested. However, it will not get us up Legacy Mountain.

Legacy is any code in production. It's live, creating value, and your business depends on it. When we talk about the challenges of legacy code, we usually refer to some combination of these:
The one thing all these overlapping kinds of legacy code have in common is that we have to deal with them. Existing production code must be supported.
LLMs give us the best return on investment on tasks with these properties.
Legacy code is the opposite situation. It's often high risk, has unclear context, and is hard to check. The more that's true for the project in front of us, the more trouble we will have applying LLM-based solutions. They can still provide value, but we need to take elaborate measures to manage the risk.
To start improving the situation, let's dispense with a few common excuses.
When limitations of AI are brought up, some people will retort that humans are also imperfect. That is a thought-terminating cliché used to dismiss legitimate concerns. It could just as well be used to defend a hammer made of jello.
Just because two things both have risk, that doesn't mean the risk is interchangeable. LLMs make different mistakes for different reasons. The software industry already has a quality problem, it’s reasonable to want to be sure we’re making the situation better and not worse.
Of course people do make mistakes - but that should be the beginning of a thought, not the end. We engineer processes to take human mistakes into account, we must do the same with AI.
Whenever there's some task that LLMs do poorly at, someone is ready to remind us that the models will get better. They invite us to assess tech that does not exist in reality today but in some hypothetical future.
The models do get better, particularly improvements in context window and tool use have been very helpful in coding agents. Still, we just can't assume they will get better in every possible dimension equally. We've clearly seen models improve in some ways while plateauing in others.
Unless we want to keep having the same conversation about counting the R's in Strawberry, we have to learn to use the right tool for the job.
Next time you feel the need to point out "the models will get better", make sure you're specifying what capacity you mean they'll get better at, and consider if there are alternatives that are already better at that thing today.
For any behavior that a coding agent exhibits, no matter how ridiculous, someone is ready to blame the user.
You should have prompted better. You should have checked every line. You should have used RAG, now it's Rules, wait... now it's Spec-Driven Development!
Have you tried writing "DON'T DELETE THE DATABASE" in all caps?
These workarounds are helpful but are a heavy burden for users and still far from foolproof.
Blaming the user invites stagnation. Empathy for users inspires better tools.
We hear in one breath we should "forget the code is even there", and in the next breath we should have checked every line of output - all the thousands of lines that it cranked out in minutes.
We hear that we must adopt this today or become irrelevant, but then when it breaks, suddenly we shouldn’t criticize the current capabilities because "this is the worst it will ever be".
It won't serve us to push this contradictory fear-based messaging. What we need is a clear value proposition that fits realistic needs.
If you're using coding agents today, your best bet is to understand that they can be used well, but they won't ensure that on their own. You will have to understand the pitfalls at every turn. You can do this by studying software craft.
In addition to learning AI, treat software engineering as its own learning path, including skills specifically about working with legacy code. With that fuller understanding, you'll be in a better position to leverage coding agents successfully.
Once you understand how you want coding agents to act, expect to spend some effort customizing them to match your workflow. The docs will show how to configure things like instructions, permissions, and tools. Use that to the fullest. You might then graduate to scripting your own agent CLI invocations or coding your workflow with an Agent SDK directly.
Before we get into some long-term ideas for tool builders, let's consider what's possible today. When we study users' current tribal knowledge, we can see how to improve. Try making today's best practice into tomorrow's default. For example:
Similarly, any experienced wrangler of coding agents has a mental model of when the agent's likely to be out of its depth, what tasks are appropriate to delegate. We don't have to leave the next user to figure this out through trial and error, we can provide signals.
The next wave of developers is ready to try coding agents, but we can't expect everyone to be as eager to wrap their minds around the confusing quirks as we were.
Perhaps you're starting to suspect that "AI hates legacy code" is not just a defeatist catchphrase or a temporary state of affairs. Suppose it's a reality we need to contend with. If we embrace that, where does it lead?
Let's go back to the original problem statement, now with the qualifiers in bold.
This offers us three paths:
Path 3 invites us to create a new foundation for software that is more manageable by coding agents. There are some approaches there. My best bet would be in a new renaissance of Domain-Specific Languages, another interesting one is Universalis by Erik Meijer. However, for this discussion we're trying to climb Legacy Mountain, not create a new mountain.
That leaves us options 1 and 2, which are different ways to say the same thing: reduce our dependence on LLMs.
We've already taken steps in that direction with the focus on agents. Incorporating tool-use and environment feedback means we are no longer limited to an LLM's guess at what the outcome will be. In high-risk contexts, the value we get from these agents is constrained by how much we can trust their guardrails. In other words, our bottleneck is the safety of the tools.
Consider what an untapped resource this notion of safe tools is, especially on legacy code. We fear to change untested code, and with good reason. Yet we trust compilers to transform millions of lines of code into assembly language and other forms. We do this without a care if we have tests or not! We trust that however many bugs were in the source code coming in, that's how many will be in the executable that comes out.
Compilers can handle high-risk code with near certainty. The rare compiler bugs that happen can be found and fixed, unlike LLM hallucinations. We are also able to transform code within the source language with related technology: refactoring algorithms that operate on syntax trees.
Another natural area to explore is Formal Methods. "Symbolic AI" approaches such as theorem provers are typically more powerful within a limited scope of operation than Machine Learning techniques like LLMs. When we build systems, we're not limited to either side, we can use the strengths of both. For instance, you could imagine sketching a mathematical proof with the help of an LLM chatbot and verifying it in one of these formalized prover languages. That's what some of the world's leading mathematicians like Terrance Tao have started to do.
If we need more options, we have all of Computer Science to choose from.
"AI first" is AI failure. We need to put our actual goals first. Let's demand excellence and use all of the pieces on the board.
I’m speaking at AI Native DevCon NYC on November 19th, with my talk “AI Hates Legacy Code”. I hope you get a chance to join in person or on YouTube. Please feel free to reach out and continue the discussion.
Register at ainativedev.io/devcon