For AI coding agents to become truly useful in everyday software development work, they must be able to handle long, multi-step tasks. However, their ability to do so is constrained by a basic limitation: they cannot retain unlimited context. As conversations, code changes, and debugging sessions stretch on, important details are often compressed or discarded, raising questions about how well agents can actually continue work over time.
That problem sits at the center of a new evaluation framework published by agentic AI software developer Factory, which examines how different approaches to context compression affect an agent’s ability to preserve useful memory during real software engineering tasks.
Context compression, for the uninitiated, refers to techniques that reduce the amount of prior interaction an AI system carries forward, typically by summarizing earlier conversation or state. In longer tasks, compression becomes the mechanism that determines what an agent remembers and what it forgets once earlier context no longer fits in memory. Factory’s new evaluation framework measures how effectively agents can continue work under different compression methods.
As AI agents move beyond short prompts into longer-running work such as debugging, code review, and feature development, the amount of context they generate can quickly exceed what even large language models can hold. These sessions may span tens or hundreds of thousands of tokens, a unit of text processed by models, making some form of summarization unavoidable.
The industry’s early response has often been to focus on aggressive reduction, producing compact summaries that minimize token usage. In practice, that approach has begun to show limits. Some summaries remain readable while quietly discarding operational details such as file names, API endpoints, or error conditions, leaving agents unable to reason correctly about what happened earlier in a task.
That tension is increasingly visible across the ecosystem. Sourcegraph, for example, recently retired “compaction” in its Amp coding agent in favor of a cleaner hand-off mechanism, after finding that repeated compression made it harder for agents to maintain continuity across phases of work. Elsewhere, Tessl has proposed its own evaluation framework for coding agents, reflecting a similar push to measure whether agents can apply technical context correctly over time rather than simply generate plausible code.
At the same time, benchmarking efforts such as Context-Bench point to a broader shift in how agent performance is being assessed. Such benchmarks attempt to measure how well agents retain, reuse, and reason over information across extended interactions.
Factory’s evaluation framework fits squarely within that trend. Rather than asking how small a summary can be, it focuses on whether compressed context still supports effective task-continuation — a question that is becoming harder to ignore as agents take on longer, more complex roles in everyday development work.
The Factor study compared three compression approaches across more than 36,000 messages from real engineering sessions. One was Factory’s own structured summarization system, which incrementally maintains context around intent, changes made, decisions taken, and next steps. The others were compression features from OpenAI and Anthropic, both designed to produce compact representations of prior context.
Factory’s data highlights how these approaches perform across dimensions tied to task continuation, including accuracy, context awareness, completeness, continuity, and instruction following.
Across targeted tests, Factory's data suggests that structured summaries more consistently retained details required to answer follow-up questions correctly. In debugging scenarios, for example, those summaries were more likely to preserve the relationship between an error code, the affected endpoint, and the underlying cause.
All three approaches achieved similar reductions in token count. The difference lay not in how much information was removed, but in what kind of information survived compression.

It is worth stressing that the evaluation was designed and conducted internally by Factory. While the company has published the structure, methodology and rationale behind the framework, it has not released it as an open benchmark that external teams can test independently.
However, the work reflects a broader shift in how AI systems are being evaluated as they take on more autonomous roles. As agents become embedded in development environments and other professional settings, their ability to maintain continuity over time becomes a real reliability issue.
Factory’s results suggest that treating context as structured state, rather than reducing it to compressed text, may offer a more dependable path forward. That approach also raises new questions about how summaries are maintained, audited, and updated as work progresses.
As AI agents are asked to operate for longer stretches with less human supervision, the mechanics of memory and compression are becoming part of the infrastructure that determines whether these systems can sustain useful work over time, rather than only performing well in short interactions.