Skip to main content
This page summarizes Factory Research’s post: Evaluating Context Compression for AI Agents (Dec 16, 2025).
For the full methodology, charts, and examples, read the original post linked above.

TL;DR

  • Long-running agent sessions can exceed any model’s context window, so some form of context compression is required.
  • The key metric isn’t tokens per request; it’s tokens per task (because missing details force costly re-fetching and rework).
  • In Factory’s evaluation, structured summarization retained more “continue-the-task” information than OpenAI’s /responses/compact and Anthropic’s SDK compression, at similar compression rates.

Why context compression matters

As agent sessions stretch into hundreds/thousands of turns, the full transcript can reach millions of tokens. If an agent loses critical state (e.g., the exact endpoint, file paths changed, or the current next step), it often:
  • re-reads files it already read
  • repeats debugging dead ends
  • forgets what changed and where
That costs more time and tokens than the compression saved.

How Factory evaluated “context quality”

Instead of using summary similarity metrics (e.g., ROUGE), Factory used a probe-based evaluation:
  1. Take real, long-running production sessions.
  2. Compress the earlier portion.
  3. Ask probes that require remembering specific, task-relevant details from the truncated history.
  4. Grade the answers for functional usefulness.

Probe types

  • Recall: factual retention (e.g., “What was the original error?”)
  • Artifact: file tracking (e.g., “Which files did we modify and how?”)
  • Continuation: task planning (e.g., “What should we do next?”)
  • Decision: reasoning chain (e.g., “What did we decide and why?”)

Scoring dimensions

Responses were scored (0–5) by an LLM judge (GPT-5.2) across:
  • Accuracy
  • Context awareness
  • Artifact trail
  • Completeness
  • Continuity
  • Instruction following
The judge is blinded to which compression method produced the response.

Compression approaches compared

ApproachWhat it producesKey trade-off
FactoryA structured, persistent summary with explicit sections (intent, file modifications, decisions, next steps). Updates by summarizing only the newly-truncated span and merging into the existing summary (“anchored iterative summarization”).Slightly larger summaries than the most aggressive compression, but better retention of task-critical details.
OpenAI/responses/compact: an opaque compressed representation optimized for reconstruction fidelity.Highest compression, but low interpretability (you can’t inspect what was preserved).
AnthropicClaude SDK built-in compression: detailed structured summaries (often 7–12k chars), regenerated on each compression.High-quality summaries, but regenerating the whole summary each time can cause drift across repeated compressions.

Results (high-level)

Factory reports evaluating 36,000+ messages from production sessions across tasks like PR review, bug fixes, feature implementation, and refactoring.

Overall scores (0–5)

MethodOverallAccuracyContextArtifactCompletenessContinuityInstruction
Factory3.704.044.012.454.443.804.99
Anthropic3.443.743.562.334.373.674.95
OpenAI3.353.433.642.194.373.774.92

Compression ratio vs. quality

The post notes similar compression rates across methods:
  • OpenAI: 99.3% token removal
  • Anthropic: 98.7% token removal
  • Factory: 98.6% token removal
Factory retained ~0.7 percentage points more tokens than OpenAI (kept more context), and scored +0.35 higher on overall quality.

What Factory says it learned

  • Structure matters: forcing explicit sections (files/decisions/next steps) reduces the chance that critical details “silently disappear” over time.
  • Compression ratio is a misleading target: aggressive compression can “save tokens” but lose details that cause expensive rework; optimize for tokens per task.
  • Artifact tracking is still hard: all methods scored low on tracking which files were created/modified/examined (Factory’s best was 2.45/5), suggesting this may need dedicated state tracking beyond summarization.
  • Probe-based evaluation is closer to agent reality than text similarity metrics, because it tests whether work can continue effectively.