> ## Documentation Index
> Fetch the complete documentation index at: https://docs.factory.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluating Context Compression

> Summary of Factory Research’s evaluation of context compression strategies for long-running AI agent sessions.

This page summarizes Factory Research’s post: **[Evaluating Context Compression for AI Agents](https://factory.ai/news/evaluating-compression)** (Dec 16, 2025).

<Info>
  For the full methodology, charts, and examples, read the original post linked above.
</Info>

***

## TL;DR

* Long-running agent sessions can exceed any model’s context window, so some form of **context compression** is required.
* The key metric isn’t *tokens per request*; it’s **tokens per task** (because missing details force costly re-fetching and rework).
* In Factory’s evaluation, **structured summarization** retained more “continue-the-task” information than OpenAI’s `/responses/compact` and Anthropic’s SDK compression, at similar compression rates.

***

## Why context compression matters

As agent sessions stretch into hundreds/thousands of turns, the full transcript can reach **millions of tokens**. If an agent loses critical state (e.g., the exact endpoint, file paths changed, or the current next step), it often:

* re-reads files it already read
* repeats debugging dead ends
* forgets what changed and where

That costs more time and tokens than the compression saved.

***

## How Factory evaluated “context quality”

Instead of using summary similarity metrics (e.g., ROUGE), Factory used a **probe-based evaluation**:

1. Take real, long-running production sessions.
2. Compress the earlier portion.
3. Ask probes that require remembering specific, task-relevant details from the truncated history.
4. Grade the answers for functional usefulness.

### Probe types

* **Recall**: factual retention (e.g., “What was the original error?”)
* **Artifact**: file tracking (e.g., “Which files did we modify and how?”)
* **Continuation**: task planning (e.g., “What should we do next?”)
* **Decision**: reasoning chain (e.g., “What did we decide and why?”)

### Scoring dimensions

Responses were scored (0–5) by an LLM judge (**GPT-5.2**) across:

* Accuracy
* Context awareness
* Artifact trail
* Completeness
* Continuity
* Instruction following

The judge is blinded to which compression method produced the response.

***

## Compression approaches compared

| Approach      | What it produces                                                                                                                                                                                                                                      | Key trade-off                                                                                                      |
| ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
| **Factory**   | A **structured, persistent summary** with explicit sections (intent, file modifications, decisions, next steps). Updates by summarizing only the newly-truncated span and **merging** into the existing summary (“anchored iterative summarization”). | Slightly larger summaries than the most aggressive compression, but better retention of task-critical details.     |
| **OpenAI**    | `/responses/compact`: an **opaque** compressed representation optimized for reconstruction fidelity.                                                                                                                                                  | Highest compression, but low interpretability (you can’t inspect what was preserved).                              |
| **Anthropic** | Claude SDK built-in compression: detailed structured summaries (often 7–12k chars), regenerated on each compression.                                                                                                                                  | High-quality summaries, but regenerating the whole summary each time can cause drift across repeated compressions. |

***

## Results (high-level)

Factory reports evaluating **36,000+ messages** from production sessions across tasks like PR review, bug fixes, feature implementation, and refactoring.

### Overall scores (0–5)

| Method    |  Overall | Accuracy |  Context | Artifact | Completeness | Continuity | Instruction |
| --------- | -------: | -------: | -------: | -------: | -----------: | ---------: | ----------: |
| Factory   | **3.70** | **4.04** | **4.01** | **2.45** |     **4.44** |   **3.80** |    **4.99** |
| Anthropic |     3.44 |     3.74 |     3.56 |     2.33 |         4.37 |       3.67 |        4.95 |
| OpenAI    |     3.35 |     3.43 |     3.64 |     2.19 |         4.37 |       3.77 |        4.92 |

### Compression ratio vs. quality

The post notes similar compression rates across methods:

* OpenAI: **99.3%** token removal
* Anthropic: **98.7%** token removal
* Factory: **98.6%** token removal

Factory retained \~0.7 percentage points more tokens than OpenAI (kept more context), and scored **+0.35** higher on overall quality.

***

## What Factory says it learned

* **Structure matters**: forcing explicit sections (files/decisions/next steps) reduces the chance that critical details “silently disappear” over time.
* **Compression ratio is a misleading target**: aggressive compression can “save tokens” but lose details that cause expensive rework; optimize for **tokens per task**.
* **Artifact tracking is still hard**: all methods scored low on tracking which files were created/modified/examined (Factory’s best was **2.45/5**), suggesting this may need dedicated state tracking beyond summarization.
* **Probe-based evaluation is closer to agent reality** than text similarity metrics, because it tests whether work can continue effectively.

***

## Related docs

* [Memory and Context Management](/guides/power-user/memory-management)
* [Token Efficiency Strategies](/guides/power-user/token-efficiency)
