How Do You Prevent AI Agent Performance from Degrading as Context Fills Up?

Chroma's context rot research tested 18 large language models on retrieval tasks and found something unexpected: performance dropped 15-40% as context length increased, even on simple lookups. The models weren't hitting token limits. They weren't running out of memory. They were just getting worse at reasoning as their context windows filled.

For AI agents, this is a reliability problem hiding in plain sight. Anthropic's analysis of millions of Claude Code sessions shows the longest agent sessions now run 45+ minutes, up from 25 minutes just three months ago. More time in session means more context accumulation. More context means more degradation. By turn 50, your agent is operating with degraded reasoning even though no individual piece of context is wrong.

Here's the operator thesis: Context rot causes gradual performance degradation as agent sessions grow longer. While teams focus on model selection and prompt engineering, the real culprit behind degrading outputs is often simpler: as the context window fills, reasoning quality drops. The fix isn't bigger context windows. It's smarter context management: compaction at task boundaries, offloading to filesystem, and monitoring context utilization as a reliability metric.

What Is Context Rot and Why It Matters

Context rot describes how models become worse at reasoning and completing tasks as their context window fills up. This isn't about hitting a token limit and crashing. It's about the gradual degradation that happens before you hit any hard ceiling.

The Needle in a Haystack (NIAH) benchmark suggests long-context models maintain consistent performance. But NIAH is a simple retrieval task. Drop a sentence in a document, ask the model to find it. Real agent work involves reasoning over ambiguous information, maintaining constraints across turns, and synthesizing insights from multiple sources.

Chroma's research extended NIAH to test more realistic scenarios. They found:

  • Semantic matching degrades faster than lexical matching. When the needle and question require inference rather than exact word matches, performance drops more sharply as context grows.
  • Distractors have non-uniform impact. Adding topically related but incorrect information degrades performance more than adding unrelated noise.
  • Haystack structure matters. Shuffled content performs differently than coherent narratives, suggesting models use logical flow as a reasoning aid.

For agents, this means the problem compounds. Agents don't just retrieve. They reason, plan, execute, observe results, and adjust. Each turn adds context. Each addition degrades future reasoning. By turn 50, the agent is operating with corrupted context even though no individual piece of context is wrong.

Anthropic's session data reveals the scale of the problem. In October 2025, the 99.9th percentile Claude Code turn lasted under 25 minutes. By January 2026, it was over 45 minutes. The longest sessions are getting longer, which means more time for context rot to accumulate.

Bigger context windows don't solve this. A 1M token window just means you can fit more context before degradation starts. It doesn't prevent degradation. The Chroma study tested models with context windows from 32K to 1M tokens. All showed performance degradation with increasing input length. The slope varies, but the direction is consistent.

Four Signals Your Agent Has Context Rot

Context rot doesn't announce itself. You have to watch for the symptoms.

Signal 1: Increasing error rate late in session

Track errors per turn over the session. If the error rate in the last third of a session is 2x or higher than the first third, context rot is likely a factor. The model isn't getting dumber. It's operating with corrupted context that leads to worse decisions.

Signal 2: Repeating information already stated

The agent restates something it said ten turns ago, or asks a question it already asked. This suggests the earlier context has been compressed or is no longer accessible in working memory. The agent is operating as if that information doesn't exist.

Signal 3: Losing track of earlier constraints

You specified at the start: maintain backward compatibility, don't touch the database schema, keep changes under 500 lines. By turn 40, the agent introduces a breaking change. When challenged, it has no memory of the original constraint. The constraint is buried in compressed context.

Signal 4: Hallucinating details that contradict early context

Late in a session, the agent states something with high confidence that directly contradicts what it said earlier. This is different from changing its mind based on new information. This is the model generating plausible-sounding content because the original context is no longer reliably accessible.

The Parameters That Define Context Health

Production agents need explicit numbers for context management.

| Parameter | Conservative | Standard | Aggressive | |-----------|-------------|----------|------------| | Context utilization before compaction | 70% | 80% | 90% | | Minimum recent context retained | 15% | 10% | 5% | | Maximum session duration | 30 minutes | 45 minutes | 60 minutes | | Error rate increase threshold | 1.5x baseline | 2x baseline | 3x baseline | | Token count per turn warning | 30% increase | 50% increase | 100% increase | | Compression ratio target | 5x | 3x | 2x | | Maximum tool output retention | 1000 tokens | 2000 tokens | 5000 tokens | | Compaction cooldown | 10 minutes | 5 minutes | 2 minutes | | Context utilization alert | 85% | 90% | 95% | | Max consecutive errors before reset | 2 | 3 | 5 |

Recommended defaults: 80% utilization before compaction, 10% recent context retained, 45-minute max session, 2x error rate threshold, 3x compression ratio, 2000-token tool output retention, 5-minute compaction cooldown, 90% alert threshold, 3 consecutive errors before reset.

These aren't magic numbers. They're starting points based on observed behavior. Adjust based on your agent's task profile. A coding agent working on large refactors needs different thresholds than a research agent summarizing documents.

Four Failure Scenarios from Context Rot

These scenarios come from production agent deployments.

Scenario 1: Long refactor loses original requirements

An agent edits 47 files over 25 minutes. Each individual edit looks correct. By file 30, it has forgotten the original constraint about maintaining backward compatibility with the v2 API. It introduces a breaking change that contradicts the initial instructions.

The operator reviews the diff and catches the breaking change. When asked why, the agent has no coherent explanation. The original constraint is somewhere in the context, but it's buried under thousands of tokens of file contents, tool outputs, and intermediate reasoning.

Mitigation:

  • Write constraints to a file and re-inject at intervals
  • Set maximum continuous edit duration to 20 minutes
  • Require constraint re-confirmation after every 10 file edits
  • Track constraint mentions in context with semantic search

Parameters: max_edit_duration=20m, constraint_file_required=true, reconfirm_after_edits=10, constraint_search_enabled=true

Scenario 2: Multi-file edit corrupts earlier changes

An agent modifies a configuration file, then edits code that depends on that config, then updates tests. Later edits accidentally undo earlier fixes because the agent has lost track of what it changed and why.

The root cause is asymmetric context retention. Recent changes are in working memory. Earlier changes have been compressed into a summary. The summary captures what was done but not the dependencies between changes.

Mitigation:

  • Offload all file diffs to filesystem immediately after edit
  • Maintain a changes.md file with structured change log
  • Require dependency check before each edit
  • Use git commits as context checkpoints

Parameters: diff_offload_enabled=true, changes_log_required=true, dependency_check_before_edit=true, git_checkpoint_interval=5

Scenario 3: Research task hallucinates findings

An agent reads 50 documents to synthesize a report. By document 40, it starts stating facts that don't appear in any document. Confidence increases as accuracy decreases.

This is the dangerous form of context rot. The model isn't just forgetting. It's generating plausible content to fill gaps in its compressed context. The user sees high-confidence assertions and assumes they're grounded in the source material.

Mitigation:

  • Extract and store key findings after each document
  • Require citation for every claim in final output
  • Cross-reference claims against source material before synthesis
  • Split research across multiple sessions for large corpora

Parameters: extraction_per_document=true, citation_required=true, claim_verification=true, max_documents_per_session=20

Scenario 4: Debugging session chases ghosts

An agent traces an error through 20 log files and code paths. By log 15, it misremembers what the original error message was. It spends the next hour investigating a problem that doesn't exist.

The original error is still in context somewhere. But it's surrounded by so much diagnostic noise that the model can't reliably access it. Each new log file adds context that makes the original signal harder to find.

Mitigation:

  • Write error hypothesis to file after initial analysis
  • Re-inject original error at regular intervals
  • Set maximum debugging chain length before reset
  • Require explicit hypothesis validation at checkpoints

Parameters: hypothesis_file_required=true, error_reinjection_interval=10, max_debug_chain_length=15, checkpoint_validation_required=true

When to Compact: The Timing Question

Compaction reduces context by summarizing older messages and retaining only recent content. But timing matters. Compact at the wrong moment and you lose critical information. Fail to compact and context rot accumulates.

Good times to compact:

  • Task boundaries. The user signals they're moving to a new task. Earlier context is likely irrelevant. This is the cleanest compaction trigger.
  • After extracting results. The agent has consumed a large document and extracted key findings. The document content can be compressed, retaining only the findings.
  • Before consuming new large context. The agent is about to read a large file or make a long API call. Compacting first maximizes available space.
  • After completing a deliverable. The agent has finished a feature, report, or fix. The working context can be compressed to a completion summary.

Bad times to compact:

  • Mid-complex-refactor. The agent is in the middle of a multi-file change. Compacting now loses the thread connecting the changes.
  • Mid-debugging-chain. The agent is tracing an error through multiple files. Compacting breaks the causal chain.
  • During planning execution. The agent has a plan and is executing steps. Compacting may lose the plan or the progress tracking.
  • When constraints are active. The agent is operating under specific constraints that need to remain accessible.

LangChain's Deep Agents SDK now exposes a compaction tool directly to the agent, letting models trigger compression at opportune moments rather than at fixed thresholds. This is a shift from harness-controlled compaction to agent-controlled compaction. Early testing shows agents are conservative but choose reasonable moments when given this control.

Context Management Strategies

Knowing when to compact is half the battle. The other half is choosing the right strategy for your agent's workload. Here are four approaches, each with tradeoffs.

Strategy 1: Autonomous compression

Give the agent a tool to compress its own context. The agent decides when based on task awareness. LangChain's implementation retains 10% of recent messages and summarizes the rest.

Pros: Leverages agent's task understanding. Compacts at natural boundaries. Cons: Agent may not recognize critical context. Requires model capability.

Strategy 2: Filesystem offloading

Offload large tool outputs, file contents, and intermediate results to the filesystem. Keep only references in context. When the agent needs details, it reads from file.

Pros: Preserves all information. Agent can retrieve on demand. Cons: Requires explicit read actions. Adds latency.

Strategy 3: Session splitting

For tasks that exceed threshold duration, split into multiple sessions. Each session starts fresh with relevant context injected from the previous session's summary.

Pros: Clean context slate. Prevents rot accumulation. Cons: Loses continuity. Requires good summarization.

Strategy 4: Progressive disclosure

Load minimal context at session start. Add context progressively as needed. Skills and MCP servers are loaded on demand rather than upfront. This works because large tool descriptions and system prompts consume context before the agent does any work. By deferring these loads until the agent actually needs them, you preserve headroom for actual task context.

Pros: Starts with maximum headroom. Reduces initial rot potential. Cons: Requires good demand prediction. May miss useful context if agent doesn't know what to request.

Monitoring Context Health in Production

Context health should be a first-class reliability metric alongside error rates and latency.

Metrics to track:

  • Context utilization percentage. Current tokens / max tokens. Alert at 90%.
  • Session duration. Time since session start. Alert at 45 minutes.
  • Error rate trend. Errors per turn over session lifetime. Alert if last-third rate is 2x first-third.
  • Compression events. Frequency and timing of compactions.
  • Post-compaction error rate. Whether compression introduces errors.

Alert thresholds:

  • Context utilization > 90%: warning
  • Context utilization > 95%: critical, force compaction or session split
  • Session duration > 45 minutes: warning
  • Error rate 2x baseline: review session for context rot
  • Error rate 3x baseline: critical, recommend session restart

Automated responses:

  • At 85% utilization: suggest compaction to user
  • At 90% utilization: trigger autonomous compaction if enabled
  • At 95% utilization: force compaction or pause for user input
  • After 3 consecutive errors: recommend session restart with summary

Context rot isn't a problem you solve once. It's a condition you manage continuously. The agents that stay reliable over long sessions are the ones that treat context as a scarce resource: they compact at task boundaries, offload large outputs to files, and reset sessions before rot accumulates. Monitor your context utilization. Set alert thresholds. And when the error rate doubles in the last third of a session, don't blame the model. Check the context.

Want the boring part handled?

Keepmyclaw gives OpenClaw operators encrypted backups, restore drills, and a faster path from "oh no" to "we're back". If this article sounds like your problem, stop whiteboarding it forever.