Context Engineering: Strategies & Failure Modes
Research synthesis from Anthropic, LangChain, and Drew Breunig
The Core Problem
Context is Finite
Even with 1M+ token windows:
- Attention degrades with length (context rot)
- N² complexity in transformer attention
- Training bias toward shorter sequences
- Cost scales linearly with tokens
Key Insight: Bigger windows ≠ better results. Smart curation > raw capacity.
The LeanSpec Connection
LeanSpec exists because:
- Context Economy - Specs must fit in working memory (human + AI)
- Signal-to-Noise - Every word must inform decisions
- Context failures happen when we violate these principles
This spec addresses: How to maintain Context Economy programmatically
Four Context Engineering Strategies
Based on LangChain's synthesis:
1. Partitioning (Write & Select)
What: Split content across multiple contexts with selective loading
LeanSpec Application:
# Instead of one 4,800-token spec:
specs/045/README.md (~830 tokens - overview)
specs/045/DESIGN.md (~1,500 tokens - design)
specs/045/IMPLEMENTATION.md (~580 tokens - plan)
specs/045/TESTING.md (~740 tokens - tests)
# AI loads only what it needs for current task
Mechanisms:
- Sub-spec files (spec 012 pattern)
- Lazy loading (read files on demand)
- Progressive disclosure (overview → details)
When to Use:
- ✓ Spec >3,500 tokens (warning threshold)
- ✓ Multiple distinct concerns (design + testing + config)
- ✓ Different concerns accessed independently
Benefits:
- ✓ Each file <3,500 tokens (fits in working memory)
- ✓ Reduce irrelevant context (only load needed sections)
- ✓ Parallel work (edit DESIGN without affecting TESTING)
2. Compaction (Remove Redundancy)
What: Eliminate duplicate or inferable content
LeanSpec Application:
# Before compaction (verbose):
## Authentication
The authentication system uses JWT tokens. JWT tokens are
industry-standard and provide stateless authentication. The
benefit of JWT tokens is that they don't require server-side
session storage...
## Implementation
We'll implement JWT authentication. JWT was chosen because...
[repeats same rationale]
# After compaction (concise):
## Authentication
Uses JWT tokens (stateless, no session storage).
## Implementation
[links to Authentication section for rationale]
Mechanisms:
- Duplicate detection (same content in multiple places)
- Inference removal (obvious from context)
- Reference consolidation (one canonical source, others link)
When to Use:
- ✅ Repeated explanations across sections
- ✅ Obvious/inferable information stated explicitly
- ✅ "For completeness" sections with little decision value
Benefits:
- ✅ Fewer tokens = faster processing
- ✅ Less distraction = better attention
- ✅ Easier maintenance = single source of truth
3. Compression (Summarize)
What: Condense while preserving essential information
LeanSpec Application:
# Before compression:
## Phase 1: Infrastructure Setup
Set up project structure:
- Create src/ directory
- Create tests/ directory
- Configure TypeScript with tsconfig.json
- Set up ESLint with .eslintrc
- Configure Prettier with .prettierrc
- Add npm scripts for build, test, lint
- Set up CI pipeline with GitHub Actions
[50 lines of detailed steps...]
# After compression (completed phase):
## ✅ Phase 1: Infrastructure Setup (Completed 2025-10-15)
Project structure established with TypeScript, testing, and CI.
See git commit abc123 for implementation details.
Mechanisms:
- Historical summarization (completed work → summary)
- Phase rollup (detailed steps → outcomes)
- Selective detail (keep decisions, summarize execution)
When to Use:
- ✅ Completed phases (outcomes matter, details don't)
- ✅ Historical context (need to know it happened, not how)
- ✅ Approaching line limits (preserve signal, reduce noise)
Benefits:
- ✅ Maintain project history without bloat
- ✅ Focus on active work, not past details
- ✅ Easy to expand if details needed later
4. Isolation (Move to Separate Context)
What: Split unrelated concerns into separate specs
LeanSpec Application:
# Before isolation (one spec):
specs/045-unified-dashboard/README.md
- Dashboard implementation
- Velocity tracking algorithm
- Health scoring system
- Chart library evaluation
- API design for metrics endpoint
[4,800 tokens covering 5 distinct concerns]
# After isolation (multiple specs):
specs/045-unified-dashboard/ # Dashboard UI
specs/060-velocity-algorithm/ # Velocity tracking
specs/061-health-scoring/ # Health metrics
specs/062-metrics-api/ # API endpoint
[Each spec <3,500 tokens, independent lifecycle]
Mechanisms:
- Concern extraction (identify unrelated topics)
- Dependency analysis (what must stay together?)
- Spec creation (move to new spec with cross-references)
When to Use:
- ✓ Multiple concerns with different lifecycles
- ✓ Sections could be standalone features
- ✓ Parts updated by different people/teams
- ✓ Spec still >3,500 tokens after partitioning
Benefits:
- ✅ Independent evolution (velocity algorithm changes ≠ dashboard changes)
- ✅ Clear ownership (different concerns, different owners)
- ✅ Easier review (focused scope per spec)
Four Context Failure Modes
Based on Drew Breunig's research:
1. Context Poisoning
Definition: Hallucinated or erroneous content makes it into context and gets repeatedly referenced
Symptoms in LeanSpec:
# AI hallucinates during edit:
"The authentication module uses Redis for session storage"
(Reality: We use JWT tokens, not Redis sessions)
# Hallucination gets saved to spec
# Later, AI reads the spec and builds on the hallucination:
"Redis configuration should use cluster mode for HA"
(Building on the original error)
# Context is now poisoned - wrong info compounds
Detection:
- ✅ Validate references against codebase
- ✅ Check for internal contradictions
- ✅ Flag content not matching implementation
Mitigation:
- ✅ Programmatic validation (catch before save)
- ✅ Regular spec-code sync checks
- ✅ Remove corrupted sections immediately
2. Context Distraction
Definition: Context grows so large the model ignores training and repeats history
Symptoms in LeanSpec:
# Spec grows to 800+ lines with extensive history
# AI behavior changes:
- Repeats past actions from spec history
- Ignores training knowledge
- Suggests outdated approaches documented in spec
- Fails to synthesize new solutions
# Example: Gemini Pokemon agent
At >100k tokens: Repeated past moves instead of new strategy
(even though training knows better strategies)
Detection:
- ✓ Monitor spec token count (>3,500 = warning, >5,000 = error)
- ✓ Track AI repetitive behavior
- ✓ Measure task completion degradation
Mitigation:
- ✓ Split at 3,500 tokens (Context Economy warning)
- ✓ Compress historical sections
- ✓ Partition by concern
Research: Databricks found degradation starts ~32k tokens for Llama 3.1 405b, earlier for smaller models
3. Context Confusion
Definition: Superfluous content influences model to make wrong decisions
Symptoms in LeanSpec:
# Spec includes MCP tool definitions for 20 integrations
# (GitHub, Jira, Slack, Linear, Notion, Asana, ...)
# Task: "Update the GitHub issue status"
# AI behavior:
- Confused about which tool to use
- Sometimes calls wrong tool (Jira instead of GitHub)
- Slower processing (evaluating irrelevant options)
- Lower accuracy
# Berkeley Function-Calling Leaderboard confirms:
ALL models perform worse with >1 tool
Detection:
- ✅ Identify sections irrelevant to current task
- ✅ Track tool/reference usage patterns
- ✅ Measure decision accuracy vs context size
Mitigation:
- ✅ Remove irrelevant sections before AI processing
- ✅ Use selective loading (only relevant sub-specs)
- ✅ Clear separation of concerns
4. Context Clash
Definition: Conflicting information within same context
Symptoms in LeanSpec:
# Early in spec:
"We'll use PostgreSQL for data storage"
# Middle of spec (after discussion):
"Actually, MongoDB is better for this use case"
# Later in spec (forgot to update):
"PostgreSQL schema design: ..."
# AI sees conflicting info:
- Both PostgreSQL AND MongoDB mentioned
- Unclear which is current decision
- May mix approaches (SQL queries against MongoDB)
Detection:
- ✅ Scan for contradictory statements
- ✅ Check for outdated decisions not marked as superseded
- ✅ Validate consistency across sections
Mitigation:
- ✅ Single source of truth per decision
- ✅ Mark superseded decisions clearly
- ✅ Use compaction to remove outdated info
Research: Microsoft/Salesforce paper showed 39% performance drop when information gathered across multiple turns (early wrong answers remain in context)
Strategy Selection Framework
Decision Matrix
| Situation | Primary Strategy | Secondary | Why |
|---|---|---|---|
| Spec >3,500 tokens, multiple concerns | Partition | Compaction | Separate concerns, remove redundancy in each |
| Spec verbose but single concern | Compaction | Compression | Remove redundancy, summarize if still too long |
| Historical phases bloating spec | Compression | - | Keep outcomes, drop details |
| Unrelated concerns in same spec | Isolation | Partition | Move to separate spec, then partition if needed |
| Spec approaching 3,500 tokens | Compaction | - | Proactive cleanup before hitting warning threshold |
Combining Strategies
Often multiple strategies apply:
Example: Spec 045 (4,800 tokens):
- Partition: Split into README + DESIGN + IMPLEMENTATION + TESTING (primary)
- Compaction: Remove redundancy within each file (secondary)
- Compression: Summarize research phase (already complete)
- Isolation: Consider moving velocity algorithm to separate spec (future)
Result:
- Before: 4,800 tokens (approaching 5K limit)
- After: Largest file ~1,500 tokens (well within limits)
Implementation Priorities
High Priority (v0.3.0)
- ✅ Partition (most common need)
- ✅ Compaction (easy wins)
- ✅ Failure detection (prevent problems)
Medium Priority (v0.4.0)
- ✅ Compression (useful but more nuanced)
- ✅ Isolation (requires deeper analysis)
Low Priority (v0.5.0)
- ✅ Automatic strategy selection
- ✅ Continuous monitoring/auto-compaction
- ✅ AI-powered conflict resolution
Measuring Success
Quantitative Metrics
Partition effectiveness:
- Spec count with >5,000 tokens: Target 0
- Spec count with >3,500 tokens: Target <10%
- Average spec size: Target <2,000 tokens
- Largest sub-spec file: Target <3,500 tokens
Compaction effectiveness:
- Redundancy ratio: Lines removed / lines total
- Target: 20-30% reduction for verbose specs
Failure prevention:
- Context poisoning incidents: Target 0/month
- Context distraction reports: Target 0/month
- Context confusion: AI wrong tool selection <1%
- Context clash: Contradictions detected before commit
Qualitative Measures
Developer experience:
- "Splitting specs is now instant"
- "No more AI corruption during edits"
- "Specs stay clean automatically"
AI agent effectiveness:
- Fewer errors on large specs
- Faster task completion
- Better decision quality
Related Research
Key Papers & Articles
-
Anthropic: Effective Context Engineering for AI Agents
- Context as finite resource
- Compaction, structured note-taking, sub-agents
- Claude Code auto-compact at 95% window
-
LangChain: Context Engineering for Agents
- Four strategies: Write, Select, Compress, Isolate
- LangGraph state management patterns
- Tool selection via RAG
-
Drew Breunig: How Contexts Fail and How to Fix Them
- Four failure modes with evidence
- Berkeley Function-Calling Leaderboard insights
- Microsoft/Salesforce sharded prompts research
Application to LeanSpec
Core insight: LeanSpec is a context engineering methodology for human-AI collaboration on software specs.
Evolution:
- v0.1.0: Manual context management (write good specs)
- v0.2.0: Detection (validate specs, warn at limits)
- v0.3.0: Programmatic transformation (this spec)
- v0.4.0: Continuous management (auto-optimization)
Remember: Context engineering is the #1 job when building with AI. These aren't just optimization techniques—they're fundamental to making AI-assisted spec management work.