Durable State for Long-Running LLM Coding Sessions

A long-running LLM coding session usually fails in a predictable, boring way: the context window fills up with operational residue before the implementation is finished.

Situation

Most LLM coding workflows treat the context window as both an execution environment and a system of record. That is fine for small, isolated edits. However, as agentic coding shifts toward multi-phase, architectural changes, the session needs to retain memory of decisions, progress, and recovery instructions over a much longer horizon.

The root cause of collapse is architectural. Large changes create more than one kind of state, and each kind ages differently:

State class	Example
Repository understanding	Entry points, call graphs, config surface
Decisions	Positional args vs required options
Execution progress	Phase 1 done, Phase 2 partial
Recovery instructions	What to do after reset

The Problem

The failure signature is usually dull rather than dramatic. The session starts repeating conclusions it already reached, requires more prompting to stay on task, and spends tokens re-explaining the repository back to itself. This happens because token pressure compounds even when work is progressing: the session retains old hypotheses, rejected decisions, and raw tool output alongside the actual implementation state. The model keeps paying rent on old reasoning. Eventually, the operator faces a bad tradeoff: keep the context and risk degradation, or clear it and lose the implementation thread.

The checkpoint needs to preserve only the state that would be expensive to rediscover:

Persist this	Do not persist this
Locked decisions	Full reasoning transcript
Phase status	Every exploratory dead end
Remaining risks	Raw tool output
Exact resume point	Verbose prose summaries
Files/modules to re-read	Ephemeral conversational phrasing

How can an LLM session maintain durable state across a large implementation without collapsing under its own context weight?

Core Concept

The durable-state pattern separates planning from execution, externalizing execution state before the context window becomes the bottleneck.

Problem	Default LLM workflow	Durable-state workflow
Planning for multi-phase changes	Lives inside one context window	Written to external state
Ambiguity handling	Mixed into implementation	Resolved first as explicit unanswered questions
Token pressure	Grows monotonically	Reset between phases
Session interruption	Often loses momentum	Resume with `claude continue`
Cross-session continuity	Weak	Restore from GitHub issue
Main failure mode	Context collapse	State drift between model view and filesystem

Use the LLM for exploration and planning.
Force it to emit unresolved questions first.
Convert the result into a compact multi-phase checklist.
Persist that checklist outside the context window (e.g., as a GitHub issue).
Rehydrate the next session from that external state.

flowchart TD
    Engineer["Engineer"] -->|"Start in plan mode"| AgentA["Agent Session A"]
    AgentA -->|"Explore codebase"| Repo["Repository"]
    AgentA -->|"Return unresolved questions"| Engineer
    Engineer -->|"Provide answers"| AgentA
    AgentA -->|"Generate multi-phase plan"| Engineer
    Engineer -->|"Execute Phase 1"| AgentA
    AgentA -->|"Patch files"| Repo
    Engineer -->|"Execute Phase 2"| AgentA
    AgentA -->|"Create checkpoint issue"| GH["GitHub Issue"]
    Engineer -->|"Start fresh session"| AgentB["Agent Session B"]
    AgentB -->|"Read checkpoint issue"| GH
    AgentB -->|"Re-read relevant files"| Repo
    AgentB -->|"Resume at next pending phase"| Engineer

In Practice

The documented pattern for maintaining durable state relies on separating planning from execution. The underlying behavior of large language models dictates that as context windows fill with token-heavy tool output, instruction adherence degrades.

1. Start in plan mode, not patch mode A documented operational rule is to force the agent to surface uncertainties before it commits to an implementation path. Ambiguity is cheap to resolve during planning but expensive after a half-finished patch set exists.

Example operator sequence for planning:

claude
# instruct agent:
# - explore relevant files
# - stay concise
# - list unresolved questions first
# - do not implement yet

2. Compress the plan aggressively Compression reduces the token footprint while preserving operational meaning. “Strict by default, fuzzy flag optional” is compressed and useful. “Matching done” is operationally useless.

Example plan format:

Phase 1
- add parser opts
- validate mutually exclusive flags
- unit tests happy path

Phase 2
- strict/fuzzy matcher abstraction
- wire config
- test edge cases

3. Execute in bounded phases Phases are bounded units that keep the live context focused on the current step. The documented pattern is to checkpoint before the session feels degraded, not after. Waiting until the context is obviously degraded means the checkpoint itself may already be low quality.

for phase in plan.phases:
    implement(phase)
    inspect(diff)
    commit_or_iterate()
    if context_pressure_high:
        persist_state()
        clear_context()
        resume_from_external_state()

4. Persist execution state before the reset GitHub’s CLI (gh issue create) behaves as a low-friction state store. The issue becomes the working-memory checkpoint, capturing what is done, decisions that should not be reopened casually, remaining risks, and exact resume instructions.

GitHub issues work well here for documented operational reasons:

They are already part of the engineering workflow.
They are durable and searchable.
They are reviewable by humans.
They are easy to create from the command line.
They are stable across terminal resets and model restarts.

gh issue create \
  --title "LLM execution checkpoint: CLI refactor" \
  --body "$(cat plan-status.md)"

Recommended body shape:

## Current status
- [x] Phase 1: parser changes
- [ ] Phase 2: matcher abstraction

## Decisions locked
- required flags, not positional

## Resume instruction
Start at Phase 2. Re-read parser module and tests before editing matcher code.

5. Clear context and rehydrate cleanly By clearing the session and fetching the GitHub issue in a fresh prompt, the context resets to a low baseline. This bridges agent execution with normal engineering review habits.

# Session A
claude
# ... plan, implement, checkpoint to GitHub issue ...

# clear session

# Session B
claude
# instruct agent:
# fetch issue 24
# rebuild working context from issue
# continue at next unchecked phase

6. Resynchronize the filesystem deliberately Git behaves predictably when files are edited out-of-band: if an operator runs a formatter or modifies a file, the agent’s prior mental model is stale. The explicit refresh step forces the agent to re-read specific modules before executing the next phase.

Read issue 24.
Re-read parser.ts and parser.test.ts.
Assume any earlier mental model is stale.
Continue at Phase 2 only after confirming current file state.

7. Keep planning prompts and execution prompts structurally different Mode confusion occurs when planning and execution prompts sound similar. A planning prompt requires unresolved questions first; an execution prompt requires bounded diff generation against an existing plan.

Where It Breaks

Scenario	Failure Mode	Mitigation
Context collapse without checkpoints	Session becomes slower and noisier over time	Persist execution state before degradation
State drift from out-of-band edits	Agent patches code against a stale mental model	Explicitly instruct agent to re-read files upon resume
Mode confusion	Agent continues planning during execution	Keep planning and execution prompts structurally different
Rapid parallel human edits	Repository changes invalidate the checkpoint	Ensure the checkpoint locks specific, stable decisions
Summary drift	Each new session interprets the checkpoint differently	Make the checkpoint format stricter and operationally specific

What to Do Next

Problem: Long-running LLM coding sessions fail due to context collapse and state drift.
Solution: Separate planning from execution and externalize multi-phase checklists into GitHub issues.
Proof: Documented model behavior shows that clearing context and rehydrating from external text prevents instruction degradation.
Action: Adopt a lightweight GitHub issue template with fixed sections for completion state, locked decisions, open risks, and exact resume instructions to make cross-session recovery reliable.

Situation

The Problem

Core Concept

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

AI Cost Incident Runbook: What to Do When Monthly Token Spend Suddenly Doubles

From Chat to Agents: Designing Goal-to-Result Systems for Real Work

Why Long-Running AI Coding Sessions Fail