Durable State for Long-Running LLM Coding Sessions

Problem

A long-running LLM coding session usually fails in a boring way.

Not with a model error. Not with a bad patch. It fails because working memory fills up before the implementation is done.

That failure shows up earlier than most operators expect. In the workflow described here, repository exploration and planning consumed about 33,000 tokens before any code was written. One implementation phase added another roughly 3,000 tokens. Clearing the context dropped the session back to a roughly 16,000-token baseline made up of system prompt, memory files, and tool overhead.

That framing matters because it changes the problem definition. The hard part is often not code generation. It is state management.

Most LLM coding workflows treat the context window as both execution environment and system of record. That is fine for small edits. It breaks for multi-phase work where the session needs to remember architecture, decisions, progress, and recovery instructions over a longer horizon.

The failure signature is usually dull rather than dramatic. The session starts repeating conclusions it already reached. It needs more prompting to stay on task. It spends tokens re-explaining the repository back to itself. Diff quality may still look acceptable in isolated moments, but the cost to maintain forward momentum keeps rising. By the time the operator notices, the session is already paying a large tax just to preserve continuity.

The Short Version

Problem	Default LLM workflow	Durable-state workflow
Planning for multi-phase changes	Lives inside one context window	Written to external state
Ambiguity handling	Mixed into implementation	Resolved first as explicit unanswered questions
Token pressure	Grows monotonically	Reset between phases
Session interruption	Often loses momentum	Resume with `claude continue`
Cross-session continuity	Weak	Restore from GitHub issue
Main failure mode	Context collapse	State drift between model view and filesystem

The core design is simple:

Use the LLM for planning.
Force it to emit unresolved questions first.
Convert the result into a compact multi-phase checklist.
Persist that checklist outside the context window.
Rehydrate the next session from that external state.

GitHub issues are enough for the persistence layer in this pattern because they are durable, searchable, reviewable, and already inside the engineering toolchain. The write path is gh issue create. The read path is a later fetch of that same issue.

Architecture

The root cause is architectural. Large changes create more than one kind of state, and each kind ages differently.

State class	Example
Repository understanding	Entry points, call graphs, config surface
Decisions	Positional args vs required options
Execution progress	Phase 1 done, Phase 2 partial
Recovery instructions	What to do after reset

When all four live only in prompt history, token usage grows even when no new information is added. The model keeps paying rent on old reasoning. Eventually the operator faces a bad tradeoff: keep the context and risk degradation, or clear it and lose the implementation thread.

The durable-state pattern separates planning from execution, then externalizes execution state before the context window becomes the bottleneck.

Why token pressure compounds even when work is progressing

The pressure is not just “more words over time.” It comes from a specific operational shape:

The model explores the repository and builds a tentative map.
The operator answers ambiguities.
The model revises the plan.
The model edits files and inspects diffs.
The operator asks for fixes, clarifications, or narrower follow-up edits.

Each step leaves residue. The session now contains old hypotheses, updated hypotheses, accepted decisions, rejected decisions, tool output, and prompts that were only useful at a specific moment. Prompt history becomes a mixture of durable state and expired state, but the model still has to consume both.

That is why large sessions degrade earlier than people expect. The total token count is not just implementation state. It is implementation state plus the archaeology of how that state was reached.

What must be externalized

The checkpoint needs to preserve only the state that would be expensive to rediscover. That usually means:

Persist this	Do not persist this
Locked decisions	Full reasoning transcript
Phase status	Every exploratory dead end
Remaining risks	Raw tool output
Exact resume point	Verbose prose summaries
Files/modules to re-read	Ephemeral conversational phrasing

This distinction matters. If the checkpoint becomes a mini transcript, it recreates the same problem outside the context window. A good checkpoint is not a diary. It is a compressed control surface for the next session.

Why GitHub issues are enough, but not magical

GitHub issues work well here for boring reasons:

They are already part of the engineering workflow.
They are durable and searchable.
They are reviewable by humans.
They are easy to create from the command line.
They are stable across terminal resets and model restarts.

That said, they are still just text. They do not enforce schema unless you impose one. They do not know whether a box was checked honestly. They do not reflect current filesystem reality unless the operator refreshes that state. So “GitHub issue as memory” is not an automation system. It is a disciplined habit backed by durable storage.

sequenceDiagram
    participant U as Engineer
    participant A as Agent Session A
    participant FS as Repository
    participant GH as GitHub Issue
    participant B as Agent Session B

    U->>A: Start in plan mode
    A->>FS: Explore codebase
    A-->>U: Return unresolved questions
    U-->>A: Provide answers
    A-->>U: Generate multi-phase plan

    U->>A: Execute Phase 1
    A->>FS: Patch files
    U->>A: Execute Phase 2
    A->>FS: Patch files

    Note over A: Context pressure rises

    A->>GH: Create checkpoint issue with status
    Note over A,B: Context cleared

    U->>B: Start fresh session
    B->>GH: Read checkpoint issue
    B->>FS: Re-read relevant files
    B-->>U: Resume at next pending phase

What I Tested

I tested the workflow as an operational loop rather than as a prompt trick.

The loop is:

Start in plan mode instead of patch mode.
Force unresolved questions before implementation.
Compress the plan into short operational bullets.
Execute one bounded phase at a time.
Persist progress into a GitHub issue before reset.
Clear context.
Start a fresh session and rehydrate from the issue plus a targeted file re-read.

The important constraint is that each phase should be small enough that its implementation does not need the full planning history in memory.

I also tested a sharper distinction between three moments that are often collapsed together in ad hoc agent use:

Exploration: understand the repository and identify ambiguities.
Commitment: lock decisions and translate them into a short execution plan.
Execution: patch code against one bounded phase at a time.

That separation matters because each moment benefits from a different prompt shape. Exploration should bias toward uncertainty discovery. Commitment should bias toward compression and decision capture. Execution should bias toward file-local correctness and diff inspection.

Example operator sequence for planning:

claude
# instruct agent:
# - explore relevant files
# - stay concise
# - list unresolved questions first
# - do not implement yet

Expected output shape:

Unresolved questions
1. Positional args or required flags
2. Strict match or fuzzy match
3. Backward compatibility for existing CLI calls
4. Error contract for invalid inputs

Only after those questions are answered should the agent generate a phased plan.

In practice, the phase boundary should be chosen by dependency shape, not by rough equal size. A good phase has:

One main objective.
A limited file surface.
A clear done condition.
A small enough blast radius that the diff can be inspected quickly.

That usually means “parser and validation,” “matching strategy,” and “docs and cleanup” are good phases, while “finish refactor” is not.

I also tested whether the persistence layer needed to be more sophisticated than GitHub issues. For this pattern, it did not. A separate notes file would also work, but issues have two practical advantages: they already sit in the repo’s operational perimeter, and they are visible to humans who may need to review or extend the work later.

What Failed

The first failure mode is the one this pattern is designed to avoid: context collapse.

Without an external checkpoint, repository exploration, planning notes, prior diffs, and half-resolved ambiguities all compete for the same window. The result is a session that becomes slower, noisier, and more brittle over time even if the implementation work is proceeding normally.

The second failure mode is state drift.

If the engineer edits files out of band in the IDE, runs a formatter, or changes the branch state while the agent is paused, the model’s internal view is stale. At that point the agent is reasoning about a repository that no longer exists. The fix is simple but non-optional: explicitly tell the model to re-read the relevant files before it continues.

That failure is worse than simple forgetfulness because it can look like competence. The agent may produce a syntactically clean patch that cleanly applies to the wrong mental model of the repository. The operator sees output, not necessarily alignment. That is why resume instructions should name concrete files or modules to reload, not just say “continue from Phase 2.”

The third failure mode is mode confusion.

An agent that was heavily instructed to plan can sometimes keep behaving like it is still in planning mode after execution begins. That is recoverable, but it proves a larger operational point: planning and patching need different prompt shapes. The operator should not assume the model will infer the transition cleanly.

This pattern also degrades when:

The task is small enough to finish in one bounded session.
The codebase is too sensitive to allow broad execution autonomy.
The work product cannot be reduced to a checklist with stable decisions.
The repository changes rapidly from parallel human edits.

In those cases, externalizing state still helps, but a GitHub issue alone may not be sufficient as the recovery mechanism.

There is also a subtler failure mode: summary drift. If the checkpoint is written too loosely, each new session interprets it slightly differently. One agent reads “clean up matcher” as naming cleanup. Another reads it as a structural refactor. Another decides tests are implied and another does not. Durable state only works if the stored state is operationally specific.

What Worked

1) Start in plan mode, not patch mode

Do not begin by asking for edits. Begin by asking for exploration and unanswered questions.

That order matters because ambiguity is cheap to resolve during planning and expensive to resolve after a half-finished patch set exists. A good planning pass should force the agent to surface uncertainties before it commits to an implementation path.

2) Compress the plan aggressively

One useful memory rule is to sacrifice grammar for concision.

That is not cosmetic. It reduces token footprint while preserving operational meaning. In this workflow, compact plans stay readable and cost less to keep around.

Example plan format:

Phase 1
- add parser opts
- validate mutually exclusive flags
- unit tests happy path

Phase 2
- strict/fuzzy matcher abstraction
- wire config
- test edge cases

Phase 3
- docs
- migration notes
- cleanup names

This is the right abstraction level. Not prose. Not a full design document. Just enough state to restart work without re-deriving the architecture.

Compression works only if the missing information is truly recoverable. The operator should compress syntax, not semantics. Remove filler words, not decisions. “strict by default, fuzzy flag optional” is compressed and useful. “matching done” is shorter but operationally useless.

3) Execute in bounded phases

Once the plan exists, switch to execution. Work one phase at a time, inspect the diff, and either commit or iterate before moving on.

for phase in plan.phases:
    implement(phase)
    inspect(diff)
    commit_or_iterate()
    if context_pressure_high:
        persist_state()
        clear_context()
        resume_from_external_state()

The point is not that phases are formal project-management objects. The point is that bounded units let the operator keep the live context focused on the current step.

The practical rule is to checkpoint before the session feels bad, not after. Waiting until the context is obviously degraded means the checkpoint itself may already be low quality. If exploration took 33,000 tokens and one implementation phase cost another 3,000, that is already enough evidence that the session has entered a stateful regime where reset planning should begin early.

4) Persist execution state before the reset

This is the key move.

Before clearing context, instruct the agent to write the current multi-phase checklist and completion state into a GitHub issue via CLI. The issue becomes the working-memory checkpoint.

gh issue create \
  --title "LLM execution checkpoint: CLI refactor" \
  --body "$(cat plan-status.md)"

Recommended body shape:

## Current status
- [x] Phase 1: parser changes
- [ ] Phase 2: matcher abstraction
- [ ] Phase 3: docs and cleanup

## Decisions locked
- required flags, not positional
- strict matching by default
- fuzzy matching behind explicit option

## Remaining risks
- existing scripts may rely on old arg order
- tests for invalid combinations incomplete

## Resume instruction
Start at Phase 2. Re-read parser module and tests before editing matcher code.

This turns GitHub into a low-friction state store. It is not magical. It is just durable text in a system engineers already use.

The quality of the checkpoint matters more than the existence of the checkpoint. A strong checkpoint usually contains four things:

Status: what is done and what is not.
Decisions: what should not be reopened casually.
Risks: what could still invalidate the remaining plan.
Resume instruction: the next exact action plus the files to re-read first.

A weak checkpoint usually fails by being too literary. If it reads like a status update to management, it will not help the next coding session. If it reads like a handoff note to another engineer about what to do next in the repo, it is probably at the right level.

5) Clear context and rehydrate cleanly

After the issue exists, clear the session. In the source workflow, the reset baseline returned to roughly 16,000 tokens. Then start fresh, fetch the issue, and continue from the next pending phase.

The operational loop becomes:

# Session A
claude
# ... plan, implement, checkpoint to GitHub issue ...

# clear session

# Session B
claude
# instruct agent:
# fetch issue 24
# rebuild working context from issue
# continue at next unchecked phase

This same pattern also supports human review. You can pause with Ctrl+C, inspect the repository in your IDE, and resume the same terminal session with claude continue. That bridges agent execution with normal engineering review habits instead of replacing them.

6) Resynchronize the filesystem deliberately

This is the part many operators skip because it feels redundant. It is not redundant.

When resuming, the next prompt should include both the checkpoint and an explicit refresh step:

Read issue 24.
Re-read parser.ts and parser.test.ts.
Assume any earlier mental model is stale.
Continue at Phase 2 only after confirming current file state.

The agent does not automatically know which facts survived the pause. A small amount of explicit resynchronization is cheaper than debugging a patch produced against stale assumptions.

7) Keep planning prompts and execution prompts structurally different

Mode confusion gets worse when both modes sound similar.

A planning prompt should sound like this:

Explore the relevant modules.
Return unresolved questions first.
Do not edit files yet.
Then propose a compact phased plan.

An execution prompt should sound like this:

Implement Phase 2 only.
Re-read matcher files first.
Do not revisit earlier decisions unless the code contradicts them.
Show the diff summary and any new risks.

The difference is not stylistic. It reduces the chance that the agent keeps widening scope when it should be converging on a patch.

Final Decision and Constraints

The practical conclusion is straightforward: context windows are execution buffers, not durable project state.

For short tasks, that distinction does not matter much. For architectural or multi-phase work, it matters immediately. The stable design is to separate planning from execution, force unresolved questions first, compress the work into a checklist, and persist that checklist outside the window before the session degrades.

GitHub issues are sufficient when:

The work can be decomposed into stable phases.
Key decisions can be locked early.
A concise checkpoint is enough to reconstruct the next action.
The repository is not changing too quickly in parallel.

GitHub issues are not sufficient by themselves when:

Files are being changed frequently out of band.
The checklist cannot capture the real shape of the remaining work.
The task depends on transient execution context more than durable decisions.

The main operational risk is not forgetting a prompt. It is filesystem drift between the agent’s mental model and the actual repository. That is why any resume instruction should explicitly include which files to re-read before the next edit.

The more general lesson is that LLM coding tools behave better when treated less like autonomous pair programmers and more like stateless workers that need deliberate rehydration. The stronger the state boundary, the more predictable the session becomes. Durable-state workflows work not because they make the model smarter, but because they reduce how much the model must remember implicitly.

Decision Checklist

Before using this pattern, ask:

Is the change large enough that planning and execution should be separate artifacts?
Can the work be decomposed into independent phases with explicit done criteria?
Are the key ambiguities known early enough to force unresolved questions first?
Is GitHub issue text sufficient to reconstruct the next step without replaying prior reasoning?
Will humans edit files out of band during execution, requiring explicit resync steps?

Next Step

The next refinement is to make the checkpoint format stricter.

A lightweight template with fixed sections for completion state, locked decisions, open risks, and exact resume instructions would reduce ambiguity further and make cross-session recovery more reliable. The core workflow already works. The remaining improvement is making the persisted state even less lossy.