QodeAssist/docs/context-architecture.md

# QodeAssist — Context Architecture (v1.0)

Status: design proposal, extends `target-architecture.md` (§7 ContextEngine,
delta #9) and `agent-templates-design.md` (the `ctx.*` template contract).
Scope: everything between "facts exist in the IDE / on disk / in the
conversation" and "bytes leave in the request body" — what context each
pipeline needs, who acquires it, where it lands in the prompt. One assembly
runs per `send()`; tool continuations stay inside LLMQore (§4.3).

---

## 1. Taxonomy — the five kinds of context

Every piece of context the model ever sees falls into one of five categories.
The categories differ in *acquisition mode*, *volatility*, and therefore
*placement* — conflating them is the root cause of today's problems (§3).

| # | Category | What it answers | Examples | Volatility |
|---|----------|-----------------|----------|------------|
| C1 | **Identity** | who is the assistant | agent `system_prompt` (persona inline or via `read_file()`), always-on skills, skills catalog | per agent change |
| C2 | **Environment** | where is it working | project name + source root, build dir, language/file info, recent changes | per project / slow |
| C3 | **Task** | what is asked *now* | chat message, attachments, images, invoked-skill body, completion prefix/suffix, refactor selection + instruction | every turn |
| C4 | **Conversation** | what happened so far | history (text, thinking, tool use/results), compression summary | grows every turn |
| C5 | **Pulled** | what the model asked for | tool results (read file, search, build, diagnostics), MCP tool results | inside the turn |

Two acquisition modes cut across the categories:

- **Push** — we inject proactively (C1–C3, C4). Push is a *per-pipeline
  policy*: completion must push everything (no latency budget for tools);
  chat should push little and let the model pull.
- **Pull** — the model requests through tools (C5). Pull needs no assembly
  policy at all, but its *results* become C4 and therefore must flow through
  the same budget and serialization rules as everything else.

One more orthogonal property drives placement: **stability**. Provider prompt
caches (Claude `cache_control`) reward byte-stable prefixes. Stable content
belongs early (system), volatile content belongs late (near the last user
message). This single rule decides almost every placement question below.

---

## 2. Context inventory per pipeline

What each use case (numbering from `target-architecture.md` §1) actually
needs, against the taxonomy:

| Context item | Cat | U1 completion | U2 chat | U3 refactor | compression | Source port |
|---|---|---|---|---|---|---|
| agent `system_prompt` (persona) | C1 | ✓ | ✓ (persona switch = agent switch) | ✓ | ✓ | AgentProfile + ContextRenderer |
| skills catalog + always-on | C1 | — | ✓ | — | — | SkillsEngine |
| project root / build dir | C2 | — | ✓ | — | — | `IProjectScanner` |
| language + file info | C2 | ✓ | — | ✓ | — | `IDocumentReader` |
| recent project changes | C2 | optional (setting) | — | optional | — | ChangesManager |
| prefix / suffix (FIM) | C3 | ✓ | — | — | — | `IDocumentReader` |
| selection + position markers | C3 | — | — | ✓ | — | `IDocumentReader` |
| user message text | C3 | — | ✓ | ✓ (instruction) | ✓ (directive) | UI |
| attachments / images | C3 | — | ✓ | — | — | chat storage (loader) |
| invoked skill body (`/cmd`) | C3 | — | ✓ | — | — | SkillsEngine |
| linked files (pinned) | C3/C2 | — | ✓ | — | — | `IProjectScanner` + fs |
| open-files sync | C3/C2 | — | ✓ | — | — | `IProjectScanner` |
| history | C4 | — (fresh session) | ✓ | — (fresh) | ✓ (read-only input) | ConversationHistory |
| tool results | C5 | — | ✓ | ✓ (optional) | — | ToolsManager / McpHub |

---

## 3. Problems in the current code this design removes

1. ~~**Two assembly paths.**~~ — RECLASSIFIED 2026-06-12 as by-design, not a
   problem: the first request renders from `ConversationHistory`; tool
   continuations are LLMQore's replay of that payload plus appended tool
   results. The replay carries the full filtered history of its base payload,
   so the feared filter divergence does not materialize in practice (§4.3).
2. **No budget.** History is never trimmed, estimated, or compacted; every
   send ships everything, forever.
3. **Volatile content in system.** Linked-file contents live in the
   `chat.context` system layer; any file edit between turns invalidates the
   provider prompt cache for the whole request.
4. **Invoked skills evaporate.** A `/skill` body is injected into the system
   layer for one send only — the next turn the model has lost the skill's
   instructions, although the conversation continues to rely on them.
5. **Silent loss.** A failed attachment load drops the block with no trace —
   neither the model nor the user learns the image is gone.
6. **Repeated materialization.** Every send re-reads and re-base64s every
   stored image/attachment of the whole history from disk.
7. **Placement decided ad hoc.** Each feature hand-formats markdown and picks
   a system layer by habit (`completion.context`, `refactor`, `chat.context`);
   there is no shared rule for what goes where, and the project-info block is
   formatted three different ways.

---

## 4. Architecture — Acquire → Assemble → Shape

Three stages with hard ownership boundaries:

```mermaid
flowchart LR
    subgraph L3["Acquire — ContextEngine (L3, ports + QtC adapters)"]
        EC["EditorContext<br/>prefix/suffix, selection,<br/>language, copyright strip"]
        PC["ProjectContext<br/>root, ignore filter,<br/>open files, changes"]
        TE["TokenEstimator<br/>calibrated by Usage"]
    end
    subgraph L4["Features (L4) — decide WHAT"]
        F["chat / completion / refactor<br/>set layers, pin providers,<br/>build user blocks"]
    end
    subgraph L2["Assemble — Session (L2) — decide WHERE & HOW MUCH"]
        SPB["SystemPromptBuilder<br/>stable layers only"]
        PIN["Pinned providers<br/>re-materialized every dispatch"]
        CA["ContextAssembler<br/>history + layers + pinned<br/>+ loader + budget → ctx"]
    end
    subgraph L1["Shape — JsonPromptTemplate (L1/L2)"]
        TPL["[body] jinja over ctx.*"]
    end
    EC --> F
    PC --> F
    F --> SPB
    F --> PIN
    F --> CA
    SPB --> CA
    PIN --> CA
    TE --> CA
    CA --> TPL
```

- **Acquire (L3)** — `ContextEngine` services behind IDE-agnostic ports read
  facts from the IDE/fs. No prompt text, no placement decisions. One shared
  `EnvBlockFormatter` renders the project/file info block so it is identical
  in every pipeline.
- **Features (L4)** decide *what* context a turn needs: they set their system
  layer, pin refreshable providers, and compose user blocks. They never
  decide request shape and never concatenate history.
- **Assemble (L2)** — `ContextAssembler` (successor of
  `Session::toLegacyContext`) is the **only** producer of the template
  context, once per `send()` dispatch; tool continuations replay that payload
  inside LLMQore (§4.3). It owns placement policy, budget enforcement,
  materialization, and the manifest.
- **Shape (L1)** — the agent's `[body]` table renders `ctx.*` into the wire
  request. Templates own *shape per provider*, never content.

### 4.1 The three injection mechanisms

| Mechanism | For | Lifetime | Refresh | Persisted |
|---|---|---|---|---|
| **System layers** (`SystemPromptBuilder`) | stable C1/C2: `agent.system`, `env.project`, `skills.catalog`, `refactor`, `compression` | conversation | on send | no |
| **Pinned providers** (new) | refreshable C3/C2: linked files, open-files sync | until unpinned | **every `send()`** | as reference only |
| **User blocks** (`send(blocks)`) | one-shot C3: message, attachments, images, invoked-skill body, completion content | that turn | never (history is immutable) | yes |

Pinned providers are the new piece:

```
session->pinContext(id, [](){ return materialized blocks; });
session->unpinContext(id);
```

The assembler calls every pinned provider at **every `send()`** and splices
the result as text blocks
**prepended to the turn's typed user message** — the last user-role wire
message that does not carry tool results (falling back to the tool-result
carrier, after its leading `tool_result` blocks, and to a synthetic user
message when the history has no user message at all). Prepending into an
existing message rather than inserting a separate one keeps strict
user/assistant alternation, which some provider APIs enforce.

The fixed anchor and the per-turn refresh split the cache cost fairly:
within a turn's tool loop the pinned blocks are byte-identical (continuations
replay the payload — pure appends, cache hits); the next `send()` re-reads
the files, and a change invalidates the cache only from the turn's anchor,
not from the system prefix. The materialized block's label states its capture
time ("content as of this turn") because a tool may mutate the file mid-loop;
the model sees such changes through the tool results themselves. Pinned
content is never stored in history and never persisted — never duplicated
turn-over-turn.

Invoked-skill bodies move the opposite way: out of the system layer into the
**user blocks of that turn** (a dedicated block type), so they persist in
history and survive the rest of the conversation (fixes problem 4).

### 4.2 Placement policy (single table, owned by the assembler)

| Content | Position in request | Why |
|---|---|---|
| `agent.system` (rendered TOML `system_prompt`) | system, first | static per agent → max cache reuse |
| `env.project`, `skills.catalog` | system, after agent | changes rarely |
| pipeline layers (`refactor`, `compression`, `completion.context`) | system, last | fresh session each time, ordering irrelevant |
| history | messages | as is |
| pinned materializations | text blocks prepended to the turn's typed user message, live content | fixed anchor keeps the prefix cache-stable; content refreshes because tools mutate files at any moment |
| task blocks | last user message | the turn itself |

`ClaudeCacheControl` breakpoints stay as they are (system / history tail);
this ordering is what makes them effective.

### 4.3 Tool continuations stay in LLMQore (replay)

The tool loop deliberately stays in LLMQore — the library is a complete,
standalone agentic client, and the loop (execute tools, count rounds,
schedule the next request, stream) is *mechanism*, which per
`target-architecture.md` design principle 3 belongs in C++ identically for
all providers. Continuation *content* is the library's default replay: the
base payload plus the assistant message and appended tool results.

An inversion hook (`setContinuationPayloadBuilder`, an optional per-request
callback letting `Session` re-assemble each continuation through
`ContextAssembler`) was implemented and **reverted 2026-06-12**: the problem
it solved was judged contrived. The replay already carries the full filtered
history of its base payload, mid-loop file changes reach the model through
the tool results themselves, and continuation growth within one turn is
bounded by `maxToolContinuations` — budget enforcement at `send()` time
covers the realistic cases. Consequences accepted with the revert: the
manifest logs one entry per `send()` (not per wire request), and pinned
content is byte-stable for the duration of a turn's tool loop (§4.1).

2026-06-13 the loop's *shape* inside LLMQore was refactored without changing
this decision (see `tool-loop-runner-plan.md`): the loop policy now lives in
`ToolLoopRunner` (per-request round state, limit, continuation decision) and
`BaseClient` slimmed to transport + tool dispatch with public primitives
`continueRequest` / `buildReplayContinuation` / `abortRequest`. Continuation
content is still the replay. QodeAssist sets the round limit via
`client->toolLoop()->setMaxRounds(...)`; the old `setMaxToolContinuations`
stays as a forwarder for compatibility.

### 4.4 Budget

`ContextAssembler` consults a `BudgetPolicy` before producing the context:

```
input_estimate = TokenEstimator(system + history + pinned + task)
limit          = agent context_window − body.max_tokens (output reserve)
```

`context_window` comes from provider/model metadata with an optional agent
TOML override. When the estimate exceeds the limit the policy returns a trim
plan executed in deterministic order:

1. elide bodies of tool results older than the last N rounds
   (`[tool result elided — N tokens]` placeholder, pairing preserved);
2. elide materializations of old stored images/attachments (placeholder
   block, reference kept in history);
3. below a hard floor — refuse with `ErrorCategory::Validation` and surface
   "compress the conversation" (ChatCompressor) in the UI.

v1.0 ships stages: **estimate + manifest + UI warning** first (no silent
trimming), then stage 1–2 elision, then auto-compression hooks. The
architecture fixes the *seam*; the policy can stay minimal.

`TokenEstimator` is calibrated per provider/model from `Usage` events
(§8.5 of the target architecture) — chars-per-token ratio updated after every
response; the chat token counter and the budget share this one estimator.

### 4.5 Materialization and caching

Stored content (attachments, images) stays reference-only in history;
materialization happens in the assembler through the `ContentLoader`. Two
fixes over today:

- the loader result is cached per `(storedPath, mtime, size)` — no re-reading
  the whole conversation's binaries on every send, and byte-identical turns
  keep the provider prompt cache warm;
- a failed load produces an **explicit placeholder block**
  (`[attachment unavailable: name.png]`) instead of silently vanishing —
  the model can say so, the manifest records it (fixes problem 5).

### 4.6 Observability: the context manifest

Every `assemble()` emits one debug-category log entry and a struct on the
event stream:

```
manifest {
  layers:   { agent.system: ~1.9k tok, env.project: ~70, skills.catalog: ~640 }
  history:  26 messages, ~14.2k tok (3 tool rounds)
  pinned:   { linked:src/main.cpp: ~2.1k }
  task:     ~310 tok, 1 image (cached)
  elided:   [ tool_result a4f1 (~8k) ]
  estimate: ~19.3k / limit 32k
}
```

Nothing is dropped silently — every filter (unsigned thinking, orphaned tool
pairs, failed loads, budget elisions) leaves a manifest record. The token
counter UI reads the same struct.

---

## 5. Wire contract — `ctx.*` stays, gains one producer

`Templates::ContextData` (→ `ctx.system_prompt`, `ctx.history`,
`ctx.prefix/suffix`, `ctx.files_metadata`) remains the contract between the
core and `[body]` templates — it is not legacy, it is the template-facing
view of the assembled context. The change is that exactly one function
produces it (`ContextAssembler::assemble`), for every request, and
`toLegacyContext`/`buildLegacyContext` are renamed into it. Existing
serialization rules carry over unchanged: system messages never enter
history, unsigned thinking is dropped, orphaned tool_use/tool_result pairs
are filtered, `CompletionContent` becomes `prefix`/`suffix`.

---

## 6. Migration plan

Ordered so every step lands independently and shrinks risk:

1. **Extract `ContextAssembler`** from `buildLegacyContext` (pure, unit-tested
   against fixture histories) + manifest logging + failed-load placeholder
   blocks. No behavior change otherwise. — DONE 2026-06-12
   (`sources/Session/ContextAssembler.{hpp,cpp}`, `test/ContextAssemblerTest.cpp`;
   manifest logged under the `qodeassist.context` category).
2. **ContentLoader cache** keyed by `(path, mtime, size)`. — DONE 2026-06-12
   (`StoredContentCache` in `ChatSerializer`, owned per-chat by
   `ClientInterface`, cleared on chat switch).
3. **Pinned providers**: linked files and open-files sync move out of the
   `chat.context` system layer; invoked-skill bodies move into the turn's
   user blocks. `chat.context` shrinks to project info + skills catalog.
   — DONE 2026-06-12 (`Session::pinContext/unpinContext`, pinned splice in
   `ContextAssembler::assemble`; `SkillInvocationContent` block persisted via
   `MessageSerializer`, invisible in the chat UI by design; open-files sync is
   covered because `ChatRootView` merges open editors into the linked list).
4. **Shared `EnvBlockFormatter`** in ContextEngine; chat/refactor/completion
   stop hand-formatting project/file info. — DONE 2026-06-12
   (`context/EnvBlockFormatter.{hpp,cpp}`: pure `formatProject`/`formatFile`
   + the `currentProject()` QtC gatherer; chat project block, refactor file
   header, and completion's `getLanguageAndFileInfo` all route through it).
5. ~~**Continuation payload callback**~~ — REVERTED 2026-06-12 (implemented,
   then judged a solution to a contrived problem; see §4.3). Continuations
   are LLMQore's default replay; `ContextAssembler` runs once per `send()`.
6. **TokenEstimator + BudgetPolicy seam** — estimate + warning first, then
   elision stages.
7. **ContextEngine port split** (delta #9 of the target architecture) —
   `EditorContext` / `ProjectContext` / `TokenEstimator` behind ports, QtC
   API only in `ide/context` adapters.

---

## 7. Open questions

1. ~~**Pinned placement**~~ — RESOLVED 2026-06-12: text blocks prepended to
   the last user-role wire message (synthetic user message only when there is
   none). A separate synthetic message would break strict role alternation on
   some provider APIs; cache behaviour of the two shapes is identical.
2. ~~**Tool-loop relocation cost**~~ — RESOLVED 2026-06-12: relocation
   rejected (LLMQore is deliberately a standalone agentic client). The
   follow-up `setContinuationPayloadBuilder` inversion hook was also
   implemented and reverted the same day — replay is the accepted behaviour
   (§4.3).
3. **Budget v1 scope** — warn-only vs. enabling tool-result elision
   immediately. Elision changes what the model sees; needs live validation.
4. **Completion and open files** — should completion gain pinned open-files
   context (cheap with this design), or stay prefix/suffix-only for latency?