refactor: Move to agent architecture

2026-06-30 01:59:11 -04:00 · 2026-05-30 14:50:49 +02:00
parent 34ce787320
commit ccc2ec2e80
364 changed files with 10801 additions and 19020 deletions
--- a/docs/context-architecture.md
+++ b/docs/context-architecture.md
@@ -0,0 +1,347 @@
+# QodeAssist — Context Architecture (v1.0)
+
+Status: design proposal, extends `target-architecture.md` (§7 ContextEngine,
+delta #9) and `agent-templates-design.md` (the `ctx.*` template contract).
+Scope: everything between "facts exist in the IDE / on disk / in the
+conversation" and "bytes leave in the request body" — what context each
+pipeline needs, who acquires it, where it lands in the prompt. One assembly
+runs per `send()`; tool continuations stay inside LLMQore (§4.3).
+
+---
+
+## 1. Taxonomy — the five kinds of context
+
+Every piece of context the model ever sees falls into one of five categories.
+The categories differ in *acquisition mode*, *volatility*, and therefore
+*placement* — conflating them is the root cause of today's problems (§3).
+
+| # | Category | What it answers | Examples | Volatility |
+|---|----------|-----------------|----------|------------|
+| C1 | **Identity** | who is the assistant | agent `system_prompt` (persona inline or via `read_file()`), always-on skills, skills catalog | per agent change |
+| C2 | **Environment** | where is it working | project name + source root, build dir, language/file info, recent changes | per project / slow |
+| C3 | **Task** | what is asked *now* | chat message, attachments, images, invoked-skill body, completion prefix/suffix, refactor selection + instruction | every turn |
+| C4 | **Conversation** | what happened so far | history (text, thinking, tool use/results), compression summary | grows every turn |
+| C5 | **Pulled** | what the model asked for | tool results (read file, search, build, diagnostics), MCP tool results | inside the turn |
+
+Two acquisition modes cut across the categories:
+
+- **Push** — we inject proactively (C1–C3, C4). Push is a *per-pipeline
+  policy*: completion must push everything (no latency budget for tools);
+  chat should push little and let the model pull.
+- **Pull** — the model requests through tools (C5). Pull needs no assembly
+  policy at all, but its *results* become C4 and therefore must flow through
+  the same budget and serialization rules as everything else.
+
+One more orthogonal property drives placement: **stability**. Provider prompt
+caches (Claude `cache_control`) reward byte-stable prefixes. Stable content
+belongs early (system), volatile content belongs late (near the last user
+message). This single rule decides almost every placement question below.
+
+---
+
+## 2. Context inventory per pipeline
+
+What each use case (numbering from `target-architecture.md` §1) actually
+needs, against the taxonomy:
+
+| Context item | Cat | U1 completion | U2 chat | U3 refactor | compression | Source port |
+|---|---|---|---|---|---|---|
+| agent `system_prompt` (persona) | C1 | ✓ | ✓ (persona switch = agent switch) | ✓ | ✓ | AgentProfile + ContextRenderer |
+| skills catalog + always-on | C1 | — | ✓ | — | — | SkillsEngine |
+| project root / build dir | C2 | — | ✓ | — | — | `IProjectScanner` |
+| language + file info | C2 | ✓ | — | ✓ | — | `IDocumentReader` |
+| recent project changes | C2 | optional (setting) | — | optional | — | ChangesManager |
+| prefix / suffix (FIM) | C3 | ✓ | — | — | — | `IDocumentReader` |
+| selection + position markers | C3 | — | — | ✓ | — | `IDocumentReader` |
+| user message text | C3 | — | ✓ | ✓ (instruction) | ✓ (directive) | UI |
+| attachments / images | C3 | — | ✓ | — | — | chat storage (loader) |
+| invoked skill body (`/cmd`) | C3 | — | ✓ | — | — | SkillsEngine |
+| linked files (pinned) | C3/C2 | — | ✓ | — | — | `IProjectScanner` + fs |
+| open-files sync | C3/C2 | — | ✓ | — | — | `IProjectScanner` |
+| history | C4 | — (fresh session) | ✓ | — (fresh) | ✓ (read-only input) | ConversationHistory |
+| tool results | C5 | — | ✓ | ✓ (optional) | — | ToolsManager / McpHub |
+
+---
+
+## 3. Problems in the current code this design removes
+
+1. ~~**Two assembly paths.**~~ — RECLASSIFIED 2026-06-12 as by-design, not a
+   problem: the first request renders from `ConversationHistory`; tool
+   continuations are LLMQore's replay of that payload plus appended tool
+   results. The replay carries the full filtered history of its base payload,
+   so the feared filter divergence does not materialize in practice (§4.3).
+2. **No budget.** History is never trimmed, estimated, or compacted; every
+   send ships everything, forever.
+3. **Volatile content in system.** Linked-file contents live in the
+   `chat.context` system layer; any file edit between turns invalidates the
+   provider prompt cache for the whole request.
+4. **Invoked skills evaporate.** A `/skill` body is injected into the system
+   layer for one send only — the next turn the model has lost the skill's
+   instructions, although the conversation continues to rely on them.
+5. **Silent loss.** A failed attachment load drops the block with no trace —
+   neither the model nor the user learns the image is gone.
+6. **Repeated materialization.** Every send re-reads and re-base64s every
+   stored image/attachment of the whole history from disk.
+7. **Placement decided ad hoc.** Each feature hand-formats markdown and picks
+   a system layer by habit (`completion.context`, `refactor`, `chat.context`);
+   there is no shared rule for what goes where, and the project-info block is
+   formatted three different ways.
+
+---
+
+## 4. Architecture — Acquire → Assemble → Shape
+
+Three stages with hard ownership boundaries:
+
+```mermaid
+flowchart LR
+    subgraph L3["Acquire — ContextEngine (L3, ports + QtC adapters)"]
+        EC["EditorContext<br/>prefix/suffix, selection,<br/>language, copyright strip"]
+        PC["ProjectContext<br/>root, ignore filter,<br/>open files, changes"]
+        TE["TokenEstimator<br/>calibrated by Usage"]
+    end
+    subgraph L4["Features (L4) — decide WHAT"]
+        F["chat / completion / refactor<br/>set layers, pin providers,<br/>build user blocks"]
+    end
+    subgraph L2["Assemble — Session (L2) — decide WHERE & HOW MUCH"]
+        SPB["SystemPromptBuilder<br/>stable layers only"]
+        PIN["Pinned providers<br/>re-materialized every dispatch"]
+        CA["ContextAssembler<br/>history + layers + pinned<br/>+ loader + budget → ctx"]
+    end
+    subgraph L1["Shape — JsonPromptTemplate (L1/L2)"]
+        TPL["[body] jinja over ctx.*"]
+    end
+    EC --> F
+    PC --> F
+    F --> SPB
+    F --> PIN
+    F --> CA
+    SPB --> CA
+    PIN --> CA
+    TE --> CA
+    CA --> TPL
+```
+
+- **Acquire (L3)** — `ContextEngine` services behind IDE-agnostic ports read
+  facts from the IDE/fs. No prompt text, no placement decisions. One shared
+  `EnvBlockFormatter` renders the project/file info block so it is identical
+  in every pipeline.
+- **Features (L4)** decide *what* context a turn needs: they set their system
+  layer, pin refreshable providers, and compose user blocks. They never
+  decide request shape and never concatenate history.
+- **Assemble (L2)** — `ContextAssembler` (successor of
+  `Session::toLegacyContext`) is the **only** producer of the template
+  context, once per `send()` dispatch; tool continuations replay that payload
+  inside LLMQore (§4.3). It owns placement policy, budget enforcement,
+  materialization, and the manifest.
+- **Shape (L1)** — the agent's `[body]` table renders `ctx.*` into the wire
+  request. Templates own *shape per provider*, never content.
+
+### 4.1 The three injection mechanisms
+
+| Mechanism | For | Lifetime | Refresh | Persisted |
+|---|---|---|---|---|
+| **System layers** (`SystemPromptBuilder`) | stable C1/C2: `agent.system`, `env.project`, `skills.catalog`, `refactor`, `compression` | conversation | on send | no |
+| **Pinned providers** (new) | refreshable C3/C2: linked files, open-files sync | until unpinned | **every `send()`** | as reference only |
+| **User blocks** (`send(blocks)`) | one-shot C3: message, attachments, images, invoked-skill body, completion content | that turn | never (history is immutable) | yes |
+
+Pinned providers are the new piece:
+
+```
+session->pinContext(id, [](){ return materialized blocks; });
+session->unpinContext(id);
+```
+
+The assembler calls every pinned provider at **every `send()`** and splices
+the result as text blocks
+**prepended to the turn's typed user message** — the last user-role wire
+message that does not carry tool results (falling back to the tool-result
+carrier, after its leading `tool_result` blocks, and to a synthetic user
+message when the history has no user message at all). Prepending into an
+existing message rather than inserting a separate one keeps strict
+user/assistant alternation, which some provider APIs enforce.
+
+The fixed anchor and the per-turn refresh split the cache cost fairly:
+within a turn's tool loop the pinned blocks are byte-identical (continuations
+replay the payload — pure appends, cache hits); the next `send()` re-reads
+the files, and a change invalidates the cache only from the turn's anchor,
+not from the system prefix. The materialized block's label states its capture
+time ("content as of this turn") because a tool may mutate the file mid-loop;
+the model sees such changes through the tool results themselves. Pinned
+content is never stored in history and never persisted — never duplicated
+turn-over-turn.
+
+Invoked-skill bodies move the opposite way: out of the system layer into the
+**user blocks of that turn** (a dedicated block type), so they persist in
+history and survive the rest of the conversation (fixes problem 4).
+
+### 4.2 Placement policy (single table, owned by the assembler)
+
+| Content | Position in request | Why |
+|---|---|---|
+| `agent.system` (rendered TOML `system_prompt`) | system, first | static per agent → max cache reuse |
+| `env.project`, `skills.catalog` | system, after agent | changes rarely |
+| pipeline layers (`refactor`, `compression`, `completion.context`) | system, last | fresh session each time, ordering irrelevant |
+| history | messages | as is |
+| pinned materializations | text blocks prepended to the turn's typed user message, live content | fixed anchor keeps the prefix cache-stable; content refreshes because tools mutate files at any moment |
+| task blocks | last user message | the turn itself |
+
+`ClaudeCacheControl` breakpoints stay as they are (system / history tail);
+this ordering is what makes them effective.
+
+### 4.3 Tool continuations stay in LLMQore (replay)
+
+The tool loop deliberately stays in LLMQore — the library is a complete,
+standalone agentic client, and the loop (execute tools, count rounds,
+schedule the next request, stream) is *mechanism*, which per
+`target-architecture.md` design principle 3 belongs in C++ identically for
+all providers. Continuation *content* is the library's default replay: the
+base payload plus the assistant message and appended tool results.
+
+An inversion hook (`setContinuationPayloadBuilder`, an optional per-request
+callback letting `Session` re-assemble each continuation through
+`ContextAssembler`) was implemented and **reverted 2026-06-12**: the problem
+it solved was judged contrived. The replay already carries the full filtered
+history of its base payload, mid-loop file changes reach the model through
+the tool results themselves, and continuation growth within one turn is
+bounded by `maxToolContinuations` — budget enforcement at `send()` time
+covers the realistic cases. Consequences accepted with the revert: the
+manifest logs one entry per `send()` (not per wire request), and pinned
+content is byte-stable for the duration of a turn's tool loop (§4.1).
+
+2026-06-13 the loop's *shape* inside LLMQore was refactored without changing
+this decision (see `tool-loop-runner-plan.md`): the loop policy now lives in
+`ToolLoopRunner` (per-request round state, limit, continuation decision) and
+`BaseClient` slimmed to transport + tool dispatch with public primitives
+`continueRequest` / `buildReplayContinuation` / `abortRequest`. Continuation
+content is still the replay. QodeAssist sets the round limit via
+`client->toolLoop()->setMaxRounds(...)`; the old `setMaxToolContinuations`
+stays as a forwarder for compatibility.
+
+### 4.4 Budget
+
+`ContextAssembler` consults a `BudgetPolicy` before producing the context:
+
+```
+input_estimate = TokenEstimator(system + history + pinned + task)
+limit          = agent context_window − body.max_tokens (output reserve)
+```
+
+`context_window` comes from provider/model metadata with an optional agent
+TOML override. When the estimate exceeds the limit the policy returns a trim
+plan executed in deterministic order:
+
+1. elide bodies of tool results older than the last N rounds
+   (`[tool result elided — N tokens]` placeholder, pairing preserved);
+2. elide materializations of old stored images/attachments (placeholder
+   block, reference kept in history);
+3. below a hard floor — refuse with `ErrorCategory::Validation` and surface
+   "compress the conversation" (ChatCompressor) in the UI.
+
+v1.0 ships stages: **estimate + manifest + UI warning** first (no silent
+trimming), then stage 1–2 elision, then auto-compression hooks. The
+architecture fixes the *seam*; the policy can stay minimal.
+
+`TokenEstimator` is calibrated per provider/model from `Usage` events
+(§8.5 of the target architecture) — chars-per-token ratio updated after every
+response; the chat token counter and the budget share this one estimator.
+
+### 4.5 Materialization and caching
+
+Stored content (attachments, images) stays reference-only in history;
+materialization happens in the assembler through the `ContentLoader`. Two
+fixes over today:
+
+- the loader result is cached per `(storedPath, mtime, size)` — no re-reading
+  the whole conversation's binaries on every send, and byte-identical turns
+  keep the provider prompt cache warm;
+- a failed load produces an **explicit placeholder block**
+  (`[attachment unavailable: name.png]`) instead of silently vanishing —
+  the model can say so, the manifest records it (fixes problem 5).
+
+### 4.6 Observability: the context manifest
+
+Every `assemble()` emits one debug-category log entry and a struct on the
+event stream:
+
+```
+manifest {
+  layers:   { agent.system: ~1.9k tok, env.project: ~70, skills.catalog: ~640 }
+  history:  26 messages, ~14.2k tok (3 tool rounds)
+  pinned:   { linked:src/main.cpp: ~2.1k }
+  task:     ~310 tok, 1 image (cached)
+  elided:   [ tool_result a4f1 (~8k) ]
+  estimate: ~19.3k / limit 32k
+}
+```
+
+Nothing is dropped silently — every filter (unsigned thinking, orphaned tool
+pairs, failed loads, budget elisions) leaves a manifest record. The token
+counter UI reads the same struct.
+
+---
+
+## 5. Wire contract — `ctx.*` stays, gains one producer
+
+`Templates::ContextData` (→ `ctx.system_prompt`, `ctx.history`,
+`ctx.prefix/suffix`, `ctx.files_metadata`) remains the contract between the
+core and `[body]` templates — it is not legacy, it is the template-facing
+view of the assembled context. The change is that exactly one function
+produces it (`ContextAssembler::assemble`), for every request, and
+`toLegacyContext`/`buildLegacyContext` are renamed into it. Existing
+serialization rules carry over unchanged: system messages never enter
+history, unsigned thinking is dropped, orphaned tool_use/tool_result pairs
+are filtered, `CompletionContent` becomes `prefix`/`suffix`.
+
+---
+
+## 6. Migration plan
+
+Ordered so every step lands independently and shrinks risk:
+
+1. **Extract `ContextAssembler`** from `buildLegacyContext` (pure, unit-tested
+   against fixture histories) + manifest logging + failed-load placeholder
+   blocks. No behavior change otherwise. — DONE 2026-06-12
+   (`sources/Session/ContextAssembler.{hpp,cpp}`, `test/ContextAssemblerTest.cpp`;
+   manifest logged under the `qodeassist.context` category).
+2. **ContentLoader cache** keyed by `(path, mtime, size)`. — DONE 2026-06-12
+   (`StoredContentCache` in `ChatSerializer`, owned per-chat by
+   `ClientInterface`, cleared on chat switch).
+3. **Pinned providers**: linked files and open-files sync move out of the
+   `chat.context` system layer; invoked-skill bodies move into the turn's
+   user blocks. `chat.context` shrinks to project info + skills catalog.
+   — DONE 2026-06-12 (`Session::pinContext/unpinContext`, pinned splice in
+   `ContextAssembler::assemble`; `SkillInvocationContent` block persisted via
+   `MessageSerializer`, invisible in the chat UI by design; open-files sync is
+   covered because `ChatRootView` merges open editors into the linked list).
+4. **Shared `EnvBlockFormatter`** in ContextEngine; chat/refactor/completion
+   stop hand-formatting project/file info. — DONE 2026-06-12
+   (`context/EnvBlockFormatter.{hpp,cpp}`: pure `formatProject`/`formatFile`
+   + the `currentProject()` QtC gatherer; chat project block, refactor file
+   header, and completion's `getLanguageAndFileInfo` all route through it).
+5. ~~**Continuation payload callback**~~ — REVERTED 2026-06-12 (implemented,
+   then judged a solution to a contrived problem; see §4.3). Continuations
+   are LLMQore's default replay; `ContextAssembler` runs once per `send()`.
+6. **TokenEstimator + BudgetPolicy seam** — estimate + warning first, then
+   elision stages.
+7. **ContextEngine port split** (delta #9 of the target architecture) —
+   `EditorContext` / `ProjectContext` / `TokenEstimator` behind ports, QtC
+   API only in `ide/context` adapters.
+
+---
+
+## 7. Open questions
+
+1. ~~**Pinned placement**~~ — RESOLVED 2026-06-12: text blocks prepended to
+   the last user-role wire message (synthetic user message only when there is
+   none). A separate synthetic message would break strict role alternation on
+   some provider APIs; cache behaviour of the two shapes is identical.
+2. ~~**Tool-loop relocation cost**~~ — RESOLVED 2026-06-12: relocation
+   rejected (LLMQore is deliberately a standalone agentic client). The
+   follow-up `setContinuationPayloadBuilder` inversion hook was also
+   implemented and reverted the same day — replay is the accepted behaviour
+   (§4.3).
+3. **Budget v1 scope** — warn-only vs. enabling tool-result elision
+   immediately. Elision changes what the model sees; needs live validation.
+4. **Completion and open files** — should completion gain pinned open-files
+   context (cheap with this design), or stay prefix/suffix-only for latency?