Dynamic Tool Loader

Source Code: src/gaia/agents/base/tool_loader.py · bundles hub/agents/python/chat/gaia_agent_chat/tool_bundles.py

Component: Per-turn tool visibility for agents (issue #688) Module: gaia.agents.base.tool_loader Status: Part 0 (#1448) + Part 1 (#1449) + Part 2 (#1450) + Part 3 (#1451) landed. Part 1 ships the selection mechanism behind a default-off toggle on the ChatAgent doc profile; Part 2 adds the explicit load_tools escape hatch (so native tool-calling models can recover a semantic miss) plus the escape-hatch activation-rate tuning signal; Part 3 adds the skill-driven signal — the tools_required of a recalled learned procedure (#887) load ahead of the semantic results. Target agent (v1): ChatAgent (doc profile), behind a default-off toggle.

Why this exists

Every tool registered in _TOOL_REGISTRY is rendered into the agent’s prompt on every turn, whether or not the conversation needs it. That couples prompt cost to the size of the toolbox rather than to what the turn actually needs. Two concrete problems follow — and this spec exists to fix both:

Cold-start latency (the user-felt problem). On local inference the model must prefill the entire system prompt before emitting the first token. The first query of a session is the worst case — nothing is cached, so the whole prompt is processed from scratch. A large tool block directly inflates first-turn time-to-first-token (TTFT). Trimming it is the most visible win: users perceive a slow first reply, never a token count.
Scaling headroom (the architectural problem). As agents gain tools (MCP connectors, skills, new mixins), the prompt grows linearly with the registry. Today’s absolute waste may be modest, but the slope is the problem: every tool added taxes every turn forever. This mechanism breaks that link — prompt cost scales with tools loaded, not tools registered.

The original issue-body spec justified this on a fixed “~12K tokens / ~400 tokens per tool” figure. That premise does not match the current renderer (see Correcting the original spec). Do not build against the 12K number — measure first, then justify on TTFT and slope.

Decided design

A semantic selector over a CORE floor, with a model-driven safety net — plus a skill-driven signal layered on once #887 landed (see Part 3). Keyword matching is explicitly out — it is too brittle to maintain and was the source of repeated review failures on the prior attempts (#811 / #922 / #957 / #958).

┌──────────────────────────────────────────────────────────┐
│ CORE  — always loaded. The functional floor.             │
│   e.g. memory, read/write, search, finish (set TBD).     │
├──────────────────────────────────────────────────────────┤
│ SEMANTIC  — primary mechanism.                           │
│   embed(conversation) → cosine vs. embedded tool docs    │
│   score tool docs ≥ τ → open matched tools' bundles.     │
├──────────────────────────────────────────────────────────┤
│ ESCAPE HATCH  — safety net for semantic misses.          │
│   recovery is PATH-DEPENDENT (see "Escape hatch"):       │
│   non-tool-calling → unlisted tool still executes;       │
│   native tool-calling → explicit load_tools meta-tool.   │
├──────────────────────────────────────────────────────────┤
│ SKILL  — future signal, gated on #887 (Part 3).          │
│   recall_skill(goal) → union matched skills' tools.      │
│   precision booster; returns [] cleanly until #887.      │
└──────────────────────────────────────────────────────────┘
   loaded = CORE ∪ SKILL ∪ SEMANTIC ∪ (escape-hatch additions)

Bundles are cohesion groups, not triggers

With keyword gone, a bundle is just “these tools belong together.” Semantic matching scores individual tool descriptions, but tools often need their siblings (query_table is useless without create_table). So when a tool matches, pull in its bundle-mates. A bundle carries no activation policy — only a name, a one-line description (used for the escape-hatch menu), and its member tools.

The escape hatch — free on one path, explicit on the other

How the model recovers from a semantic miss depends on the render path (the same split that governs filtering — see the two render paths):

Non-tool-calling models — recovery is nearly free. The model emits free-form tool names, and Agent._execute_tool validates against self._tools_registry — the full registry — with unknown-name resolution (MCP-prefix stripping, bare-prefix candidate lists) already built in. So if the loader filters only the prompt and leaves the registry whole, a tool the model names that semantic didn’t surface still executes. The escape hatch’s only extra job here is re-surfacing that tool’s bundle in later turns.
Native tool-calling models — recovery must be explicit. These models are physically constrained to the schemas passed in tools=; they cannot emit a call for a tool that isn’t listed. Reactive-via-execution is impossible. This path needs an always-loaded load_tools(bundle) meta-tool (in the schema list every turn) plus a short menu of bundle names + one-liners; the model calls it, and the next turn’s tools= includes the expanded set. On a small local model this is the less-reliable path — measure whether it actually fires.

Design rule: never tighten _execute_tool to the loaded subset. Keeping execution validated against the full registry is what makes recovery automatic on the non-tool-calling path. Restricting it re-introduces the catastrophic failure mode — “the model wanted a real tool and was refused.”

When memory is off, all tools load (the legacy path)

Semantic selection rides on the memory subsystem’s embedder (MemoryMixin._embed_text), and memory is a user-controlled setting — users can turn it off deliberately. Dynamic tool loading must inherit that switch cleanly:

When memory is disabled, the loader reverts to the legacy behavior: every registered tool is exposed, exactly as before this feature existed. No semantic selection, no filtering. This is a documented off-state, not a degradation — the user chose it, and the agent must behave identically to a build without the tool loader.

This collapses three distinct conditions onto the same safe behavior — load everything — so a user can never lose access to a tool by toggling memory:

Condition	Who triggers it	Tool-loader behavior
Memory disabled by the user	Deliberate user setting	Legacy path: expose all registered tools. No-op loader.
Loader’s own toggle off (default in v1)	Deliberate (default-off)	Legacy path: expose all registered tools.
Embedder unavailable (Lemonade down, model missing)	Failure	Disable loading for the session, expose all tools, log loudly.

The distinction matters for messaging, not mechanism: the user-disabled and toggle-off cases are silent and expected; the embedder-failure case logs an actionable error (it’s a fault, not a choice). All three land on “all tools available,” so the old method is always the floor the loader degrades to.

Correcting the original spec

The issue-body spec drifted from the code. These are the corrections an implementer must internalize:

Original spec said	Reality in the code	Consequence
~12K tokens, ~400/tool	`_format_tools_for_prompt` emits one line per tool (~30–60 tokens)	The text block is ~1–2K, not 12K. Justify on TTFT + slope, not absolute waste.
Filter `_format_tools_for_prompt`	There are two render paths	Must also filter the native path or tool-calling models get zero benefit.
Three tiers incl. skill (#887)	#887 not landed; #606 memory has	Defer the skill tier to Part 3 (gated on #887); v1 is CORE + semantic. Reuse `MemoryMixin._embed_text` — it exists and is already in ChatAgent.
Tidy hook into `_compose_system_prompt`	System prompt is cached (`_system_prompt_cache`) and takes no user turn	Must recompute when selection changes and thread the user message in.
Embedder unavailable → default raise, opt-in degrade	Memory v2 disables itself for the session on embedder failure	Mirror memory v2: embedder down → disable dynamic loading for the session, fall back to full registry, log loudly.

The two render paths (most important correction)

Tool descriptions reach the model two different ways, depending on the model:

Non-tool-calling models → _format_tools_for_prompt() renders text into the system prompt (cheap, one line per tool).
Tool-calling models → _build_openai_tool_schemas() (via the _openai_tools property) passes full JSON function schemas as the tools= API param. This is where the real tokens live.

A single tool selection must drive both renderers. Filtering only the text path was the gap every prior PR missed.

KPIs

Savings alone is a trap — it is maximized by loading nothing, which breaks the agent. The KPI set is benefit gated by no-harm:

KPI	Type	Target
First-turn TTFT reduction	Headline (user-felt)	Materially lower vs. unfiltered baseline; concrete target set from the Part-0 measurement
Prompt-cost slope (tokens per added tool)	Architectural	≈ flat — adding N tools ≈ 0 extra prompt cost
Tool recall (turns where every needed tool was available)	Hard guardrail	~100% — the merge gate
Task success rate	Outcome guardrail	Within ~2% of unfiltered baseline
Later-turn TTFT	Regression guardrail	No regression from cache thrash (see below)
Escape-hatch activation rate	Tuning signal	Low; rising = τ too strict

Measuring recall without hand-labeling: run the unfiltered agent as baseline and record which tools it actually calls per turn — that is the “needed” set. Run the loader and check each needed tool was visible when called. A miss is a recall failure.

The cache-thrash trap (TTFT’s flip side)

The cold-start win and a later-turn regression come from the same mechanism, so the design must handle both:

After turn 1 the system-prompt prefix is KV-cached — that is why turns 2+ are fast. If the loader recomputes selection every turn and the tool set changes, it invalidates the cache and re-prefills every turn, turning a first-turn win into an every-turn tax.

Two rules follow:

Select once at conversation start (or change rarely), not aggressively per-turn. Stable selection preserves both the cold-start win and warm-cache reuse.
Place the dynamic tool block at the end of the prompt, after all stable instructions. KV cache is prefix-based: everything stays cached up to the first byte that differs, so the volatile part must come last. Note this is a text-path lever — native tool-calling models carry tools in the tools= param, not the prompt body, so there the only lever is set stability (rule 1), not ordering.

Phased build

Ship in distinct reviewable phases so each layer earns its place with data — not one 1,000-line drop like the prior attempts.

Part 0 — Measure first (do this before any loader code)

Measure the real tool-prompt cost on ChatAgent across both render paths, and the prompt-cost slope (add dummy tools, show the unfiltered prompt grows while the loaded one stays flat). This validates the TTFT/slope justification — and sets the concrete TTFT target — before more is built. Success criteria:

A reproducible harness reports tool-prompt token cost for ChatAgent on both the text path (_format_tools_for_prompt) and the native path (_build_openai_tool_schemas).
Prompt-cost slope (tokens per added tool) is quantified for both paths.
Baseline first-turn TTFT is recorded for the unfiltered agent.
A go/no-go note states whether the measured cost justifies the loader, and sets the concrete TTFT-reduction target Part 1 must hit.

Part 1 — Selection + dual-path filtering ✅ landed (#1449)

Wire selection into _compose_system_prompt; filter both _format_tools_for_prompt and _build_openai_tool_schemas.
CORE always-on + semantic via _embed_text; bundles as cohesion groups.
Fix the caching seam: recompute when selection changes; tool block last.
Embedder down → session-disable fallback (mirror memory v2), logged.
Memory disabled or loader toggle off → expose all tools (legacy path). See When memory is off.
Leave _execute_tool on the full registry. This alone gives the free non-tool-calling recovery (an unlisted tool the model names still runs) — so Part 1 is not without a safety net, even though the explicit escape-hatch machinery lands in Part 2.
Default-off toggle; scope to ChatAgent doc.

Success criteria:

First-turn TTFT drops by at least the Part-0 target with the loader on.
Tool recall ~100% — every tool the unfiltered baseline called was available when the loader-on run called it (the merge gate).
Task success within ~2% of the unfiltered baseline on gaia eval agent.
No later-turn TTFT regression — turns 2+ stay warm-cached (selection stable).
Both render paths verified — text block and native tools= schemas measurably shrink, not just one.
All three off-states revert to all tools — memory disabled, loader toggle off, and embedder unavailable each expose the full registry (the last logs).

How Part 1 shipped (implementation reference)

Toggle & tunables. Default-off ChatAgentConfig fields, each overridable by an env var (env wins) so the eval harness can flip them without code changes:

Knob	Config field	Env override	Default
Enable	`dynamic_tools`	`GAIA_DYNAMIC_TOOLS` (`1`/`true`/`yes`/`on`)	`False`
Match threshold τ	`dynamic_tools_threshold`	`GAIA_DYNAMIC_TOOLS_TAU`	`0.20` (inclusive)
Loaded-set cap	`dynamic_tools_max`	`GAIA_DYNAMIC_TOOLS_MAX`	`14`

A malformed *_TAU / *_MAX value raises at construction (no silent default). The loader is built only for prompt_profile == "doc" with the toggle on; otherwise self.tool_loader is None and the agent stays on the legacy path. As of #1798 the dynamic_tools enable knob is also reachable as a Beta toggle in the Agent UI Settings panel (default off). GAIA_DYNAMIC_TOOLS still wins when set — the toggle then reflects the effective value and disables — so τ and the cap stay env-only tuning. CORE (10, always-on, cap- & eviction-exempt) — defined in tool_bundles.py: remember, recall, update_memory, forget, search_past_conversations, read_file, query_documents, query_specific_file, set_loop_state, request_user_input. Bundles (cohesion groups, pulled in whole on a member match): rag_query, rag_index, file_search, file_browse, file_edit, data, shell, clipboard, desktop, vision, memory, loop_control. CORE ∪ all bundle members must equal the 37-tool doc registry exactly. The drift guard is the CI test test_chat_tool_bundles.py (it compares both sets, so a new doc tool forces a conscious bundling decision); ToolLoader.validate_registry, called once on first select, additionally fails loudly at runtime if a CORE/bundle name is missing from the registry. Selection each turn = CORE ∪ semantically-matched tools (+ their bundles), accumulated into a session-scoped loaded set that only grows, then sorted. The query embedded is the previous user message + current, trailing 4,000 chars (Open Q4). Below the cap the set is monotonic; at the cap a non-CORE tool is LRU-evicted (oldest last_call_ts, falling back to load time for never-called tools); CORE and tools admitted this turn are exempt; if nothing is evictable the candidate is skipped and logged. Each turn emits one structured TOOL_LOADER {json} INFO line (scores, matched, bundle_pulled, admitted, evicted, skipped, loaded) for Part-2 tuning and the recall gate. Caching seam. The system prompt is recomputed only when the sorted selection changes, so steady-state turns serialize byte-identically and the backend KV prefix stays warm. When a filter is active the tools block moves after the response-format template (volatile content last); with no filter the legacy order and bytes are preserved exactly. Native known gap (Amendment 2) — closed by Part 2. _execute_tool is never tightened, so a non-tool-calling model that names an unlisted tool still runs it (free recovery) and the loader logs TOOL_LOADER_ESCAPE_HATCH. In Part 1 native tool-calling models had no such hatch — a semantic miss could not self-recover. Part 2 closes the recovery gap with the always-on load_tools meta-tool (the model loads the bundle it needs and calls the tool on its next step), and the recall gate’s native exemption is removed accordingly. Approved deviations from this sketch (flagged in the #1449 PR):

No finish in CORE — turn completion is protocol-level in GAIA; there is no finish registry tool. CORE = 10 real names.
max_tools=14, not 8–12 — CORE alone is 10 and eviction-exempt, so the cap is 10 CORE + 4 dynamic slots.
Config via ChatAgentConfig + env, not config.toml — the repo has no [agent.tool_loader] TOML section; tunables live on the dataclass with GAIA_DYNAMIC_TOOLS* env overrides, leaving GaiaConfig untouched.

Measured reduction is smaller than the original estimate. The 5 memory CORE tools carry verbose docstrings, so CORE alone renders ~40% of the 37-tool native baseline — meaning CORE-only is the ~60%-reduction best case and a full 14-tool loaded set lands around ~50% of baseline (~50% token reduction), not the ~68% first guessed. Token count is only a proxy; the authoritative ≥60% first-turn TTFT gate is the live gaia.eval.tool_cost --live-ttft --filter run. test_tool_loader_token_budget.py pins these filtered costs as a static guard.

Part 2 — Explicit escape hatch + tuning ✅ landed (#1450)

Add bundle re-surfacing + a discoverability menu of bundle names, and the load_tools meta-tool that native tool-calling models need (the free recovery from Part 1 covers only the non-tool-calling path).
Instrument escape-hatch activation rate as the semantic-threshold tuning dial.

Success criteria:

A native tool-calling model recovers a semantically-missed tool via load_tools within one extra turn (demonstrated end-to-end).
The non-tool-calling free-recovery path is verified (an unlisted tool the model names still executes).
Hard recall failures = 0 — with recovery in place, no task fails because a needed tool was permanently unreachable.
Escape-hatch activation rate is logged per session and usable as the threshold-tuning signal (rising rate ⇒ τ too strict).

How Part 2 shipped (implementation reference)

load_tools is always-on via CORE. load_tools is added to DOC_CORE_TOOLS (CORE = 11), so once registered it renders in both the text prompt and the native tools= schema every active turn and is cap-/eviction-exempt. It is registered only when the loader is active (self.tool_loader is not None), so the default-off doc path stays byte-identical — the unfiltered 37-tool baseline is unchanged. Recovery lands on the next model step, not the next user turn. The load_tools(bundle) handler calls ToolLoader.load_bundle, then Agent._apply_tool_filter — the one place the active filter and the cached system prompt move together. Because system_prompt and _openai_tools are read live at every LLM call, the expanded set is visible to the very next step in the same query, which is what lets smart_discovery recover on turn 1. load_bundle is cap-aware. It resolves a bundle name (or a bare tool name, via the reverse index) and admits members with the same LRU-evict path select() uses — protecting CORE and the members being loaded now — so max_tools holds at all times. It emits a same-turn TOOL_LOADER {…, "event": "load_tools", …} superset line. Menu is stable and native-only. A compact bundle menu (name + one-line description, from ToolBundle.description) is injected into the stable prefix of the doc system prompt (before the volatile tools tail → no KV thrash), and only for native tool-calling models — non-native models already have free recovery and are the TTFT-sensitive path. Tuning signal is log-derived. The loader counts escape-hatch (free) and load_tools (explicit) activations per session and emits a TOOL_LOADER_SESSION summary on reset_session() (escape_hatch_rate = (escape_hatch + load_tools) / turns). gaia.eval.tool_recall aggregates these from the server log and reports the per-turn rate alongside recall — no UI-DB migration. Recall gate flipped correctly. tool_recall.py unions same-turn load_tools superset lines into that turn’s loaded set and treats load_tools as always-satisfied; only then is the native “known gap” exemption removed, so a successful recovery passes the gate and a genuinely unrecovered miss fails it on every model. Cap unchanged at 14 (→ 3 dynamic slots now that CORE = 11). The eval gates recall; bump the default only if recall or the escape-hatch rate regresses.

Part 3 — Skill-driven signal ✅ landed (#1451)

A third selection signal, unblocked once #887 (skill auto-synthesis) landed (PR #1794, refined by #1818/#1828). A skill is a procedure the agent distilled from its own successful multi-step runs; each stored procedure declares the exact tools_required for that recipe.

When recall_skill(goal) matches the user’s goal, union in the matched skills’ tools_required ahead of semantic results.
This is high precision, low recall: it fires only for goals solved before, but when it does it is exact — the tools come from a recorded successful run, not a similarity guess. It complements semantic match (high recall, lower precision), it does not replace it.
Same retrieval mechanism, different corpus. Skills are stored semantically in the same MemoryStore as #606 memory — each procedure carries an embedding, and recall_skill(goal) is a vector search over procedures.embedding. So this tier reuses the same embedder as the semantic tier; it just searches learned procedures instead of tool docs.
Additive, not a reshape. It slots in as one more input to the same selector: loaded = CORE ∪ SKILL ∪ SEMANTIC ∪ escape-hatch. Precedence is CORE > SKILL > SEMANTIC. Nothing in Parts 0–2 changes.
Graceful absence is feature-detected, not import-guarded. When no procedure has been synthesized yet — the state every new user is in — the procedures corpus is empty, so recall_skill returns [] at its index.ntotal == 0 guard and the loader runs on CORE + semantic exactly as in Parts 1–2. The signal checks the populated corpus at runtime, never catches a missing module, so the regression test exercises the real fall-through path.

Success criteria:

When a skill matches the goal, its tools_required are loaded ahead of semantic results (verified with a fixture skill).
Measurable precision lift on tasks with a known matching skill — fewer escape-hatch activations and/or higher recall than the Part 1–2 baseline.
Graceful absence — with #887 disabled/absent, the signal returns [] and Parts 0–2 behavior is byte-for-byte unchanged (regression test).
No regression when no skill matches the goal (falls through to semantic).

How Part 3 shipped (implementation reference)

The loader stays memory-agnostic. recall_skill and Skill are not imported into tool_loader.py (that would be an upward dependency). Instead ChatAgent — the composition layer — flattens the recalled procedures’ tools_required into a List[str] and passes it to select() via a new keyword-only skill_tools param. The loader admits plain tool names; it never learns what a skill is. Recall runs once per turn; the loader reuses the cache (zero extra TTFT cost). recall_skill already runs once per turn in MemoryMixin._refresh_recalled_skills (before tool selection, via the process_query → super() ordering). Part 3 refactors that method to cache the matched Skill objects in self._recalled_skills alongside the rendered prompt; _recalled_skill_tools() reads that cache. So the SKILL signal adds no second embed/FAISS call — TTFT is the whole point of the loader. Precedence is realized by admission order, not set union. select() admits CORE first (cap-exempt), then the SKILL tools (cap-bound, in recall order, deduped, no bundle pull-in — the exact recipe), then the semantic candidates. Because new_candidates is computed after, it naturally excludes skill-admitted tools. A fat recipe is cap-bound like any non-CORE tool, so it can never exceed max_tools; an idle skill tool LRU-evicts next turn, and recall re-runs each turn so the set self-heals. Graceful absence is feature-detected. _recalled_skill_tools() returns [] whenever self._recalled_skills is empty — the state of every new user (empty procedures corpus → recall_skill []). With skill_tools None/empty the loaded set and the TOOL_LOADER log line are byte-identical to a CORE+SEMANTIC build (the skill key is emitted only when the signal fires). The query asymmetry is intentional. The semantic query is previous + current user message (Part 1), but the SKILL signal derives from the clean current goal that recall_skill matched — skill recall matches a goal-shaped trigger, not a conversation window, so the prior turn would blur the match. Where the criteria are proven. The mechanism (ahead-of-semantic, escape-hatch avoided, graceful absence, no double recall) is proven deterministically in test_tool_loader_selection.py and test_memory_mixin.py (pure-numpy embedder, no model/Lemonade/GPU). The metric (precision lift / no-regression) is the gaia eval agent --category tool_selection run vs the committed scorecard_tool_selection.json baseline — a cold eval runs from an empty corpus, so it proves Part 3 = Parts 0–2 (the no-regression criterion); the precision lift needs a seeded procedure matching a scenario goal.

Open questions — resolved in Part 1

These were open in the design sketch; Part 1 (#1449) decided them as follows:

Bundle definitions and CORE membership — decided. CORE = 10 names and 12 bundles, in tool_bundles.py (see How Part 1 shipped), pinned to cover the 37-tool doc registry exactly.
Similarity threshold τ / cap — decided. τ = 0.20 inclusive, cap = 14 (both tunable via config/env). τ was calibrated against live nomic embeddings: question↔tool-description similarity is weak (real scores ~0.05–0.25), so the original 0.55 guess made the semantic tier inert (CORE-only). 0.20 surfaces document-action tools while excluding noise.
Selection stability policy — decided. Re-select every turn, but the loaded set only grows (“expand-on-new-match”) and the prompt is recomputed only when it changes, so non-expansion turns stay KV-warm. At the cap a non-CORE tool is LRU-evicted.
Query construction — decided. Previous user message + current, trailing 4,000 chars (assistant/RAG text excluded by construction).
Config / toggle location — decided. ChatAgentConfig fields + GAIA_DYNAMIC_TOOLS* env overrides (no config.toml section; GaiaConfig untouched).
Part-0 methodology — decided in #1448. gaia.eval.tool_cost measures deterministic cost/slope and live prompt-prefill TTFT.

Current state of the code

tool_loader.py is the live semantic loader (Part 1, #1449): ChatAgent builds it for the doc profile when the toggle is on, calls select() each turn via the base _select_tools_for_turn hook, and both render paths filter from the same selection. The old keyword/bundle-policy skeleton was removed; the class name ToolLoader and reset_session() were kept so the existing (guarded) call sites in cli.py / chat/app.py needed no change. Recall recovery for native tool-calling models has shipped (Part 2, #1450): the loader exposes bundle_names / format_bundle_menu / load_bundle and per-session escape-hatch counters; ChatAgent registers the load_tools meta-tool and injects the native-only bundle menu; and gaia.eval.tool_recall unions mid-loop load_tools lines, drops the native exemption, and reports the escape-hatch activation rate. The skill-driven tier has shipped (Part 3, #1451): select() takes a keyword-only skill_tools list and admits it after CORE and ahead of the semantic candidates; ChatAgent._select_tools_for_turn feeds it MemoryMixin._recalled_skill_tools(), which flattens the tools_required of the procedures recall_skill matched this turn — reusing the per-turn recall cache, so the tier adds no extra embed/FAISS cost and stays byte-identical to Parts 1–2 whenever no procedure matches.

How #800 (scratchpad/memory collision) is resolved

#800 tracked a feared collision between scratchpad.query_data and memory.recall. The loader resolves it by design asymmetry, and the literal pair never actually arises: scratchpad tools are registered only for the data/full profiles (chat/agent.py), never doc — the only profile the loader is wired to. The doc-profile analog the loader does arbitrate is analyze_data_file (the structured-data tool) vs recall:

recall ∈ CORE (tool_bundles.py DOC_CORE_TOOLS) — always loaded, cap- and eviction-exempt: persistent recall is always relevant.
analyze_data_file ∈ the conditional data bundle — loaded only when the turn’s query semantically clears τ.

So the two co-occur only when the turn justifies the data tool; recall is never the gated side. Pinned by tests/unit/test_tool_loader_disambiguation.py (deterministic, real CORE/bundle config) and the live eval/scenarios/tool_selection/data_vs_recall_disambiguation.yaml scenario.

Dependencies

#606 (memory v2) — landed. Provides MemoryMixin._embed_text (nomic-embed-text-v2-moe-GGUF, 768-dim) already mixed into ChatAgent. Hard dependency for semantic selection; no new infra needed. Note memory is user-controllable — when the user turns it off, the loader reverts to the legacy all-tools path (see When memory is off).
#887 (procedural memory / skills) — landed (PR #1794, refined by #1818/#1828). Provides recall_skill and the per-procedure tools_required. The skill tier shipped on top of it as Part 3; when the procedures corpus is empty (every new user) recall_skill returns [], so the loader runs on CORE + semantic exactly as in Parts 0–2.

What's Next

Agent UI

Ecosystem

Infrastructure

Agents

Dynamic Tool Loader

Why this exists

Decided design

Bundles are cohesion groups, not triggers

The escape hatch — free on one path, explicit on the other

When memory is off, all tools load (the legacy path)

Correcting the original spec

The two render paths (most important correction)

KPIs

The cache-thrash trap (TTFT’s flip side)

Phased build

Part 0 — Measure first (do this before any loader code)

Part 1 — Selection + dual-path filtering ✅ landed (#1449)

How Part 1 shipped (implementation reference)

Part 2 — Explicit escape hatch + tuning ✅ landed (#1450)

How Part 2 shipped (implementation reference)

Part 3 — Skill-driven signal ✅ landed (#1451)

How Part 3 shipped (implementation reference)

Open questions — resolved in Part 1

Current state of the code

How #800 (scratchpad/memory collision) is resolved

Dependencies

​Why this exists

​Decided design

​Bundles are cohesion groups, not triggers

​The escape hatch — free on one path, explicit on the other

​When memory is off, all tools load (the legacy path)

​Correcting the original spec

​The two render paths (most important correction)

​KPIs

​The cache-thrash trap (TTFT’s flip side)

​Phased build

​Part 0 — Measure first (do this before any loader code)

​Part 1 — Selection + dual-path filtering ✅ landed (#1449)

​How Part 1 shipped (implementation reference)

​Part 2 — Explicit escape hatch + tuning ✅ landed (#1450)

​How Part 2 shipped (implementation reference)

​Part 3 — Skill-driven signal ✅ landed (#1451)

​How Part 3 shipped (implementation reference)

​Open questions — resolved in Part 1

​Current state of the code

​How #800 (scratchpad/memory collision) is resolved

​Dependencies

Why this exists

Decided design

Bundles are cohesion groups, not triggers

The escape hatch — free on one path, explicit on the other

When memory is off, all tools load (the legacy path)

Correcting the original spec

The two render paths (most important correction)

KPIs

The cache-thrash trap (TTFT’s flip side)

Phased build

Part 0 — Measure first (do this before any loader code)

Part 1 — Selection + dual-path filtering ✅ landed (#1449)

How Part 1 shipped (implementation reference)

Part 2 — Explicit escape hatch + tuning ✅ landed (#1450)

How Part 2 shipped (implementation reference)

Part 3 — Skill-driven signal ✅ landed (#1451)

How Part 3 shipped (implementation reference)

Open questions — resolved in Part 1

Current state of the code

How #800 (scratchpad/memory collision) is resolved

Dependencies