Source Code:
src/gaia/eval/This is not the general evaluation framework. This page covers
gaia eval agent — the scenario-based benchmark that stress-tests the live Agent UI end-to-end. For batch experiments and ground-truth generation, see the Evaluation Framework.Overview
The Agent Eval Benchmark drives the live Agent UI through multi-turn conversations, then judges every response with an LLM (claude-sonnet-4-6 by default). Each scenario creates a real Agent UI session via MCP, sends user messages, captures the full transcript, and produces a scored evaluation. 54 YAML scenario files span 10 categories covering RAG quality, context retention, tool selection, error recovery, hallucination resistance, adversarial inputs, personality compliance, vision capabilities, web/system tools, and real-world documents. Why this matters: Unlike the general eval framework (which compares isolated model outputs), the Agent Eval Benchmark tests the full system end-to-end — RAG indexing, tool dispatch, context window management, multi-turn state, and hallucination resistance — through the same Agent UI that real users interact with. Key Features:- Multi-turn scenario simulation with persona-driven user messages
- 7-dimension scoring rubric with deterministic weighted aggregation
- Automated fix mode that invokes Claude Code to repair failures and re-evaluate
- Regression testing with baseline comparison and per-scenario deltas
- Architecture audit mode (no LLM calls) to detect structural limitations
- CI/CD integration with budget and timeout controls
Architecture
The benchmark runs as two distinct processes connected over MCP: a Python orchestrator (AgentEvalRunner) that manages scenarios, timeouts, and scoring; and a claude -p subprocess per scenario that acts as both user simulator and LLM judge. The system under test — the Agent UI and its Lemonade backend — runs independently and is treated as a black box.
System Overview
Key design decisions:| Decision | Rationale |
|---|---|
One claude -p subprocess per scenario | Isolates eval agent state — no cross-scenario memory leakage. API cost is bounded per scenario by --max-budget-usd, not per run |
--output-format json --json-schema | Forces the eval agent to emit a machine-parseable result dict. The runner never parses free-form text |
| Prompts inlined into system prompt at runtime | The claude -p subprocess has no file-read tools. simulator.md, judge_turn.md, and judge_scenario.md are read by the runner and concatenated into the -p prompt string |
| Score recomputed deterministically | The runner overwrites the eval agent’s arithmetic using _SCORE_WEIGHTS — consistent results regardless of which model judges |
Progress tracking via .progress.json | Interrupted runs resume from the last completed scenario; corrupt or missing traces are re-run automatically |
Eval Agent Lifecycle (per-scenario subprocess)
Eachclaude -p subprocess runs a 6-phase protocol. The eval agent has access to the Agent UI MCP server tools and uses them to drive a real session:
Phase 2 detail: The eval agent generates natural language user messages from the turn’s objective and persona — not verbatim copies. It calls send_message() and waits for the full agent response before scoring. It does not retry on poor responses; it scores and moves to the next turn regardless.
Error short-circuits and skips:
| Condition | Status | Subprocess invoked? |
|---|---|---|
| Corpus file not on disk | SKIPPED_NO_DOCUMENT | No — runner skips entirely |
system_status() HTTP error | INFRA_ERROR | Yes — exits Phase 1 early |
index_document() returns 0 chunks (non-adversarial) | SETUP_ERROR | Yes — exits Phase 1 early |
| Subprocess wall-clock timeout | TIMEOUT | Yes — killed by runner |
| Claude API budget cap hit | BUDGET_EXCEEDED | Yes — Claude returns error_max_budget_usd |
Score Computation Pipeline
After each subprocess returns its JSON result, the runner validates and deterministically overwrites the eval agent’s arithmetic before writing the trace file: The recomputation applies in three passes:-
Per-turn:
recompute_turn_score(scores_dict)applies_SCORE_WEIGHTS. If the recomputed value differs from the eval agent’s reported value by more than 0.25, the discrepancy is logged and the recomputed value wins. The per-turnpassflag is also recalculated (correctness ≥ 4 AND computed ≥ 6.0). -
Scenario-level:
overall_scoreis recomputed as the arithmetic mean of recomputed per-turn scores, replacing the eval agent’s scenario-level value entirely. -
Status re-derivation: The runner applies the rubric rules to recomputed values. An eval-agent-reported PASS can be overridden to FAIL (if any turn has
correctness < 4oroverall_score < 6.0), and a reported FAIL can be upgraded to PASS (if all turns satisfy both criteria).BLOCKED_BY_ARCHITECTUREis never overridden — if it passes all rubric criteria, a warning is emitted for human review instead of an automatic upgrade. True infrastructure statuses (TIMEOUT,BUDGET_EXCEEDED,INFRA_ERROR,SETUP_ERROR) are also never overridden. -
Average score integrity: In
scorecard.json, FAIL scenario scores are capped at 5.99 before computingavg_score. A scenario can score 9.8/10 on five of seven dimensions and still FAIL on hallucination — that 9.8 would inflate the benchmark’s quality signal if included raw.
Fix Mode Loop
When--fix is passed, the runner repeats a diagnose-repair-retest cycle:
The fixer subprocess (claude -p fixer.md) receives the scorecard.json path, summary.md path, and a JSON list of failing scenario IDs with their root_cause and recommended_fix fields. It patches files in src/gaia/ and writes a fix_log.json documenting each change. The loop exits early if judged_pass_rate ≥ --target-pass-rate or all scenarios pass.
Prerequisites
Verify Claude Code CLI
The runner invokes scenarios via If not installed, see Claude Code installation.
claude -p subprocess:Quick Start
eval/results/<run_id>/.
Scenario Categories
| Category | Scenarios | What It Tests |
|---|---|---|
rag_quality | 7 | Factual extraction, hallucination resistance, negation handling, table/CSV data, cross-section synthesis, budget queries |
context_retention | 4 | Pronoun resolution, cross-turn file recall, multi-document context, conversation summary |
tool_selection | 4 | Choosing the right tool, smart discovery (no docs indexed — find and index), multi-step planning, no-tool-needed detection |
error_recovery | 3 | File-not-found graceful handling, empty search fallback, vague request clarification |
adversarial | 3 | Empty file, large document (>100k tokens), topic switching |
personality | 3 | Concise responses, no sycophancy, honest limitation acknowledgement |
vision | 3 | Screenshot capture, VLM graceful degradation, SD graceful degradation |
real_world | 19 | Real PDFs, XLSX, specs (10-K filings, GDPR articles, RFC specs, technical datasheets, license texts, government data) |
web_system | 6 | Clipboard tools, desktop notifications, webpage fetching, window listing, system info, text-to-speech |
captured | 2 | Golden-path replays from real Agent UI sessions |
Scoring System
The judge evaluates each turn across 7 dimensions with fixed weights:| Dimension | Weight | What It Measures |
|---|---|---|
| Correctness | 25% | Factual accuracy against ground truth |
| Tool Selection | 20% | Chose the right tools; did not over-use or skip tools |
| Context Retention | 20% | Remembered prior turns; resolved pronouns; no re-indexing needed |
| Completeness | 15% | Answered all parts of the question |
| Efficiency | 10% | Did not make unnecessary tool calls or ask redundant clarifications |
| Personality | 5% | Tone, conciseness, avoiding sycophancy |
| Error Recovery | 5% | Gracefully handled missing files, empty results, ambiguous queries |
Pass / Fail Rules
- PASS:
overall_score >= 6.0AND no turn hascorrectness < 4 - FAIL:
overall_score < 6.0OR any turn hascorrectness < 4
Severity Levels
critical— Automatic FAIL if the agent hallucinates, invents facts, or fails the primary objective. Scenarios likehallucination_resistance,cross_turn_file_recall, andsmart_discoveryuse this level.standard— Scored purely on the numeric threshold.
Status Legend
Statuses are grouped by how they affect scoring. Judged statuses count towardavg_score and judged_pass_rate. Infrastructure statuses are excluded from quality metrics — they indicate environmental issues, not agent quality.
| Status | Type | Meaning |
|---|---|---|
| PASS | Judged | Scenario passed all criteria |
| FAIL | Judged | Score below threshold or critical failure |
| BLOCKED_BY_ARCHITECTURE | Judged | Agent UI architecture prevents success (e.g., history window too small). Counts toward avg_score but status is never overridden to PASS — a warning is logged instead |
| TIMEOUT | Infrastructure | Scenario exceeded time limit |
| BUDGET_EXCEEDED | Infrastructure | Claude API budget cap hit before completion |
| INFRA_ERROR | Infrastructure | Agent UI backend unreachable or MCP failure |
| SETUP_ERROR | Infrastructure | Document indexing failed (0 chunks) |
| SKIPPED_NO_DOCUMENT | Infrastructure | Corpus file not present on disk (e.g., real-world docs not committed) |
| ERRORED | Infrastructure | Eval agent crashed, returned non-JSON, or encountered an unexpected error |
Test Corpus
The benchmark ships with a synthetic corpus ineval/corpus/documents/ with ground truth facts defined in eval/corpus/manifest.json.
| File | Format | Domain | Sample Facts |
|---|---|---|---|
acme_q3_report.md | Markdown | Finance | Q3 revenue: $14.2M; CEO Q4 outlook: 15—18% growth |
employee_handbook.md | Markdown | HR Policy | PTO (first year): 15 days; Remote work: up to 3 days/week |
sales_data_2025.csv | CSV | Sales | Top salesperson: Sarah Chen 340,000 |
product_comparison.html | HTML | Product | StreamLine: 79/mo, 4.7 stars |
api_reference.py | Python | Technical | Auth: Bearer token via Authorization header |
meeting_notes_q3.txt | Text | General | Next meeting: October 15, 2025 at 2:00 PM |
budget_2025.md | Markdown | Finance | Total budget: 1.3M; CFO approval threshold: $50K |
large_report.md | Markdown | Compliance | Section 52 finding (adversarial: >100k tokens) |
sample_chart.png | Image | Test | 1x1 pixel test image for vision scenarios |
empty.txt, unicode_test.txt, duplicate_sections.md) used by the adversarial category.
CLI Reference
| Flag | Default | Description |
|---|---|---|
--scenario ID | — | Run one scenario by ID |
--category NAME | — | Run all scenarios in a category |
--audit-only | false | Check architecture constraints without running LLM calls |
--generate-corpus | false | Regenerate corpus documents and validate manifest.json |
--backend URL | http://localhost:4200 | Agent UI backend URL |
--model MODEL | claude-sonnet-4-6 | Judge model |
--budget USD | 2.00 | Max spend per scenario |
--timeout SECS | 900 | Per-scenario timeout (auto-scaled for large-doc and multi-turn scenarios) |
--fix | false | Auto-invoke Claude Code to repair failures, then re-eval |
--max-fix-iterations N | 3 | Max repair cycles in --fix mode |
--target-pass-rate N | 0.90 | Stop fixing early when pass rate reaches this threshold |
--compare PATH... | — | Compare two scorecard.json files or compare against saved baseline |
--save-baseline | false | Save this run’s scorecard as eval/results/baseline.json |
--capture-session UUID | — | Convert a live Agent UI session into a YAML scenario |
Fix Mode
Fix mode automates the repair loop: evaluate, diagnose failures, patch source code, and re-evaluate. Phases:- Phase A: Full eval run — All scenarios (or filtered set) execute normally
- Phase B: Diagnose + repair — Claude Code reads failing scenario transcripts and patches Agent UI source files
- Phase C: Re-run failures — Only the previously failed scenarios are re-evaluated
- Phase D: Diff scorecard — Produces a comparison showing regressions and improvements
- Critical severity scenarios first
- Architecture fixes (in
_chat_helpers.py, base agent classes) before prompt fixes - Multi-scenario failures before single-scenario issues
Fix mode uses Claude Code to patch
src/gaia/ source files. Review diffs before committing. Always run python util/lint.py --all --fix after fix iterations.Regression Testing
- Per-scenario delta: PASS to FAIL regressions (highlighted), FAIL to PASS improvements
- Category-level pass rate change
- Score delta per scenario (warns when score drops by more than 2.0 points within the same status)
Writing Custom Scenarios
Scenario YAML files live undereval/scenarios/<category>/. The runner discovers them automatically via recursive glob.
Full Schema Example
Each turn needs at least one of
ground_truth (non-null dict) or success_criteria (non-empty string) — providing both gives maximum judging precision. Valid personas: casual_user, data_analyst, power_user, confused_user, adversarial_user.eval/scenarios/<category>/ and it will be picked up automatically on the next run.
Capturing Real Sessions
~/.gaia/chat/gaia_chat.db), extracts turns and indexed documents, and writes a scenario YAML to eval/scenarios/captured/.
After capture, you must review and edit the generated file to add proper ground_truth and success_criteria fields — the capture tool populates the structure but cannot infer expected answers.
Architecture Audit
- History window size (
_MAX_HISTORY_PAIRSin_chat_helpers.py) - Message truncation limits (
_MAX_MSG_CHARS) - Tool result persistence in conversation history
- Agent persistence model (stateless per-message vs. persistent)
BLOCKED_BY_ARCHITECTURE and provides recommendations (e.g., “increase _MAX_HISTORY_PAIRS to 10+”). Run this before the full benchmark to understand expected failures due to architecture limits rather than AI quality.
Output Files
After a run, results are written toeval/results/<run_id>/:
| File | Description |
|---|---|
scorecard.json | Machine-readable results with per-scenario details, scores, and cost |
summary.md | Human-readable pass/fail report with emoji status icons |
traces/<scenario_id>.json | Full per-scenario trace (turns, dimension scores, reasoning) |
fix_log.json | Written by --fix mode: list of files changed and rationale per fix |
eval/results/baseline.json | Saved baseline (written by --save-baseline) |
Sample summary.md Output
CI/CD Integration
.github/workflows/test_eval.yml that runs structural validation (scenario YAML parsing, manifest integrity, scorecard generation) on every push to main or PR targeting main. Full LLM-driven eval runs are triggered via workflow_dispatch or scheduled separately.
Next Steps
Evaluation Framework
Batch experiments, ground truth generation, and model comparison
Agent UI Guide
The desktop chat application that the benchmark tests
RAG SDK
Document indexing and retrieval under the hood
Agent System
Base Agent class, tools, and state management