Skip to main content
Source Code: src/gaia/eval/
This is not the general evaluation framework. This page covers gaia eval agent — the scenario-based benchmark that stress-tests the live Agent UI end-to-end. For batch experiments and ground-truth generation, see the Evaluation Framework.

Overview

The Agent Eval Benchmark drives the live Agent UI through multi-turn conversations, then judges every response with an LLM (claude-sonnet-4-6 by default). Each scenario creates a real Agent UI session via MCP, sends user messages, captures the full transcript, and produces a scored evaluation. 54 YAML scenario files span 10 categories covering RAG quality, context retention, tool selection, error recovery, hallucination resistance, adversarial inputs, personality compliance, vision capabilities, web/system tools, and real-world documents. Why this matters: Unlike the general eval framework (which compares isolated model outputs), the Agent Eval Benchmark tests the full system end-to-end — RAG indexing, tool dispatch, context window management, multi-turn state, and hallucination resistance — through the same Agent UI that real users interact with. Key Features:
  • Multi-turn scenario simulation with persona-driven user messages
  • 7-dimension scoring rubric with deterministic weighted aggregation
  • Automated fix mode that invokes Claude Code to repair failures and re-evaluate
  • Regression testing with baseline comparison and per-scenario deltas
  • Architecture audit mode (no LLM calls) to detect structural limitations
  • CI/CD integration with budget and timeout controls

Architecture

The benchmark runs as two distinct processes connected over MCP: a Python orchestrator (AgentEvalRunner) that manages scenarios, timeouts, and scoring; and a claude -p subprocess per scenario that acts as both user simulator and LLM judge. The system under test — the Agent UI and its Lemonade backend — runs independently and is treated as a black box.

System Overview

Key design decisions:
DecisionRationale
One claude -p subprocess per scenarioIsolates eval agent state — no cross-scenario memory leakage. API cost is bounded per scenario by --max-budget-usd, not per run
--output-format json --json-schemaForces the eval agent to emit a machine-parseable result dict. The runner never parses free-form text
Prompts inlined into system prompt at runtimeThe claude -p subprocess has no file-read tools. simulator.md, judge_turn.md, and judge_scenario.md are read by the runner and concatenated into the -p prompt string
Score recomputed deterministicallyThe runner overwrites the eval agent’s arithmetic using _SCORE_WEIGHTS — consistent results regardless of which model judges
Progress tracking via .progress.jsonInterrupted runs resume from the last completed scenario; corrupt or missing traces are re-run automatically

Eval Agent Lifecycle (per-scenario subprocess)

Each claude -p subprocess runs a 6-phase protocol. The eval agent has access to the Agent UI MCP server tools and uses them to drive a real session: Phase 2 detail: The eval agent generates natural language user messages from the turn’s objective and persona — not verbatim copies. It calls send_message() and waits for the full agent response before scoring. It does not retry on poor responses; it scores and moves to the next turn regardless. Error short-circuits and skips:
ConditionStatusSubprocess invoked?
Corpus file not on diskSKIPPED_NO_DOCUMENTNo — runner skips entirely
system_status() HTTP errorINFRA_ERRORYes — exits Phase 1 early
index_document() returns 0 chunks (non-adversarial)SETUP_ERRORYes — exits Phase 1 early
Subprocess wall-clock timeoutTIMEOUTYes — killed by runner
Claude API budget cap hitBUDGET_EXCEEDEDYes — Claude returns error_max_budget_usd
Timeout scaling — the runner computes an effective timeout per scenario to account for document indexing and turn count:
effective_timeout = max(base_timeout,
                        120s startup overhead
                        + num_docs × 90s
                        + num_turns × 200s)
                    capped at 7200s

Score Computation Pipeline

After each subprocess returns its JSON result, the runner validates and deterministically overwrites the eval agent’s arithmetic before writing the trace file: The recomputation applies in three passes:
  1. Per-turn: recompute_turn_score(scores_dict) applies _SCORE_WEIGHTS. If the recomputed value differs from the eval agent’s reported value by more than 0.25, the discrepancy is logged and the recomputed value wins. The per-turn pass flag is also recalculated (correctness ≥ 4 AND computed ≥ 6.0).
  2. Scenario-level: overall_score is recomputed as the arithmetic mean of recomputed per-turn scores, replacing the eval agent’s scenario-level value entirely.
  3. Status re-derivation: The runner applies the rubric rules to recomputed values. An eval-agent-reported PASS can be overridden to FAIL (if any turn has correctness < 4 or overall_score < 6.0), and a reported FAIL can be upgraded to PASS (if all turns satisfy both criteria). BLOCKED_BY_ARCHITECTURE is never overridden — if it passes all rubric criteria, a warning is emitted for human review instead of an automatic upgrade. True infrastructure statuses (TIMEOUT, BUDGET_EXCEEDED, INFRA_ERROR, SETUP_ERROR) are also never overridden.
  4. Average score integrity: In scorecard.json, FAIL scenario scores are capped at 5.99 before computing avg_score. A scenario can score 9.8/10 on five of seven dimensions and still FAIL on hallucination — that 9.8 would inflate the benchmark’s quality signal if included raw.

Fix Mode Loop

When --fix is passed, the runner repeats a diagnose-repair-retest cycle: The fixer subprocess (claude -p fixer.md) receives the scorecard.json path, summary.md path, and a JSON list of failing scenario IDs with their root_cause and recommended_fix fields. It patches files in src/gaia/ and writes a fix_log.json documenting each change. The loop exits early if judged_pass_rate ≥ --target-pass-rate or all scenarios pass.

Prerequisites

1

Install eval dependencies

uv pip install -e ".[eval]"
2

Set up the judge model API key

The benchmark uses Claude as the judge model. Export your API key:
export ANTHROPIC_API_KEY=sk-ant-...
3

Start the LLM backend

Lemonade server provides the local LLM and embeddings for the Agent UI:
lemonade-server serve
4

Start the Agent UI backend

gaia chat --ui
5

Verify Claude Code CLI

The runner invokes scenarios via claude -p subprocess:
claude --version
If not installed, see Claude Code installation.

Quick Start

# Run the full benchmark (all 54 scenarios)
gaia eval agent

# Run a single scenario by ID
gaia eval agent --scenario simple_factual_rag

# Run all scenarios in a category
gaia eval agent --category rag_quality

# Architecture audit only (no LLM calls, no cost)
gaia eval agent --audit-only
Results are written to eval/results/<run_id>/.

Scenario Categories

CategoryScenariosWhat It Tests
rag_quality7Factual extraction, hallucination resistance, negation handling, table/CSV data, cross-section synthesis, budget queries
context_retention4Pronoun resolution, cross-turn file recall, multi-document context, conversation summary
tool_selection4Choosing the right tool, smart discovery (no docs indexed — find and index), multi-step planning, no-tool-needed detection
error_recovery3File-not-found graceful handling, empty search fallback, vague request clarification
adversarial3Empty file, large document (>100k tokens), topic switching
personality3Concise responses, no sycophancy, honest limitation acknowledgement
vision3Screenshot capture, VLM graceful degradation, SD graceful degradation
real_world19Real PDFs, XLSX, specs (10-K filings, GDPR articles, RFC specs, technical datasheets, license texts, government data)
web_system6Clipboard tools, desktop notifications, webpage fetching, window listing, system info, text-to-speech
captured2Golden-path replays from real Agent UI sessions

Scoring System

The judge evaluates each turn across 7 dimensions with fixed weights:
DimensionWeightWhat It Measures
Correctness25%Factual accuracy against ground truth
Tool Selection20%Chose the right tools; did not over-use or skip tools
Context Retention20%Remembered prior turns; resolved pronouns; no re-indexing needed
Completeness15%Answered all parts of the question
Efficiency10%Did not make unnecessary tool calls or ask redundant clarifications
Personality5%Tone, conciseness, avoiding sycophancy
Error Recovery5%Gracefully handled missing files, empty results, ambiguous queries
Per-turn score is the weighted sum of all 7 dimensions (0–10 scale). The runner recomputes this deterministically from dimension scores rather than trusting the LLM’s arithmetic — ensuring consistent results regardless of which model is used as judge. Scenario-level score is the mean of all per-turn scores. FAIL scores are capped at 5.99 in the average so a single perfect FAIL cannot inflate the benchmark’s overall quality signal.

Pass / Fail Rules

  • PASS: overall_score >= 6.0 AND no turn has correctness < 4
  • FAIL: overall_score < 6.0 OR any turn has correctness < 4

Severity Levels

  • critical — Automatic FAIL if the agent hallucinates, invents facts, or fails the primary objective. Scenarios like hallucination_resistance, cross_turn_file_recall, and smart_discovery use this level.
  • standard — Scored purely on the numeric threshold.

Status Legend

Statuses are grouped by how they affect scoring. Judged statuses count toward avg_score and judged_pass_rate. Infrastructure statuses are excluded from quality metrics — they indicate environmental issues, not agent quality.
StatusTypeMeaning
PASSJudgedScenario passed all criteria
FAILJudgedScore below threshold or critical failure
BLOCKED_BY_ARCHITECTUREJudgedAgent UI architecture prevents success (e.g., history window too small). Counts toward avg_score but status is never overridden to PASS — a warning is logged instead
TIMEOUTInfrastructureScenario exceeded time limit
BUDGET_EXCEEDEDInfrastructureClaude API budget cap hit before completion
INFRA_ERRORInfrastructureAgent UI backend unreachable or MCP failure
SETUP_ERRORInfrastructureDocument indexing failed (0 chunks)
SKIPPED_NO_DOCUMENTInfrastructureCorpus file not present on disk (e.g., real-world docs not committed)
ERROREDInfrastructureEval agent crashed, returned non-JSON, or encountered an unexpected error

Test Corpus

The benchmark ships with a synthetic corpus in eval/corpus/documents/ with ground truth facts defined in eval/corpus/manifest.json.
FileFormatDomainSample Facts
acme_q3_report.mdMarkdownFinanceQ3 revenue: $14.2M; CEO Q4 outlook: 15—18% growth
employee_handbook.mdMarkdownHR PolicyPTO (first year): 15 days; Remote work: up to 3 days/week
sales_data_2025.csvCSVSalesTop salesperson: Sarah Chen 70,000;Q1total:70,000; Q1 total: 340,000
product_comparison.htmlHTMLProductStreamLine: 49/mo,4.2stars;ProFlow:49/mo, 4.2 stars; ProFlow: 79/mo, 4.7 stars
api_reference.pyPythonTechnicalAuth: Bearer token via Authorization header
meeting_notes_q3.txtTextGeneralNext meeting: October 15, 2025 at 2:00 PM
budget_2025.mdMarkdownFinanceTotal budget: 4.2M;Engineering:4.2M; Engineering: 1.3M; CFO approval threshold: $50K
large_report.mdMarkdownComplianceSection 52 finding (adversarial: >100k tokens)
sample_chart.pngImageTest1x1 pixel test image for vision scenarios
The manifest also defines adversarial documents (empty.txt, unicode_test.txt, duplicate_sections.md) used by the adversarial category.
RAG cache freshness — If you see cached documents showing “1 chunk, 0B”, clear the RAG cache before running:
rm ~/.gaia/*.pkl
Stale caches can contain synthesized summaries instead of verbatim document content, causing false failures.

CLI Reference

FlagDefaultDescription
--scenario IDRun one scenario by ID
--category NAMERun all scenarios in a category
--audit-onlyfalseCheck architecture constraints without running LLM calls
--generate-corpusfalseRegenerate corpus documents and validate manifest.json
--backend URLhttp://localhost:4200Agent UI backend URL
--model MODELclaude-sonnet-4-6Judge model
--budget USD2.00Max spend per scenario
--timeout SECS900Per-scenario timeout (auto-scaled for large-doc and multi-turn scenarios)
--fixfalseAuto-invoke Claude Code to repair failures, then re-eval
--max-fix-iterations N3Max repair cycles in --fix mode
--target-pass-rate N0.90Stop fixing early when pass rate reaches this threshold
--compare PATH...Compare two scorecard.json files or compare against saved baseline
--save-baselinefalseSave this run’s scorecard as eval/results/baseline.json
--capture-session UUIDConvert a live Agent UI session into a YAML scenario

Fix Mode

Fix mode automates the repair loop: evaluate, diagnose failures, patch source code, and re-evaluate. Phases:
  1. Phase A: Full eval run — All scenarios (or filtered set) execute normally
  2. Phase B: Diagnose + repair — Claude Code reads failing scenario transcripts and patches Agent UI source files
  3. Phase C: Re-run failures — Only the previously failed scenarios are re-evaluated
  4. Phase D: Diff scorecard — Produces a comparison showing regressions and improvements
# Fix all failures, up to 3 iterations
gaia eval agent --fix

# Fix rag_quality failures only, with tighter budget
gaia eval agent --category rag_quality --fix --max-fix-iterations 5 --target-pass-rate 0.95
The fixer prioritizes repairs in this order:
  1. Critical severity scenarios first
  2. Architecture fixes (in _chat_helpers.py, base agent classes) before prompt fixes
  3. Multi-scenario failures before single-scenario issues
Fix mode uses Claude Code to patch src/gaia/ source files. Review diffs before committing. Always run python util/lint.py --all --fix after fix iterations.

Regression Testing

# Save current run as the new baseline
gaia eval agent --save-baseline

# Compare latest run against saved baseline (auto-detects eval/results/baseline.json)
gaia eval agent --compare eval/results/latest/scorecard.json

# Explicit two-file comparison
gaia eval agent --compare eval/results/run_20250320/scorecard.json eval/results/run_20250322/scorecard.json
Comparison output includes:
  • Per-scenario delta: PASS to FAIL regressions (highlighted), FAIL to PASS improvements
  • Category-level pass rate change
  • Score delta per scenario (warns when score drops by more than 2.0 points within the same status)

Writing Custom Scenarios

Scenario YAML files live under eval/scenarios/<category>/. The runner discovers them automatically via recursive glob.

Full Schema Example

id: my_custom_scenario           # unique identifier (snake_case)
name: "My Custom Scenario"        # human-readable name
category: rag_quality             # one of the 10 categories
severity: critical                # critical | standard
description: |
  What this scenario tests and why.

persona: data_analyst             # casual_user | data_analyst | power_user | confused_user | adversarial_user

setup:
  index_documents:
    - corpus_doc: acme_q3_report   # references manifest.json document id
      path: "eval/corpus/documents/acme_q3_report.md"

turns:
  - turn: 1
    objective: "Ask about Q3 revenue"
    ground_truth:
      doc_id: acme_q3_report
      fact_id: q3_revenue          # references manifest.json fact id
      expected_answer: "$14.2 million"
    success_criteria: "Agent correctly states $14.2 million"

  - turn: 2
    objective: "Ask a follow-up that must NOT be answered"
    ground_truth:
      doc_id: acme_q3_report
      fact_id: cfo_name
      expected_answer: null        # null = agent must say it doesn't know
      note: "NOT in document"
    success_criteria: "Agent admits it doesn't know. FAIL if agent invents a name."

expected_outcome: |
  One-sentence summary of what a passing run looks like.
Each turn needs at least one of ground_truth (non-null dict) or success_criteria (non-empty string) — providing both gives maximum judging precision. Valid personas: casual_user, data_analyst, power_user, confused_user, adversarial_user.
Place your YAML file under eval/scenarios/<category>/ and it will be picked up automatically on the next run.

Capturing Real Sessions

# Convert a live Agent UI conversation to a scenario YAML
gaia eval agent --capture-session 29c211c7-31b5-4084-bb3f-1825c0210942
This reads the session from the Agent UI database (~/.gaia/chat/gaia_chat.db), extracts turns and indexed documents, and writes a scenario YAML to eval/scenarios/captured/. After capture, you must review and edit the generated file to add proper ground_truth and success_criteria fields — the capture tool populates the structure but cannot infer expected answers.

Architecture Audit

gaia eval agent --audit-only
Runs a static analysis of the Agent UI’s internal constraints without making any LLM calls:
  • History window size (_MAX_HISTORY_PAIRS in _chat_helpers.py)
  • Message truncation limits (_MAX_MSG_CHARS)
  • Tool result persistence in conversation history
  • Agent persistence model (stateless per-message vs. persistent)
The audit flags which scenarios will be automatically BLOCKED_BY_ARCHITECTURE and provides recommendations (e.g., “increase _MAX_HISTORY_PAIRS to 10+”). Run this before the full benchmark to understand expected failures due to architecture limits rather than AI quality.

Output Files

After a run, results are written to eval/results/<run_id>/:
FileDescription
scorecard.jsonMachine-readable results with per-scenario details, scores, and cost
summary.mdHuman-readable pass/fail report with emoji status icons
traces/<scenario_id>.jsonFull per-scenario trace (turns, dimension scores, reasoning)
fix_log.jsonWritten by --fix mode: list of files changed and rationale per fix
eval/results/baseline.jsonSaved baseline (written by --save-baseline)

Sample summary.md Output

# GAIA Agent Eval — run_20250322_143000
**Date:** 2026-03-22T14:30:00+00:00
**Model:** claude-sonnet-4-6

## Summary
- **Total:** 54 scenarios
- **Passed:** 34 ✅
- **Failed:** 4 ❌
- **Blocked:** 2 🚫
- **Timeout:** 0 ⏱
- **Budget exceeded:** 0 💸
- **Infra error:** 0 🔧
- **Skipped (no doc):** 14 ⏭
- **Errored:** 0 ⚠️
- **Pass rate (all):** 63%
- **Pass rate (judged):** 85%
- **Avg score (judged):** 7.4/10

## By Category
| Category | Pass | Fail | Blocked | Infra | Skipped | Avg Score |
|----------|------|------|---------|-------|---------|-----------|
| rag_quality | 5 | 1 | 0 | 0 | 1 | 7.2 |
| context_retention | 3 | 1 | 0 | 0 | 0 | 6.8 |

## Scenarios
-**simple_factual_rag** — PASS (8.2/10)
-**hallucination_resistance** — PASS (9.1/10)
-**cross_turn_file_recall** — FAIL (4.8/10)
  - Root cause: History window too small to retain document context
- 🚫 **conversation_summary** — BLOCKED_BY_ARCHITECTURE (n/a)

**Cost:** $0.1240

CI/CD Integration

- name: Run Agent Eval Benchmark
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
  run: |
    gaia eval agent --category rag_quality --budget 1.00 --timeout 300
Use --category to limit CI costs. The rag_quality and context_retention categories cover the highest-impact tests and typically complete in under 10 minutes.
The benchmark includes a GitHub Actions workflow at .github/workflows/test_eval.yml that runs structural validation (scenario YAML parsing, manifest integrity, scorecard generation) on every push to main or PR targeting main. Full LLM-driven eval runs are triggered via workflow_dispatch or scheduled separately.

Next Steps

Evaluation Framework

Batch experiments, ground truth generation, and model comparison

Agent UI Guide

The desktop chat application that the benchmark tests

RAG SDK

Document indexing and retrieval under the hood

Agent System

Base Agent class, tools, and state management