Skip to main content

Documentation Index

Fetch the complete documentation index at: https://amd-gaia.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Source Code: src/gaia/eval/

What Is Agent Eval?

The Agent Eval framework validates your GAIA agent’s quality by running realistic, multi-turn conversations against the live Agent UI. It uses Claude Code as both user simulator and judge — driving conversations through MCP, scoring every response across 7 dimensions, and producing a machine-readable scorecard. Unlike unit tests that check individual functions, Agent Eval tests the full system end-to-end: RAG indexing, tool dispatch, context retention, hallucination resistance, and personality compliance — through the same interface real users interact with. What you get:
  • Automated multi-turn conversation testing with persona-driven user messages
  • 7-dimension scoring rubric (correctness, tool selection, context retention, completeness, efficiency, personality, error recovery)
  • Deterministic pass/fail with weighted scoring
  • Regression detection via baseline comparison
  • Auto-fix mode that invokes Claude Code to repair failures

Prerequisites

1

Install eval dependencies

uv pip install -e ".[eval]"
2

Set your Anthropic API key

The benchmark uses Claude as the judge model:
export ANTHROPIC_API_KEY=sk-ant-...
3

Install Claude Code CLI

The runner invokes scenarios via claude -p subprocess. Verify it’s installed:
claude --version
If not installed, see Claude Code installation.
4

Start the LLM backend

Lemonade Server provides the local LLM and embeddings:
lemonade-server serve
5

Start the Agent UI backend

gaia chat --ui
The Agent UI backend must be running on http://localhost:4200 (default).

Quick Start

Run your first eval in under 5 minutes:

1. Run a Single Scenario

Start with a simple RAG factual lookup test:
gaia eval agent --scenario simple_factual_rag
This will:
  1. Create a new Agent UI session via MCP
  2. Index the acme_q3_report.md document
  3. Ask about Q3 revenue (simulating a power_user persona)
  4. Judge the response against the ground truth ($14.2 million)
  5. Output a scored result

2. Run a Category

Test all RAG quality scenarios:
gaia eval agent --category rag_quality

3. Run the Full Benchmark

Run all 54 scenarios across 10 categories:
gaia eval agent

4. Architecture Audit (Free)

Check for structural limitations without making any LLM calls:
gaia eval agent --audit-only

Reading the Scorecard

After each run, results are written to eval/results/<run_id>/:
FileDescription
scorecard.jsonMachine-readable results with per-scenario scores and cost
summary.mdHuman-readable pass/fail report
traces/<scenario_id>.jsonFull per-scenario trace with turn-level scores and reasoning

Understanding Pass / Fail

A scenario passes when both conditions are met:
  • Overall score is 6.0 or higher (out of 10)
  • No turn has a correctness score below 4
A scenario fails if either:
  • Overall score is below 6.0, OR
  • Any single turn has correctness below 4 (hard fail on hallucination or wrong answer)

The 7 Scoring Dimensions

Each turn is scored across 7 dimensions. The overall score is a weighted sum:
DimensionWeightWhat It Measures
Correctness25%Factual accuracy against ground truth. Wrong numbers, wrong names, or hallucinated facts score 0
Tool Selection20%Used the right tools in the right order. Skipping tools or over-calling scores low
Context Retention20%Remembered prior turns, resolved pronouns, didn’t re-ask established information
Completeness15%Answered all parts of the question
Efficiency10%Took the optimal path without redundant tool calls
Personality5%Concise, direct tone. No sycophancy or generic AI hedging
Error Recovery5%Gracefully handled missing files, empty results, or ambiguous queries

Status Codes

StatusMeaning
PASSAll criteria met
FAILScore too low or critical failure (hallucination, wrong answer)
BLOCKED_BY_ARCHITECTUREAgent UI architecture prevents success (e.g., history window too small)
TIMEOUTScenario exceeded time limit
BUDGET_EXCEEDEDClaude API budget cap hit
INFRA_ERRORAgent UI backend unreachable or MCP failure
SETUP_ERRORDocument indexing failed (0 chunks)
SKIPPED_NO_DOCUMENTCorpus file not present on disk
ERROREDEval agent crashed or returned invalid output
Only PASS, FAIL, and BLOCKED_BY_ARCHITECTURE count toward the average score and judged pass rate. Infrastructure statuses are excluded from quality metrics.

Sample Output

# GAIA Agent Eval -- eval-20260322-143000
**Model:** claude-sonnet-4-6

## Summary
- Total: 54 scenarios
- Passed: 34 / Failed: 4 / Blocked: 2
- Skipped: 14 (real-world docs not on disk)
- Pass rate (judged): 85%
- Avg score: 7.4/10

## By Category
| Category         | Pass | Fail | Avg Score |
|------------------|------|------|-----------|
| rag_quality      | 5    | 1    | 7.2       |
| context_retention| 3    | 1    | 6.8       |

Scenario Categories

The benchmark includes 54 scenarios across 10 categories:
CategoryCountWhat It Tests
rag_quality7Factual extraction, hallucination resistance, negation handling, CSV/table data
context_retention4Cross-turn recall, pronoun resolution, multi-document context
tool_selection4Correct tool usage, smart discovery, multi-step planning
error_recovery3Graceful handling of missing files, empty results, vague requests
adversarial3Empty files, large documents (>100k tokens), topic switching
personality3Concise responses, no sycophancy, honest limitations
vision3Screenshot capture, VLM integration
real_world19Real PDFs, XLSX, 10-K filings, RFC specs, datasheets
web_system6Clipboard, desktop notifications, webpage fetching, system info
captured2Golden-path replays from real user sessions

Common Workflows

Regression Testing

Save a baseline after a known-good run, then compare future runs:
# Save current results as baseline
gaia eval agent --save-baseline

# Later, compare a new run against the baseline
gaia eval agent --compare eval/results/latest/scorecard.json

# Explicit two-file comparison
gaia eval agent --compare eval/results/run_A/scorecard.json eval/results/run_B/scorecard.json
The comparison shows per-scenario deltas, regressions (PASS to FAIL), and improvements (FAIL to PASS).

Auto-Fix Mode

Let Claude Code automatically diagnose and repair failures:
# Fix all failures, up to 3 iterations
gaia eval agent --fix

# Fix a specific category with custom targets
gaia eval agent --category rag_quality --fix --max-fix-iterations 5 --target-pass-rate 0.95
Fix mode runs in a loop:
  1. Evaluate all scenarios
  2. Diagnose failures and patch source code
  3. Re-run only the failed scenarios
  4. Compare results — stop when --target-pass-rate is reached
Fix mode patches src/gaia/ source files directly. Always review diffs before committing. Run python util/lint.py --all --fix after fix iterations.

Capturing Real Sessions

Convert a live Agent UI conversation into a replayable scenario:
gaia eval agent --capture-session 29c211c7-31b5-4084-bb3f-1825c0210942
This reads the session from the Agent UI database, extracts turns and indexed documents, and writes a scenario YAML to eval/scenarios/captured/. You must then edit the file to add proper ground_truth and success_criteria fields.

Filtering by Tags

Run only scenarios with specific tags:
# Run scenarios tagged with "healthcare" OR "critical"
gaia eval agent --tag healthcare --tag critical

Output Formats

# JUnit XML for CI integration
gaia eval agent --output-format junit

# Markdown summary
gaia eval agent --output-format markdown

Cost Control

ControlFlagDefault
Budget per scenario--budget$2.00
Timeout per scenario--timeout900 seconds
Run specific scenario--scenarioall
Run specific category--categoryall
Typical costs:
  • Single scenario: 0.020.02 -- 0.10
  • Full benchmark (54 scenarios): 1.001.00 -- 5.00
  • Architecture audit: $0.00 (no LLM calls)
Start with --audit-only (free) to understand expected failures, then run individual categories to control costs during development.

Next Steps

Scenario Authoring

Write custom scenarios with YAML and ground truth

CI/CD Integration

Run Agent Eval in GitHub Actions with regression detection

CLI Reference

Complete flag reference for gaia eval agent

Agent Eval Benchmark

Deep-dive into architecture, scoring pipeline, and fix mode internals