Documentation Index
Fetch the complete documentation index at: https://amd-gaia.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Source Code:
src/gaia/eval/What Is Agent Eval?
The Agent Eval framework validates your GAIA agent’s quality by running realistic, multi-turn conversations against the live Agent UI. It uses Claude Code as both user simulator and judge — driving conversations through MCP, scoring every response across 7 dimensions, and producing a machine-readable scorecard. Unlike unit tests that check individual functions, Agent Eval tests the full system end-to-end: RAG indexing, tool dispatch, context retention, hallucination resistance, and personality compliance — through the same interface real users interact with. What you get:- Automated multi-turn conversation testing with persona-driven user messages
- 7-dimension scoring rubric (correctness, tool selection, context retention, completeness, efficiency, personality, error recovery)
- Deterministic pass/fail with weighted scoring
- Regression detection via baseline comparison
- Auto-fix mode that invokes Claude Code to repair failures
Prerequisites
Install Claude Code CLI
The runner invokes scenarios via If not installed, see Claude Code installation.
claude -p subprocess. Verify it’s installed:Quick Start
Run your first eval in under 5 minutes:1. Run a Single Scenario
Start with a simple RAG factual lookup test:- Create a new Agent UI session via MCP
- Index the
acme_q3_report.mddocument - Ask about Q3 revenue (simulating a
power_userpersona) - Judge the response against the ground truth (
$14.2 million) - Output a scored result
2. Run a Category
Test all RAG quality scenarios:3. Run the Full Benchmark
Run all 54 scenarios across 10 categories:4. Architecture Audit (Free)
Check for structural limitations without making any LLM calls:Reading the Scorecard
After each run, results are written toeval/results/<run_id>/:
| File | Description |
|---|---|
scorecard.json | Machine-readable results with per-scenario scores and cost |
summary.md | Human-readable pass/fail report |
traces/<scenario_id>.json | Full per-scenario trace with turn-level scores and reasoning |
Understanding Pass / Fail
A scenario passes when both conditions are met:- Overall score is 6.0 or higher (out of 10)
- No turn has a correctness score below 4
- Overall score is below 6.0, OR
- Any single turn has correctness below 4 (hard fail on hallucination or wrong answer)
The 7 Scoring Dimensions
Each turn is scored across 7 dimensions. The overall score is a weighted sum:| Dimension | Weight | What It Measures |
|---|---|---|
| Correctness | 25% | Factual accuracy against ground truth. Wrong numbers, wrong names, or hallucinated facts score 0 |
| Tool Selection | 20% | Used the right tools in the right order. Skipping tools or over-calling scores low |
| Context Retention | 20% | Remembered prior turns, resolved pronouns, didn’t re-ask established information |
| Completeness | 15% | Answered all parts of the question |
| Efficiency | 10% | Took the optimal path without redundant tool calls |
| Personality | 5% | Concise, direct tone. No sycophancy or generic AI hedging |
| Error Recovery | 5% | Gracefully handled missing files, empty results, or ambiguous queries |
Status Codes
| Status | Meaning |
|---|---|
PASS | All criteria met |
FAIL | Score too low or critical failure (hallucination, wrong answer) |
BLOCKED_BY_ARCHITECTURE | Agent UI architecture prevents success (e.g., history window too small) |
TIMEOUT | Scenario exceeded time limit |
BUDGET_EXCEEDED | Claude API budget cap hit |
INFRA_ERROR | Agent UI backend unreachable or MCP failure |
SETUP_ERROR | Document indexing failed (0 chunks) |
SKIPPED_NO_DOCUMENT | Corpus file not present on disk |
ERRORED | Eval agent crashed or returned invalid output |
Only
PASS, FAIL, and BLOCKED_BY_ARCHITECTURE count toward the average score and judged pass rate. Infrastructure statuses are excluded from quality metrics.Sample Output
Scenario Categories
The benchmark includes 54 scenarios across 10 categories:| Category | Count | What It Tests |
|---|---|---|
rag_quality | 7 | Factual extraction, hallucination resistance, negation handling, CSV/table data |
context_retention | 4 | Cross-turn recall, pronoun resolution, multi-document context |
tool_selection | 4 | Correct tool usage, smart discovery, multi-step planning |
error_recovery | 3 | Graceful handling of missing files, empty results, vague requests |
adversarial | 3 | Empty files, large documents (>100k tokens), topic switching |
personality | 3 | Concise responses, no sycophancy, honest limitations |
vision | 3 | Screenshot capture, VLM integration |
real_world | 19 | Real PDFs, XLSX, 10-K filings, RFC specs, datasheets |
web_system | 6 | Clipboard, desktop notifications, webpage fetching, system info |
captured | 2 | Golden-path replays from real user sessions |
Common Workflows
Regression Testing
Save a baseline after a known-good run, then compare future runs:Auto-Fix Mode
Let Claude Code automatically diagnose and repair failures:- Evaluate all scenarios
- Diagnose failures and patch source code
- Re-run only the failed scenarios
- Compare results — stop when
--target-pass-rateis reached
Capturing Real Sessions
Convert a live Agent UI conversation into a replayable scenario:eval/scenarios/captured/. You must then edit the file to add proper ground_truth and success_criteria fields.
Filtering by Tags
Run only scenarios with specific tags:Output Formats
Cost Control
| Control | Flag | Default |
|---|---|---|
| Budget per scenario | --budget | $2.00 |
| Timeout per scenario | --timeout | 900 seconds |
| Run specific scenario | --scenario | all |
| Run specific category | --category | all |
- Single scenario: 0.10
- Full benchmark (54 scenarios): 5.00
- Architecture audit: $0.00 (no LLM calls)
Next Steps
Scenario Authoring
Write custom scenarios with YAML and ground truth
CI/CD Integration
Run Agent Eval in GitHub Actions with regression detection
CLI Reference
Complete flag reference for
gaia eval agentAgent Eval Benchmark
Deep-dive into architecture, scoring pipeline, and fix mode internals