Documentation Index
Fetch the complete documentation index at: https://amd-gaia.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Source Code:
eval/scenarios/Overview
Scenarios are YAML files that define multi-turn conversations the eval agent will simulate against the live Agent UI. Each scenario specifies a persona, documents to index, user objectives per turn, and ground truth for scoring. The runner discovers scenarios automatically via recursive glob undereval/scenarios/. Place your YAML file in the appropriate category subdirectory and it will be picked up on the next run.
Scenario YAML Format
Here is the complete schema with all fields documented:Key Field Rules
| Field | Required | Notes |
|---|---|---|
id | Yes | Unique across all scenarios. Use snake_case. |
category | Yes | Must be one of the 10 defined categories. |
persona | Yes | Must be one of the 5 defined personas. |
setup.index_documents | Yes | Can be empty ([]) for scenarios that test discovery. |
turns | Yes | At least one turn required. Numbers must be sequential starting from 1. |
ground_truth or success_criteria | One required per turn | Providing both gives maximum judging precision. |
expected_answer | Optional | When non-null, strict automatic-zero rules apply. When null, agent must say it doesn’t know. |
user_message | Optional | If omitted, the eval agent generates a natural message from objective + persona. |
When
expected_answer is non-null, the judge applies strict automatic-zero rules: wrong numbers (>5% deviation), wrong names, lazy refusals (saying “I can’t find that” without calling the query tool), and hallucinated sources all result in a correctness score of 0.Categories
Place scenario files in the appropriate subdirectory undereval/scenarios/:
| Category | Use When Testing… |
|---|---|
rag_quality | Factual extraction, hallucination resistance, CSV/table parsing, cross-section synthesis |
context_retention | Pronoun resolution, remembering indexed documents, multi-document conversations |
tool_selection | Tool routing, smart document discovery, multi-step planning, knowing when NOT to use tools |
error_recovery | Missing files, empty search results, vague queries, graceful fallbacks |
adversarial | Empty files, very large documents, rapid topic switching |
personality | Conciseness, avoiding sycophancy, admitting limitations honestly |
vision | Screenshot capture, VLM analysis, graceful degradation when VLM unavailable |
real_world | Real PDFs, spreadsheets, legal documents, financial filings |
web_system | Clipboard, notifications, webpage fetching, system information |
captured | Replaying real user sessions as golden-path tests |
Personas
Each scenario specifies a persona that shapes how the eval agent crafts user messages:| Persona | Behavior | When to Use |
|---|---|---|
casual_user | Short messages, uses pronouns (“that file”), occasionally vague | Testing robustness with ambiguous input |
power_user | Precise requests, specific file names, multi-step instructions | Testing efficiency and correct tool routing |
confused_user | Wrong terminology, self-correcting mid-sentence | Testing error recovery and clarification |
adversarial_user | Edge cases, topic switches, impossible requests | Testing hallucination resistance and boundary handling |
data_analyst | Numbers, comparisons, aggregations, structured output expectations | Testing numerical accuracy and data synthesis |
Step-by-Step: Write a Custom Scenario
Step 1: Choose Your Category and Persona
Decide what aspect of agent behavior you’re testing, and which user persona best exercises it.Step 2: Identify or Create Corpus Documents
Checkeval/corpus/manifest.json for existing documents. If you need a new document:
- Add the file to
eval/corpus/documents/ - Update
eval/corpus/manifest.jsonwith the document entry and facts (see Corpus Management below)
Step 3: Write the YAML
Create a new file ateval/scenarios/<category>/<your_scenario_id>.yaml:
Step 4: Validate
Run your scenario in isolation:eval/results/<run_id>/traces/budget_threshold_check.json for detailed scoring and reasoning.
Step 5: Iterate
If the scenario is flaky (passing/failing inconsistently):- Make
success_criteriamore specific - Add
expected_answerwith exact values - Add
notefields to clarify edge cases for the judge - Consider whether the persona is too ambiguous
Corpus Management
The test corpus lives ineval/corpus/ with a manifest that defines documents and their ground truth facts.
Directory Structure
Manifest Format
The manifest (eval/corpus/manifest.json) defines documents and their ground truth facts:
Adding a New Document
- Create the document file in
eval/corpus/documents/:
- Add a document entry to
manifest.json:
- Validate the manifest:
Document Formats
| Format | Extension | Notes |
|---|---|---|
markdown | .md | Best for structured text with sections |
html | .html | Preserves tables and formatting |
csv | .csv | Tests tabular data extraction |
text | .txt | Plain text |
python | .py | Code documentation |
image | .png | Tests vision capabilities |
Fact Difficulty Levels
| Level | Meaning |
|---|---|
easy | Fact is stated directly in the document |
medium | Requires reading comprehension or inference |
hard | Requires synthesis across sections or numerical calculation |
The 7 Scoring Dimensions In-Depth
Correctness (25%)
Factual accuracy against ground truth. The most heavily weighted dimension. Scoring guide:- 10 = Exact match with ground truth
- 7 = Minor omission but core fact correct
- 4 = Partially correct
- 0 = Wrong answer, hallucination, or invented facts
expected_answer is non-null):
- Wrong number: ground truth is specific and response deviates >5%
- Wrong name: different person or entity named
- Lazy refusal: says “I can’t find that” without calling the query tool
- Hallucinated source: claims a fact “from the document” that contradicts ground truth
| Deviation from Ground Truth | Max Score |
|---|---|
| ≤1% | 10 |
| ≤5% | 8 |
| 5-15% | 4 |
| 15-50% | 1 |
| >50% | 0 |
Tool Selection (20%)
Whether the agent chose the right tools in the right order.- 10 = Optimal tool chain
- 7 = Correct tools with 1-2 extra calls
- 4 = Wrong tool but recovered
- 0 = Completely wrong tools or skipped required tools
Context Retention (20%)
Whether the agent used information from prior turns.- 10 = Perfect recall and pronoun resolution
- 7 = Mostly remembered, minor gaps
- 4 = Missed key context from prior turns
- 0 = Completely ignored conversation history
Capped at 4 if the agent re-asks for information already established in a prior turn.
Completeness (15%)
Whether all parts of the question were answered.- 10 = Every aspect addressed
- 7 = Most parts answered
- 4 = Partial answer
- 0 = Didn’t answer the question
Efficiency (10%)
Whether the agent took the optimal path.- 10 = Minimal necessary tool calls
- 7 = 1-2 extra steps
- 4 = Redundant work
- 0 = Tool loop (3+ identical calls)
Personality (5%)
GAIA voice and tone compliance.- 10 = Concise, direct, professional
- 7 = Neutral tone
- 4 = Generic AI hedging (“As an AI, I…”)
- 0 = Sycophantic or overly verbose
Error Recovery (5%)
Graceful handling of failure conditions.- 10 = Graceful fallback with helpful message
- 7 = Recovered after retry
- 4 = Partial recovery
- 0 = Gave up without explanation
Real Examples
RAG Quality: Simple Factual Lookup
Context Retention: Cross-Turn File Recall
Personality: Honest Limitation
Adversarial: Empty File
Tool Selection: Smart Discovery
Best Practices
What Makes a Good Scenario
Do
- Use specific, verifiable ground truth facts
- Test one behavior per scenario (single responsibility)
- Provide both
ground_truthANDsuccess_criteria - Include
expected_answerwith exact values when possible - Add
notefields to clarify ambiguous cases - Test negative cases (
expected_answer: null)
Don't
- Write vague success criteria (“Agent responds helpfully”)
- Test multiple unrelated behaviors in one scenario
- Assume the agent has context from other scenarios
- Use subjective quality judgments as pass criteria
- Create scenarios that depend on external services
- Omit
ground_truthwhen exact answers are available
Avoiding Flaky Scenarios
A flaky scenario passes and fails inconsistently across runs. Common causes:- Vague success criteria — “Agent provides a good answer” is too subjective. Be specific: “Agent states the Q3 revenue was $14.2 million.”
-
Missing expected_answer — Without an exact expected value, the judge has more latitude. Always provide
expected_answerwhen the fact is deterministic. -
Ambiguous persona behavior — If
user_messageis omitted, the eval agent generates messages fromobjective+persona. For critical tests, provide an explicituser_messageto remove variability. - Timing-sensitive tests — Don’t test time-dependent behavior (e.g., “What day is it?”). Ground truth must be static.
- External dependencies — Don’t rely on external URLs, APIs, or services that may be unavailable.
Next Steps
Getting Started
Run your first eval and read the scorecard
CI/CD Integration
Automate eval runs in GitHub Actions
CLI Reference
Complete command reference for
gaia eval agentAgent Eval Benchmark
Architecture details, scoring pipeline, and internals