Documentation Index Fetch the complete documentation index at: https://amd-gaia.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The Agent Eval framework is designed for CI/CD integration. You can run evaluations on every push, compare results against baselines, detect regressions, and control costs with per-scenario budgets and timeouts.
This guide covers:
Setting up GitHub Actions workflows
Baseline management for regression detection
Cost budgeting strategies
Interpreting scorecard diffs between releases
Quick Start: GitHub Actions
Here’s a minimal workflow that runs the RAG quality scenarios on every push:
.github/workflows/agent-eval.yml
name : Agent Eval
on :
push :
branches : [ main ]
pull_request :
branches : [ main ]
jobs :
eval :
runs-on : ubuntu-latest
timeout-minutes : 30
steps :
- uses : actions/checkout@v4
- name : Set up Python
uses : actions/setup-python@v5
with :
python-version : "3.12"
- name : Install dependencies
run : |
pip install uv
uv pip install -e ".[eval]" --system
- name : Install Claude Code
run : npm install -g @anthropic-ai/claude-code
- name : Start Agent UI backend
run : |
uv run python -m gaia.ui.server &
sleep 5 # Wait for server to start
- name : Run Agent Eval (RAG quality)
env :
ANTHROPIC_API_KEY : ${{ secrets.ANTHROPIC_API_KEY }}
run : |
gaia eval agent \
--category rag_quality \
--budget 1.00 \
--timeout 300
- name : Upload results
if : always()
uses : actions/upload-artifact@v4
with :
name : eval-results
path : eval/results/
The Agent UI backend requires a running Lemonade Server for inference. In CI, you can either:
Use a cloud LLM provider via --backend pointing to a hosted instance
Pre-install Lemonade Server in your CI environment
Use a self-hosted runner with AMD hardware
Baseline Workflow
Regression detection works by comparing scorecard results between runs. The typical workflow:
1. Establish a Baseline
After a known-good release, save the scorecard as a baseline:
gaia eval agent --save-baseline
This writes eval/results/baseline.json. Commit this file to your repository.
2. Compare Against Baseline
On subsequent runs, compare the current results against the saved baseline:
gaia eval agent --compare eval/results/latest/scorecard.json
When only one path is provided, it’s compared against eval/results/baseline.json automatically.
3. Explicit Two-File Comparison
Compare any two scorecards directly:
gaia eval agent --compare eval/results/v1.0/scorecard.json eval/results/v1.1/scorecard.json
Complete Baseline CI Workflow
.github/workflows/agent-eval-regression.yml
name : Agent Eval Regression Check
on :
pull_request :
branches : [ main ]
paths :
- "src/gaia/agents/**"
- "src/gaia/ui/**"
- "src/gaia/rag/**"
jobs :
eval-regression :
runs-on : ubuntu-latest
timeout-minutes : 45
steps :
- uses : actions/checkout@v4
- name : Set up Python
uses : actions/setup-python@v5
with :
python-version : "3.12"
- name : Install dependencies
run : |
pip install uv
uv pip install -e ".[eval]" --system
npm install -g @anthropic-ai/claude-code
- name : Start Agent UI backend
run : |
uv run python -m gaia.ui.server &
sleep 5
- name : Run eval and compare against baseline
env :
ANTHROPIC_API_KEY : ${{ secrets.ANTHROPIC_API_KEY }}
run : |
# Run eval
gaia eval agent \
--category rag_quality \
--category context_retention \
--budget 1.50 \
--timeout 300
# Compare against committed baseline
gaia eval agent \
--compare eval/results/latest/scorecard.json
- name : Upload results
if : always()
uses : actions/upload-artifact@v4
with :
name : eval-scorecard
path : |
eval/results/latest/scorecard.json
eval/results/latest/summary.md
Cost Budgeting
Per-Scenario Budget
The --budget flag sets the maximum USD spend per scenario:
# Conservative: $1 per scenario
gaia eval agent --budget 1.00
# Default: $2 per scenario
gaia eval agent
# Generous: $5 per scenario (for complex multi-turn scenarios)
gaia eval agent --budget 5.00
When a scenario exceeds its budget, it receives the BUDGET_EXCEEDED status and is excluded from quality metrics.
Typical Costs
Scope Approximate Cost Time Single scenario 0.02 − − 0.02 -- 0.02 − − 0.1030s — 3min RAG quality category (7 scenarios) 0.20 − − 0.20 -- 0.20 − − 0.705 — 15min Context retention (4 scenarios) 0.10 − − 0.10 -- 0.10 − − 0.403 — 10min Full benchmark (54 scenarios) 1.00 − − 1.00 -- 1.00 − − 5.0020 — 60min Architecture audit $0.00 < 1min
Costs depend on the judge model. The default claude-sonnet-4-6 is cost-effective. Using claude-opus-4.1 as the judge increases costs ~5x but may improve scoring accuracy for edge cases.
Cost-Optimized CI Strategy
Run different tiers of evaluation based on the trigger:
.github/workflows/agent-eval-tiered.yml
name : Agent Eval (Tiered)
on :
push :
branches : [ main ]
pull_request :
branches : [ main ]
workflow_dispatch :
inputs :
full_eval :
description : "Run full benchmark"
type : boolean
default : false
jobs :
# Always run: architecture audit (free)
audit :
runs-on : ubuntu-latest
steps :
- uses : actions/checkout@v4
- name : Set up Python
uses : actions/setup-python@v5
with :
python-version : "3.12"
- name : Install
run : |
pip install uv
uv pip install -e ".[eval]" --system
- name : Architecture Audit
run : gaia eval agent --audit-only
# PR only: run core categories (~$0.50)
core-eval :
if : github.event_name == 'pull_request'
runs-on : ubuntu-latest
timeout-minutes : 30
steps :
- uses : actions/checkout@v4
- name : Set up Python
uses : actions/setup-python@v5
with :
python-version : "3.12"
- name : Install
run : |
pip install uv
uv pip install -e ".[eval]" --system
npm install -g @anthropic-ai/claude-code
- name : Start backend
run : |
uv run python -m gaia.ui.server &
sleep 5
- name : Run core eval
env :
ANTHROPIC_API_KEY : ${{ secrets.ANTHROPIC_API_KEY }}
run : |
gaia eval agent \
--category rag_quality \
--category context_retention \
--budget 1.00 \
--timeout 300
- name : Upload results
if : always()
uses : actions/upload-artifact@v4
with :
name : core-eval-results
path : eval/results/
# Manual or main push: full benchmark (~$3.00)
full-eval :
if : >
github.event_name == 'workflow_dispatch' && github.event.inputs.full_eval == 'true'
|| (github.event_name == 'push' && github.ref == 'refs/heads/main')
runs-on : ubuntu-latest
timeout-minutes : 90
steps :
- uses : actions/checkout@v4
- name : Set up Python
uses : actions/setup-python@v5
with :
python-version : "3.12"
- name : Install
run : |
pip install uv
uv pip install -e ".[eval]" --system
npm install -g @anthropic-ai/claude-code
- name : Start backend
run : |
uv run python -m gaia.ui.server &
sleep 5
- name : Run full eval and save baseline
env :
ANTHROPIC_API_KEY : ${{ secrets.ANTHROPIC_API_KEY }}
run : |
gaia eval agent --save-baseline --budget 2.00
- name : Upload results
if : always()
uses : actions/upload-artifact@v4
with :
name : full-eval-results
path : eval/results/
Interpreting Scorecard Diffs
When you run --compare, the output shows per-scenario deltas:
Regressions (PASS to FAIL)
❌ REGRESSION: simple_factual_rag
Before: PASS (8.2/10)
After: FAIL (3.1/10)
Delta: -5.1 points
Regressions are highlighted and should block the PR. The trace file (traces/simple_factual_rag.json) contains the full conversation, dimension scores, and reasoning to help diagnose the issue.
Improvements (FAIL to PASS)
✅ IMPROVEMENT: cross_turn_file_recall
Before: FAIL (4.8/10)
After: PASS (7.5/10)
Delta: +2.7 points
Score Drops Within Same Status
⚠️ SCORE DROP: hallucination_resistance
Before: PASS (9.1/10)
After: PASS (6.2/10)
Delta: -2.9 points
Even if the status stays PASS, a score drop of more than 2.0 points triggers a warning — the scenario is getting closer to failing.
Category-Level Changes
## Category Summary
| Category | Before | After | Change |
|------------------|--------|--------|--------|
| rag_quality | 85% | 71% | -14% |
| context_retention| 75% | 100% | +25% |
JUnit Output for CI
Generate JUnit XML output for integration with CI dashboards (e.g., GitHub Actions test summary):
gaia eval agent --output-format junit
The JUnit output maps:
Each scenario = one test case
PASS = test passed
FAIL = test failed (with failure message from root cause analysis)
BLOCKED_BY_ARCHITECTURE = test skipped
Infrastructure statuses = test errored
Timeout Management
The --timeout flag sets the base timeout per scenario in seconds. The runner automatically scales it based on scenario complexity:
effective_timeout = max(base_timeout,
120s startup overhead
+ num_docs x 90s per document
+ num_turns x 200s per turn)
capped at 7200s (2 hours)
CI recommendations:
Context Timeout Reasoning Quick PR check 300 (5min)Most single-turn RAG scenarios finish in 1-2 minutes Standard CI run 900 (default)Covers multi-turn and multi-document scenarios Full benchmark 1200Extra buffer for large-document and vision scenarios
# Quick PR check with tight timeout
gaia eval agent --category rag_quality --timeout 300
# Standard CI run
gaia eval agent --timeout 900
Custom Scenario Directories
Use --scenario-dir to include scenarios from external directories:
# Include project-specific scenarios alongside the built-in ones
gaia eval agent --scenario-dir ~/my-project/eval-scenarios
# Multiple additional directories
gaia eval agent \
--scenario-dir ~/project-a/scenarios \
--scenario-dir ~/project-b/scenarios
Similarly, use --corpus-dir for additional corpus directories:
gaia eval agent --corpus-dir ~/my-project/eval-corpus
Best Practices for CI
Do
Run --audit-only on every push (free)
Use --category to limit CI costs on PRs
Save baselines after each release
Upload eval/results/ as artifacts
Set timeout-minutes on the job
Use --budget to cap per-scenario costs
Trigger full benchmarks via workflow_dispatch
Don't
Run the full 54-scenario benchmark on every commit
Skip --compare — regressions are the whole point
Use --fix in CI (patches should be reviewed by humans)
Ignore BLOCKED_BY_ARCHITECTURE — track these as known issues
Set budget too low ($0.50) — scenarios may hit BUDGET_EXCEEDED
Never use --fix in CI pipelines. Fix mode patches source code directly and should only be used in local development where changes can be reviewed before committing.
Next Steps
Getting Started Run your first eval and read the scorecard
Scenario Authoring Write custom scenarios with YAML and ground truth
CLI Reference Complete flag reference for gaia eval agent
Agent Eval Benchmark Architecture deep-dive and scoring internals