CI/CD Integration

Source Code: src/gaia/eval/

Overview

The Agent Eval framework is designed for CI/CD integration. You can run evaluations on every push, compare results against baselines, detect regressions, and control costs with per-scenario budgets and timeouts. This guide covers:

Setting up GitHub Actions workflows
Baseline management for regression detection
Cost budgeting strategies
Interpreting scorecard diffs between releases

Quick Start: GitHub Actions

Here’s a minimal workflow that runs the RAG quality scenarios on every push:

.github/workflows/agent-eval.yml

name: Agent Eval

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  eval:
    runs-on: ubuntu-latest
    timeout-minutes: 30

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: |
          pip install uv
          uv pip install -e ".[eval]" --system

      - name: Install Claude Code
        run: npm install -g @anthropic-ai/claude-code

      - name: Start Agent UI backend
        run: |
          uv run python -m gaia.ui.server &
          sleep 5  # Wait for server to start

      - name: Run Agent Eval (RAG quality)
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          gaia eval agent \
            --category rag_quality \
            --budget 1.00 \
            --timeout 300

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: eval/results/

The Agent UI backend requires a running Lemonade Server for inference. In CI, you can either:

Use a cloud LLM provider via --backend pointing to a hosted instance
Pre-install Lemonade Server in your CI environment
Use a self-hosted runner with AMD hardware

Baseline Workflow

Regression detection works by comparing scorecard results between runs. The typical workflow:

1. Establish a Baseline

After a known-good release, save the scorecard as a baseline:

gaia eval agent --save-baseline

This writes eval/results/baseline.json. Commit this file to your repository.

2. Compare Against Baseline

On subsequent runs, compare the current results against the saved baseline:

gaia eval agent --compare eval/results/latest/scorecard.json

When only one path is provided, it’s compared against eval/results/baseline.json automatically.

3. Explicit Two-File Comparison

Compare any two scorecards directly:

gaia eval agent --compare eval/results/v1.0/scorecard.json eval/results/v1.1/scorecard.json

Complete Baseline CI Workflow

.github/workflows/agent-eval-regression.yml

name: Agent Eval Regression Check

on:
  pull_request:
    branches: [main]
    paths:
      - "src/gaia/agents/**"
      - "src/gaia/ui/**"
      - "src/gaia/rag/**"

jobs:
  eval-regression:
    runs-on: ubuntu-latest
    timeout-minutes: 45

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: |
          pip install uv
          uv pip install -e ".[eval]" --system
          npm install -g @anthropic-ai/claude-code

      - name: Start Agent UI backend
        run: |
          uv run python -m gaia.ui.server &
          sleep 5

      - name: Run eval and compare against baseline
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          # Run eval
          gaia eval agent \
            --category rag_quality \
            --category context_retention \
            --budget 1.50 \
            --timeout 300

          # Compare against committed baseline
          gaia eval agent \
            --compare eval/results/latest/scorecard.json

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-scorecard
          path: |
            eval/results/latest/scorecard.json
            eval/results/latest/summary.md

Cost Budgeting

Per-Scenario Budget

The --budget flag sets the maximum USD spend per scenario:

# Conservative: $1 per scenario
gaia eval agent --budget 1.00

# Default: $2 per scenario
gaia eval agent

# Generous: $5 per scenario (for complex multi-turn scenarios)
gaia eval agent --budget 5.00

When a scenario exceeds its budget, it receives the BUDGET_EXCEEDED status and is excluded from quality metrics.

Typical Costs

Scope	Approximate Cost	Time
Single scenario	$0.02 --$ 0.10	30s — 3min
RAG quality category (7 scenarios)	$0.20 --$ 0.70	5 — 15min
Context retention (4 scenarios)	$0.10 --$ 0.40	3 — 10min
Full benchmark (54 scenarios)	$1.00 --$ 5.00	20 — 60min
Architecture audit	$0.00	< 1min

Costs depend on the judge model. The default claude-sonnet-4-6 is cost-effective. Using claude-opus-4.1 as the judge increases costs ~5x but may improve scoring accuracy for edge cases.

Cost-Optimized CI Strategy

Run different tiers of evaluation based on the trigger:

.github/workflows/agent-eval-tiered.yml

name: Agent Eval (Tiered)

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  workflow_dispatch:
    inputs:
      full_eval:
        description: "Run full benchmark"
        type: boolean
        default: false

jobs:
  # Always run: architecture audit (free)
  audit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - name: Install
        run: |
          pip install uv
          uv pip install -e ".[eval]" --system
      - name: Architecture Audit
        run: gaia eval agent --audit-only

  # PR only: run core categories (~$0.50)
  core-eval:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - name: Install
        run: |
          pip install uv
          uv pip install -e ".[eval]" --system
          npm install -g @anthropic-ai/claude-code
      - name: Start backend
        run: |
          uv run python -m gaia.ui.server &
          sleep 5
      - name: Run core eval
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          gaia eval agent \
            --category rag_quality \
            --category context_retention \
            --budget 1.00 \
            --timeout 300
      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: core-eval-results
          path: eval/results/

  # Manual or main push: full benchmark (~$3.00)
  full-eval:
    if: >
      github.event_name == 'workflow_dispatch' && github.event.inputs.full_eval == 'true'
      || (github.event_name == 'push' && github.ref == 'refs/heads/main')
    runs-on: ubuntu-latest
    timeout-minutes: 90
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - name: Install
        run: |
          pip install uv
          uv pip install -e ".[eval]" --system
          npm install -g @anthropic-ai/claude-code
      - name: Start backend
        run: |
          uv run python -m gaia.ui.server &
          sleep 5
      - name: Run full eval and save baseline
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          gaia eval agent --save-baseline --budget 2.00
      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: full-eval-results
          path: eval/results/

Interpreting Scorecard Diffs

When you run --compare, the output shows per-scenario deltas:

Regressions (PASS to FAIL)

❌ REGRESSION: simple_factual_rag
   Before: PASS (8.2/10)
   After:  FAIL (3.1/10)
   Delta:  -5.1 points

Regressions are highlighted and should block the PR. The trace file (traces/simple_factual_rag.json) contains the full conversation, dimension scores, and reasoning to help diagnose the issue.

Improvements (FAIL to PASS)

✅ IMPROVEMENT: cross_turn_file_recall
   Before: FAIL (4.8/10)
   After:  PASS (7.5/10)
   Delta:  +2.7 points

Score Drops Within Same Status

⚠️ SCORE DROP: hallucination_resistance
   Before: PASS (9.1/10)
   After:  PASS (6.2/10)
   Delta:  -2.9 points

Even if the status stays PASS, a score drop of more than 2.0 points triggers a warning — the scenario is getting closer to failing.

Category-Level Changes

## Category Summary
| Category         | Before | After  | Change |
|------------------|--------|--------|--------|
| rag_quality      | 85%    | 71%    | -14%   |
| context_retention| 75%    | 100%   | +25%   |

JUnit Output for CI

Generate JUnit XML output for integration with CI dashboards (e.g., GitHub Actions test summary):

gaia eval agent --output-format junit

The JUnit output maps:

Each scenario = one test case
PASS = test passed
FAIL = test failed (with failure message from root cause analysis)
BLOCKED_BY_ARCHITECTURE = test skipped
Infrastructure statuses = test errored

Timeout Management

The --timeout flag sets the base timeout per scenario in seconds. The runner automatically scales it based on scenario complexity:

effective_timeout = max(base_timeout,
                        120s startup overhead
                        + num_docs x 90s per document
                        + num_turns x 200s per turn)
                   capped at 7200s (2 hours)

CI recommendations:

Context	Timeout	Reasoning
Quick PR check	`300` (5min)	Most single-turn RAG scenarios finish in 1-2 minutes
Standard CI run	`900` (default)	Covers multi-turn and multi-document scenarios
Full benchmark	`1200`	Extra buffer for large-document and vision scenarios

# Quick PR check with tight timeout
gaia eval agent --category rag_quality --timeout 300

# Standard CI run
gaia eval agent --timeout 900

Custom Scenario Directories

Use --scenario-dir to include scenarios from external directories:

# Include project-specific scenarios alongside the built-in ones
gaia eval agent --scenario-dir ~/my-project/eval-scenarios

# Multiple additional directories
gaia eval agent \
  --scenario-dir ~/project-a/scenarios \
  --scenario-dir ~/project-b/scenarios

Similarly, use --corpus-dir for additional corpus directories:

gaia eval agent --corpus-dir ~/my-project/eval-corpus

Best Practices for CI

Do

Run --audit-only on every push (free)
Use --category to limit CI costs on PRs
Save baselines after each release
Upload eval/results/ as artifacts
Set timeout-minutes on the job
Use --budget to cap per-scenario costs
Trigger full benchmarks via workflow_dispatch

Don't

Run the full 54-scenario benchmark on every commit
Skip --compare — regressions are the whole point
Use --fix in CI (patches should be reviewed by humans)
Ignore BLOCKED_BY_ARCHITECTURE — track these as known issues
Set budget too low ($0.50) — scenarios may hit BUDGET_EXCEEDED

Never use --fix in CI pipelines. Fix mode patches source code directly and should only be used in local development where changes can be reviewed before committing.

Next Steps

Getting Started

Run your first eval and read the scorecard

Scenario Authoring

Write custom scenarios with YAML and ground truth

CLI Reference

Complete flag reference for gaia eval agent

Agent Eval Benchmark

Architecture deep-dive and scoring internals

Getting Started

C++ Framework

Python Framework

CI/CD Integration

Overview

Quick Start: GitHub Actions

Baseline Workflow

1. Establish a Baseline

2. Compare Against Baseline

3. Explicit Two-File Comparison

Complete Baseline CI Workflow

Cost Budgeting

Per-Scenario Budget

Typical Costs

Cost-Optimized CI Strategy

Interpreting Scorecard Diffs

Regressions (PASS to FAIL)

Improvements (FAIL to PASS)

Score Drops Within Same Status

Category-Level Changes

JUnit Output for CI

Timeout Management

Custom Scenario Directories

Best Practices for CI

Do

Don't

Next Steps

Getting Started

Scenario Authoring

CLI Reference

Agent Eval Benchmark

Getting Started

C++ Framework

Python Framework

Documentation Index

​Overview

​Quick Start: GitHub Actions

​Baseline Workflow

​1. Establish a Baseline

​2. Compare Against Baseline

​3. Explicit Two-File Comparison

​Complete Baseline CI Workflow

​Cost Budgeting

​Per-Scenario Budget

​Typical Costs

​Cost-Optimized CI Strategy

​Interpreting Scorecard Diffs

​Regressions (PASS to FAIL)

​Improvements (FAIL to PASS)

​Score Drops Within Same Status

​Category-Level Changes

​JUnit Output for CI

​Timeout Management

​Custom Scenario Directories

​Best Practices for CI

Do

Don't

​Next Steps

Getting Started

Scenario Authoring

CLI Reference

Agent Eval Benchmark

Overview

Quick Start: GitHub Actions

Baseline Workflow

1. Establish a Baseline

2. Compare Against Baseline

3. Explicit Two-File Comparison

Complete Baseline CI Workflow

Cost Budgeting

Per-Scenario Budget

Typical Costs

Cost-Optimized CI Strategy

Interpreting Scorecard Diffs

Regressions (PASS to FAIL)

Improvements (FAIL to PASS)

Score Drops Within Same Status

Category-Level Changes

JUnit Output for CI

Timeout Management

Custom Scenario Directories

Best Practices for CI

Next Steps