Skip to main content

Testing

17.1 Testing Agents

import pytest
from my_package.agent import MyAgent

def test_agent_creation():
    """Test agent can be created."""
    agent = MyAgent()
    assert agent is not None

def test_tool_execution():
    """Test a tool's implementation directly.

    The base Agent does not expose a public `execute_tool()` — tools are
    invoked by the planner after parsing LLM output. For unit tests, either
    call the tool function directly (it's just a decorated Python function)
    or look it up in the global registry.
    """
    from gaia.agents.base.tools import _TOOL_REGISTRY
    agent = MyAgent()
    entry = _TOOL_REGISTRY["my_tool"]
    result = entry["function"](param="value")
    assert result["status"] == "success"

def test_query_processing():
    """Test full query processing."""
    agent = MyAgent()
    result = agent.process_query("Do something")
    assert "result" in result

17.2 Silent Mode for Testing

from my_package.agent import MyAgent

def test_silent_mode():
    """Test with no console output."""
    agent = MyAgent(silent_mode=True)
    result = agent.process_query("Test query")
    # No console output, just results
    assert result is not None

17.3 Mocking LLM Responses

from unittest.mock import patch
from gaia.chat.sdk import AgentResponse

def test_with_mocked_llm():
    """Test agent with a mocked AgentSDK.

    The AgentSDK exposes `send()` (and `send_stream()`), not `.complete()`.
    Return an AgentResponse whose `text` contains the JSON the planner
    expects — the agent will parse it and invoke the referenced tool.
    """
    with patch("gaia.chat.sdk.AgentSDK") as mock_chat:
        mock_chat.return_value.send.return_value = AgentResponse(
            text='{"tool": "my_tool", "tool_args": {"param": "value"}}',
        )

        agent = MyAgent()
        agent.process_query("Test")

        assert mock_chat.return_value.send.called
Consider using the built-in MockLLMProvider / MockVLMClient helpers in gaia.testing (src/gaia/testing/mocks.py) for richer fixtures. The require_lemonade pytest fixture in tests/conftest.py skips integration tests automatically if no Lemonade server is running.

Behavior-E2E — Assert side-effects, not replies

The class of bug “agent claims success but the tool never ran” (see #1428) is invisible to:
  • Unit tests that mock the LLM and feed clean tool calls
  • UI render E2E tests that assert “the agent replied”
The pattern to catch it: drive each tool-using agent through the real server with a real model, repeat each scenario N× (the failure is output-format-dependent and non-deterministic), and assert the tool’s side-effect actually occurred — not merely that the agent produced a reply.
  • False-success is a hard fail: a “success” reply with no side-effect is worse than honest failure.
  • Planted unguessable facts: use secrets.token_hex(4) in the prompt so a cached or hallucinated reply cannot accidentally pass.
  • N× repetition: repeat ≥5 times; treat any single run with a missing side-effect as a failure.
See src/gaia/eval/behavior_harness.py for the reusable Scenario/BehaviorHarness implementation. New agents plug in by adding a Scenario to BUILDER_SCENARIOS (or their own list). The live tests live in tests/integration/eval/test_behavior_e2e.py and are gated by @pytest.mark.real_model — they skip automatically on Mac/standard CI and only run on [self-hosted, strix-halo] runners once #1297 lands. The helper unit tests (tests/unit/eval/test_behavior_harness.py) are fully deterministic and run in normal CI with no model required.