Testing
17.1 Testing Agents
import pytest
from my_package.agent import MyAgent
def test_agent_creation():
"""Test agent can be created."""
agent = MyAgent()
assert agent is not None
def test_tool_execution():
"""Test a tool's implementation directly.
The base Agent does not expose a public `execute_tool()` — tools are
invoked by the planner after parsing LLM output. For unit tests, either
call the tool function directly (it's just a decorated Python function)
or look it up in the global registry.
"""
from gaia.agents.base.tools import _TOOL_REGISTRY
agent = MyAgent()
entry = _TOOL_REGISTRY["my_tool"]
result = entry["function"](param="value")
assert result["status"] == "success"
def test_query_processing():
"""Test full query processing."""
agent = MyAgent()
result = agent.process_query("Do something")
assert "result" in result
17.2 Silent Mode for Testing
from my_package.agent import MyAgent
def test_silent_mode():
"""Test with no console output."""
agent = MyAgent(silent_mode=True)
result = agent.process_query("Test query")
# No console output, just results
assert result is not None
17.3 Mocking LLM Responses
from unittest.mock import patch
from gaia.chat.sdk import AgentResponse
def test_with_mocked_llm():
"""Test agent with a mocked AgentSDK.
The AgentSDK exposes `send()` (and `send_stream()`), not `.complete()`.
Return an AgentResponse whose `text` contains the JSON the planner
expects — the agent will parse it and invoke the referenced tool.
"""
with patch("gaia.chat.sdk.AgentSDK") as mock_chat:
mock_chat.return_value.send.return_value = AgentResponse(
text='{"tool": "my_tool", "tool_args": {"param": "value"}}',
)
agent = MyAgent()
agent.process_query("Test")
assert mock_chat.return_value.send.called
Consider using the built-in MockLLMProvider / MockVLMClient helpers in
gaia.testing (src/gaia/testing/mocks.py) for richer fixtures. The
require_lemonade pytest fixture in tests/conftest.py skips integration
tests automatically if no Lemonade server is running.
Behavior-E2E — Assert side-effects, not replies
The class of bug “agent claims success but the tool never ran” (see #1428) is invisible to:
- Unit tests that mock the LLM and feed clean tool calls
- UI render E2E tests that assert “the agent replied”
The pattern to catch it: drive each tool-using agent through the real server with a real model,
repeat each scenario N× (the failure is output-format-dependent and non-deterministic),
and assert the tool’s side-effect actually occurred — not merely that the agent produced a reply.
- False-success is a hard fail: a “success” reply with no side-effect is worse than honest failure.
- Planted unguessable facts: use
secrets.token_hex(4) in the prompt so a cached or hallucinated reply cannot accidentally pass.
- N× repetition: repeat ≥5 times; treat any single run with a missing side-effect as a failure.
See src/gaia/eval/behavior_harness.py for the reusable Scenario/BehaviorHarness implementation.
New agents plug in by adding a Scenario to BUILDER_SCENARIOS (or their own list).
The live tests live in tests/integration/eval/test_behavior_e2e.py and are gated by
@pytest.mark.real_model — they skip automatically on Mac/standard CI and only run on
[self-hosted, strix-halo] runners once #1297 lands.
The helper unit tests (tests/unit/eval/test_behavior_harness.py) are fully deterministic
and run in normal CI with no model required.