Skip to main content

Introduction

This playbook demonstrates building a multi-modal agent by composing image generation and vision capabilities into a single agent class. You’ll use Python’s multiple inheritance to combine SDToolsMixin (Stable Diffusion image generation) and VLMToolsMixin (Vision Language Model analysis) with GAIA’s base Agent class. The architecture is straightforward: your agent runs an LLM as its reasoning engine, which has access to tools from both mixins. When a user asks to generate an image and create a story, the LLM decides the sequence—call generate_image(), then call create_story_from_image() to analyze the result. Behind the scenes, Lemonade Server hosts the model endpoints locally, providing LLM inference for the agent’s reasoning, VLM inference for image analysis, and Stable Diffusion for image generation. Everything runs on your machine with no external API calls. The key pattern you’ll learn is tool composition through mixins. Each mixin registers its tools via the @tool decorator during initialization. You’ll see how domain-specific tools can wrap generic capabilities, creating specialized workflows while keeping the underlying components reusable. By the end, you’ll have a working multi-modal agent in ~60 lines of code and understand how to compose tool capabilities using GAIA’s mixin architecture.

Quick Test

First time? See Setup Guide to install prerequisites (uv, Python, AMD drivers).
Want to try it first before building? Install models and start Lemonade Server:
gaia init --profile sd
Generate one image + story with the built-in agent:
gaia sd "generate an image of a robot exploring ancient ruins and tell me the story of what it discovers"

What You’ll Build

An ImageStoryAgent that generates images and creates stories about them. ~60 lines of code total. When you run agent.process_query("create a robot exploring ancient ruins"), here’s what happens:
  1. LLM decides the plan: Qwen3-8B analyzes your query and returns {plan: [{tool: "generate_image", tool_args: {prompt: "..."}}, {tool: "create_story_from_image", tool_args: {...}}]}
  2. Executes generate_image(): Sends prompt to SDXL-Turbo (4-step diffusion model), receives PNG bytes, saves to disk (~17s)
  3. Executes create_story_from_image(): Calls your custom tool → sends image bytes to Qwen3-VL-4B vision model → receives 2-3 paragraph narrative (~15s)
  4. Returns result: {"status": "success", "result": "Here's your image and story...", "conversation": [...], "steps_taken": 3}
You’ll create the create_story_from_image tool yourself using VLM capabilities. Three models, one agent:
  • Qwen3-8B-GGUF (5GB): Reasoning engine - decides which tools to call and when
  • SDXL-Turbo (6.5GB): Diffusion model - generates images from text prompts
  • Qwen3-VL-4B (3.2GB): Vision-language model - analyzes images and generates text
All run locally via Lemonade Server. GGUF models use AMD Radeon iGPU (Vulkan), SDXL runs on CPU.

Prerequisites

1

Complete setup

If you haven’t already, follow the Setup guide to install prerequisites (uv, Lemonade Server, etc.).
2

Create project folder

mkdir my-sd-agent
cd my-sd-agent
3

Create virtual environment

uv venv .venv --python 3.12
uv will automatically download Python 3.12 if not already installed.
Activate the environment:
source .venv/bin/activate
4

Install GAIA

uv pip install amd-gaia
5

Initialize SD profile

gaia init --profile sd
This command:
  1. Installs Lemonade Server (if not already installed)
  2. Starts Lemonade Server automatically
  3. Configures Lemonade Server with 16K context (required for SD agent multi-step planning)
  4. Downloads three models (~15GB):
    • SDXL-Turbo (6.5GB) - Diffusion model in safetensors format
    • Qwen3-8B-GGUF (5GB) - Agent LLM in quantized GGUF format (loaded with 16K context)
    • Qwen3-VL-4B-Instruct-GGUF (3.2GB) - Vision LLM in quantized GGUF format
  5. Verifies models and context size
Hardware acceleration:
  • GGUF models run on iGPU (Radeon integrated graphics) via Vulkan
  • SDXL-Turbo currently runs on CPU (GPU acceleration planned)
Video: Running gaia init --profile sd and watching models download

Building the Agent

Create a new file image_story_agent.py and add this code:

Step 1: Initialize the Agent

from gaia.agents.base import Agent
from gaia.sd import SDToolsMixin
from gaia.vlm import VLMToolsMixin

class ImageStoryAgent(Agent, SDToolsMixin, VLMToolsMixin):
    """Agent that generates images and creates stories."""

    def __init__(self, output_dir="./generated_images"):
        super().__init__(model_id="Qwen3-8B-GGUF")
        self.init_sd(output_dir=output_dir, default_model="SDXL-Turbo")
        self.init_vlm(model="Qwen3-VL-4B-Instruct-GGUF")
What this does:
  • Agent, SDToolsMixin, VLMToolsMixin: Inherits three capabilities - reasoning, image generation, vision analysis
  • super().__init__(model_id="Qwen3-8B-GGUF"): Sets up the reasoning LLM (downloaded by gaia init --profile sd)
  • init_sd(): Registers image generation tools (generate_image, list_sd_models, get_generation_history)
  • init_vlm(): Registers vision tools (analyze_image, answer_question_about_image)
The agent now has 5 tools from mixins. You’ll add a 6th custom tool next.
From SDToolsMixin:
generate_image(
    prompt: str,
    model: str = None,
    size: str = None,
    steps: int = None,
    cfg_scale: float = None,
    seed: int = None
) -> dict
  • Sends prompt to SDXL-Turbo diffusion model
  • Returns: {"status": "success", "image_path": "...", "generation_time_s": 17.2, "seed": 123456}
  • Saves PNG to output_dir
list_sd_models() -> dict
  • Returns available models with speed/quality characteristics
get_generation_history(limit: int = 10) -> dict
  • Returns recent generations from this session
From VLMToolsMixin:
analyze_image(image_path: str, focus: str = "all") -> dict
  • Sends image bytes to Qwen3-VL-4B vision model
  • Returns: {"status": "success", "description": "...", "focus_analysis": "..."}
answer_question_about_image(image_path: str, question: str) -> dict
  • Sends image + question to VLM
  • Returns: {"status": "success", "answer": "...", "confidence": "high"}
You can change default_model in init_sd():
ModelSpeedQuality
SDXL-Turbo (default)~17sBalanced
SD-Turbo~13sFast prototyping
SDXL-Base-1.0~9minPhotorealistic

Step 2: Add Custom Tool

Add _register_tools() to create create_story_from_image:
    def _register_tools(self):
        from gaia.agents.base.tools import tool
        from pathlib import Path

        @tool(atomic=True)
        def create_story_from_image(image_path: str, story_style: str = "any") -> dict:
            """Generate a creative short story (2-3 paragraphs) based on an image."""
            path = Path(image_path)
            if not path.exists():
                return {"status": "error", "error": f"Image not found: {image_path}"}

            # Read image bytes
            image_bytes = path.read_bytes()

            # Build story prompt based on style
            style_map = {
                "whimsical": "playful and lighthearted",
                "dramatic": "intense and emotionally charged",
                "adventure": "exciting with action and discovery",
                "educational": "informative and teaches something",
                "any": "engaging and imaginative"
            }
            style_desc = style_map.get(story_style, "engaging and imaginative")

            # Call VLM to generate story
            prompt = f"Create a short creative story (2-3 paragraphs) that is {style_desc}. Bring the image to life with narrative."
            story = self.vlm_client.extract_from_image(image_bytes, prompt=prompt)

            return {
                "status": "success",
                "story": story,
                "story_style": story_style,
                "image_path": str(path)
            }
What this does:
  • @tool(atomic=True): Registers create_story_from_image as an atomic tool
    • atomic=True: Tool is self-contained - executes completely without multi-step planning
    • Most tools are atomic (generate image, analyze image, query database)
    • Non-atomic: multi-step tasks like “research a topic”
  • Reads image bytes from file path
  • Builds a storytelling prompt based on the style parameter
  • Calls self.vlm_client.extract_from_image() - sends image + prompt to Qwen3-VL-4B vision model
  • Returns structured dict with story text and metadata
  • Decorator auto-generates JSON schema from function signature and docstring

Step 3: Add System Prompt (Optional)

Add _get_system_prompt() to customize instructions:
    def _get_system_prompt(self) -> str:
        return """You are an image generation and storytelling agent.

WORKFLOW for image + story requests:
1. Call generate_image() - THIS RETURNS: {"status": "success", "image_path": ".gaia/cache/sd/images/..."}
2. Extract the ACTUAL image_path value from step 1's result
3. Call create_story_from_image(image_path=<ACTUAL PATH FROM STEP 1>, story_style=<user's tone>)
4. Include the COMPLETE story text in your final answer

CRITICAL: Extract actual values from tool results, NEVER use placeholders like "$IMAGE_PATH$" or "generated_image_path"

CORRECT parameter passing:
Step 1 result: {"image_path": ".gaia/cache/sd/images/robot_123.png"}
Step 2 args: {"image_path": ".gaia/cache/sd/images/robot_123.png", "story_style": "adventure"} ✓

INCORRECT (DO NOT DO THIS):
Step 2 args: {"image_path": "$IMAGE_PATH$"} ✗
Step 2 args: {"image_path": "generated_image_path"} ✗

OTHER RULES:
- Generate ONE image by default (multiple only if explicitly requested: "3 images", "variations")
- Match story_style to user's request: "whimsical" (cute/playful), "adventure" (action), "dramatic" (intense), "any" (default)
- Include full story text in answer - users want to read it immediately"""
What this does:
  • Documents your custom create_story_from_image tool (parameters, return value, purpose)
  • Provides explicit workflow: generate image → create story using the image_path → include full text in answer
  • Sets critical rules: use image_path from results, include complete story text, default to one image
  • Gives example showing proper parameter passing between tool calls
Why this is important:
  • Mixins only describe their own tools (SD guidelines don’t know about VLM or custom tools)
  • Your custom tool needs explicit documentation and usage instructions
  • Clear workflow prevents errors like missing image_path or incomplete responses
  • Examples help the LLM understand expected behavior
Tool schemas (parameters, types) are auto-added by the @tool decorator. Your system prompt provides usage guidelines and workflow that schemas can’t express.

Step 4: Add CLI and Run

Add the CLI loop to test your agent:
if __name__ == "__main__":
    import os

    os.makedirs("./generated_images", exist_ok=True)
    agent = ImageStoryAgent()

    print("Image Story Agent ready! Type 'quit' to exit.\n")

    while True:
        user_input = input("You: ").strip()
        if user_input.lower() in ("quit", "exit", "q"):
            break
        if user_input:
            result = agent.process_query(user_input)
            if result.get("result"):
                print(f"\nAgent: {result['result']}\n")
Quick summary:
  • Creates output directory and initializes agent (loads Qwen3-8B, registers 6 tools)
  • Loops: reads input → calls process_query(input) → prints result
  • process_query() handles LLM planning and tool execution automatically
When you call agent.process_query("create a robot exploring ruins"):Step 1 - LLM Planning:
# Agent sends to Qwen3-8B
{
  "messages": [{"role": "user", "content": "create a robot exploring ruins"}],
  "tools": [
    {"name": "generate_image", "parameters": {"prompt": "str", ...}},
    {"name": "create_story_from_image", "parameters": {"image_path": "str", ...}},
    # ... other tools
  ]
}
Step 2 - LLM Response:
# LLM decides the plan
{
  "plan": [
    {"tool": "generate_image", "tool_args": {"prompt": "robot exploring ancient ruins"}},
    {"tool": "create_story_from_image", "tool_args": {"image_path": "..."}}
  ]
}
Step 3 - Tool Execution:
  • Calls generate_image("robot exploring ancient ruins") → Saves PNG to disk (~17s)
  • Calls create_story_from_image("path/to/image.png") → Gets story from VLM (~15s)
Step 4 - Final Synthesis:
  • Agent sends tool results back to LLM
  • LLM creates final answer combining both results
Step 5 - Return Value:
{
  "status": "success",
  "result": "I created an image of a robot exploring ancient ruins and wrote a story about it...",
  "conversation": [...],  # Full conversation history
  "steps_taken": 3,
  "duration": 42.3
}
Agent limits: Max 20 steps (configurable). Stops when LLM returns answer field instead of requesting more tools.
Run it:
python image_story_agent.py
Or skip the typing and run the complete example from the GAIA repository:
python examples/sd_agent_example.py
Try:
You: create a robot exploring ancient ruins
You: generate a sunset over mountains and describe the colors
You: make a cyberpunk street scene and tell me a story
🎉 Congratulations! You’ve built a fully functional multi-modal agent in ~60 lines of code. The agent can generate images with Stable Diffusion, analyze them with vision models, and create stories—all running locally on AMD hardware.

Troubleshooting

Error: LemonadeClientError: Cannot connect to http://localhost:8000Solution: Lemonade Server isn’t running. Check status:
lemonade-server status
If not running, gaia init --profile sd should have started it. Try restarting:
lemonade-server serve
Error: Model 'SDXL-Turbo' not foundSolution: Download missing models:
lemonade-server pull SDXL-Turbo
lemonade-server pull Qwen3-VL-4B-Instruct-GGUF
Or re-run init:
gaia init --profile sd
Error: ModuleNotFoundError: No module named 'gaia.sd'Solution: Upgrade GAIA:
uv pip install --upgrade amd-gaia
Verify installation:
python -c "from gaia.sd import SDToolsMixin; print('OK')"
Question: How do I generate the same image twice?Solution: Use a fixed seed:CLI:
gaia sd "robot kitten" --seed 42
Python:
result = agent.process_query("generate robot kitten with seed 42")
By default, images use random seeds for variety. Specify a seed for reproducibility.

Still Having Issues?

If you’re experiencing a problem not listed above, we’re here to help: Report an Issue:
  • 🐛 Create a GitHub Issue - Best for bugs, feature requests, and technical issues
  • Include your GAIA version (gaia --version), OS, and error messages
Get Help: Before Reporting:
  1. Check the FAQ and Troubleshooting Guide
  2. Search existing issues to avoid duplicates
  3. Include reproduction steps, error messages, and environment details

What’s Next?