Building a Multi-Modal Image Generation Agent

Source Code: src/gaia/agents/sd/agent.py

Introduction

This playbook demonstrates building a multi-modal agent by composing image generation and vision capabilities into a single agent class. You’ll use Python’s multiple inheritance to combine SDToolsMixin (Stable Diffusion image generation) and VLMToolsMixin (Vision Language Model analysis) with GAIA’s base Agent class. The architecture is straightforward: your agent runs an LLM as its reasoning engine, which has access to tools from both mixins. When a user asks to generate an image and create a story, the LLM decides the sequence—call generate_image(), then call create_story_from_image() to analyze the result. Behind the scenes, Lemonade Server hosts the model endpoints locally, providing LLM inference for the agent’s reasoning, VLM inference for image analysis, and Stable Diffusion for image generation. Everything runs on your machine with no external API calls. The key pattern you’ll learn is tool composition through mixins. Each mixin registers its tools via the @tool decorator during initialization. You’ll see how domain-specific tools can wrap generic capabilities, creating specialized workflows while keeping the underlying components reusable. By the end, you’ll have a working multi-modal agent in ~60 lines of code and understand how to compose tool capabilities using GAIA’s mixin architecture.

Quick Test

First time? See Setup Guide to install prerequisites (uv, Python, AMD drivers).

Want to try it first before building? Install models and start Lemonade Server:

gaia init --profile sd

Generate one image + story with the built-in agent:

gaia sd "generate an image of a robot exploring ancient ruins and tell me the story of what it discovers"

What You’ll Build

An ImageStoryAgent that generates images and creates stories about them. ~60 lines of code total. When you run agent.process_query("create a robot exploring ancient ruins"), here’s what happens:

LLM decides the plan: Qwen3-8B analyzes your query and returns {plan: [{tool: "generate_image", tool_args: {prompt: "..."}}, {tool: "create_story_from_image", tool_args: {...}}]}
Executes generate_image(): Sends prompt to SDXL-Turbo (4-step diffusion model), receives PNG bytes, saves to disk (~17s)
Executes create_story_from_image(): Calls your custom tool → sends image bytes to Qwen3-VL-4B vision model → receives 2-3 paragraph narrative (~15s)
Returns result: {"status": "success", "result": "Here's your image and story...", "conversation": [...], "steps_taken": 3}

You’ll create the create_story_from_image tool yourself using VLM capabilities. Three models, one agent:

Qwen3-8B-GGUF (5GB): Reasoning engine - decides which tools to call and when
SDXL-Turbo (6.5GB): Diffusion model - generates images from text prompts
Qwen3-VL-4B (3.2GB): Vision-language model - analyzes images and generates text

All run locally via Lemonade Server. GGUF models use AMD Radeon iGPU (Vulkan), SDXL runs on CPU.

Prerequisites

Complete setup

If you haven’t already, follow the Setup guide to install prerequisites (uv, Lemonade Server, etc.).

Create project folder

mkdir my-sd-agent
cd my-sd-agent

Create virtual environment

uv venv .venv --python 3.12

uv will automatically download Python 3.12 if not already installed.

Activate the environment:

Linux / macOS
Windows

source .venv/bin/activate

.venv\Scripts\activate

Install GAIA

uv pip install amd-gaia

Initialize SD profile

gaia init --profile sd

This command:

Installs Lemonade Server (if not already installed)
Starts Lemonade Server automatically
Configures Lemonade Server with 16K context (required for SD agent multi-step planning)
Downloads three models (~15GB):
- SDXL-Turbo (6.5GB) - Diffusion model in safetensors format
- Qwen3-8B-GGUF (5GB) - Agent LLM in quantized GGUF format (loaded with 16K context)
- Qwen3-VL-4B-Instruct-GGUF (3.2GB) - Vision LLM in quantized GGUF format
Verifies models and context size

Hardware acceleration:

GGUF models run on iGPU (Radeon integrated graphics) via Vulkan
SDXL-Turbo currently runs on CPU (GPU acceleration planned)

Video: Running gaia init --profile sd and watching models download

Building the Agent

Create a new file image_story_agent.py and add this code:

Step 1: Initialize the Agent

from gaia.agents.base import Agent
from gaia.sd import SDToolsMixin
from gaia.vlm import VLMToolsMixin

class ImageStoryAgent(Agent, SDToolsMixin, VLMToolsMixin):
    """Agent that generates images and creates stories."""

    def __init__(self, output_dir="./generated_images"):
        super().__init__(model_id="Qwen3-8B-GGUF")
        self.init_sd(output_dir=output_dir, default_model="SDXL-Turbo")
        self.init_vlm(model="Qwen3-VL-4B-Instruct-GGUF")

What this does:

Agent, SDToolsMixin, VLMToolsMixin: Inherits three capabilities - reasoning, image generation, vision analysis
super().__init__(model_id="Qwen3-8B-GGUF"): Sets up the reasoning LLM (downloaded by gaia init --profile sd)
init_sd(): Registers image generation tools (generate_image, list_sd_models, get_generation_history)
init_vlm(): Registers vision tools (analyze_image, answer_question_about_image)

The agent now has 5 tools from mixins. You’ll add a 6th custom tool next.

View all 5 mixin tools and their signatures

From SDToolsMixin:

generate_image(
    prompt: str,
    model: str = None,
    size: str = None,
    steps: int = None,
    cfg_scale: float = None,
    seed: int = None
) -> dict

Sends prompt to SDXL-Turbo diffusion model
Returns: {"status": "success", "image_path": "...", "generation_time_s": 17.2, "seed": 123456}
Saves PNG to output_dir

list_sd_models() -> dict

Returns available models with speed/quality characteristics

get_generation_history(limit: int = 10) -> dict

Returns recent generations from this session

From VLMToolsMixin:

analyze_image(image_path: str, focus: str = "all") -> dict

Sends image bytes to Qwen3-VL-4B vision model
Returns: {"status": "success", "description": "...", "focus_analysis": "..."}

answer_question_about_image(image_path: str, question: str) -> dict

Sends image + question to VLM
Returns: {"status": "success", "answer": "...", "confidence": "high"}

Available SD models

You can change default_model in init_sd():

Model	Speed	Quality
SDXL-Turbo (default)	~17s	Balanced
SD-Turbo	~13s	Fast prototyping
SDXL-Base-1.0	~9min	Photorealistic

Step 2: Add Custom Tool

Add _register_tools() to create create_story_from_image:

    def _register_tools(self):
        from gaia.agents.base.tools import tool
        from pathlib import Path

        @tool(atomic=True)
        def create_story_from_image(image_path: str, story_style: str = "any") -> dict:
            """Generate a creative short story (2-3 paragraphs) based on an image."""
            path = Path(image_path)
            if not path.exists():
                return {"status": "error", "error": f"Image not found: {image_path}"}

            # Read image bytes
            image_bytes = path.read_bytes()

            # Build story prompt based on style
            style_map = {
                "whimsical": "playful and lighthearted",
                "dramatic": "intense and emotionally charged",
                "adventure": "exciting with action and discovery",
                "educational": "informative and teaches something",
                "any": "engaging and imaginative"
            }
            style_desc = style_map.get(story_style, "engaging and imaginative")

            # Call VLM to generate story
            prompt = f"Create a short creative story (2-3 paragraphs) that is {style_desc}. Bring the image to life with narrative."
            story = self.vlm_client.extract_from_image(image_bytes, prompt=prompt)

            return {
                "status": "success",
                "story": story,
                "story_style": story_style,
                "image_path": str(path)
            }

What this does:

@tool(atomic=True): Registers create_story_from_image as an atomic tool
- atomic=True: Tool is self-contained - executes completely without multi-step planning
- Most tools are atomic (generate image, analyze image, query database)
- Non-atomic: multi-step tasks like “research a topic”
Reads image bytes from file path
Builds a storytelling prompt based on the style parameter
Calls self.vlm_client.extract_from_image() - sends image + prompt to Qwen3-VL-4B vision model
Returns structured dict with story text and metadata
Decorator auto-generates JSON schema from function signature and docstring

Step 3: Add System Prompt (Optional)

Add _get_system_prompt() to customize instructions:

    def _get_system_prompt(self) -> str:
        return """You are an image generation and storytelling agent.

WORKFLOW for image + story requests:
1. Call generate_image() - THIS RETURNS: {"status": "success", "image_path": ".gaia/cache/sd/images/..."}
2. Extract the ACTUAL image_path value from step 1's result
3. Call create_story_from_image(image_path=<ACTUAL PATH FROM STEP 1>, story_style=<user's tone>)
4. Include the COMPLETE story text in your final answer

CRITICAL: Extract actual values from tool results, NEVER use placeholders like "$IMAGE_PATH$" or "generated_image_path"

CORRECT parameter passing:
Step 1 result: {"image_path": ".gaia/cache/sd/images/robot_123.png"}
Step 2 args: {"image_path": ".gaia/cache/sd/images/robot_123.png", "story_style": "adventure"} ✓

INCORRECT (DO NOT DO THIS):
Step 2 args: {"image_path": "$IMAGE_PATH$"} ✗
Step 2 args: {"image_path": "generated_image_path"} ✗

OTHER RULES:
- Generate ONE image by default (multiple only if explicitly requested: "3 images", "variations")
- Match story_style to user's request: "whimsical" (cute/playful), "adventure" (action), "dramatic" (intense), "any" (default)
- Include full story text in answer - users want to read it immediately"""

What this does:

Documents your custom create_story_from_image tool (parameters, return value, purpose)
Provides explicit workflow: generate image → create story using the image_path → include full text in answer
Sets critical rules: use image_path from results, include complete story text, default to one image
Gives example showing proper parameter passing between tool calls

Why this is important:

Mixins only describe their own tools (SD guidelines don’t know about VLM or custom tools)
Your custom tool needs explicit documentation and usage instructions
Clear workflow prevents errors like missing image_path or incomplete responses
Examples help the LLM understand expected behavior

Tool schemas (parameters, types) are auto-added by the @tool decorator. Your system prompt provides usage guidelines and workflow that schemas can’t express.

Step 4: Add CLI and Run

Add the CLI loop to test your agent:

if __name__ == "__main__":
    import os

    os.makedirs("./generated_images", exist_ok=True)
    agent = ImageStoryAgent()

    print("Image Story Agent ready! Type 'quit' to exit.\n")

    while True:
        user_input = input("You: ").strip()
        if user_input.lower() in ("quit", "exit", "q"):
            break
        if user_input:
            result = agent.process_query(user_input)
            if result.get("result"):
                print(f"\nAgent: {result['result']}\n")

How the CLI works

Quick summary:

Creates output directory and initializes agent (loads Qwen3-8B, registers 6 tools)
Loops: reads input → calls process_query(input) → prints result
process_query() handles LLM planning and tool execution automatically

See detailed step-by-step execution

When you call agent.process_query("create a robot exploring ruins"):Step 1 - LLM Planning:

# Agent sends to Qwen3-8B
{
  "messages": [{"role": "user", "content": "create a robot exploring ruins"}],
  "tools": [
    {"name": "generate_image", "parameters": {"prompt": "str", ...}},
    {"name": "create_story_from_image", "parameters": {"image_path": "str", ...}},
    # ... other tools
  ]
}

Step 2 - LLM Response:

# LLM decides the plan
{
  "plan": [
    {"tool": "generate_image", "tool_args": {"prompt": "robot exploring ancient ruins"}},
    {"tool": "create_story_from_image", "tool_args": {"image_path": "..."}}
  ]
}

Step 3 - Tool Execution:

Calls generate_image("robot exploring ancient ruins") → Saves PNG to disk (~17s)
Calls create_story_from_image("path/to/image.png") → Gets story from VLM (~15s)

Step 4 - Final Synthesis:

Agent sends tool results back to LLM
LLM creates final answer combining both results

Step 5 - Return Value:

{
  "status": "success",
  "result": "I created an image of a robot exploring ancient ruins and wrote a story about it...",
  "conversation": [...],  # Full conversation history
  "steps_taken": 3,
  "duration": 42.3
}

Agent limits: Max 20 steps (configurable). Stops when LLM returns answer field instead of requesting more tools.

Run it:

python image_story_agent.py

Or skip the typing and run the complete example from the GAIA repository:

python examples/sd_agent_example.py

Try:

You: create a robot exploring ancient ruins
You: generate a sunset over mountains and describe the colors
You: make a cyberpunk street scene and tell me a story

🎉 Congratulations! You’ve built a fully functional multi-modal agent in ~60 lines of code. The agent can generate images with Stable Diffusion, analyze them with vision models, and create stories—all running locally on AMD hardware.

Troubleshooting

Connection Error: Cannot connect to Lemonade Server

Error: LemonadeClientError: Cannot connect to http://localhost:8000Solution: Lemonade Server isn’t running. Check status:

lemonade-server status

If not running, gaia init --profile sd should have started it. Try restarting:

lemonade-server serve

Model not found error

Error: Model 'SDXL-Turbo' not foundSolution: Download missing models:

lemonade-server pull SDXL-Turbo
lemonade-server pull Qwen3-VL-4B-Instruct-GGUF

Or re-run init:

gaia init --profile sd

Import error: No module named 'gaia.sd'

Error: ModuleNotFoundError: No module named 'gaia.sd'Solution: Upgrade GAIA:

uv pip install --upgrade amd-gaia

Verify installation:

python -c "from gaia.sd import SDToolsMixin; print('OK')"

How to get reproducible results

Question: How do I generate the same image twice?Solution: Use a fixed seed:CLI:

gaia sd "robot kitten" --seed 42

Python:

result = agent.process_query("generate robot kitten with seed 42")

By default, images use random seeds for variety. Specify a seed for reproducibility.

Still Having Issues?

If you’re experiencing a problem not listed above, we’re here to help: Report an Issue:

🐛 Create a GitHub Issue - Best for bugs, feature requests, and technical issues
Include your GAIA version (gaia --version), OS, and error messages

Get Help:

📖 GAIA Documentation - Complete guides and API reference
💬 GitHub Discussions - Ask questions and share ideas
📧 Email: [email protected] - For direct support

Before Reporting:

Check the FAQ and Troubleshooting Guide
Search existing issues to avoid duplicates
Include reproduction steps, error messages, and environment details

What’s Next?

Try the CLI

Use gaia sd for quick image generation with stories

Agent System Deep Dive

Complete guide to GAIA’s Agent architecture

VLM Tools Reference

Learn more about VLMToolsMixin capabilities

Other Playbooks

Build Chat agents, Code agents, Jira agents, and more

Getting Started

User Guides

Playbooks

SDK Reference

Building a Multi-Modal Image Generation Agent

Introduction

Quick Test

What You’ll Build

Prerequisites

Building the Agent

Step 1: Initialize the Agent

Step 2: Add Custom Tool

Step 3: Add System Prompt (Optional)

Step 4: Add CLI and Run

Troubleshooting

Still Having Issues?

What’s Next?

Try the CLI

Agent System Deep Dive

VLM Tools Reference

Other Playbooks

Getting Started

User Guides

Playbooks

SDK Reference

​Introduction

​Quick Test

​What You’ll Build

​Prerequisites

​Building the Agent

​Step 1: Initialize the Agent

​Step 2: Add Custom Tool

​Step 3: Add System Prompt (Optional)

​Step 4: Add CLI and Run

​Troubleshooting

​Still Having Issues?

​What’s Next?

Try the CLI

Agent System Deep Dive

VLM Tools Reference

Other Playbooks

Introduction

Quick Test

What You’ll Build

Prerequisites

Building the Agent

Step 1: Initialize the Agent

Step 2: Add Custom Tool

Step 3: Add System Prompt (Optional)

Step 4: Add CLI and Run

Troubleshooting

Still Having Issues?

What’s Next?