Source Code:
src/gaia/agents/sd/agent.pyIntroduction
This playbook demonstrates building a multi-modal agent by composing image generation and vision capabilities into a single agent class. You’ll use Python’s multiple inheritance to combineSDToolsMixin (Stable Diffusion image generation) and VLMToolsMixin (Vision Language Model analysis) with GAIA’s base Agent class.
The architecture is straightforward: your agent runs an LLM as its reasoning engine, which has access to tools from both mixins. When a user asks to generate an image and create a story, the LLM decides the sequence—call generate_image(), then call create_story_from_image() to analyze the result. Behind the scenes, Lemonade Server hosts the model endpoints locally, providing LLM inference for the agent’s reasoning, VLM inference for image analysis, and Stable Diffusion for image generation. Everything runs on your machine with no external API calls.
The key pattern you’ll learn is tool composition through mixins. Each mixin registers its tools via the @tool decorator during initialization. You’ll see how domain-specific tools can wrap generic capabilities, creating specialized workflows while keeping the underlying components reusable.
By the end, you’ll have a working multi-modal agent in ~60 lines of code and understand how to compose tool capabilities using GAIA’s mixin architecture.
Quick Test
First time? See Setup Guide to install prerequisites (uv, Python, AMD drivers).
What You’ll Build
AnImageStoryAgent that generates images and creates stories about them. ~60 lines of code total.
When you run agent.process_query("create a robot exploring ancient ruins"), here’s what happens:
- LLM decides the plan: Qwen3-8B analyzes your query and returns
{plan: [{tool: "generate_image", tool_args: {prompt: "..."}}, {tool: "create_story_from_image", tool_args: {...}}]} - Executes generate_image(): Sends prompt to SDXL-Turbo (4-step diffusion model), receives PNG bytes, saves to disk (~17s)
- Executes create_story_from_image(): Calls your custom tool → sends image bytes to Qwen3-VL-4B vision model → receives 2-3 paragraph narrative (~15s)
- Returns result:
{"status": "success", "result": "Here's your image and story...", "conversation": [...], "steps_taken": 3}
create_story_from_image tool yourself using VLM capabilities.
Three models, one agent:
- Qwen3-8B-GGUF (5GB): Reasoning engine - decides which tools to call and when
- SDXL-Turbo (6.5GB): Diffusion model - generates images from text prompts
- Qwen3-VL-4B (3.2GB): Vision-language model - analyzes images and generates text
Prerequisites
Complete setup
If you haven’t already, follow the Setup guide to install prerequisites (uv, Lemonade Server, etc.).
Create virtual environment
uv will automatically download Python 3.12 if not already installed.
- Linux / macOS
- Windows
Initialize SD profile
- Installs Lemonade Server (if not already installed)
- Starts Lemonade Server automatically
- Configures Lemonade Server with 16K context (required for SD agent multi-step planning)
- Downloads three models (~15GB):
- SDXL-Turbo (6.5GB) - Diffusion model in safetensors format
- Qwen3-8B-GGUF (5GB) - Agent LLM in quantized GGUF format (loaded with 16K context)
- Qwen3-VL-4B-Instruct-GGUF (3.2GB) - Vision LLM in quantized GGUF format
- Verifies models and context size
- GGUF models run on iGPU (Radeon integrated graphics) via Vulkan
- SDXL-Turbo currently runs on CPU (GPU acceleration planned)
Video: Running
gaia init --profile sd and watching models downloadBuilding the Agent
Create a new fileimage_story_agent.py and add this code:
Step 1: Initialize the Agent
Agent, SDToolsMixin, VLMToolsMixin: Inherits three capabilities - reasoning, image generation, vision analysissuper().__init__(model_id="Qwen3-8B-GGUF"): Sets up the reasoning LLM (downloaded bygaia init --profile sd)init_sd(): Registers image generation tools (generate_image,list_sd_models,get_generation_history)init_vlm(): Registers vision tools (analyze_image,answer_question_about_image)
View all 5 mixin tools and their signatures
View all 5 mixin tools and their signatures
From SDToolsMixin:
- Sends prompt to SDXL-Turbo diffusion model
- Returns:
{"status": "success", "image_path": "...", "generation_time_s": 17.2, "seed": 123456} - Saves PNG to
output_dir
- Returns available models with speed/quality characteristics
- Returns recent generations from this session
- Sends image bytes to Qwen3-VL-4B vision model
- Returns:
{"status": "success", "description": "...", "focus_analysis": "..."}
- Sends image + question to VLM
- Returns:
{"status": "success", "answer": "...", "confidence": "high"}
Available SD models
Available SD models
You can change
default_model in init_sd():| Model | Speed | Quality |
|---|---|---|
| SDXL-Turbo (default) | ~17s | Balanced |
| SD-Turbo | ~13s | Fast prototyping |
| SDXL-Base-1.0 | ~9min | Photorealistic |
Step 2: Add Custom Tool
Add_register_tools() to create create_story_from_image:
@tool(atomic=True): Registerscreate_story_from_imageas an atomic tool- atomic=True: Tool is self-contained - executes completely without multi-step planning
- Most tools are atomic (generate image, analyze image, query database)
- Non-atomic: multi-step tasks like “research a topic”
- Reads image bytes from file path
- Builds a storytelling prompt based on the style parameter
- Calls
self.vlm_client.extract_from_image()- sends image + prompt to Qwen3-VL-4B vision model - Returns structured dict with story text and metadata
- Decorator auto-generates JSON schema from function signature and docstring
Step 3: Add System Prompt (Optional)
Add_get_system_prompt() to customize instructions:
- Documents your custom
create_story_from_imagetool (parameters, return value, purpose) - Provides explicit workflow: generate image → create story using the image_path → include full text in answer
- Sets critical rules: use image_path from results, include complete story text, default to one image
- Gives example showing proper parameter passing between tool calls
- Mixins only describe their own tools (SD guidelines don’t know about VLM or custom tools)
- Your custom tool needs explicit documentation and usage instructions
- Clear workflow prevents errors like missing image_path or incomplete responses
- Examples help the LLM understand expected behavior
Tool schemas (parameters, types) are auto-added by the
@tool decorator. Your system prompt provides usage guidelines and workflow that schemas can’t express.Step 4: Add CLI and Run
Add the CLI loop to test your agent:How the CLI works
How the CLI works
Quick summary:
- Creates output directory and initializes agent (loads Qwen3-8B, registers 6 tools)
- Loops: reads input → calls
process_query(input)→ prints result process_query()handles LLM planning and tool execution automatically
See detailed step-by-step execution
See detailed step-by-step execution
When you call Step 2 - LLM Response:Step 3 - Tool Execution:Agent limits: Max 20 steps (configurable). Stops when LLM returns
agent.process_query("create a robot exploring ruins"):Step 1 - LLM Planning:- Calls
generate_image("robot exploring ancient ruins")→ Saves PNG to disk (~17s) - Calls
create_story_from_image("path/to/image.png")→ Gets story from VLM (~15s)
- Agent sends tool results back to LLM
- LLM creates final answer combining both results
answer field instead of requesting more tools.Troubleshooting
Connection Error: Cannot connect to Lemonade Server
Connection Error: Cannot connect to Lemonade Server
Error: If not running,
LemonadeClientError: Cannot connect to http://localhost:8000Solution: Lemonade Server isn’t running. Check status:gaia init --profile sd should have started it. Try restarting:Model not found error
Model not found error
Error: Or re-run init:
Model 'SDXL-Turbo' not foundSolution: Download missing models:Import error: No module named 'gaia.sd'
Import error: No module named 'gaia.sd'
Error: Verify installation:
ModuleNotFoundError: No module named 'gaia.sd'Solution: Upgrade GAIA:How to get reproducible results
How to get reproducible results
Question: How do I generate the same image twice?Solution: Use a fixed seed:CLI:Python:By default, images use random seeds for variety. Specify a seed for reproducibility.
Still Having Issues?
If you’re experiencing a problem not listed above, we’re here to help: Report an Issue:- 🐛 Create a GitHub Issue - Best for bugs, feature requests, and technical issues
- Include your GAIA version (
gaia --version), OS, and error messages
- 📖 GAIA Documentation - Complete guides and API reference
- 💬 GitHub Discussions - Ask questions and share ideas
- 📧 Email: [email protected] - For direct support
- Check the FAQ and Troubleshooting Guide
- Search existing issues to avoid duplicates
- Include reproduction steps, error messages, and environment details