CLI Command: gaia sd - Demonstrates multi-modal agents with SDToolsMixin + VLMToolsMixinSDK Reference: SDToolsMixin API • VLMToolsMixin API

What It Is

The gaia sd command demonstrates GAIA’s multi-modal agent architecture through a practical example: generating images with AI-enhanced prompts and automatic story creation. Three AI models working together (all local on Ryzen AI):

🧠 LLM (Qwen3-8B) - Analyzes your input, plans workflow, adds quality keywords
🖼️ Stable Diffusion - Generates images from enhanced prompts
👁️ VLM (Qwen3-VL-4B) - Analyzes images and creates creative stories

Type “robot kitten” → Get image + 2-3 paragraph story about the character

See It In Action

Complete multi-modal workflow: Prompt enhancement → Image generation → VLM analysis → Story creation (all in ~35 seconds)

gaia sd "create a cute robot kitten and tell me a short story about it"

Quick Start

First time? See Setup Guide to install GAIA and Lemonade Server.

Initialize GAIA (first time only)

gaia init --profile sd

This installs Lemonade Server and downloads all required models (~15GB):

SDXL-Turbo (image generation - 6.5GB)
Qwen3-8B-GGUF (agentic reasoning + prompt enhancement - 5GB)
Qwen3-VL-4B-Instruct-GGUF (vision + stories - 3.2GB)

Already installed a different profile? Run this to add the SD models to your existing setup.

Generate image with story

gaia sd "create a cute robot kitten and tell me a short story about it"

LLM enhances prompt → SD generates image (~17s) → VLM creates story (~17s)Output:

Images: .gaia/cache/sd/images/robot_kitten.png
Stories: .gaia/cache/sd/images/robot_kitten_story.txt (auto-generated)

The story file includes both the VLM-generated narrative and image description.

What models are needed?

The agent uses three models (auto-downloaded on first use):

SDXL-Turbo (6.5GB) - Image generation
Qwen3-8B-GGUF (5.0GB) - Agentic reasoning and prompt enhancement
Qwen3-VL-4B-Instruct-GGUF (3.2GB) - Vision analysis and stories

Total: ~15GB for the complete multi-modal experience.Why 8B LLM? Complex agentic tasks (planning, tool calling, prompt enhancement) benefit from stronger instruction following.Already have them? The agent will use what’s available.

How It Works

Execution Flow:

Agent plans which tools to call based on request
generate_image() → LLM enhances prompt → SDXL-Turbo generates (~17s)
analyze_image() → Qwen3-VL-4B describes the image
create_story_from_image() → Qwen3-VL-4B writes creative story (~17s)

All models run locally on Ryzen AI. Agent orchestrates the multi-modal workflow.

Example Output

What you see when you run gaia sd "create a cute robot kitten and tell me a short story about it":

🤖 Processing: 'create a cute robot kitten and tell me a short story about it'

📝 Step 1: Generating image...
🎨 Enhanced: "adorable robotic kitten, glowing LED eyes, soft lighting..."
⠋ Generating (4 steps)... (17s)

[IMAGE PREVIEW]

📝 Step 2: Analyzing with VLM...
👁️ "A captivating robotic kitten with chrome body, amber eyes..."

📝 Step 3: Creating story...
📖 Story: "In Cyber-Cat City, Whiskers the robot kitten..."

✅ Complete! Image + Story created in 35 seconds

Open image? [Y/n]:

Different Use Cases

The agent adapts to what you ask for:

# Just an image (no story)
gaia sd "robot kitten"

# Image with story (automatic)
gaia sd "create a cute robot kitten and tell me a short story about it"

# Multiple images
gaia sd "create 3 robot kittens"

# Analyze existing image
gaia sd "tell me about .gaia/cache/sd/images/robot.png"

The agent plans flexibly based on your request.

Models

Model	Speed	Quality	What Agent Adds
SDXL-Turbo (default)	~17s	⭐⭐⭐	Artistic style, detailed lighting, Cinematic/Photographic keywords
SD-Turbo	~13s	⭐⭐	Concise enhancement, key elements only
SDXL-Base-1.0	~9min	⭐⭐⭐⭐⭐	Camera settings (f/2.8, ISO 500), photorealistic focus
SD-1.5	~88s	⭐⭐⭐	Balanced traditional keywords

Plus VLM: +17s for story creation (optional, based on request)

Performance benchmarks measured on AMD Ryzen AI MAX+ 395 with Radeon 8060S (16 cores, 32 threads @ 3.0 GHz). Your performance may vary based on hardware configuration.

Under the Hood: Architecture

The SD Agent combines three mixins:

from gaia.agents.base import Agent
from gaia.sd import SDToolsMixin
from gaia.vlm import VLMToolsMixin

class SDAgent(Agent, SDToolsMixin, VLMToolsMixin):
    def __init__(self, config):
        # Initialize Agent base with LLM config
        super().__init__(
            model_id=config.model_id,  # Qwen3-8B for agentic reasoning
            max_steps=config.max_steps,
            min_context_size=16384,  # 16K for multi-step planning
        )

        # Initialize SD tools (auto-registers generate_image, list_sd_models, etc.)
        self.init_sd(
            output_dir=config.output_dir,
            default_model=config.sd_model,  # SDXL-Turbo
        )

        # Initialize VLM tools (auto-registers analyze_image, create_story_from_image, etc.)
        self.init_vlm(model="Qwen3-VL-4B-Instruct-GGUF")

Available tools:

generate_image(prompt, model, size, steps, cfg_scale) - Generate with SD
analyze_image(image_path, focus) - VLM description
create_story_from_image(image_path, story_style) - VLM narrative
create_story_from_last_image() - SD-specific wrapper (finds last image)
list_sd_models() - Show available models
get_generation_history() - See generated images

Tool Composition Example

SD-specific convenience tool registered in the agent (not in the mixin):

# In SDAgent._register_tools()
@tool(
    atomic=True,  # Single-step execution (no multi-step planning)
    name="create_story_from_last_image",
    description="Analyze last SD image and create story",
)
def create_story_from_last_image():
    """SD-specific wrapper - no need to specify image path."""
    last_image = self.sd_generations[-1]["image_path"]

    # Calls generic VLM tool
    return self._create_story_from_image(last_image, story_style="whimsical")

# The @tool decorator automatically registers it - no need to call register_tool()

Why atomic=True? This tool wraps multiple VLM calls internally (analyze + story). Marking it atomic tells the agent to execute it as a single step without trying to break it down further. The agent can call this tool in one step instead of planning multiple sub-steps. This demonstrates how agents can add custom tools that compose mixin functionality.

Agent Orchestration & Planning

The agent uses a plan-execute-reflect cycle powered by the LLM:

Planning Phase - LLM analyzes user request and creates a plan
- Examines available tools in the registry
- Decides which tools to call and in what order
- Can create multi-step plans: [generate_image, analyze_image, create_story_from_image]
Execution Phase - Agent executes tools sequentially
- Calls each tool with generated parameters
- Captures results and errors
- Maintains state in self.sd_generations list
Reflection Phase - LLM examines results
- Checks if goal achieved
- Decides whether to continue or provide final answer
- Can replan if errors occur (max 10 steps by default)

Example Planning:

User: "create a robot kitten and tell me a story"

LLM Planning:
{
  "thought": "Need to generate image first, then create story",
  "plan": [
    {"tool": "generate_image", "tool_args": {"prompt": "enhanced..."}},
    {"tool": "create_story_from_last_image", "tool_args": {}}
  ]
}

Tool Discovery & Registration

Tools are discovered through a global registry pattern:

# 1. Mixin tools registered during init
@tool(atomic=True, name="generate_image")
def generate_image(prompt, model, size, steps, cfg_scale):
    ...

# 2. Agent-specific tools registered in _register_tools()
@tool(atomic=True, name="create_story_from_last_image")
def create_story_from_last_image():
    ...

# 3. LLM sees all registered tools in system prompt
_TOOL_REGISTRY = {
    "generate_image": {...},
    "analyze_image": {...},
    "create_story_from_image": {...},
    "create_story_from_last_image": {...},  # Custom SD tool
}

The LLM receives tool descriptions in its system prompt and decides which to call based on the user’s request.

State Management

The agent tracks generation history for session context:

self.sd_generations = [
    {
        "image_path": ".gaia/cache/sd/images/robot_kitten_abc123.png",
        "prompt": "enhanced prompt with quality keywords",
        "model": "SDXL-Turbo",
        "size": "512x512",
        "steps": 4,
        "seed": 42,
        "generation_time_s": 17.0,
        "image_hash": "a1b2c3d4...",  # For deduplication
    },
    # ... more generations
]

This enables:

create_story_from_last_image() to find the most recent image
get_generation_history() to show what was created
Session continuity (“create another one” references previous)

Multi-Model Coordination

Three models collaborate with different roles:

Model	Role	Context Size	When Used
Qwen3-8B (LLM)	Orchestration & Enhancement	16K	Every request - plans workflow, enhances prompts
SDXL-Turbo (SD)	Image Generation	N/A	When `generate_image` tool called
Qwen3-VL-4B (VLM)	Vision & Stories	16K	When VLM tools called (analyze/story)

Communication Flow:

User Request
    ↓
Qwen3-8B (LLM) - Decides: "Need image + story"
    ↓
Calls: generate_image(enhanced_prompt)
    ↓
SDXL-Turbo - Generates image → Returns image_path
    ↓
Qwen3-8B (LLM) - Decides: "Now create story"
    ↓
Calls: create_story_from_last_image()
    ↓
Qwen3-VL-4B (VLM) - Analyzes image → Creates story
    ↓
Qwen3-8B (LLM) - Formats final response
    ↓
User receives: Enhanced prompt + Image + Story

Each model is loaded on-demand and cached by Lemonade Server.

System Prompt Intelligence

The agent’s system prompt encodes research-backed prompt engineering strategies tailored to each SD model. The LLM learns these patterns and applies them automatically. Research Sources (2026):

Enhancement Strategy:

SDXL-Turbo (fast, 4 steps, 512x512)

# Model-specific guidance in system prompt
"""
SDXL-Turbo prefers:
- Sentence-style prompts over comma-separated tags
- Proven modifiers: "8K", "Aqua Vista" (depth), "masterpiece"
- Style keywords: "Photographic" (faces), "Cinematic" (texture)
- Lighting specifics: volumetric fog, rim lights, soft diffused
- Keyword weights: (subject: 1.1) for 10% emphasis

ENHANCEMENT PATTERN:
[Subject with materials] + [descriptive action/pose] +
[lighting scenario] + [style: Cinematic/Photographic] +
[quality: 8K, Aqua Vista, sharp focus]

EXAMPLE:
Input: "robot kitten"
Output: "adorable robotic kitten with large expressive LED eyes
        and metallic silver body, sitting in playful pose with
        tilted head, soft studio lighting with rim lights
        highlighting metallic surfaces, digital art style,
        Cinematic aesthetic, highly detailed mechanical joints,
        sharp focus, 8K quality"
"""

Parameters: size="512x512", steps=4, cfg_scale=1.0

SDXL-Base-1.0 (photorealistic, 20 steps, 1024x1024)

# Model-specific guidance in system prompt
"""
SDXL-Base-1.0 excels at:
- Full descriptive sentences (natural language)
- Camera settings: "35mm lens", "f/2.8 aperture", "ISO 500"
- Style: ALWAYS "Photographic" or "Cinematic" for realism
- Lighting: "golden hour", "studio three-point", "soft box"
- Materials: "brushed metal", "soft fabric", "rough stone"
- Quality: "8K", "DSLR photograph", "professional photography"
- Composition: "rule of thirds", "bokeh", "shallow depth of field"

AVOID for photorealism:
- "illustration", "anime", "CGI", "3D render"

PHOTOREALISTIC PATTERN:
[Subject with specific materials] + [natural language description] +
[camera settings: lens, aperture, ISO] + [lighting scenario] +
[style: Photographic] + [quality: 8K, DSLR photograph]

EXAMPLE:
Input: "portrait"
Output: "portrait of person with expressive eyes, natural skin
        texture and pores visible, captured with 50mm lens at
        f/2.8 aperture and ISO 320, soft diffused studio lighting
        from left, Photographic style, professional DSLR photograph,
        highly detailed, 8K quality"
"""

Parameters: size="1024x1024", steps=20, cfg_scale=7.5

Why This Works: The LLM learns patterns from the system prompt’s examples and applies them contextually:

Sees user input: “robot kitten”
Matches to enhancement pattern for current model
Generates: enhanced prompt with quality keywords
Calls: generate_image(enhanced_prompt, size=512x512, steps=4, cfg_scale=1.0)

This approach is more flexible than hardcoded templates - the LLM can adapt enhancements based on user intent while following proven guidelines.

Options

Option	Default	Description
`--sd-model`	SDXL-Turbo	SD model for generation
`--size`	auto	Image size (auto per model)
`--steps`	auto	Inference steps (auto per model)
`--cfg-scale`	auto	CFG scale (auto per model)
`--seed`	random	Reproducible results
`--no-open`	-	Skip viewer prompt
`-i, --interactive`	-	Chat mode
`--max-steps`	10	Limit agent planning

Troubleshooting

Missing VLM stories

If you only see image generation without story, the agent chose not to create one based on your request.To explicitly request story:

gaia sd "robot kitten with a story"

Or ask for it separately:

gaia sd "analyze that image and create a story"

Slow story creation

VLM analysis + story creation takes ~17 seconds total (two VLM calls).To skip stories (faster):

gaia sd "robot kitten"  # Agent may skip story if not requested

The agent decides based on your phrasing.

Context size warnings

Qwen3-8B-GGUF needs 16K context for multi-step planning. Start Lemonade with:

lemonade-server serve --ctx-size 16384

Or use gaia init --profile sd to configure automatically.

Under the Hood: Composable System Prompts

The SD Agent uses GAIA’s composable system prompt pattern introduced in the playbook.

How It Works

SDToolsMixin provides prompt engineering guidelines automatically:

# Mixin provides static base + instance-specific prompts
class SDToolsMixin:
    @staticmethod
    def get_base_sd_guidelines() -> str:
        """Research-backed SD prompt engineering (static)."""
        return BASE_GUIDELINES + WORKFLOW_INSTRUCTIONS

    def get_sd_system_prompt(self) -> str:
        """Base + model-specific enhancements."""
        base = self.get_base_sd_guidelines()
        if hasattr(self, 'sd_default_model'):
            # Add model-specific parts (SDXL-Turbo vs SDXL-Base, etc.)
            return base + MODEL_SPECIFIC_PROMPTS[self.sd_default_model]
        return base

Agents automatically inherit this knowledge through mixin composition.

Debugging Prompts

To see what prompt the agent uses:

# Print final system prompt
python -c "from gaia.agents.sd import SDAgent; agent = SDAgent(); print(agent.system_prompt)"

# View just SD guidelines
python -c "from gaia.sd import SDToolsMixin; print(SDToolsMixin.get_base_sd_guidelines())"

See the Agent System Deep Dive for more debugging methods.

Random Seeds for Variety

By default, each generation uses a random seed for unique results:

gaia sd "robot kitten"  # Different image each time

For reproducible results, specify a seed:

gaia sd "robot kitten" --seed 42  # Same image every time with seed 42

The agent returns the seed used in the response, so you can reproduce any image.

Hardware Acceleration

Current implementation:

Component	Format	Hardware	Performance
Qwen3-8B-GGUF	GGUF (quantized)	iGPU (Radeon)	Fast reasoning
SDXL-Turbo	Safetensors	CPU	~17s per image
Qwen3-VL-4B-GGUF	GGUF (quantized)	iGPU (Radeon)	Fast vision

GGUF models run on integrated GPU via Vulkan. SD currently runs on CPU (GPU acceleration planned).

Next Steps

Build Your Own Agent

Complete playbook: Build a multi-modal image generation agent in ~60 lines of code

CLI Reference

Complete option reference

Getting Started

User Guides

Playbooks

SDK Reference

Image Generation Agent

What It Is

See It In Action

Quick Start

How It Works

Example Output

Different Use Cases

Models

Under the Hood: Architecture

Tool Composition Example

Agent Orchestration & Planning

Tool Discovery & Registration

State Management

Multi-Model Coordination

System Prompt Intelligence

Options

Troubleshooting

Under the Hood: Composable System Prompts

How It Works

Debugging Prompts

Random Seeds for Variety

Hardware Acceleration

Next Steps

Build Your Own Agent

CLI Reference

Getting Started

User Guides

Playbooks

SDK Reference

​What It Is

​See It In Action

​Quick Start

​How It Works

​Multi-Modal Agent Architecture

​Example Output

​Different Use Cases

​Models

​Under the Hood: Architecture

​Multi-Modal Agent Design

​Tool Composition Example

​Agent Orchestration & Planning

​Tool Discovery & Registration

​State Management

​Multi-Model Coordination

​System Prompt Intelligence

​Options

​Troubleshooting

​Under the Hood: Composable System Prompts

​How It Works

​Debugging Prompts

​Random Seeds for Variety

​Hardware Acceleration

​Next Steps

Build Your Own Agent

CLI Reference

What It Is

See It In Action

Quick Start

How It Works

Multi-Modal Agent Architecture

Example Output

Different Use Cases

Models

Under the Hood: Architecture

Multi-Modal Agent Design

Tool Composition Example

Agent Orchestration & Planning

Tool Discovery & Registration

State Management

Multi-Model Coordination

System Prompt Intelligence

Options

Troubleshooting

Under the Hood: Composable System Prompts

How It Works

Debugging Prompts

Random Seeds for Variety

Hardware Acceleration

Next Steps