CLI Command:
gaia sd - Demonstrates multi-modal agents with SDToolsMixin + VLMToolsMixinSDK Reference: SDToolsMixin API • VLMToolsMixin APIWhat It Is
Thegaia sd command demonstrates GAIA’s multi-modal agent architecture through a practical example: generating images with AI-enhanced prompts and automatic story creation.
Three AI models working together (all local on Ryzen AI):
- 🧠 LLM (Qwen3-8B) - Analyzes your input, plans workflow, adds quality keywords
- 🖼️ Stable Diffusion - Generates images from enhanced prompts
- 👁️ VLM (Qwen3-VL-4B) - Analyzes images and creates creative stories
See It In Action
Complete multi-modal workflow: Prompt enhancement → Image generation → VLM analysis → Story creation (all in ~35 seconds)Quick Start
First time? See Setup Guide to install GAIA and Lemonade Server.
Initialize GAIA (first time only)
- SDXL-Turbo (image generation - 6.5GB)
- Qwen3-8B-GGUF (agentic reasoning + prompt enhancement - 5GB)
- Qwen3-VL-4B-Instruct-GGUF (vision + stories - 3.2GB)
Generate image with story
- Images:
.gaia/cache/sd/images/robot_kitten.png - Stories:
.gaia/cache/sd/images/robot_kitten_story.txt(auto-generated)
What models are needed?
What models are needed?
The agent uses three models (auto-downloaded on first use):
- SDXL-Turbo (6.5GB) - Image generation
- Qwen3-8B-GGUF (5.0GB) - Agentic reasoning and prompt enhancement
- Qwen3-VL-4B-Instruct-GGUF (3.2GB) - Vision analysis and stories
How It Works
Multi-Modal Agent Architecture
Execution Flow:- Agent plans which tools to call based on request
- generate_image() → LLM enhances prompt → SDXL-Turbo generates (~17s)
- analyze_image() → Qwen3-VL-4B describes the image
- create_story_from_image() → Qwen3-VL-4B writes creative story (~17s)
Example Output
What you see when you rungaia sd "create a cute robot kitten and tell me a short story about it":
Different Use Cases
The agent adapts to what you ask for:Models
| Model | Speed | Quality | What Agent Adds |
|---|---|---|---|
| SDXL-Turbo (default) | ~17s | ⭐⭐⭐ | Artistic style, detailed lighting, Cinematic/Photographic keywords |
| SD-Turbo | ~13s | ⭐⭐ | Concise enhancement, key elements only |
| SDXL-Base-1.0 | ~9min | ⭐⭐⭐⭐⭐ | Camera settings (f/2.8, ISO 500), photorealistic focus |
| SD-1.5 | ~88s | ⭐⭐⭐ | Balanced traditional keywords |
Performance benchmarks measured on AMD Ryzen AI MAX+ 395 with Radeon 8060S (16 cores, 32 threads @ 3.0 GHz). Your performance may vary based on hardware configuration.
Under the Hood: Architecture
Multi-Modal Agent Design
The SD Agent combines three mixins:generate_image(prompt, model, size, steps, cfg_scale)- Generate with SDanalyze_image(image_path, focus)- VLM descriptioncreate_story_from_image(image_path, story_style)- VLM narrativecreate_story_from_last_image()- SD-specific wrapper (finds last image)list_sd_models()- Show available modelsget_generation_history()- See generated images
Tool Composition Example
SD-specific convenience tool registered in the agent (not in the mixin):atomic=True? This tool wraps multiple VLM calls internally (analyze + story). Marking it atomic tells the agent to execute it as a single step without trying to break it down further. The agent can call this tool in one step instead of planning multiple sub-steps.
This demonstrates how agents can add custom tools that compose mixin functionality.
Agent Orchestration & Planning
The agent uses a plan-execute-reflect cycle powered by the LLM:-
Planning Phase - LLM analyzes user request and creates a plan
- Examines available tools in the registry
- Decides which tools to call and in what order
- Can create multi-step plans:
[generate_image, analyze_image, create_story_from_image]
-
Execution Phase - Agent executes tools sequentially
- Calls each tool with generated parameters
- Captures results and errors
- Maintains state in
self.sd_generationslist
-
Reflection Phase - LLM examines results
- Checks if goal achieved
- Decides whether to continue or provide final answer
- Can replan if errors occur (max 10 steps by default)
Tool Discovery & Registration
Tools are discovered through a global registry pattern:State Management
The agent tracks generation history for session context:create_story_from_last_image()to find the most recent imageget_generation_history()to show what was created- Session continuity (“create another one” references previous)
Multi-Model Coordination
Three models collaborate with different roles:| Model | Role | Context Size | When Used |
|---|---|---|---|
| Qwen3-8B (LLM) | Orchestration & Enhancement | 16K | Every request - plans workflow, enhances prompts |
| SDXL-Turbo (SD) | Image Generation | N/A | When generate_image tool called |
| Qwen3-VL-4B (VLM) | Vision & Stories | 16K | When VLM tools called (analyze/story) |
System Prompt Intelligence
The agent’s system prompt encodes research-backed prompt engineering strategies tailored to each SD model. The LLM learns these patterns and applies them automatically. Research Sources (2026): Enhancement Strategy:SDXL-Turbo (fast, 4 steps, 512x512)
SDXL-Turbo (fast, 4 steps, 512x512)
size="512x512", steps=4, cfg_scale=1.0SDXL-Base-1.0 (photorealistic, 20 steps, 1024x1024)
SDXL-Base-1.0 (photorealistic, 20 steps, 1024x1024)
size="1024x1024", steps=20, cfg_scale=7.5- Sees user input: “robot kitten”
- Matches to enhancement pattern for current model
- Generates: enhanced prompt with quality keywords
- Calls:
generate_image(enhanced_prompt, size=512x512, steps=4, cfg_scale=1.0)
Options
| Option | Default | Description |
|---|---|---|
--sd-model | SDXL-Turbo | SD model for generation |
--size | auto | Image size (auto per model) |
--steps | auto | Inference steps (auto per model) |
--cfg-scale | auto | CFG scale (auto per model) |
--seed | random | Reproducible results |
--no-open | - | Skip viewer prompt |
-i, --interactive | - | Chat mode |
--max-steps | 10 | Limit agent planning |
Troubleshooting
Missing VLM stories
Missing VLM stories
If you only see image generation without story, the agent chose not to create one based on your request.To explicitly request story:Or ask for it separately:
Slow story creation
Slow story creation
VLM analysis + story creation takes ~17 seconds total (two VLM calls).To skip stories (faster):The agent decides based on your phrasing.
Context size warnings
Context size warnings
Qwen3-8B-GGUF needs 16K context for multi-step planning. Start Lemonade with:Or use
gaia init --profile sd to configure automatically.Under the Hood: Composable System Prompts
The SD Agent uses GAIA’s composable system prompt pattern introduced in the playbook.How It Works
SDToolsMixin provides prompt engineering guidelines automatically:
Debugging Prompts
To see what prompt the agent uses:Random Seeds for Variety
By default, each generation uses a random seed for unique results:Hardware Acceleration
Current implementation:| Component | Format | Hardware | Performance |
|---|---|---|---|
| Qwen3-8B-GGUF | GGUF (quantized) | iGPU (Radeon) | Fast reasoning |
| SDXL-Turbo | Safetensors | CPU | ~17s per image |
| Qwen3-VL-4B-GGUF | GGUF (quantized) | iGPU (Radeon) | Fast vision |