Part 3: Deployment & Optimization

Source Code: src/gaia/chat/sdk.py | src/gaia/rag/sdk.py

This is Part 3 of 3. Complete Part 1 and Part 2 first: Part 1 | Part 2

Time to complete: 15-20 minutes
What you’ll learn: Agent intelligence, advanced patterns, production deployment, and performance optimization

Understanding the Agent’s Intelligence

How the Agent “Thinks”

When you ask a question, the agent goes through a reasoning loop. Here’s a real example: User: “Find the oil manual on my drive and tell me the vision statement” Agent’s Internal Reasoning (this is what the LLM decides):

// Step 1: Analyze what's needed
{
  "thought": "User wants oil manual. Need to find it first.",
  "goal": "Locate oil manual on drive",
  "tool": "search_file",
  "tool_args": {"file_pattern": "oil manual"}
}

// Result: Found "Oil-Gas-Regulatory-Manual.pdf"

// Step 2: Index it
{
  "thought": "Found the file. Need to index before querying.",
  "goal": "Index the oil manual",
  "tool": "index_document",
  "tool_args": {"file_path": "C:/Users/user/Documents/Oil-Gas-Regulatory-Manual.pdf"}
}

// Result: Successfully indexed, 150 chunks created

// Step 3: Search for vision
{
  "thought": "Document indexed. Now search for vision statement.",
  "goal": "Find vision statement in oil manual",
  "tool": "query_specific_file",
  "tool_args": {
    "file_path": "C:/Users/user/Documents/Oil-Gas-Regulatory-Manual.pdf",
    "query": "vision statement"
  }
}

// Result: Retrieved 5 chunks, top score: 0.92
// Chunk: "The vision is to be recognized as..."

// Step 4: Formulate answer
{
  "thought": "I have the vision statement from the document",
  "goal": "Provide answer with citation",
  "answer": "According to the Oil & Gas Regulatory Manual, the vision is to be recognized as the leading regulatory authority in energy sector governance... (Page 3)"
}

Note: This orchestration is handled by the LLM at runtime based on:

System prompt (defines agent behavior)
Tool schemas (available functions and their parameters)
User input (the query to process)

Search Key Generation (Smart Retrieval)

The agent generates multiple variations of your query to improve retrieval. User asks: “What is the vision of the oil & gas regulator?” Agent generates multiple search keys:

Original: “What is the vision of the oil & gas regulator?”
Keywords: “vision oil gas regulator”
Reformulated: “oil gas regulator vision definition”
Alternate: “oil gas regulator vision explanation”

Then searches with ALL of them and combines results!

# This happens in _generate_search_keys method
def _generate_search_keys(self, query: str) -> List[str]:
    keys = [query]  # Original

    # Extract keywords (remove stop words)
    words = query.lower().split()
    keywords = [w for w in words if w not in STOP_WORDS and len(w) > 2]
    keys.append(" ".join(keywords))

    # Add reformulations
    if query.lower().startswith("what is"):
        topic = query[8:].strip("?").strip()
        keys.append(f"{topic} definition")
        keys.append(f"{topic} explanation")

    return keys

Impact on retrieval:

Increases recall by matching different phrasings
Compensates for keyword mismatches
Trade-off: More compute (multiple searches) for better coverage

Advanced Patterns

Pattern 1: Multi-Document Synthesis

Example
Execution
Implementation

agent.process_query(
    "Compare safety protocols in oil manual vs gas manual. "
    "What are the key differences?"
)

Agent reasoning:
Query both documents for "safety protocols"
Extract relevant sections from each
Synthesize comparative analysis
Cite both sources in response

The query_documents tool searches across ALL indexed documents by default. The LLM receives chunks from multiple sources and synthesizes them.For more control:

# Query specific files
result = agent.execute_tool("query_specific_file", {
    "file_path": "oil-manual.pdf",
    "query": "safety protocols"
})

Pattern 2: Contextual Follow-ups

Example
How It Works

# Initial query
agent.process_query("What are the installation requirements?")
# Response: "Python 3.10+, 8GB RAM, 50GB disk..."

# Follow-up (implicit context)
agent.process_query("What about for production?")
# Agent understands context: still discussing installation
# Retrieves production-specific requirements

The agent’s process_query maintains conversation state:

# Conversation history is maintained
self.conversation_history = [
    {"role": "user", "content": "What are install requirements?"},
    {"role": "assistant", "content": "Python 3.10+..."},
    {"role": "user", "content": "What about for production?"}
]
# LLM receives full history as context

Pattern 3: Progressive Discovery

Example
Pattern

# Query 1
agent.process_query("What documents do you have about AI?")
# Agent: Searches filesystem, finds AI PDFs, indexes them

# Query 2
agent.process_query("Tell me about neural networks")
# Agent: Searches newly-indexed AI documents

# Query 3
agent.process_query("Are there any documents about transformers?")
# Agent: Searches again, finds more, indexes on-demand

Deployment Options

As a Web API

Server
Client
Deploy

api_server.py

from gaia.agents.chat.agent import ChatAgent, ChatAgentConfig
from gaia.api.openai_server import create_app
from gaia.api.agent_registry import registry
import uvicorn

# Configure agent
config = ChatAgentConfig(
    watch_directories=["./company_docs"],
    silent_mode=True  # Disable console output for API
)

agent = ChatAgent(config)

# Create OpenAI-compatible API
app = create_app()

# Register agent
registry.register("doc-qa", lambda: agent)

# Run server
if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8080)

client.py

from openai import OpenAI

# Connect to your agent API
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="dummy"  # Not validated in local mode
)

# Query the agent
response = client.chat.completions.create(
    model="doc-qa",
    messages=[
        {"role": "user", "content": "What's in the manual?"}
    ]
)

print(response.choices[0].message.content)

# Production with gunicorn
gunicorn api_server:app \
    --workers 4 \
    --bind 0.0.0.0:8080 \
    --timeout 120

# With Docker (on AI PC)
docker build -t doc-qa-agent .
docker run -p 8080:8080 doc-qa-agent

Deployment on AI PCs: The agent runs entirely locally on Ryzen AI hardware. No cloud dependencies, ensuring data privacy and low latency.

As a CLI Tool

Implementation
Package Config
Usage

cli.py

#!/usr/bin/env python3
import sys
from gaia.agents.chat.agent import ChatAgent, ChatAgentConfig

def main():
    if len(sys.argv) < 2:
        print("Usage: doc-qa <question>")
        print("       doc-qa --interactive")
        sys.exit(1)

    config = ChatAgentConfig(
        watch_directories=["./documents"],
        silent_mode=False
    )
    agent = ChatAgent(config)

    if sys.argv[1] == "--interactive":
        # Interactive mode
        while True:
            question = input("You: ")
            if question.lower() in ['quit', 'exit']:
                break
            result = agent.process_query(question)
            print(f"Agent: {result.get('answer')}\n")
    else:
        # One-shot query
        question = " ".join(sys.argv[1:])
        result = agent.process_query(question)
        print(result.get('answer'))

if __name__ == "__main__":
    main()

pyproject.toml

[project]
name = "doc-qa-agent"
version = "1.0.0"
dependencies = ["amd-gaia>=0.14.0"]

[project.scripts]
doc-qa = "my_agent.cli:main"

Install with: uv pip install -e .

# One-shot query
doc-qa "What does the manual say about installation?"

# Interactive mode
doc-qa --interactive

# From anywhere after install
cd ~/projects
doc-qa "Search my indexed docs for API examples"

Troubleshooting

No documents indexed

Symptom: Agent responds with “No documents indexed”Debugging:

# Verify RAG dependencies
uv pip install -e ".[rag]"

# Check document exists
import os
print(os.path.exists("./manual.pdf"))  # Should be True

# Check indexed state
print(f"Indexed: {agent.indexed_files}")
print(f"RAG files: {agent.rag.indexed_files}")

Common causes:

Missing RAG dependencies (uv pip install -e ".[rag]")
Incorrect file path
File not readable

Poor retrieval quality

Symptom: Agent returns irrelevant informationTuning options:

# Retrieve more chunks
config = ChatAgentConfig(max_chunks=10)

# Use semantic chunking (slower, more accurate)
config = ChatAgentConfig(use_llm_chunking=True)

# Larger chunks for more context
config = ChatAgentConfig(chunk_size=800)

# Debug what's being retrieved
config = ChatAgentConfig(debug=True)

Check retrieval scores:

response = agent.rag.query("your question")
print(f"Scores: {response.chunk_scores}")
# Scores < 0.5 indicate weak matches

File watching not working

Symptom: New files not auto-indexedSolution:

# Install watchdog dependency
uv pip install "watchdog>=2.1.0"

Verify:

# Check watchers are active
print(f"Watching: {agent.watch_directories}")
print(f"Observers: {len(agent.observers)}")  # Should be > 0

# Check file handler telemetry
if agent.file_handlers:
    telemetry = agent.file_handlers[0].get_telemetry()
    print(f"Events: {telemetry}")

Slow indexing performance

Symptom: Indexing takes excessive timeOptimizations:

# Smaller chunks = faster indexing
config = ChatAgentConfig(chunk_size=300)

# Disable LLM chunking (use fast heuristic)
config = ChatAgentConfig(use_llm_chunking=False)

# Index incrementally, not all at once
# Let file watching handle gradual indexing

Benchmark:

import time
start = time.time()
agent.rag.index_document("large.pdf")
print(f"Indexing took: {time.time() - start:.2f}s")

Typical performance on AI PC with Ryzen AI:

Small PDF (10 pages): ~2-3 seconds
Medium PDF (50 pages): ~8-12 seconds
Large PDF (200 pages): ~30-45 seconds
Embedding generation: ~50-100 chunks/second on NPU

Best Practices

Start Small

Index a few representative documents first, then scale. Full drive indexing is rarely necessary.

Tune for Document Type

Technical docs need larger chunks (600-800 tokens). FAQs work well with smaller chunks (300-400 tokens).

Use Sessions

Session persistence avoids re-indexing on every restart. Critical for large document sets.

Monitor Selectively

Watch only actively changing directories. Static archives don’t benefit from monitoring.

Security

Set allowed_paths in production. Prevents path traversal and unauthorized access.

Incremental Development

Build step-by-step. Test each component before combining them.

Performance Tuning

Indexing Speed
Memory Management
Query Optimization

Profile	chunk_size	max_chunks	use_llm_chunking	Use Case	AI PC Performance
Fast	300	3	False	Development, quick FAQs	~2-3s per PDF (NPU)
Balanced	500	5	False	Production default	~5-7s per PDF (NPU)
Accurate	800	7	True	Complex technical queries	~10-15s per PDF (NPU+LLM)

# Development (fast)
config = ChatAgentConfig(
    chunk_size=300,
    max_chunks=3,
    use_llm_chunking=False
)

# Production (balanced)
config = ChatAgentConfig(
    chunk_size=500,
    max_chunks=5,
    use_llm_chunking=False
)

# Research (accurate)
config = ChatAgentConfig(
    chunk_size=800,
    max_chunks=7,
    use_llm_chunking=True
)

# Limit indexed files
config = ChatAgentConfig(
    max_indexed_files=100,   # Cap at 100 files
    max_file_size_mb=50      # Skip files > 50MB
)

# Manual cleanup
agent.rag.clear_index()      # Clear vector index
agent.stop_watching()        # Stop file observers

Memory usage estimates:

~1MB per 100 chunks (embeddings)
~500KB per PDF (extracted text cache)
~10KB per file watcher event

AI PC Performance: On Ryzen AI systems, NPU handles embedding generation offloading work from CPU/RAM, improving overall system responsiveness.

# Reduce max_steps for faster responses
config = ChatAgentConfig(max_steps=5)

# Disable stats for production
config = ChatAgentConfig(
    show_stats=False,
    debug=False
)

# Use smaller model for speed (efficient on NPU)
config = ChatAgentConfig(
    model_id="Qwen2.5-0.5B-Instruct-CPU"
)

AMD Ryzen AI Optimization: Smaller models like Qwen2.5-0.5B are optimized for NPU execution, providing fast inference with minimal power consumption on AI PCs.

What You’ve Learned

Agent Architecture

The reasoning loop pattern: query → tool selection → execution → synthesis

Tool System

Using @tool decorator to register Python functions as LLM-invocable capabilities

RAG Integration

Vector embeddings, FAISS indexing, and semantic retrieval

Mixin Pattern

Composing agent functionality via multiple inheritance

File Monitoring

Watchdog-based reactive indexing with debouncing

Session Management

JSON serialization for state persistence

Security

Path validation, symlink detection, and access control

Performance

Tuning chunk size, retrieval count, and chunking strategies

Next Steps

Extend with Voice

Add Whisper (ASR) and Kokoro (TTS) for voice interaction.See: Voice-Enabled Assistant (coming soon)

Build Code Agent

Create an agent with code generation and debugging capabilities.See: Code Generation Agent (coming soon)

Multi-Agent Orchestration

Route queries to specialized agents based on intent.See: Multi-Agent System (coming soon)

Package for Distribution

Package as Windows installer or Electron app.See: Deployment Guide

Source Code Reference

ChatAgent

Main agent implementation with session management and file monitoring

RAGToolsMixin

Document indexing and query tools

FileToolsMixin

Directory monitoring implementation

FileSearchToolsMixin

Multi-phase file discovery across drives

Resources

GitHub

Source code, issues, and discussions

Discord

Developer community chat

Email

[email protected]

Getting Started

User Guides

Playbooks

SDK Reference

​Understanding the Agent’s Intelligence

​How the Agent “Thinks”

​Search Key Generation (Smart Retrieval)

​Advanced Patterns

​Pattern 1: Multi-Document Synthesis

​Pattern 2: Contextual Follow-ups

​Pattern 3: Progressive Discovery

​Deployment Options

​As a Web API

​As a CLI Tool

​Troubleshooting

​Best Practices

Start Small

Tune for Document Type

Use Sessions

Monitor Selectively

Security

Incremental Development

​Performance Tuning

​What You’ve Learned

Agent Architecture

Tool System

RAG Integration

Mixin Pattern

File Monitoring

Session Management

Security

Performance

​Next Steps

​Source Code Reference

ChatAgent

RAGToolsMixin

FileToolsMixin

FileSearchToolsMixin

​Resources

GitHub

Discord

Email

Understanding the Agent’s Intelligence

How the Agent “Thinks”

Search Key Generation (Smart Retrieval)

Advanced Patterns

Pattern 1: Multi-Document Synthesis

Pattern 2: Contextual Follow-ups

Pattern 3: Progressive Discovery

Deployment Options

As a Web API

As a CLI Tool

Troubleshooting

Best Practices

Performance Tuning

What You’ve Learned

Next Steps

Source Code Reference

Resources