Voice Interaction

Source Code: src/gaia/talk/ · src/gaia/audio/

GAIA’s talk mode enables natural voice-based interaction with LLMs using Whisper for automatic speech recognition (ASR) and Kokoro TTS for text-to-speech (TTS). Have natural conversations with AI through your microphone and speakers.

First time here? Complete the Setup guide first to install GAIA and its dependencies.

Quick Start

Install talk extras

With GAIA installed, add the talk extras:

uv pip install "amd-gaia[talk]"

Start Lemonade Server

Launch the AI backend server:

lemonade-server serve

You can also double-click the desktop shortcut to start the server

Launch Talk Mode

Start voice conversation:

Full Voice (ASR + TTS)
ASR Only

gaia talk

gaia talk --no-tts

Start Speaking

When you see ⠴ Listening..., you can start talking

[2025-02-06 11:56:30] | INFO | Starting audio processing thread...
[2025-02-06 11:56:30] | INFO | Listening for voice input...
⠴ Listening...

Say “exit” or “quit” to end the session

Voice Commands

Exit Session

Say “exit” or “quit”

Clear History

Say “restart”

Trigger Response

Natural pauses (>1 second)

Stop Playback

Press Enter during audio

Configuration Options

Customize your voice interaction experience:

Whisper Model Size
Audio Device
Performance Stats

# Choose model size for speech recognition
gaia talk --whisper-model-size medium

# Available: tiny, base, small, medium, large

Larger models provide better accuracy but require more resources

# Specify which microphone to use
gaia talk --audio-device-index 2

Use gaia test --test-type asr-list-audio-devices to list available devices

# Show performance statistics
gaia talk --stats

Document Q&A with Voice

Voice interaction supports document-based Q&A through RAG (Retrieval-Augmented Generation). Ask questions about your PDF documents using natural speech!

Quick Start with Documents

# Voice chat with a document
gaia talk --index manual.pdf

Use Cases

Technical Support

Voice chat with product manuals and troubleshooting guides

Research

Speak questions about research papers and documentation

Learning

Voice interaction with textbooks and educational materials

Accessibility

Hands-free document Q&A for users with mobility needs

Field Work

Voice queries about procedures when hands are busy

Documentation

Quick reference lookup while working

How It Works

Document Indexing

PDFs are automatically indexed when you start talk with --index

Voice Input

Speak your question about the documents

Context Retrieval

Relevant document sections are retrieved automatically

Voice Response

AI answers based on document context and speaks the response

See the Chat documentation - Document Q&A section for more details on RAG capabilities

Testing ASR Components

Test the automatic speech recognition system using various test modes:

Audio File Transcription

Test transcription of existing audio files:

gaia test --test-type asr-file-transcription --input-audio-file path/to/audio.wav

Supported Audio Formats

WAV
MP3
M4A
Other common formats

Options:

--input-audio-file: Path to the audio file (required)
--whisper-model-size: Model size (default: “base”)

List Audio Devices

Discover available audio input devices:

gaia test --test-type asr-list-audio-devices

Microphone Recording Test

Test real-time transcription from your microphone:

gaia test --test-type asr-microphone --recording-duration 15

Options:

--recording-duration: Recording duration in seconds (default: 10)
--whisper-model-size: Model size (default: “base”)
--audio-device-index: Specific microphone (optional)

Testing TTS Components

Test text-to-speech capabilities with various test modes:

Text Preprocessing

Test how TTS processes and formats text:

gaia test --test-type tts-preprocessing

Streaming Playback

Test real-time audio generation and playback:

gaia test --test-type tts-streaming --test-text "Your test text here"

Test Output Includes

Processing progress
Playback progress
Currently spoken text
Performance metrics

Audio File Generation

Generate and save audio to WAV file:

gaia test --test-type tts-audio-file \
  --test-text "Your test text here" \
  --output-audio-file ./test_output.wav

Troubleshooting

Audio Device Errors

Try different --audio-device-index values
List available devices: gaia test --test-type asr-list-audio-devices
Check system audio input settings (Settings > Audio > Input)
Ensure correct microphone is selected as default input device

Poor ASR Accuracy

Try larger Whisper models: --whisper-model-size medium or large
Ensure you’re in a quiet environment
Speak clearly at a moderate pace
Check microphone positioning and quality
Verify microphone is not muted

No Voice Response (TTS)

Check system audio output/speaker settings
Verify TTS is enabled (not using --no-tts flag)
Ensure system volume is not muted
Verify espeak-ng is properly installed
Test with: gaia test --test-type tts-streaming

Voice Input Not Recognized

Check microphone permissions
Verify microphone is working in other applications
Test with: gaia test --test-type asr-microphone
Adjust --audio-device-index if multiple microphones

RAG Issues

Missing RAG dependencies:

uv pip install -e ".[rag]"

Other issues:

PDF processing errors: Ensure PDFs have extractable text (not scanned images)
Slow indexing: Use --stats to monitor; larger documents take time
Context not used: Verify documents indexed successfully at startup
Empty responses: Check PDFs contain extractable text

Best Practices

Optimal Environment

Use in quiet environments for best recognition accuracy

Speech Clarity

Speak clearly and at moderate pace

Model Selection

Balance accuracy vs. performance based on your hardware

Natural Pauses

Use natural pauses to trigger AI responses

Next Steps

Chat SDK

Learn about the underlying chat capabilities

CLI Reference

Explore all command-line options

Development Guide

Build custom voice-enabled agents

Features Overview

Discover all GAIA capabilities

Getting Started

User Guides

Playbooks

SDK Reference

​Quick Start

​Voice Commands

Exit Session

Clear History

Trigger Response

Stop Playback

​Configuration Options

​Document Q&A with Voice

​Quick Start with Documents

​Use Cases

Technical Support

Research

Learning

Accessibility

Field Work

Documentation

​How It Works

​Testing ASR Components

​Audio File Transcription

​List Audio Devices

​Microphone Recording Test

​Testing TTS Components

​Text Preprocessing

​Streaming Playback

​Audio File Generation

​Troubleshooting

​Best Practices

Optimal Environment

Speech Clarity

Model Selection

Natural Pauses

​Next Steps

Chat SDK

CLI Reference

Development Guide

Features Overview

Quick Start

Voice Commands

Configuration Options

Document Q&A with Voice

Quick Start with Documents

Use Cases

How It Works

Testing ASR Components

Audio File Transcription

List Audio Devices

Microphone Recording Test

Testing TTS Components

Text Preprocessing

Streaming Playback

Audio File Generation

Troubleshooting

Best Practices

Next Steps