Documentation Index
Fetch the complete documentation index at: https://amd-gaia.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Looking for usage docs? See the Code Index guide for CLI walkthroughs and architecture overview.
CodeIndexSDK indexes repository files (no git history, no PRs) using Lemonade Server embeddings and FAISS for vector similarity search. The SDK is exposed to agents via CodeIndexToolsMixin, which is already composed into the built-in CodeAgent.
Installation
uv pip install -e '.[rag]'
FAISS and numpy live under the [rag] extras. If they’re missing the SDK raises:
code_index dependencies missing. Install with: pip install -e '.[rag]'
CodeIndexConfig
from gaia.code_index.sdk import CodeIndexConfig
config = CodeIndexConfig(
repo_path=".",
max_files=5000,
max_file_size_mb=1.0,
chunk_overlap=50,
embedding_model="nomic-embed-text-v2-moe-GGUF",
cache_dir="~/.gaia/code_index",
embedding_base_url=None,
)
| Field | Type | Default | Description |
|---|
repo_path | str | — | Repository root (required) |
max_files | int | 5000 | Max files to scan |
max_file_size_mb | float | 1.0 | Skip files larger than this |
chunk_overlap | int | 50 | Token overlap between chunks |
embedding_model | str | "nomic-embed-text-v2-moe-GGUF" | Lemonade embedding model |
cache_dir | str | "~/.gaia/code_index" | Cache directory |
embedding_base_url | Optional[str] | None | Custom Lemonade base URL |
CodeIndexSDK
from gaia.code_index.sdk import CodeIndexSDK
sdk = CodeIndexSDK(config)
index_repository() -> IndexResult
Discover source files, parse them, embed via Lemonade, and persist a FAISS index. Re-runs are incremental — unchanged files (matched by SHA-256) reuse their existing embeddings.
result = sdk.index_repository()
print(f"Files: {result.files_indexed}, "
f"Chunks: {result.chunks_created}, "
f"Took: {result.duration_seconds:.1f}s")
search(query, scope="all", top_k=10) -> List[SearchResult]
results = sdk.search("error handling in agent tools", top_k=5)
for r in results:
print(f"[{r.score:.3f}] {r.chunk.file_path}:{r.chunk.start_line} "
f"— {r.chunk.symbol_name or '?'}")
| Parameter | Type | Default | Description |
|---|
query | str | — | Natural language or code query |
scope | str | "all" | "all" or "code" (reserved for future filters) |
top_k | int | 10 | Number of results |
L2 distance is converted to a similarity score in [0, 1] via 1 / (1 + dist).
get_status() -> Dict
status = sdk.get_status()
# {
# "indexed": True,
# "repo_path": "/abs/path",
# "embedding_model": "nomic-embed-text-v2-moe-GGUF",
# "total_chunks": 1234,
# "code_chunks": 1234,
# "files_tracked": 312,
# "created_at": 1714000000.0,
# "cache_path": "/Users/me/.gaia/code_index/<hash>"
# }
clear_index()
Removes the FAISS index and metadata for this repo from the cache.
Data types
CodeChunk
| Field | Type | Description |
|---|
content | str | Source text |
file_path | str | Path relative to repo root |
language | str | Detected language |
start_line | int | First line (1-indexed) |
end_line | int | Last line |
symbol_name | Optional[str] | Function / class name |
symbol_type | Optional[str] | "function", "class", … |
docstring | Optional[str] | Extracted docstring |
imports | List[str] | Imports parsed from the file |
SearchResult
| Field | Type | Description |
|---|
chunk | CodeChunk | The matched chunk |
score | float | Similarity score in [0, 1] (higher is better) |
result_type | str | Currently always "code" |
IndexResult
| Field | Type | Description |
|---|
files_indexed | int | Files newly parsed in this run |
chunks_created | int | Total chunks in the resulting index (reused + new) |
duration_seconds | float | Wall-clock time for the run |
Parsers
gaia.code_index.parsers provides language detection and symbol-aware chunking.
from gaia.code_index.parsers import (
detect_language,
is_binary_file,
parse_python_file,
parse_generic_file,
chunk_code_file,
)
# Detect language from filename
lang = detect_language("agent.py") # "python"
# Parse a Python file with the AST
chunks = parse_python_file("agent.py", source_code)
# Parse JS/TS/Go/Rust/Java/C/C++ via regex
chunks = parse_generic_file("agent.ts", source_code, "typescript")
# High-level dispatcher
chunks = chunk_code_file("agent.go", source_code, max_size_mb=1.0)
The mixin lives at gaia.agents.code_index.tools.mixin.CodeIndexToolsMixin. It exposes four @tool-decorated methods through register_code_index_tools():
| Tool | Description |
|---|
index_codebase(repo_path="") | Build / refresh the FAISS index |
search_code_index(query, scope="all", top_k=10) | Semantic search |
get_index_status() | Report indexed state, chunk counts, embedding model, cache path |
clear_code_index() | Remove the cached index |
CodeAgent already composes this mixin, so gaia-code index chat and any CodeAgent instance get the tools for free.
Composing onto a custom Python agent
from gaia.agents.base.agent import Agent
from gaia.agents.code_index.tools.mixin import CodeIndexToolsMixin
class MyResearchAgent(CodeIndexToolsMixin, Agent):
def __init__(self, repo_path: str = "."):
super().__init__()
self._init_code_index_state(repo_path=repo_path)
def _register_tools(self):
super()._register_tools()
self.register_code_index_tools()
State is initialised lazily — calling _init_code_index_state is optional but recommended so the agent honours the repo path you intended.
CodeIndexToolsMixin is also registered in KNOWN_TOOLS["code_index"] (src/gaia/agents/registry.py) for dynamic composition when scaffolding agents — see the custom-agent guide.
Privacy and safety
- All embeddings are generated locally by Lemonade Server.
- Sensitive filenames are skipped during discovery:
.env, .env.*, SSH private keys, *.pem, *.key, *.pfx, *.p12, *.jks, *.keystore.
- The SDK uses
gaia.security.PathValidator to keep file reads scoped to the configured repo_path, and the index_codebase tool refuses any repo_path that isn’t a sub-path of the agent’s original root.
Roadmap
- #869 — MCP server wrapper exposing the index to external code assistants.
- #870 — Multi-repo indexing in a single namespace.
- #871 — Verilog / VHDL parsers.