Skip to main content

Documentation Index

Fetch the complete documentation index at: https://amd-gaia.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Looking for usage docs? See the Code Index guide for CLI walkthroughs and architecture overview.
CodeIndexSDK indexes repository files (no git history, no PRs) using Lemonade Server embeddings and FAISS for vector similarity search. The SDK is exposed to agents via CodeIndexToolsMixin, which is already composed into the built-in CodeAgent.

Installation

uv pip install -e '.[rag]'
FAISS and numpy live under the [rag] extras. If they’re missing the SDK raises:
code_index dependencies missing. Install with: pip install -e '.[rag]'

CodeIndexConfig

from gaia.code_index.sdk import CodeIndexConfig

config = CodeIndexConfig(
    repo_path=".",
    max_files=5000,
    max_file_size_mb=1.0,
    chunk_overlap=50,
    embedding_model="nomic-embed-text-v2-moe-GGUF",
    cache_dir="~/.gaia/code_index",
    embedding_base_url=None,
)
FieldTypeDefaultDescription
repo_pathstrRepository root (required)
max_filesint5000Max files to scan
max_file_size_mbfloat1.0Skip files larger than this
chunk_overlapint50Token overlap between chunks
embedding_modelstr"nomic-embed-text-v2-moe-GGUF"Lemonade embedding model
cache_dirstr"~/.gaia/code_index"Cache directory
embedding_base_urlOptional[str]NoneCustom Lemonade base URL

CodeIndexSDK

from gaia.code_index.sdk import CodeIndexSDK

sdk = CodeIndexSDK(config)

index_repository() -> IndexResult

Discover source files, parse them, embed via Lemonade, and persist a FAISS index. Re-runs are incremental — unchanged files (matched by SHA-256) reuse their existing embeddings.
result = sdk.index_repository()
print(f"Files: {result.files_indexed}, "
      f"Chunks: {result.chunks_created}, "
      f"Took: {result.duration_seconds:.1f}s")

search(query, scope="all", top_k=10) -> List[SearchResult]

results = sdk.search("error handling in agent tools", top_k=5)
for r in results:
    print(f"[{r.score:.3f}] {r.chunk.file_path}:{r.chunk.start_line} "
          f"— {r.chunk.symbol_name or '?'}")
ParameterTypeDefaultDescription
querystrNatural language or code query
scopestr"all""all" or "code" (reserved for future filters)
top_kint10Number of results
L2 distance is converted to a similarity score in [0, 1] via 1 / (1 + dist).

get_status() -> Dict

status = sdk.get_status()
# {
#   "indexed": True,
#   "repo_path": "/abs/path",
#   "embedding_model": "nomic-embed-text-v2-moe-GGUF",
#   "total_chunks": 1234,
#   "code_chunks": 1234,
#   "files_tracked": 312,
#   "created_at": 1714000000.0,
#   "cache_path": "/Users/me/.gaia/code_index/<hash>"
# }

clear_index()

Removes the FAISS index and metadata for this repo from the cache.
sdk.clear_index()

Data types

CodeChunk

FieldTypeDescription
contentstrSource text
file_pathstrPath relative to repo root
languagestrDetected language
start_lineintFirst line (1-indexed)
end_lineintLast line
symbol_nameOptional[str]Function / class name
symbol_typeOptional[str]"function", "class", …
docstringOptional[str]Extracted docstring
importsList[str]Imports parsed from the file

SearchResult

FieldTypeDescription
chunkCodeChunkThe matched chunk
scorefloatSimilarity score in [0, 1] (higher is better)
result_typestrCurrently always "code"

IndexResult

FieldTypeDescription
files_indexedintFiles newly parsed in this run
chunks_createdintTotal chunks in the resulting index (reused + new)
duration_secondsfloatWall-clock time for the run

Parsers

gaia.code_index.parsers provides language detection and symbol-aware chunking.
from gaia.code_index.parsers import (
    detect_language,
    is_binary_file,
    parse_python_file,
    parse_generic_file,
    chunk_code_file,
)

# Detect language from filename
lang = detect_language("agent.py")  # "python"

# Parse a Python file with the AST
chunks = parse_python_file("agent.py", source_code)

# Parse JS/TS/Go/Rust/Java/C/C++ via regex
chunks = parse_generic_file("agent.ts", source_code, "typescript")

# High-level dispatcher
chunks = chunk_code_file("agent.go", source_code, max_size_mb=1.0)

Agent integration: CodeIndexToolsMixin

The mixin lives at gaia.agents.code_index.tools.mixin.CodeIndexToolsMixin. It exposes four @tool-decorated methods through register_code_index_tools():
ToolDescription
index_codebase(repo_path="")Build / refresh the FAISS index
search_code_index(query, scope="all", top_k=10)Semantic search
get_index_status()Report indexed state, chunk counts, embedding model, cache path
clear_code_index()Remove the cached index
CodeAgent already composes this mixin, so gaia-code index chat and any CodeAgent instance get the tools for free.

Composing onto a custom Python agent

from gaia.agents.base.agent import Agent
from gaia.agents.code_index.tools.mixin import CodeIndexToolsMixin

class MyResearchAgent(CodeIndexToolsMixin, Agent):
    def __init__(self, repo_path: str = "."):
        super().__init__()
        self._init_code_index_state(repo_path=repo_path)

    def _register_tools(self):
        super()._register_tools()
        self.register_code_index_tools()
State is initialised lazily — calling _init_code_index_state is optional but recommended so the agent honours the repo path you intended. CodeIndexToolsMixin is also registered in KNOWN_TOOLS["code_index"] (src/gaia/agents/registry.py) for dynamic composition when scaffolding agents — see the custom-agent guide.

Privacy and safety

  • All embeddings are generated locally by Lemonade Server.
  • Sensitive filenames are skipped during discovery: .env, .env.*, SSH private keys, *.pem, *.key, *.pfx, *.p12, *.jks, *.keystore.
  • The SDK uses gaia.security.PathValidator to keep file reads scoped to the configured repo_path, and the index_codebase tool refuses any repo_path that isn’t a sub-path of the agent’s original root.

Roadmap

  • #869 — MCP server wrapper exposing the index to external code assistants.
  • #870 — Multi-repo indexing in a single namespace.
  • #871 — Verilog / VHDL parsers.