Import: from gaia.vlm import StructuredVLMExtractor (structured extraction) or from gaia.llm import VLMClient (raw VLM)
Detailed Spec: spec/vlm-client
Setup: gaia init --profile vlm installs Lemonade Server and downloads the Qwen3-VL-4B vision model (~3 GB).
Purpose: Extract structured data (tables, charts, key-value pairs, timelines) from images and documents using vision-language models.
The StructuredVLMExtractor is the high-level API for extracting structured data from images and documents. It handles prompt engineering, JSON parsing, and format conversion automatically.
from gaia.vlm import StructuredVLMExtractor
extractor = StructuredVLMExtractor()
Pull specific fields from an image (invoices, forms, receipts):
from pathlib import Path
image_bytes = Path("invoice.png").read_bytes()
fields = extractor.extract_key_values(
image_bytes,
keys=["invoice_number", "date", "total", "vendor_name"]
)
# {"invoice_number": "INV-12345", "date": "2024-01-15", "total": 150.00, "vendor_name": "Acme Corp"}
Pull tabular data as a list of row dictionaries:
image_bytes = Path("spreadsheet.png").read_bytes()
rows = extractor.extract_table(image_bytes, table_description="a sales report table")
# [
# {"product": "Widget A", "units": "150", "revenue": "$7,500"},
# {"product": "Widget B", "units": "230", "revenue": "$11,500"}
# ]
Pull values from bar charts, pie charts, timelines, or any visual data representation. The value_format parameter controls how values are returned:
# Bar chart with numeric values
sales = extractor.extract_chart_data(
image_bytes,
categories=["Q1", "Q2", "Q3", "Q4"],
value_format="number"
)
# {"Q1": 150000, "Q2": 175000, "Q3": 200000, "Q4": 225000}
# Pie chart with percentages
shares = extractor.extract_chart_data(
image_bytes,
categories=["Product A", "Product B", "Product C"],
value_format="percentage"
)
# {"Product A": 45.5, "Product B": 32.0, "Product C": 22.5}
# Timeline with time durations (HH:MM:SS converted to decimal hours)
hours = extractor.extract_chart_data(
image_bytes,
categories=["Active", "Idle", "Offline", "Maintenance"],
value_format="time_hms_decimal"
)
# {"Active": 14.777, "Idle": 0.022, "Offline": 7.654, "Maintenance": 1.547}
Supported formats:
| value_format | Use Case | VLM Extracts | Python Converts | Output |
|---|
"auto" | Unknown charts | As-is | No conversion | Mixed |
"number" | Bar/line charts | Numbers | No conversion | Number |
"percentage" | Pie charts | Percentages | No conversion | Float |
"time_hms" | Time charts | "14:46:38" | No conversion | String |
"time_hms_decimal" | Time charts | "14:46:38" | HH:MM:SS to decimal | Float |
Define a typed schema for complex documents:
schema = {
"fields": {
"driver_name": {"type": "string", "description": "driver full name", "required": True},
"total_hours": {"type": "number", "description": "total duty hours"},
"record_date": {"type": "date", "description": "log date"}
}
}
data = extractor.extract_structured(image_bytes, schema)
# {"driver_name": "HOLMAN, UNDRA D", "total_hours": 24.0, "record_date": "2024-08-05"}
Process Full Documents
Extract from multi-page PDFs with a single call:
result = extractor.extract(
"driver-logs.pdf",
pages="all",
extract_tables=True,
extract_timelines=True,
timeline_status_types=["OFF", "SB", "D", "ON"],
on_progress=lambda current, total: print(f"Page {current}/{total}")
)
# Access structured data
for page in result["pages"]:
print(f"Page {page['page']}: {page.get('timeline', {})}")
# Aggregated totals across all pages
print(result["aggregated_data"]["timeline_totals"])
A key design principle: VLMs read text accurately but do math poorly. When extracting numeric data from charts, the extractor uses a two-step approach:
- VLM extracts strings - Ask the model to read values as text (e.g.,
"14:46:38")
- Python converts - Reliable code does the math (e.g.,
14 + 46/60 + 38/3600 = 14.777)
This achieves 100% accuracy on numeric conversions, compared to ~40% when asking the VLM to convert directly. The value_format parameter controls whether conversion happens.
Low-Level VLMClient
For custom prompts or direct VLM access, use VLMClient:
from gaia.llm import VLMClient
from pathlib import Path
vlm = VLMClient()
if vlm.check_availability():
image_bytes = Path("document.png").read_bytes()
# Raw text extraction (default OCR prompt)
text = vlm.extract_from_image(image_bytes)
# Custom prompt
result = vlm.extract_from_image(
image_bytes,
prompt="Describe all charts and diagrams in this image."
)
StructuredVLMExtractor uses VLMClient internally. Use VLMClient directly only when you need full control over the prompt.
VLM in an Agent
from gaia.agents.base.agent import Agent
from gaia.agents.base.tools import tool
from gaia.vlm import StructuredVLMExtractor
from pathlib import Path
class InvoiceAgent(Agent):
"""Agent that processes scanned invoices."""
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.extractor = StructuredVLMExtractor()
def _get_system_prompt(self) -> str:
return "You process scanned invoices and extract billing data."
def _register_tools(self):
agent = self
@tool
def extract_invoice(image_path: str) -> dict:
"""Extract invoice fields from a scanned image."""
path = Path(image_path)
if not path.exists():
return {"error": f"Image not found: {image_path}"}
return agent.extractor.extract_key_values(
path.read_bytes(),
keys=["invoice_number", "date", "total", "vendor", "line_items"]
)
@tool
def extract_table(image_path: str) -> dict:
"""Extract line items table from an invoice image."""
path = Path(image_path)
if not path.exists():
return {"error": f"Image not found: {image_path}"}
rows = agent.extractor.extract_table(
path.read_bytes(),
table_description="an invoice line items table"
)
return {"line_items": rows, "count": len(rows)}