parsr.

Integration · Chroma

The fastest local-dev vector DB — Chroma + parsr chunks for prototypes and edge deployments.

Chroma (~14K GitHub stars) is the most-used vector DB for prototyping: pip install, no daemon, embedded SQLite-backed by default, scales to a single-node Chroma Cloud deployment for production. The wedge for parsr customers is the iteration speed — a fintech team can stand up RAG over a parsr-extracted dataset in <50 lines, no infrastructure ticket. Chroma's collection metadata filters mirror parsr's chunk metadata exactly, so `chunk.metadata` JSON hands directly into `collection.add(..., metadatas=[...])`. Production-scale users move to Chroma Cloud or pgvector; Chroma is the right starting point.

Install

One command

pip install parsr-sdk chromadb openai

Code

Working sample

Chroma integrationcode
from parsr_sdk import AsyncParsr
import chromadb

parsr = AsyncParsr(api_key="sk_eu_live_...")
client = chromadb.PersistentClient(path="./.chroma")

What you get

Highlights

  • Zero-ops local dev — pip install, no daemon
  • Chroma metadata filters map 1:1 onto parsr chunk metadata
  • Persistent client backed by SQLite — your dev DB survives restarts
  • Production migration path to Chroma Cloud (or pgvector) without code changes
  • Built-in OpenAI / Cohere / Voyage embedding functions — no separate embedder wiring

Architecture

How the pieces fit

Single Chroma collection (or one per doc_type). PersistentClient writes to local SQLite for development; HttpClient targets Chroma Cloud for production. parsr.parse(*, include_chunks=true) → collection.add(ids, documents, metadatas, embeddings). Embeddings can be supplied explicitly (OpenAI route) or computed by Chroma's embedding_function.

Quickstart

End-to-end example

Parse a document with `include_chunks=true`, embed each chunk, upsert into Chroma, query.

parsr → embed → Chroma → querypython
import os
from parsr_sdk import AsyncParsr
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

parsr = AsyncParsr(api_key=os.environ["PARSR_API_KEY"])
client = chromadb.PersistentClient(path="./.chroma")

embedder = OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small",
)
collection = client.get_or_create_collection(
    name="parsr_invoices",
    embedding_function=embedder,
    metadata={"hnsw:space": "cosine"},
)

# 1. Parse + chunks.
result = await parsr.parse_invoice(
    document_url="https://files.example.com/invoice.pdf",
    include_chunks=True,
    chunking={"strategy": "block"},
)

# 2. Insert. Chroma calls the embedder for us.
collection.add(
    ids=[c.id for c in result.chunks],
    documents=[c.text for c in result.chunks],
    metadatas=[
        {
            "doc_type": c.metadata.get("doc_type", "invoice"),
            "org_id": "org_acme",
            "section": c.metadata.get("section", ""),
            "page_numbers_first": (c.page_numbers or [None])[0],
        }
        for c in result.chunks
    ],
)

# 3. Filtered semantic query.
hits = collection.query(
    query_texts=["Largest line item"],
    where={"$and": [
        {"org_id": {"$eq": "org_acme"}},
        {"doc_type": {"$eq": "invoice"}},
    ]},
    n_results=3,
)
for doc, meta, dist in zip(hits["documents"][0], hits["metadatas"][0], hits["distances"][0]):
    print(dist, meta["section"], doc[:80])

Cost

What you'll actually pay

Chroma local is free — your only cost is the embedding provider (~€0.02 per 1K chunks at OpenAI text-embedding-3-small). Chroma Cloud Starter is $0/mo up to ~3M vectors as of writing; production tiers start around $30/mo. parsr cost is unchanged. The combined dev-time pipeline costs <€5 for a 10K-chunk demo on a developer's laptop.

Performance

Tuning tips

  • Switch from PersistentClient to HttpClient (Chroma Cloud or self-hosted) once your collection passes ~500K vectors — local SQLite gets slow
  • Use the OpenAI embedding function with text-embedding-3-small — it batches calls automatically (saves round-trip latency)
  • Index by separate collection per doc_type rather than a single mixed collection at scale — Chroma's where filters are cheaper than collection-level scans
  • Don't store full chunk text in metadata — Chroma already stores `documents`; duplicating bloats the collection and slows queries

Three lines and you're calling parsr from Chroma.

Start building