parsr.

Integration · Weaviate

RAG over parsr-extracted documents with Weaviate's hybrid search and managed embeddings.

Weaviate (~14K GitHub stars) is the leading open-source vector DB with a managed cloud offering and strong hybrid search story (BM25 + vector). Customers running on Weaviate Cloud get parsr's structured extraction plus Weaviate's `text2vec-openai` / `text2vec-cohere` modules — the embedding step happens server-side, eliminating the client-side embedding round-trip. The other angle: Weaviate's class-based schema maps cleanly onto parsr's doc_type — one Weaviate class per parsr specialist (`BankStatementChunk`, `InvoiceChunk`, etc.). Hybrid search shines on financial-document RAG because customers query both for semantic content ("the rent payment last month") and exact tokens (an IBAN, an invoice number) in the same call.

Install

One command

pip install parsr-sdk weaviate-client

Code

Working sample

Weaviate integrationcode
from parsr_sdk import AsyncParsr
import weaviate

parsr = AsyncParsr(api_key="sk_eu_live_...")
client = weaviate.connect_to_weaviate_cloud(
    cluster_url="https://....weaviate.network",
    auth_credentials=weaviate.auth.AuthApiKey("..."),
)

What you get

Highlights

  • Hybrid search (BM25 + vector) — exact-token matches on IBANs and invoice numbers stay reliable
  • Server-side embeddings via text2vec-openai / text2vec-cohere modules — no client embedding round-trip
  • Multi-tenant collections give first-class isolation per customer org
  • Self-host or Weaviate Cloud, both EU-region-available
  • GraphQL query layer composes with parsr metadata filters

Architecture

How the pieces fit

Define one Weaviate collection per parsr doc_type (or one shared collection with a doc_type filter). parsr.parse(*, include_chunks=true) → for each chunk, batch.add_object(properties={text, doc_type, page_numbers, section}, vector=<auto if text2vec-openai is enabled>). At query time, hybrid search combines BM25 with vector similarity — the right model for financial RAG where customers mix semantic questions with exact-token lookups.

Quickstart

End-to-end example

Parse a document with `include_chunks=true`, embed each chunk, upsert into Weaviate, query.

parsr → embed → Weaviate → querypython
import os
from parsr_sdk import AsyncParsr
import weaviate
from weaviate.classes.config import Configure, Property, DataType

parsr = AsyncParsr(api_key=os.environ["PARSR_API_KEY"])
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.environ["WEAVIATE_URL"],
    auth_credentials=weaviate.auth.AuthApiKey(os.environ["WEAVIATE_API_KEY"]),
    headers={"X-OpenAI-Api-Key": os.environ["OPENAI_API_KEY"]},
)

# 1. Define a collection. text2vec-openai handles embeddings server-side.
if not client.collections.exists("InvoiceChunk"):
    client.collections.create(
        name="InvoiceChunk",
        vectorizer_config=Configure.Vectorizer.text2vec_openai(
            model="text-embedding-3-small",
        ),
        properties=[
            Property(name="text", data_type=DataType.TEXT),
            Property(name="doc_type", data_type=DataType.TEXT),
            Property(name="page_numbers", data_type=DataType.INT_ARRAY),
            Property(name="section", data_type=DataType.TEXT),
            Property(name="org_id", data_type=DataType.TEXT),
        ],
    )
collection = client.collections.get("InvoiceChunk")

# 2. Parse + chunks.
result = await parsr.parse_invoice(
    document_url="https://files.example.com/invoice.pdf",
    include_chunks=True,
    chunking={"strategy": "block"},
)

# 3. Bulk insert. Weaviate embeds server-side; we just hand it text.
with collection.batch.dynamic() as batch:
    for c in result.chunks:
        batch.add_object(
            properties={
                "text": c.text,
                "doc_type": c.metadata.get("doc_type", "invoice"),
                "page_numbers": c.page_numbers,
                "section": c.metadata.get("section", ""),
                "org_id": "org_acme",
            },
            uuid=c.id,
        )

# 4. Hybrid search — both semantic and exact-token in one query.
hits = collection.query.hybrid(
    query="Largest line item on the Cloudflare invoice",
    alpha=0.6,  # 0.0 = pure BM25, 1.0 = pure vector
    limit=3,
    filters=weaviate.classes.query.Filter.by_property("org_id").equal("org_acme"),
)
for h in hits.objects:
    print(h.properties["section"], "page", h.properties["page_numbers"])

Cost

What you'll actually pay

Weaviate Cloud Standard tier starts ~$25/mo for 1M-vector capacity; Serverless adds per-query billing on top. Self-hosting on a CAX21 box is ~€7/mo for the same 1M-vector capacity if you can run the operator yourself. parsr ingestion cost is identical to other vector DB integrations (€99 Growth + per-page overage). The differentiator is hybrid search — Weaviate is one of the cheapest paths to production hybrid retrieval; Pinecone's Inference equivalent is more expensive at scale.

Performance

Tuning tips

  • Use one collection per parsr doc_type to keep BM25 indices small + relevant
  • text2vec-openai with text-embedding-3-small is the right starting point; switch to a self-hosted text2vec-transformers if EU-only embedding is required
  • Set alpha around 0.5–0.7 for finance docs — pure semantic underweights exact matches on IBANs and amounts
  • Use multi-tenancy collections rather than org_id property filters once you have >100 customer tenants — it's faster and simpler

Three lines and you're calling parsr from Weaviate.

Start building