Integration · Weaviate
RAG over parsr-extracted documents with Weaviate's hybrid search and managed embeddings.
Weaviate (~14K GitHub stars) is the leading open-source vector DB with a managed cloud offering and strong hybrid search story (BM25 + vector). Customers running on Weaviate Cloud get parsr's structured extraction plus Weaviate's `text2vec-openai` / `text2vec-cohere` modules — the embedding step happens server-side, eliminating the client-side embedding round-trip. The other angle: Weaviate's class-based schema maps cleanly onto parsr's doc_type — one Weaviate class per parsr specialist (`BankStatementChunk`, `InvoiceChunk`, etc.). Hybrid search shines on financial-document RAG because customers query both for semantic content ("the rent payment last month") and exact tokens (an IBAN, an invoice number) in the same call.
Install
One command
pip install parsr-sdk weaviate-clientCode
Working sample
from parsr_sdk import AsyncParsr
import weaviate
parsr = AsyncParsr(api_key="sk_eu_live_...")
client = weaviate.connect_to_weaviate_cloud(
cluster_url="https://....weaviate.network",
auth_credentials=weaviate.auth.AuthApiKey("..."),
)What you get
Highlights
- Hybrid search (BM25 + vector) — exact-token matches on IBANs and invoice numbers stay reliable
- Server-side embeddings via text2vec-openai / text2vec-cohere modules — no client embedding round-trip
- Multi-tenant collections give first-class isolation per customer org
- Self-host or Weaviate Cloud, both EU-region-available
- GraphQL query layer composes with parsr metadata filters
Architecture
How the pieces fit
Define one Weaviate collection per parsr doc_type (or one shared collection with a doc_type filter). parsr.parse(*, include_chunks=true) → for each chunk, batch.add_object(properties={text, doc_type, page_numbers, section}, vector=<auto if text2vec-openai is enabled>). At query time, hybrid search combines BM25 with vector similarity — the right model for financial RAG where customers mix semantic questions with exact-token lookups.
Quickstart
End-to-end example
Parse a document with `include_chunks=true`, embed each chunk, upsert into Weaviate, query.
import os
from parsr_sdk import AsyncParsr
import weaviate
from weaviate.classes.config import Configure, Property, DataType
parsr = AsyncParsr(api_key=os.environ["PARSR_API_KEY"])
client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.environ["WEAVIATE_URL"],
auth_credentials=weaviate.auth.AuthApiKey(os.environ["WEAVIATE_API_KEY"]),
headers={"X-OpenAI-Api-Key": os.environ["OPENAI_API_KEY"]},
)
# 1. Define a collection. text2vec-openai handles embeddings server-side.
if not client.collections.exists("InvoiceChunk"):
client.collections.create(
name="InvoiceChunk",
vectorizer_config=Configure.Vectorizer.text2vec_openai(
model="text-embedding-3-small",
),
properties=[
Property(name="text", data_type=DataType.TEXT),
Property(name="doc_type", data_type=DataType.TEXT),
Property(name="page_numbers", data_type=DataType.INT_ARRAY),
Property(name="section", data_type=DataType.TEXT),
Property(name="org_id", data_type=DataType.TEXT),
],
)
collection = client.collections.get("InvoiceChunk")
# 2. Parse + chunks.
result = await parsr.parse_invoice(
document_url="https://files.example.com/invoice.pdf",
include_chunks=True,
chunking={"strategy": "block"},
)
# 3. Bulk insert. Weaviate embeds server-side; we just hand it text.
with collection.batch.dynamic() as batch:
for c in result.chunks:
batch.add_object(
properties={
"text": c.text,
"doc_type": c.metadata.get("doc_type", "invoice"),
"page_numbers": c.page_numbers,
"section": c.metadata.get("section", ""),
"org_id": "org_acme",
},
uuid=c.id,
)
# 4. Hybrid search — both semantic and exact-token in one query.
hits = collection.query.hybrid(
query="Largest line item on the Cloudflare invoice",
alpha=0.6, # 0.0 = pure BM25, 1.0 = pure vector
limit=3,
filters=weaviate.classes.query.Filter.by_property("org_id").equal("org_acme"),
)
for h in hits.objects:
print(h.properties["section"], "page", h.properties["page_numbers"])Cost
What you'll actually pay
Weaviate Cloud Standard tier starts ~$25/mo for 1M-vector capacity; Serverless adds per-query billing on top. Self-hosting on a CAX21 box is ~€7/mo for the same 1M-vector capacity if you can run the operator yourself. parsr ingestion cost is identical to other vector DB integrations (€99 Growth + per-page overage). The differentiator is hybrid search — Weaviate is one of the cheapest paths to production hybrid retrieval; Pinecone's Inference equivalent is more expensive at scale.
Performance
Tuning tips
- Use one collection per parsr doc_type to keep BM25 indices small + relevant
- text2vec-openai with text-embedding-3-small is the right starting point; switch to a self-hosted text2vec-transformers if EU-only embedding is required
- Set alpha around 0.5–0.7 for finance docs — pure semantic underweights exact matches on IBANs and amounts
- Use multi-tenancy collections rather than org_id property filters once you have >100 customer tenants — it's faster and simpler
Three lines and you're calling parsr from Weaviate.
Start building