Integration · Elasticsearch
Hybrid search over parsr-extracted finance docs with Elasticsearch's BM25 + dense vector RRF.
Elasticsearch (~67K GitHub stars) added first-class dense_vector and HNSW indexing in 8.x; 8.13+ ships RRF (reciprocal rank fusion) for hybrid retrieval as a single search call. For fintechs already operating Elasticsearch for log aggregation or full-text search, adding parsr-extracted finance documents alongside is a one-mapping change. The killer use case: an EU compliance team auditing transactions can run a single hybrid query ("all SEPA transfers over €10K from the last quarter") against parsr's structured fields plus the raw text — exact filter on the structured side, semantic recall on the chunks. Elastic Cloud's EU regions (Frankfurt, Amsterdam, Dublin) keep residency end-to-end with parsr EU.
Install
One command
pip install parsr-sdk elasticsearch openaiCode
Working sample
from parsr_sdk import AsyncParsr
from elasticsearch import AsyncElasticsearch
parsr = AsyncParsr(api_key="sk_eu_live_...")
es = AsyncElasticsearch("https://....cloud.es.io:9243", api_key="...")What you get
Highlights
- RRF hybrid search (BM25 + vector) in a single call — best-in-class for compliance + finance queries
- kNN queries support post-filters on structured fields (date ranges, amounts, doc_type)
- Single cluster handles logs + RAG — no second datastore for ops to manage
- Elastic Cloud Frankfurt/Amsterdam/Dublin pairs with parsr EU for end-to-end residency
- Aggregations work natively on parsr-extracted fields — facet by month, currency, counterparty
Architecture
How the pieces fit
One Elasticsearch index per doc_type (or a single index with a doc_type keyword field for filtering). Each chunk becomes one document with `text` (analyzed for BM25), `embedding` (dense_vector with HNSW), and `metadata.*` flat keyword fields. parsr.parse(*, include_chunks=true) → embed text → bulk index. Query via the `_search` API with an RRF block combining a BM25 `match` and a `knn` clause.
Quickstart
End-to-end example
Parse a document with `include_chunks=true`, embed each chunk, upsert into Elasticsearch, query.
import os
from parsr_sdk import AsyncParsr
from elasticsearch import AsyncElasticsearch
from elasticsearch.helpers import async_bulk
from openai import AsyncOpenAI
parsr = AsyncParsr(api_key=os.environ["PARSR_API_KEY"])
es = AsyncElasticsearch(
os.environ["ES_URL"], api_key=os.environ["ES_API_KEY"]
)
openai = AsyncOpenAI()
INDEX = "parsr-invoices"
# 1. Mapping (run once).
if not await es.indices.exists(index=INDEX):
await es.indices.create(
index=INDEX,
mappings={
"properties": {
"text": {"type": "text"},
"embedding": {
"type": "dense_vector",
"dims": 1536,
"index": True,
"similarity": "cosine",
},
"doc_type": {"type": "keyword"},
"org_id": {"type": "keyword"},
"page_numbers": {"type": "integer"},
"section": {"type": "keyword"},
}
},
)
# 2. Parse with chunks.
result = await parsr.parse_invoice(
document_url="https://files.example.com/invoice.pdf",
include_chunks=True,
chunking={"strategy": "block"},
)
# 3. Embed + bulk index.
texts = [c.text for c in result.chunks]
embeds = await openai.embeddings.create(model="text-embedding-3-small", input=texts)
docs = [
{
"_index": INDEX,
"_id": c.id,
"_source": {
"text": c.text,
"embedding": e.embedding,
"doc_type": c.metadata.get("doc_type", "invoice"),
"org_id": "org_acme",
"page_numbers": c.page_numbers,
"section": c.metadata.get("section", ""),
},
}
for c, e in zip(result.chunks, embeds.data)
]
await async_bulk(es, docs)
# 4. Hybrid query — RRF combines BM25 + kNN in one shot.
question = "Largest Cloudflare line item"
qe = await openai.embeddings.create(
model="text-embedding-3-small", input=[question]
)
hits = await es.search(
index=INDEX,
retriever={
"rrf": {
"retrievers": [
{"standard": {"query": {"match": {"text": question}}}},
{"knn": {
"field": "embedding",
"query_vector": qe.data[0].embedding,
"k": 10,
"num_candidates": 50,
}},
],
"rank_window_size": 50,
}
},
query={"term": {"org_id": "org_acme"}},
size=3,
)
for hit in hits["hits"]["hits"]:
src = hit["_source"]
print(hit["_score"], src["section"], src["page_numbers"])Cost
What you'll actually pay
Elastic Cloud Standard EU starts ~€95/mo for the smallest hot-tier deployment that supports HNSW. Self-hosting on a 4 vCPU / 8 GB box is achievable but ops-heavy. parsr cost is unchanged from other integrations. For teams already running Elastic for logs, adding a parsr index is essentially free; for greenfield RAG-only deployments pgvector or Qdrant is cheaper.
Performance
Tuning tips
- Use RRF (8.13+) over manual BM25+vector blending — the algorithm is defensively documented and tuned
- Set num_candidates ≥ 5× k for the kNN clause; smaller values silently degrade recall
- Filter by org_id with a `term` query rather than a `bool.filter` — the post-filter path is faster on dense_vector indices
- Run a separate index per region for EU/US tenants; cross-region replicas are unnecessary expense for RAG queries that should never cross borders
Three lines and you're calling parsr from Elasticsearch.
Start building