Integration · Elasticsearch

Hybrid search over parsr-extracted finance docs with Elasticsearch's BM25 + dense vector RRF.

Elasticsearch (~67K GitHub stars) added first-class dense_vector and HNSW indexing in 8.x; 8.13+ ships RRF (reciprocal rank fusion) for hybrid retrieval as a single search call. For fintechs already operating Elasticsearch for log aggregation or full-text search, adding parsr-extracted finance documents alongside is a one-mapping change. The killer use case: an EU compliance team auditing transactions can run a single hybrid query ("all SEPA transfers over €10K from the last quarter") against parsr's structured fields plus the raw text — exact filter on the structured side, semantic recall on the chunks. Elastic Cloud's EU regions (Frankfurt, Amsterdam, Dublin) keep residency end-to-end with parsr EU.

Get an API key Read the docs →

Install

One command

pip install parsr-sdk elasticsearch openai

Code

Working sample

Elasticsearch integrationcode

from parsr_sdk import AsyncParsr
from elasticsearch import AsyncElasticsearch

parsr = AsyncParsr(api_key="sk_eu_live_...")
es = AsyncElasticsearch("https://....cloud.es.io:9243", api_key="...")

What you get

Highlights

RRF hybrid search (BM25 + vector) in a single call — best-in-class for compliance + finance queries
kNN queries support post-filters on structured fields (date ranges, amounts, doc_type)
Single cluster handles logs + RAG — no second datastore for ops to manage
Elastic Cloud Frankfurt/Amsterdam/Dublin pairs with parsr EU for end-to-end residency
Aggregations work natively on parsr-extracted fields — facet by month, currency, counterparty

Architecture

How the pieces fit

One Elasticsearch index per doc_type (or a single index with a doc_type keyword field for filtering). Each chunk becomes one document with `text` (analyzed for BM25), `embedding` (dense_vector with HNSW), and `metadata.*` flat keyword fields. parsr.parse(*, include_chunks=true) → embed text → bulk index. Query via the `_search` API with an RRF block combining a BM25 `match` and a `knn` clause.

Quickstart

End-to-end example

Parse a document with `include_chunks=true`, embed each chunk, upsert into Elasticsearch, query.

parsr → embed → Elasticsearch → querypython

import os
from parsr_sdk import AsyncParsr
from elasticsearch import AsyncElasticsearch
from elasticsearch.helpers import async_bulk
from openai import AsyncOpenAI

parsr = AsyncParsr(api_key=os.environ["PARSR_API_KEY"])
es = AsyncElasticsearch(
    os.environ["ES_URL"], api_key=os.environ["ES_API_KEY"]
)
openai = AsyncOpenAI()

INDEX = "parsr-invoices"

# 1. Mapping (run once).
if not await es.indices.exists(index=INDEX):
    await es.indices.create(
        index=INDEX,
        mappings={
            "properties": {
                "text": {"type": "text"},
                "embedding": {
                    "type": "dense_vector",
                    "dims": 1536,
                    "index": True,
                    "similarity": "cosine",
                },
                "doc_type": {"type": "keyword"},
                "org_id": {"type": "keyword"},
                "page_numbers": {"type": "integer"},
                "section": {"type": "keyword"},
            }
        },
    )

# 2. Parse with chunks.
result = await parsr.parse_invoice(
    document_url="https://files.example.com/invoice.pdf",
    include_chunks=True,
    chunking={"strategy": "block"},
)

# 3. Embed + bulk index.
texts = [c.text for c in result.chunks]
embeds = await openai.embeddings.create(model="text-embedding-3-small", input=texts)
docs = [
    {
        "_index": INDEX,
        "_id": c.id,
        "_source": {
            "text": c.text,
            "embedding": e.embedding,
            "doc_type": c.metadata.get("doc_type", "invoice"),
            "org_id": "org_acme",
            "page_numbers": c.page_numbers,
            "section": c.metadata.get("section", ""),
        },
    }
    for c, e in zip(result.chunks, embeds.data)
]
await async_bulk(es, docs)

# 4. Hybrid query — RRF combines BM25 + kNN in one shot.
question = "Largest Cloudflare line item"
qe = await openai.embeddings.create(
    model="text-embedding-3-small", input=[question]
)
hits = await es.search(
    index=INDEX,
    retriever={
        "rrf": {
            "retrievers": [
                {"standard": {"query": {"match": {"text": question}}}},
                {"knn": {
                    "field": "embedding",
                    "query_vector": qe.data[0].embedding,
                    "k": 10,
                    "num_candidates": 50,
                }},
            ],
            "rank_window_size": 50,
        }
    },
    query={"term": {"org_id": "org_acme"}},
    size=3,
)
for hit in hits["hits"]["hits"]:
    src = hit["_source"]
    print(hit["_score"], src["section"], src["page_numbers"])

Cost

What you'll actually pay

Elastic Cloud Standard EU starts ~€95/mo for the smallest hot-tier deployment that supports HNSW. Self-hosting on a 4 vCPU / 8 GB box is achievable but ops-heavy. parsr cost is unchanged from other integrations. For teams already running Elastic for logs, adding a parsr index is essentially free; for greenfield RAG-only deployments pgvector or Qdrant is cheaper.

Performance

Tuning tips

Use RRF (8.13+) over manual BM25+vector blending — the algorithm is defensively documented and tuned
Set num_candidates ≥ 5× k for the kNN clause; smaller values silently degrade recall
Filter by org_id with a `term` query rather than a `bool.filter` — the post-filter path is faster on dense_vector indices
Run a separate index per region for EU/US tenants; cross-region replicas are unnecessary expense for RAG queries that should never cross borders

Three lines and you're calling parsr from Elasticsearch.

Start building