parsr.

Schema-driven extraction

Any financial document, your schema, structured JSON

Provide a JSON Schema, get structured data back validated against that schema. For document types we don't have specialist endpoints for yet, or for highly customized extraction. Stays vertical: financial documents only.

extract-with-schema.shbash
curl -X POST https://eu-api.tryparsr.dev/v1/extract \
  -H "Authorization: Bearer $PARSR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document_url": "https://example.com/etrade-q1-2026.pdf",
    "schema": {
      "type": "object",
      "required": ["account_number", "period_end", "positions"],
      "properties": {
        "account_number": { "type": "string" },
        "period_end":     { "type": "string", "format": "date" },
        "total_value":    { "type": "string" },
        "positions": {
          "type": "array",
          "items": {
            "type": "object",
            "required": ["symbol", "quantity", "market_value"],
            "properties": {
              "symbol":       { "type": "string" },
              "quantity":     { "type": "string" },
              "cost_basis":   { "type": "string" },
              "market_value": { "type": "string" }
            }
          }
        }
      }
    },
    "wait": 60
  }'

Format coverage

Long-tail financial docs

Languages

30+

Avg latency

~5.8s p50

Field accuracy

~88% flat schemas

What we extract

Every field, with confidence and citations

The response shape mirrors your input schema 1:1 under result. Every leaf field gets a confidence score and bounding box in field_metadata, plus a top-level validation.schema_conformance summary you can branch on.

Input

Anonymized E*TRADE Q1 2026 brokerage statement, page 1 of 4 — extracted against a custom 5-field schema

Anonymized custom extraction preview
response.jsonjson
{
  "request_id": "req_01JXM4P9V8QK7Z3D",
  "schema_version": "extract.v1",
  "result": {
    "account_number": "X47-512983",
    "period_end": "2026-03-31",
    "total_value": "184230.55",
    "positions": [
      {
        "symbol": "VOO",
        "quantity": "412",
        "cost_basis": "152844.00",
        "market_value": "198736.40"
      },
      {
        "symbol": "MSFT",
        "quantity": "85",
        "cost_basis": "31204.50",
        "market_value": "38492.15"
      }
    ]
  },
  "validation": {
    "schema_conformance": {
      "valid": true,
      "missing_fields": [],
      "type_mismatches": [],
      "extra_fields_dropped": []
    },
    "field_completeness": {
      "declared": 5,
      "populated": 5,
      "ratio": 1.0
    }
  },
  "field_metadata": {
    "account_number":             { "confidence": 0.98, "bbox": { "page": 1, "x": 0.71, "y": 0.08, "w": 0.18, "h": 0.018 } },
    "period_end":                 { "confidence": 0.99, "bbox": { "page": 1, "x": 0.71, "y": 0.11, "w": 0.18, "h": 0.018 } },
    "positions[0].symbol":        { "confidence": 0.99, "bbox": { "page": 2, "x": 0.06, "y": 0.32, "w": 0.08, "h": 0.018 } },
    "positions[0].market_value":  { "confidence": 0.96, "bbox": { "page": 2, "x": 0.74, "y": 0.32, "w": 0.14, "h": 0.018 } }
  }
}
FieldTypeDescriptionConf. typical
request_idstringServer-assigned ULID for this extraction. Use it in support tickets and for log correlation.100%
schema_versionstring (literal)Always 'extract.v1' for the schema-driven endpoint. Specialist endpoints have their own schema_version (e.g. 'bank_statement.v2').100%
resultobject (your schema)The extracted payload, shape-for-shape mirror of your input JSON Schema. All leaf values are returned as strings to preserve source precision; cast in your code.88%
validation.schema_conformanceobjectvalid (bool), missing_fields[] (required fields the model couldn't find), type_mismatches[] (where source disagrees with declared type), extra_fields_dropped[] (model output that wasn't in the schema).100%
validation.field_completenessobjectdeclared / populated / ratio. Optional fields not present in the source are counted as un-populated, not missing.100%
field_metadata.<json_path>objectOne entry per leaf field, keyed by JSON path (dot + bracket notation for arrays). Each holds confidence in [0,1] and a normalized bbox on the source page.92%

Domain-specific validation

What makes this a specialist

Schema conformance

Every required field in your JSON Schema is verified to be present and type-correct in the response. Missing required fields, type mismatches, and any model output not declared in the schema are all surfaced — never silently coerced or dropped without a record.

validation.schema_conformance.valid

exampleSchema requires period_end as date. Source statement printed 'Q1 2026' instead of an ISO date. Returned with valid=false, missing_fields=['period_end'], result.period_end=null. Caller decides: re-prompt, fall back, or queue for review.

Field completeness ratio

Reports the share of declared fields that were populated from the source. Optional fields that simply aren't in the document are counted as not-populated rather than missing — useful for measuring schema fit across a corpus before scaling spend.

validation.field_completeness.ratio

exampleSchema declared 18 fields; source statement only contains 11 of them (cost_basis, prior_period, dividends not printed). ratio: 0.61. Signal that the schema is over-specified for this document type — trim it or split into sub-schemas.

Per-field confidence + bbox

Every leaf field in your schema gets a confidence score in [0,1] and a normalized bbox in field_metadata. Confidence < 0.85 is the recommended escalation threshold; bbox is what your UI uses to render an inline citation when a human reviews the document.

field_metadata.<json_path>.confidence

examplepositions[2].cost_basis returned with confidence 0.62 — source row had a smudged digit. Agent does not auto-book the trade; the page renders the bbox so a human approves the corrected value in one click.

Format coverage

Where customers reach for /v1/extract

Long-tail financial documents — anything outside the specialist endpoints, scoped to finance only

Brokerage statements

  • E*TRADE
  • Fidelity
  • Charles Schwab
  • Degiro

ESPP / RSU vesting schedules

  • Carta
  • Shareworks (Solium)
  • Computershare
  • E*TRADE Stock Plan

K-1 partnership statements (US tax)

  • Form 1065 K-1
  • Form 1120-S K-1
  • PTP K-1 packages

Crypto exchange statements

  • Coinbase tax docs
  • Kraken statements
  • Binance transaction reports

Loan amortization schedules

  • Mortgage amortization PDFs
  • SMB term-loan schedules
  • Auto-loan payoff letters

Lease agreements (financial clauses only)

  • Rent + escalation tables
  • Operating-expense pass-throughs
  • Renewal options & rates

Custom internal templates

  • Reconciliation reports
  • Firm-specific underwriting forms
  • Internal P&L exports

Code recipes

From document to JSON in five lines

extract.shbash
curl -X POST https://eu-api.tryparsr.dev/v1/extract \
  -H "Authorization: Bearer $PARSR_API_KEY" \
  -H "Idempotency-Key: $(uuidgen)" \
  -H "Content-Type: application/json" \
  -d '{
    "document_url": "https://example.com/etrade-q1.pdf",
    "schema": {
      "type": "object",
      "required": ["account_number", "period_end", "positions"],
      "properties": {
        "account_number": { "type": "string" },
        "period_end":     { "type": "string", "format": "date" },
        "total_value":    { "type": "string" },
        "positions": {
          "type": "array",
          "items": {
            "type": "object",
            "required": ["symbol", "quantity", "market_value"],
            "properties": {
              "symbol":       { "type": "string" },
              "quantity":     { "type": "string" },
              "cost_basis":   { "type": "string" },
              "market_value": { "type": "string" }
            }
          }
        }
      }
    },
    "wait": 60
  }'
extract_brokerage.pypython
import os, httpx

schema = {
    "type": "object",
    "required": ["account_number", "period_end", "positions"],
    "properties": {
        "account_number": {"type": "string"},
        "period_end":     {"type": "string", "format": "date"},
        "total_value":    {"type": "string"},
        "positions": {
            "type": "array",
            "items": {
                "type": "object",
                "required": ["symbol", "quantity", "market_value"],
                "properties": {
                    "symbol":       {"type": "string"},
                    "quantity":     {"type": "string"},
                    "cost_basis":   {"type": "string"},
                    "market_value": {"type": "string"},
                },
            },
        },
    },
}

resp = httpx.post(
    "https://eu-api.tryparsr.dev/v1/extract",
    headers={"Authorization": f"Bearer {os.environ['PARSR_API_KEY']}"},
    json={
        "document_url": "https://example.com/etrade-q1.pdf",
        "schema": schema,
        "wait": 60,
    },
    timeout=70,
)
body = resp.json()
conformance = body["validation"]["schema_conformance"]
if not conformance["valid"]:
    raise ValueError(f"Missing fields: {conformance['missing_fields']}")

for pos in body["result"]["positions"]:
    print(pos["symbol"], pos["quantity"], pos["market_value"])
extractBrokerage.tstypescript
const schema = {
  type: "object",
  required: ["account_number", "period_end", "positions"],
  properties: {
    account_number: { type: "string" },
    period_end:     { type: "string", format: "date" },
    total_value:    { type: "string" },
    positions: {
      type: "array",
      items: {
        type: "object",
        required: ["symbol", "quantity", "market_value"],
        properties: {
          symbol:       { type: "string" },
          quantity:     { type: "string" },
          cost_basis:   { type: "string" },
          market_value: { type: "string" },
        },
      },
    },
  },
} as const;

const resp = await fetch("https://eu-api.tryparsr.dev/v1/extract", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.PARSR_API_KEY}`,
    "Content-Type": "application/json",
    "Idempotency-Key": crypto.randomUUID(),
  },
  body: JSON.stringify({
    document_url: "https://example.com/etrade-q1.pdf",
    schema,
    wait: 60,
  }),
});

const body = await resp.json();
const { schema_conformance } = body.validation;
if (!schema_conformance.valid) {
  throw new Error(`missing fields: ${schema_conformance.missing_fields.join(", ")}`);
}
for (const pos of body.result.positions) {
  console.log(pos.symbol, pos.quantity, pos.market_value);
}
agent.pypython
from langchain_parsr import ParsrToolkit
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent

# ParsrToolkit exposes 'parsr_extract' — the agent supplies the schema at call time.
tools = ParsrToolkit.from_env().get_tools()
agent = create_react_agent(ChatOpenAI(model="gpt-4o"), tools)

result = await agent.ainvoke({
    "messages": [(
        "user",
        "Extract account_number, period_end, and a positions[] list "
        "(symbol, quantity, market_value) from this E*TRADE statement: "
        "https://example.com/etrade-q1.pdf"
    )]
})
print(result["messages"][-1].content)

Compared

How parsr's /v1/extract compares

VendorPricing per pageEU residencySchema-drivenConfidence + bboxVertical specialization
parsrfrom €0.022Default (eu-api region-bound key)Yes — JSON Schema at request time, no trainingPer leaf field, in field_metadataFinance-only by design
Reducto ExtractCustomNo (US default)Yes — JSON Schema at request timePartialGeneral parser
Mindee Custom TrainingQuote-based; requires labeled training setPro tier+ onlyNo — template-based, train per formatPartialNo (horizontal)
DocuPipeQuote-based; dashboard-drivenNoYes — schemas authored in-dashboardYesNo (horizontal)

Hitting >10K pages/mo of one doc type? Ask for a specialist.

/v1/extract is for the long tail. Once a single doc type crosses ~10K pages/mo we ship a dedicated specialist endpoint in 48 hours — curated prompts, per-doc-type validators, fixture tests, and a published accuracy bar. Specialists run faster, cost the same, and are measurably more accurate than schema-driven extraction. Email a sample (anonymized fine) and we'll confirm within 24 hours.

Request a format →

FAQ

Common questions

  • How is /v1/extract different from /v1/parse?

    /v1/parse with doc_type='bank_statement' (or 'invoice', 'pay_stub', etc.) hits a specialist pipeline with curated prompts, per-doc-type validators (like balance-chain integrity), and a published accuracy bar from fixture tests. /v1/extract is the schema-driven escape hatch: you supply a JSON Schema at request time and we extract against it. More flexible, less accurate. Use specialists when one exists; use /v1/extract for everything else in your finance stack.

  • What if my schema is huge?

    Token cost scales with schema size — both the input cost (we send your schema to the vision LLM) and accuracy. Flat schemas with 5–20 fields are the sweet spot at ~88% field accuracy. 50+ fields or deeply nested schemas drop noticeably. If you have a large schema, split it: run two or three /v1/extract calls with sub-schemas instead of one mega-call. The page-cache rebates the second call when the file bytes match.

  • Do you train on my schema?

    No. parsr never trains on customer schemas, documents, or outputs. /v1/extract is vision-LLM extraction with few-shot prompting against the JSON Schema you send at request time. Each request is independent. There is no model fine-tuning step, no labeled-data requirement, and nothing to set up beforehand — paste a schema in your request body and you're live.

  • Schema design — any tips?

    Three rules. (1) Flat beats nested: a 12-field flat object accurately extracts more reliably than a 6-field schema with two levels of nesting. (2) Use 'string' for numeric leaves and cast on your side — preserves source precision and avoids silent rounding. (3) Mark only what's truly required as required. Optional fields not present in the source surface in field_completeness, not as conformance failures. Full guide at /docs/extract/schema-design.

  • What happens when a required field is missing?

    The response is still well-formed. validation.schema_conformance.valid=false, missing_fields lists every required field the model couldn't find on the document, and result.<missing_field> is null. Type mismatches (declared 'date', source printed 'Q1 2026') surface in type_mismatches[] with both the declared type and the raw value. Your code branches on conformance.valid — re-prompt, fall back to a different schema, or queue for human review.

  • When should I ask you for a specialist endpoint instead?

    If you process more than ~10K pages/mo of a single doc type, ask. Specialists are measurably more accurate (curated prompts + per-doc-type validators + fixture tests), run a touch faster, and cost the same per page. We ship new specialist endpoints in 48 hours — email a sample to support@tryparsr.dev and we'll confirm within 24 hours whether it's a good fit.

200 free pages. No credit card. No sales call.

Drop custom extraction parsing into your stack in an afternoon. If it doesn't earn its keep, walk away — no lock-in.

Get an API key