How is /v1/extract different from /v1/parse?

/v1/parse with doc_type='bank_statement' (or 'invoice', 'pay_stub', etc.) hits a specialist pipeline with curated prompts, per-doc-type validators (like balance-chain integrity), and a published accuracy bar from fixture tests. /v1/extract is the schema-driven escape hatch: you supply a JSON Schema at request time and we extract against it. More flexible, less accurate. Use specialists when one exists; use /v1/extract for everything else in your finance stack.

What if my schema is huge?

Token cost scales with schema size — both the input cost (we send your schema to the vision LLM) and accuracy. Flat schemas with 5–20 fields are the sweet spot at ~88% field accuracy. 50+ fields or deeply nested schemas drop noticeably. If you have a large schema, split it: run two or three /v1/extract calls with sub-schemas instead of one mega-call. The page-cache rebates the second call when the file bytes match.

Do you train on my schema?

No. parsr never trains on customer schemas, documents, or outputs. /v1/extract is vision-LLM extraction with few-shot prompting against the JSON Schema you send at request time. Each request is independent. There is no model fine-tuning step, no labeled-data requirement, and nothing to set up beforehand — paste a schema in your request body and you're live.

Schema design — any tips?

Three rules. (1) Flat beats nested: a 12-field flat object accurately extracts more reliably than a 6-field schema with two levels of nesting. (2) Use 'string' for numeric leaves and cast on your side — preserves source precision and avoids silent rounding. (3) Mark only what's truly required as required. Optional fields not present in the source surface in field_completeness, not as conformance failures. Full guide at /docs/extract/schema-design.

What happens when a required field is missing?

The response is still well-formed. validation.schema_conformance.valid=false, missing_fields lists every required field the model couldn't find on the document, and result. is null. Type mismatches (declared 'date', source printed 'Q1 2026') surface in type_mismatches[] with both the declared type and the raw value. Your code branches on conformance.valid — re-prompt, fall back to a different schema, or queue for human review.

When should I ask you for a specialist endpoint instead?

If you process more than ~10K pages/mo of a single doc type, ask. Specialists are measurably more accurate (curated prompts + per-doc-type validators + fixture tests), run a touch faster, and cost the same per page. We ship new specialist endpoints in 48 hours — email a sample to support@tryparsr.dev and we'll confirm within 24 hours whether it's a good fit.

Schema-driven extraction

Any financial document, your schema, structured JSON

Provide a JSON Schema, get structured data back validated against that schema. For document types we don't have specialist endpoints for yet, or for highly customized extraction. Stays vertical: financial documents only.

Get an API key in 60 seconds Read the docs →

extract-with-schema.shbash

curl -X POST https://eu-api.tryparsr.dev/v1/extract \
  -H "Authorization: Bearer $PARSR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document_url": "https://example.com/etrade-q1-2026.pdf",
    "schema": {
      "type": "object",
      "required": ["account_number", "period_end", "positions"],
      "properties": {
        "account_number": { "type": "string" },
        "period_end":     { "type": "string", "format": "date" },
        "total_value":    { "type": "string" },
        "positions": {
          "type": "array",
          "items": {
            "type": "object",
            "required": ["symbol", "quantity", "market_value"],
            "properties": {
              "symbol":       { "type": "string" },
              "quantity":     { "type": "string" },
              "cost_basis":   { "type": "string" },
              "market_value": { "type": "string" }
            }
          }
        }
      }
    },
    "wait": 60
  }'

Format coverage

Long-tail financial docs

Languages

30+

Avg latency

~5.8s p50

Field accuracy

~88% flat schemas

What we extract

Every field, with confidence and citations

The response shape mirrors your input schema 1:1 under result. Every leaf field gets a confidence score and bounding box in field_metadata, plus a top-level validation.schema_conformance summary you can branch on.

Input

Anonymized E*TRADE Q1 2026 brokerage statement, page 1 of 4 — extracted against a custom 5-field schema

Anonymized custom extraction preview

response.jsonjson

{
  "request_id": "req_01JXM4P9V8QK7Z3D",
  "schema_version": "extract.v1",
  "result": {
    "account_number": "X47-512983",
    "period_end": "2026-03-31",
    "total_value": "184230.55",
    "positions": [
      {
        "symbol": "VOO",
        "quantity": "412",
        "cost_basis": "152844.00",
        "market_value": "198736.40"
      },
      {
        "symbol": "MSFT",
        "quantity": "85",
        "cost_basis": "31204.50",
        "market_value": "38492.15"
      }
    ]
  },
  "validation": {
    "schema_conformance": {
      "valid": true,
      "missing_fields": [],
      "type_mismatches": [],
      "extra_fields_dropped": []
    },
    "field_completeness": {
      "declared": 5,
      "populated": 5,
      "ratio": 1.0
    }
  },
  "field_metadata": {
    "account_number":             { "confidence": 0.98, "bbox": { "page": 1, "x": 0.71, "y": 0.08, "w": 0.18, "h": 0.018 } },
    "period_end":                 { "confidence": 0.99, "bbox": { "page": 1, "x": 0.71, "y": 0.11, "w": 0.18, "h": 0.018 } },
    "positions[0].symbol":        { "confidence": 0.99, "bbox": { "page": 2, "x": 0.06, "y": 0.32, "w": 0.08, "h": 0.018 } },
    "positions[0].market_value":  { "confidence": 0.96, "bbox": { "page": 2, "x": 0.74, "y": 0.32, "w": 0.14, "h": 0.018 } }
  }
}

Field	Type	Description	Conf. typical
request_id	string	Server-assigned ULID for this extraction. Use it in support tickets and for log correlation.	100%
schema_version	string (literal)	Always 'extract.v1' for the schema-driven endpoint. Specialist endpoints have their own schema_version (e.g. 'bank_statement.v2').	100%
result	object (your schema)	The extracted payload, shape-for-shape mirror of your input JSON Schema. All leaf values are returned as strings to preserve source precision; cast in your code.	88%
validation.schema_conformance	object	valid (bool), missing_fields[] (required fields the model couldn't find), type_mismatches[] (where source disagrees with declared type), extra_fields_dropped[] (model output that wasn't in the schema).	100%
validation.field_completeness	object	declared / populated / ratio. Optional fields not present in the source are counted as un-populated, not missing.	100%
field_metadata.<json_path>	object	One entry per leaf field, keyed by JSON path (dot + bracket notation for arrays). Each holds confidence in [0,1] and a normalized bbox on the source page.	92%

Domain-specific validation

What makes this a specialist

Schema conformance

Every required field in your JSON Schema is verified to be present and type-correct in the response. Missing required fields, type mismatches, and any model output not declared in the schema are all surfaced — never silently coerced or dropped without a record.

validation.schema_conformance.valid

exampleSchema requires period_end as date. Source statement printed 'Q1 2026' instead of an ISO date. Returned with valid=false, missing_fields=['period_end'], result.period_end=null. Caller decides: re-prompt, fall back, or queue for review.

Field completeness ratio

Reports the share of declared fields that were populated from the source. Optional fields that simply aren't in the document are counted as not-populated rather than missing — useful for measuring schema fit across a corpus before scaling spend.

validation.field_completeness.ratio

exampleSchema declared 18 fields; source statement only contains 11 of them (cost_basis, prior_period, dividends not printed). ratio: 0.61. Signal that the schema is over-specified for this document type — trim it or split into sub-schemas.

Per-field confidence + bbox

Every leaf field in your schema gets a confidence score in [0,1] and a normalized bbox in field_metadata. Confidence < 0.85 is the recommended escalation threshold; bbox is what your UI uses to render an inline citation when a human reviews the document.

field_metadata.<json_path>.confidence

examplepositions[2].cost_basis returned with confidence 0.62 — source row had a smudged digit. Agent does not auto-book the trade; the page renders the bbox so a human approves the corrected value in one click.

Format coverage

Where customers reach for /v1/extract

Long-tail financial documents — anything outside the specialist endpoints, scoped to finance only

Brokerage statements

E*TRADE
Fidelity
Charles Schwab
Degiro

ESPP / RSU vesting schedules

Carta
Shareworks (Solium)
Computershare
E*TRADE Stock Plan

K-1 partnership statements (US tax)

Form 1065 K-1
Form 1120-S K-1
PTP K-1 packages

Crypto exchange statements

Coinbase tax docs
Kraken statements
Binance transaction reports

Loan amortization schedules

Mortgage amortization PDFs
SMB term-loan schedules
Auto-loan payoff letters

Lease agreements (financial clauses only)

Rent + escalation tables
Operating-expense pass-throughs
Renewal options & rates

Custom internal templates

Reconciliation reports
Firm-specific underwriting forms
Internal P&L exports

Read the schema-design guide →

Code recipes

From document to JSON in five lines

extract.shbash

curl -X POST https://eu-api.tryparsr.dev/v1/extract \
  -H "Authorization: Bearer $PARSR_API_KEY" \
  -H "Idempotency-Key: $(uuidgen)" \
  -H "Content-Type: application/json" \
  -d '{
    "document_url": "https://example.com/etrade-q1.pdf",
    "schema": {
      "type": "object",
      "required": ["account_number", "period_end", "positions"],
      "properties": {
        "account_number": { "type": "string" },
        "period_end":     { "type": "string", "format": "date" },
        "total_value":    { "type": "string" },
        "positions": {
          "type": "array",
          "items": {
            "type": "object",
            "required": ["symbol", "quantity", "market_value"],
            "properties": {
              "symbol":       { "type": "string" },
              "quantity":     { "type": "string" },
              "cost_basis":   { "type": "string" },
              "market_value": { "type": "string" }
            }
          }
        }
      }
    },
    "wait": 60
  }'

extract_brokerage.pypython

import os, httpx

schema = {
    "type": "object",
    "required": ["account_number", "period_end", "positions"],
    "properties": {
        "account_number": {"type": "string"},
        "period_end":     {"type": "string", "format": "date"},
        "total_value":    {"type": "string"},
        "positions": {
            "type": "array",
            "items": {
                "type": "object",
                "required": ["symbol", "quantity", "market_value"],
                "properties": {
                    "symbol":       {"type": "string"},
                    "quantity":     {"type": "string"},
                    "cost_basis":   {"type": "string"},
                    "market_value": {"type": "string"},
                },
            },
        },
    },
}

resp = httpx.post(
    "https://eu-api.tryparsr.dev/v1/extract",
    headers={"Authorization": f"Bearer {os.environ['PARSR_API_KEY']}"},
    json={
        "document_url": "https://example.com/etrade-q1.pdf",
        "schema": schema,
        "wait": 60,
    },
    timeout=70,
)
body = resp.json()
conformance = body["validation"]["schema_conformance"]
if not conformance["valid"]:
    raise ValueError(f"Missing fields: {conformance['missing_fields']}")

for pos in body["result"]["positions"]:
    print(pos["symbol"], pos["quantity"], pos["market_value"])

extractBrokerage.tstypescript

const schema = {
  type: "object",
  required: ["account_number", "period_end", "positions"],
  properties: {
    account_number: { type: "string" },
    period_end:     { type: "string", format: "date" },
    total_value:    { type: "string" },
    positions: {
      type: "array",
      items: {
        type: "object",
        required: ["symbol", "quantity", "market_value"],
        properties: {
          symbol:       { type: "string" },
          quantity:     { type: "string" },
          cost_basis:   { type: "string" },
          market_value: { type: "string" },
        },
      },
    },
  },
} as const;

const resp = await fetch("https://eu-api.tryparsr.dev/v1/extract", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.PARSR_API_KEY}`,
    "Content-Type": "application/json",
    "Idempotency-Key": crypto.randomUUID(),
  },
  body: JSON.stringify({
    document_url: "https://example.com/etrade-q1.pdf",
    schema,
    wait: 60,
  }),
});

const body = await resp.json();
const { schema_conformance } = body.validation;
if (!schema_conformance.valid) {
  throw new Error(`missing fields: ${schema_conformance.missing_fields.join(", ")}`);
}
for (const pos of body.result.positions) {
  console.log(pos.symbol, pos.quantity, pos.market_value);
}

agent.pypython

from langchain_parsr import ParsrToolkit
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent

# ParsrToolkit exposes 'parsr_extract' — the agent supplies the schema at call time.
tools = ParsrToolkit.from_env().get_tools()
agent = create_react_agent(ChatOpenAI(model="gpt-4o"), tools)

result = await agent.ainvoke({
    "messages": [(
        "user",
        "Extract account_number, period_end, and a positions[] list "
        "(symbol, quantity, market_value) from this E*TRADE statement: "
        "https://example.com/etrade-q1.pdf"
    )]
})
print(result["messages"][-1].content)

What teams build with this

Three integration patterns

AI bookkeeping agents — edge-case docs

Bookkeeping copilots hit a long tail of statements that no specialist endpoint covers — a niche brokerage, an ESPP grant letter, a custom reconciliation export. /v1/extract lets the agent declare the schema it needs at runtime and get structured JSON back, instead of falling back to raw OCR and regex.

See use case →

Lending & underwriting — custom income docs

Underwriters routinely see income proofs that aren't pay stubs or bank statements: K-1s, RSU vesting schedules, brokerage dividend reports. Define the income-relevant schema once, run it across the borrower's document bundle, and join the structured output back into your decisioning model.

See use case →

Accounting automation — firm-specific templates

Every accounting firm has its own internal templates: client onboarding worksheets, recurring journal-entry exports, partner-comp summaries. /v1/extract turns those firm-specific PDFs and spreadsheets into JSON your workflow can consume — without waiting for a specialist endpoint to be built.

See use case →

Compared

How parsr's /v1/extract compares

Vendor	Pricing per page	EU residency	Schema-driven	Confidence + bbox	Vertical specialization
parsr	from €0.022	Default (eu-api region-bound key)	Yes — JSON Schema at request time, no training	Per leaf field, in field_metadata	Finance-only by design
Reducto Extract	Custom	No (US default)	Yes — JSON Schema at request time	Partial	General parser
Mindee Custom Training	Quote-based; requires labeled training set	Pro tier+ only	No — template-based, train per format	Partial	No (horizontal)
DocuPipe	Quote-based; dashboard-driven	No	Yes — schemas authored in-dashboard	Yes	No (horizontal)

See full comparison vs Reducto →

Hitting >10K pages/mo of one doc type? Ask for a specialist.

/v1/extract is for the long tail. Once a single doc type crosses ~10K pages/mo we ship a dedicated specialist endpoint in 48 hours — curated prompts, per-doc-type validators, fixture tests, and a published accuracy bar. Specialists run faster, cost the same, and are measurably more accurate than schema-driven extraction. Email a sample (anonymized fine) and we'll confirm within 24 hours.

Request a format →

FAQ

Common questions

How is /v1/extract different from /v1/parse?
/v1/parse with doc_type='bank_statement' (or 'invoice', 'pay_stub', etc.) hits a specialist pipeline with curated prompts, per-doc-type validators (like balance-chain integrity), and a published accuracy bar from fixture tests. /v1/extract is the schema-driven escape hatch: you supply a JSON Schema at request time and we extract against it. More flexible, less accurate. Use specialists when one exists; use /v1/extract for everything else in your finance stack.
What if my schema is huge?
Token cost scales with schema size — both the input cost (we send your schema to the vision LLM) and accuracy. Flat schemas with 5–20 fields are the sweet spot at ~88% field accuracy. 50+ fields or deeply nested schemas drop noticeably. If you have a large schema, split it: run two or three /v1/extract calls with sub-schemas instead of one mega-call. The page-cache rebates the second call when the file bytes match.
Do you train on my schema?
No. parsr never trains on customer schemas, documents, or outputs. /v1/extract is vision-LLM extraction with few-shot prompting against the JSON Schema you send at request time. Each request is independent. There is no model fine-tuning step, no labeled-data requirement, and nothing to set up beforehand — paste a schema in your request body and you're live.
Schema design — any tips?
Three rules. (1) Flat beats nested: a 12-field flat object accurately extracts more reliably than a 6-field schema with two levels of nesting. (2) Use 'string' for numeric leaves and cast on your side — preserves source precision and avoids silent rounding. (3) Mark only what's truly required as required. Optional fields not present in the source surface in field_completeness, not as conformance failures. Full guide at /docs/extract/schema-design.
What happens when a required field is missing?
The response is still well-formed. validation.schema_conformance.valid=false, missing_fields lists every required field the model couldn't find on the document, and result.<missing_field> is null. Type mismatches (declared 'date', source printed 'Q1 2026') surface in type_mismatches[] with both the declared type and the raw value. Your code branches on conformance.valid — re-prompt, fall back to a different schema, or queue for human review.
When should I ask you for a specialist endpoint instead?
If you process more than ~10K pages/mo of a single doc type, ask. Specialists are measurably more accurate (curated prompts + per-doc-type validators + fixture tests), run a touch faster, and cost the same per page. We ship new specialist endpoints in 48 hours — email a sample to support@tryparsr.dev and we'll confirm within 24 hours whether it's a good fit.

200 free pages. No credit card. No sales call.

Drop custom extraction parsing into your stack in an afternoon. If it doesn't earn its keep, walk away — no lock-in.

Get an API key