Grounded Extraction: Getting to 98% Zero-Failure at Scale

"Mostly correct" is a failing grade when the output feeds a legal workflow. The target was document-level zero-failure: every field on a document either correct or explicitly flagged — never silently wrong. We hit 98% across 40 lakh test rows. This is how.

Zero-failure is not the same as accuracy

A 95%-accurate field that lies confidently 5% of the time is worse than one that abstains. Zero-failure reframes the goal: it is acceptable to say "I don't know," and unacceptable to invent. Every design choice below trades raw coverage for honesty.

Grounding against structured sources

Free-text LLM extraction drifts. Grounding it against a database or CSV of known values pins it down. Instead of asking the model to recall a court name or statute, we constrain it to select from a canonical list, and reject anything outside it.

Canonical entities (courts, judges, statutes) live in a DB/CSV the model must map to.
An extracted value with no match is not "close enough" — it is flagged for review.
This converts a generation problem into a constrained matching problem, which is far easier to verify.

Windowed precedent extraction

Long judgments blow past comfortable context limits and bury the model's attention. Rather than feed an entire document and hope, we slide a window over it, extract within each window, and merge. Precedents and citations — which cluster in specific sections — get found because the relevant window is small enough for the model to read carefully.

The verification ledger

Every extracted value carries provenance: where in the source it came from, and whether it grounded to a canonical record. A value that cannot point back to evidence is downgraded to "needs review" rather than emitted as fact. That single rule is most of the zero-failure number.

A system that knows what it doesn't know is more valuable than one that's slightly more accurate and silent about its mistakes.

Measuring it honestly

40 lakh rows is enough that anecdotes don't matter — distributions do. We tracked failure as "emitted a wrong value without a flag," not "got something wrong." Under that definition, the 2% residual is almost entirely flagged-for-review, not silent errors.

Takeaways

Ground generation against canonical data; turn recall into constrained selection.
Window long documents so attention stays sharp where the signal lives.
Attach provenance to every value and abstain when it's missing.
Define your metric as silent failures — that's the number users actually feel.