Cutting an LLM Pipeline 25% on 80 Lakh Judgments
How collapsing a seven-stage LLM pipeline into two Gemini-2.5-Flash-Lite calls cut legal NER extraction cost by 25% across an 80-lakh judgment corpus — without losing accuracy.
The first version worked. That was the problem. A seven-stage LLM pipeline extracted named entities from Indian court judgments at acceptable accuracy — and at a cost that scaled linearly with a corpus of 80 lakh documents. Seven model calls per document is seven times the bill, seven times the latency, and seven places for the chain to break.
This is the story of getting that down to two calls on Gemini-2.5-Flash-Lite while improving reliability.
Why seven stages existed in the first place
Each stage did one "clean" job: detect parties, then dates, then statutes, then citations, and so on. It read well on a whiteboard. In production it meant the same 4,000-token judgment was tokenized and shipped to the model seven times. The model re-read the entire document at every stage to extract one field.
The hidden cost was input tokens, not output tokens. We were paying to re-send context the model had already seen.
The two-call architecture
The rewrite folded the seven extraction prompts into a single structured-output call that returns every entity type at once, followed by one verification call that checks the structured result against the source. The model reads the document once, emits a typed JSON object, and a second pass validates it.
- Call 1 — Extract. One prompt, one JSON schema covering all entity types. Structured output forces the shape, so there is nothing to parse heuristically.
- Call 2 — Verify. A cheaper grounding pass that confirms each extracted value actually appears in (or is supported by) the source text, and flags anything it cannot ground.
Where the 25% came from
Most of the saving was eliminating redundant input tokens — five fewer full-document reads per judgment. Switching the heavy reasoning stages to Flash-Lite captured the rest. The verification call is small and cheap relative to a full extraction, so adding it back did not erase the win.
Counting calls is the wrong mental model. Count tokens, and specifically count how many times you re-send the same context.
What it cost in accuracy — nothing
The fear with collapsing stages is that a single prompt juggling many entity types degrades each one. It did not, because structured output gives the model a stable scaffold to fill, and the verification pass catches the failure mode that actually matters in legal text: hallucinated values that were never in the document.
Takeaways
- Granular, one-job-per-call pipelines optimize for readability, not for the bill.
- Re-sending context is the silent cost center. Consolidate reads before you consolidate logic.
- A cheap verification pass buys back the reliability you'd otherwise lose by merging stages.
- Model tiering (Flash-Lite for the bulk, a stronger model only where needed) compounds the saving.
Two calls, 25% cheaper, and easier to reason about. The simplest pipeline that holds is almost always fewer stages than the first design suggests.