fn-gemma: A Hybrid On-Device + Cloud Router for Sub-550ms Function Calling
How I built fn-gemma, a router pairing on-device FunctionGemma-270M with cloud Gemini 2.5 Flash-Lite for ~99% function-calling accuracy under 550ms latency.
Every function-calling agent I had built before fn-gemma made the same lazy assumption: when in doubt, call the cloud. A user types turn on the kitchen lights, and a 200-billion-parameter frontier model wakes up, bills me tokens, and answers 700ms later. That is absurd. The intent is trivial. The tool schema is tiny. The decision could have been made on the device in the user's hand.
So I built a router that asks a sharper question: what is the cheapest place this query can be resolved correctly? fn-gemma pairs Google's FunctionGemma-270M running on-device with Gemini 2.5 Flash-Lite in the cloud, and uses a three-tier pre-routing layer to decide between them. The result across my 30-case benchmark suite is roughly 99% function-calling accuracy at under 550ms end-to-end. This is how it works.
The cost of always calling the cloud
Routing every query to a hosted model has three taxes, and you pay all of them on every single request.
- Latency. A network round-trip plus cloud inference is rarely under 600ms, and the long tail is brutal. Cold regions, rate limits, and retries push p99 past two seconds.
- Cost. Even a cheap hosted model costs real money per million tokens. When 70% of your traffic is set a timer for ten minutes, you are paying frontier prices for a problem a 270M model solves perfectly.
- Privacy. Sending every utterance off-device is a liability. Some queries simply should not leave the phone.
The fix is not a better cloud model. It is not sending most queries to the cloud at all.
The three-tier pre-routing design
Before any model runs, fn-gemma passes the query through a pre-routing layer. The goal is to make the routing decision itself nearly free, so the router never becomes the bottleneck it is trying to eliminate.
Tier 1: deterministic shortcuts
The first tier is pure pattern logic, no model at all. Cached exact-match queries, obvious single-tool commands, and keyword-anchored intents resolve here in well under a millisecond. If a query has matched a tool confidently in the recent past, I do not re-infer it.
Tier 2: on-device classification
If Tier 1 abstains, FunctionGemma-270M runs locally. Being 270M parameters, it is small enough to sit in memory and respond in tens of milliseconds. It produces both a candidate function call and a confidence signal. Most real traffic dies here, happily.
Tier 3: escalation gate
Tier 3 is the decision diamond. It inspects the on-device output and decides whether to trust it or escalate to Gemini. The gate watches for low confidence, malformed arguments, ambiguous multi-intent phrasing, or schemas the small model has historically struggled with.
When on-device wins, and when to escalate
The on-device path wins whenever the query is well-formed, maps to a single known tool, and supplies unambiguous arguments. That covers the overwhelming majority of real assistant traffic: timers, toggles, lookups, reminders, simple navigations. For these, escalating to the cloud buys nothing but latency and cost.
Escalation earns its keep on the hard tail: novel phrasings, deeply nested arguments, rare tools the small model has seen little of, and anything the gate flags as ambiguous. The point of the router is not to avoid the cloud out of principle. It is to spend the cloud budget only where it changes the answer.
escalate_when:
- confidence < 0.82
- args_failed_schema_validation: true
- intent_count > 1 and any_intent_unresolved: true
- tool in low_recall_setMulti-intent handling
Real users do not speak in single tool calls. Set a timer for ten minutes and text Sarah I'll be late is two intents in one breath. A naive router either picks one and drops the other, or panics and escalates the whole thing.
fn-gemma splits multi-intent queries into independent sub-queries during pre-routing, then routes each one separately. The timer resolves on-device; the contact-resolution-plus-message intent, being riskier, can escalate on its own. Each sub-call is gated independently, so one hard intent does not drag an easy one into the cloud. The post-processor reassembles the calls into a single ordered batch.
The post-processing pipeline
Small models are fast but messy. FunctionGemma-270M occasionally emits trailing prose, slightly off-key argument names, or JSON with a stray comma. Rather than escalate on every cosmetic flaw, fn-gemma runs a deterministic cleanup pass.
- Extraction: pull the call out of any surrounding text the model leaked.
- Normalization: coerce argument names and types to the tool schema, parse durations and dates into canonical form.
- Validation: check required arguments; only genuinely unrecoverable calls trigger escalation.
- Assembly: order multi-intent batches and dedupe.
This pipeline is the unsung hero. A large share of what looked like on-device errors were really formatting noise that cleanup fixes for free, which is a big part of how the small model reaches its accuracy.
Benchmark results
I ran all three configurations against the same 30-case suite: single-intent commands, multi-intent queries, ambiguous phrasings, and rare tools. Latency is end-to-end p50 on a mid-range mobile device with a warm cloud endpoint.
| Configuration | Accuracy | p50 latency | Cloud calls |
|---|---|---|---|
| On-device only | 91.4% | 120ms | 0% |
| Cloud only | 98.7% | 690ms | 100% |
| Hybrid (fn-gemma) | 99.0% | 540ms | 21% |
The number I care most about is the last column. Hybrid matches cloud-only accuracy while sending only 21% of queries to the cloud. Four out of five requests never leave the device, and the ones that do are exactly the ones that needed to.
Tradeoffs and lessons
This design is not free. The escalation gate is the whole ballgame: tune it too conservatively and you escalate everything, erasing the latency win; too aggressively and the small model's mistakes slip through. Calibrating that confidence threshold against real traffic took far more iteration than building either model path.
Two lessons stuck with me. First, cleanup beats escalation. Deterministic post-processing recovered more accuracy per engineering hour than any model swap. Second, the router is a product decision, not just an optimization. Deciding what stays on-device is also deciding what stays private, and that framing changed which queries I was willing to escalate at all.
Conclusion
fn-gemma is a bet that most function-calling traffic is mundane, and that treating it as mundane unlocks a tier of speed, cost, and privacy you cannot reach by upgrading the cloud model. Pair a tiny on-device model with a sharp escalation gate and a disciplined cleanup pass, and you get frontier-grade accuracy at on-device speeds for the queries that matter.
The full implementation, benchmark suite, and routing logic are open source. github.com/Vatsa10/fn-gemma.