Essays on shipping AI systems — the architecture decisions, the cost trade-offs, and the things that broke.
A fully local GraphRAG pipeline: extract a knowledge graph from raw text with LangExtract + Ollama, store it in Neo4j, index entities in Qdrant, and answer questions with graph-grounded retrieval — no cloud, no API keys.
A RAG system that adapts to the query and corrects its own reasoning — adaptive web scraping, FAISS/ChromaDB retrieval, and a two-stage self-correcting LLM that critiques and improves its first answer.
Config-driven multi-agent systems beat hard-coded chains. Lessons from routing requests across dozens of specialist agents — when to add an agent, and when not to.
How I used FAISS HNSW for fast approximate vector search to deduplicate multilingual content at scale, tuning M, ef params, and cosine thresholds for high recall.
How collapsing a seven-stage LLM pipeline into two Gemini-2.5-Flash-Lite calls cut legal NER extraction cost by 25% across an 80-lakh judgment corpus — without losing accuracy.
DB/CSV grounding and windowed precedent extraction took a legal extraction system to 98% document-level zero-failure across 40 lakh test rows. Here's the architecture.
How I built fn-gemma, a router pairing on-device FunctionGemma-270M with cloud Gemini 2.5 Flash-Lite for ~99% function-calling accuracy under 550ms latency.
How I built a working transformer in Rust with no PyTorch and no autograd, hand-rolling tensors, scaled dot-product attention, and the full backward pass myself.
A full-parameter fine-tune of Qwen3.5-0.8B for Hindi→Gujarati translation on AI4Bharat IN22-Conv — instruction formatting, loss masking, bf16, gradient accumulation, and Hugging Face Accelerate.