Vector Search at Scale: FAISS HNSW for Multilingual Dedup & Retrieval

When I started building a content-intelligence pipeline, the assumption was simple: ingest articles, embed them, and find duplicates. Then the corpus crossed eight million documents across eleven languages, and my naive approach quietly fell apart. The same news story would arrive in English, Hindi, and Spanish within minutes, and I needed to recognize all three as one event. The bottleneck was not embedding quality. It was search. This is the story of how I made vector search fast, multilingual, and accurate enough to trust in production.

Why exact search dies at scale

The textbook answer to similarity search is brute force: compute the distance from your query vector to every vector in the index, then take the top k. With a flat index this is exact and trivially correct. It is also a quadratic trap.

My embeddings were 768-dimensional. A single query against eight million vectors means eight million dot products of length 768. That is roughly six billion floating-point multiply-adds per query. On one CPU core that lands around 200 to 400 milliseconds. Fine for a demo. But deduplication is not one query, it is one query per ingested document. At a few hundred documents per second, flat search would need dozens of cores doing nothing but distance math, and the cost would scale linearly with corpus size forever. I needed sublinear search, and that meant accepting approximation.

What HNSW actually is

Hierarchical Navigable Small World graphs, or HNSW, are the approach that finally worked. The idea borrows from skip lists. Instead of a flat list of vectors, you build a layered graph. Each node is a vector. Edges connect a node to its near neighbors. The trick is the hierarchy: the bottom layer contains every node and is densely connected, while each layer above it is an increasingly sparse sample of nodes, acting like an express lane.

A node's maximum layer is chosen randomly with an exponentially decaying probability, so most nodes live only at the bottom and a rare few reach the top. Searching becomes a descent: you start at a single entry point in the top layer, greedily hop toward the query, and when you can get no closer at that layer you drop down one level and repeat. By the time you reach the bottom layer you are already in the right neighborhood, so the final fine-grained search is cheap.

Greedy descent through an HNSW graph: start at the sparse top entry point, hop toward the query at each layer, drop down, and finish in the dense bottom layer.

The greedy search, concretely

At each layer the algorithm keeps a candidate set and a dynamic list of the best results found so far. It expands the closest unvisited candidate, looks at that node's neighbors, and pushes any that improve the result list. The search at a layer terminates when the nearest remaining candidate is farther than the worst result already kept. The size of that result list during search is governed by efSearch. The whole thing is a beam search over a graph, and its expected complexity is logarithmic in the number of nodes rather than linear.

M, efConstruction, and efSearch

Three parameters control the recall and latency tradeoff, and getting them right was most of the work.

M is the number of bidirectional edges each node gets at the upper layers (the bottom layer gets 2*M). Higher M means a richer graph, better recall, and more memory. I found M=32 a sweet spot for 768-dim embeddings; M=16 saved memory but cost recall on hard multilingual pairs.
efConstruction controls how thoroughly neighbors are searched while inserting nodes. It is a build-time cost only. I used efConstruction=200. Pushing it to 400 barely moved recall but doubled build time.
efSearch is the query-time beam width. This is the dial you actually tune in production: larger means higher recall and higher latency. It does not change the index, so you can adjust it per query class without rebuilding.

The mental model: M and efConstruction determine the ceiling on quality your graph can offer; efSearch decides how much of that ceiling you pay for on each query.

Building the index in FAISS

FAISS makes HNSW a one-liner to construct. The important detail for deduplication is the metric. I wanted cosine similarity, so I L2-normalize every vector and use inner product, which on unit vectors is exactly cosine.

import faiss
import numpy as np

d = 768
M = 32
index = faiss.IndexHNSWFlat(d, M, faiss.METRIC_INNER_PRODUCT)
index.hnsw.efConstruction = 200
index.hnsw.efSearch = 128

# vectors must be L2-normalized so inner product == cosine
faiss.normalize_L2(embeddings)
index.add(embeddings)

faiss.normalize_L2(query)
scores, ids = index.search(query, k=10)
# scores are cosine similarities in [-1, 1]

That is the entire core. Everything else is plumbing around it.

Multilingual embeddings and cosine-threshold dedup

HNSW only finds neighbors that are actually near in vector space. For cross-lingual dedup to work, the same story in different languages must land near each other. That is an embedding problem, not a search problem. I used a multilingual sentence encoder trained with a contrastive objective on parallel and back-translated pairs, so that a sentence and its translation are pulled together in the shared space. Alignment is not free: idiomatic translations and culturally rephrased headlines drift apart, and I measured a real gap between same-language and cross-language similarity for true duplicates.

That gap is exactly why a single global cosine threshold is dangerous. After embedding I run each new document through the index, take the top neighbors, and apply a threshold to decide keep or drop. But I learned to use two thresholds: a strict one (around 0.92) for same-language pairs and a looser one (around 0.86) for cross-language pairs, with the language pair detected up front. A near-duplicate above threshold is dropped or merged into the existing cluster; everything else is kept and added to the index.

The dedup pipeline: text is embedded, queried against the HNSW index, scored, and either kept (and indexed) or dropped as a near-duplicate.

Results

On an eight-million-vector index with M=32 and efSearch=128, recall@10 measured against a flat exact baseline held at roughly 0.97. Median query latency dropped from about 280 ms (flat) to under 2 ms (HNSW) on a single core, a hundred-fold improvement that made real-time dedup feasible. Across a labeled evaluation set the pipeline removed about 31 percent of incoming documents as near-duplicates, with cross-language pairs accounting for nearly a third of those catches. False merges stayed under 1 percent after the two-threshold change.

The single biggest accuracy win was not a better index. It was admitting that English-to-English and English-to-Hindi duplicates do not live at the same cosine distance.

Pitfalls I hit

Forgetting to normalize. Inner product on un-normalized vectors is not cosine. One missed normalize_L2 call silently wrecked thresholds.
Deletes. HNSW does not support true deletion well. I used tombstoning and periodic full rebuilds rather than removing nodes in place.
Memory. HNSW keeps the full graph and all vectors in RAM. At M=32 the graph overhead alone was significant; I eventually paired HNSW with product quantization for the largest shards.
One global threshold. Covered above, but worth repeating because it caused the most subtle bugs.

Lessons

Approximate search is not a compromise you apologize for; at this scale it is the only thing that works, and 97 percent recall at a hundredth of the latency is an obviously good trade. Tune efSearch in production and leave M and efConstruction fixed at build time. Treat the embedding model and the threshold as a single coupled system, because the geometry your encoder produces dictates every distance decision downstream. And measure recall against an exact baseline on real data, not on synthetic vectors, because the hard cases are always the cross-lingual ones.

Conclusion

FAISS HNSW turned an intractable quadratic problem into a sub-millisecond graph walk, and multilingual embeddings plus language-aware cosine thresholds turned that fast search into trustworthy deduplication. The pipeline now keeps a multilingual corpus clean in real time, and the architecture has held up as the corpus has grown. If I were starting again, I would spend less time chasing a perfect threshold and more time on the embedding alignment that makes the threshold meaningful in the first place.