Vector Search Algorithms from scratch

Building IVF and HNSW from scratch to understand the accuracy/speed trade-off in vector search.

Exact nearest-neighbor search is essentially impossible at scale. A billion 1024-dimensional vectors means a trillion floating-point operations per query, and you need an answer in milliseconds. Every production vector database makes the same trade-off: give up the guarantee of finding the exact match in exchange for finding an excellent match, fast.

Databases like OpenSearch, Pinecone, and QDrant handle this using specialized algorithms like IVF and HNSW. This post builds both from scratch to understand the accuracy/speed trade-off. It does not cover product quantization, ScaNN, or DiskANN. Familiarity with vector embeddings is assumed.

Measuring closeness

Given a search query, vector search finds the documents whose embeddings are closest to the query’s embedding. “Closest” requires a distance function. Two are standard:

Vector distanceDot product

Euclidean distance is the default in most vector databases and the metric used throughout this post. Dot product (or its normalized form, cosine similarity) is preferred when vector magnitudes are meaningful, such as in recommendation systems where a larger magnitude signals stronger preference.

Given a corpus of NN vectors and a query vector qq, k-nearest-neighbor (kNN) search finds the kk vectors closest to qq. The rest of this post is about making that selection fast.

Naive approach: The Linear Scan

The simplest kNN implementation is a linear scan: compute the distance from the query to every vector in the corpus, then return the kk smallest.

We call this a Flat Index.

class FlatIndex:
    def __init__(self):
        self._vectors = []

    def index(self, vector: np.ndarray):
        self._vectors.append(vector)

    def search(self, query: np.ndarray, k: int) -> list[tuple[int, float]]:
        # 1. Compute euclidean distance from query to every vector in the corpus
        dataset_matrix = np.stack(self._vectors)
        distances = np.linalg.norm(dataset_matrix - query, axis=1)

        # 2. Partial sort: O(n) to get top-k, then sort only those k elements
        nearest_indices = np.argpartition(distances, k)[:k]
        nearest_indices = nearest_indices[np.argsort(distances[nearest_indices])]

        # 3. Return (id, distance) pairs
        return [(int(i), float(distances[i])) for i in nearest_indices]

This works for small datasets. The problem is that it’s a brute-force scan.

If we have NN vectors of dimension DD, every search takes O(ND)O(N \cdot D) time. At 1 million vectors and 128 dimensions, that is 128 million multiplications per query. Production indexes routinely hold a billion vectors at 1024 dimensions.

Running a simple benchmark on a consumer laptop shows almost 1.2s per query at 1 million vectors and 128 dimensions.

> uv run plot_flat_scaling.py
n=  1,000  0.85 ms/query
n=  5,000  4.79 ms/query
n= 10,000  8.50 ms/query
n= 50,000  45.79 ms/query
n=100,000  98.35 ms/query
n=500,000  658.69 ms/query
n=1,000,000  1258.07 ms/query

To do better, we need to prune the search to only vectors likely to be close to the query.

Aside: Curse of dimensionality

In low dimensions, a natural way to prune the search space is spatial partitioning. A QuadTree recursively divides 2D space into cells. To find the nearest neighbor of a query, locate its cell and only check the vectors inside. An excellent visualization is available here.

  +--------------------+------------------------+
  |                    |                        |
  |   ·   ·    ·       |    ·             ·     |
  |                    |          ·             |   (skip)
  |       ·            |                  ·     |
  |                    |                        |
  +----------+---------+------------------------+
  |   · ·    |  · [Q]  |                        |
  |   ·      |  ·   ·  |    ·         ·         |   (search Q's cell only)
  |          |    ·    |                        |
  +----------+---------+------------------------+
  |    ·          ·    |       ·           ·    |
  |          ·         |             ·          |   (skip)
  +--------------------+------------------------+

  [Q] = query     · = indexed vector

This doesn’t generalise to high dimensions. The natural extension splits every cell along all dd axes at once, producing 2d2^d sub-regions per split. But 2d2^d is also exactly the number of distinct binary vectors in dd dimensions, so after a single split, each cell holds at most one data point and the index offers no pruning at all. For continuous embeddings the same logic applies: at d=30d = 30 you already have a billion cells, more than most datasets have vectors, and the index degenerates into a linear scan.

A smarter approach, the Kd-tree, splits along one dimension at a time and stays shallow. But even this fails at high dd: the deeper problem is that high-dimensional space is mostly empty. As dd grows, all points drift to roughly the same distance from any query, and there is no “nearby” region worth exploiting. This is the curse of dimensionality.

Scaling with IVF (Inverted File Index)

IVF sidesteps the curse of dimensionality by learning the partitions from data. K-means clustering finds centroids that summarise the natural groupings in the corpus. Unlike axis-aligned splits, these clusters naturally group vectors in high dimensions.

Building the index is two steps:

  1. Train: run k-means on the corpus to find nlist centroids.
  2. Index: assign every vector to its nearest centroid.

Each centroid defines a Voronoi cell. This is the region of space closer to it than to any other centroid.

At query time, find the nprobe nearest centroids and run exact search only within those clusters. Higher nprobe improves recall but slows down search, taking us to the core accuracy/speed tradeoff in IVF. OpenSearch for example exposes both nlist and nprobe as first-class IVF parameters.

class IVFIndex:
    def __init__(self, nlist: int, nprobe: int):
        self._nlist = nlist
        self._nprobe = nprobe
        self._centroids = None
        self._lists = {}  # cluster_id → [(vector, id)]

    def train(self, vectors: np.ndarray):
        # Random initialisation
        centroids = np.random.default_rng().choice(vectors, size=self._nlist, replace=False, axis=0)

        for _ in range(25):
            # Assign each vector to its nearest centroid
            labels = np.argmin(np.linalg.norm(vectors[:, None] - centroids[None], axis=2), axis=1)
            # Recompute centroids as the mean of their assigned vectors
            new_centroids = np.array([
                vectors[labels == i].mean(axis=0) if (labels == i).any() else centroids[i]
                for i in range(self._nlist)
            ])
            # End loop early if the centroids converge
            if np.allclose(centroids, new_centroids):
                break
            centroids = new_centroids

        self._centroids = centroids

    def index(self, vector_id: int, vector: np.ndarray):
        cell = int(np.argmin(np.linalg.norm(self._centroids - vector, axis=1)))
        self._lists.setdefault(cell, []).append((vector, vector_id))

    def search(self, query: np.ndarray, k: int) -> list[tuple[int, float]]:
        # 1. Find the nprobe nearest centroids
        cells = np.argpartition(np.linalg.norm(self._centroids - query, axis=1), self._nprobe)[:self._nprobe]

        # 2. Gather candidates from those clusters
        candidates, ids = [], []
        for cell in cells:
            for vec, vid in self._lists.get(int(cell), []):
                candidates.append(vec)
                ids.append(vid)

        # 3. Exact search within candidates only
        distances = np.linalg.norm(np.stack(candidates) - query, axis=1)
        nearest = np.argpartition(distances, k)[:k]
        nearest = nearest[np.argsort(distances[nearest])]

        return [(ids[i], float(distances[i])) for i in nearest]

IVF in action: a worked example

To make this concrete, consider 8 vectors in 2D and a query q=[5,4]q = [5, 4], with nlist=3 and nprobe=1. We obtain 3 clusters:

Vectors:  v0=[1,2]  v1=[2,1]  v2=[4,3]        (cluster A)
          v3=[8,9]  v4=[9,8]  v5=[8.5,8.5]    (cluster B)
          v6=[5,1]  v7=[6,2]                   (cluster C)

After k-means training:
  Centroid A = [2.3, 2.0]
  Centroid B = [8.5, 8.5]
  Centroid C = [5.5, 1.5]

Note that v2=[4,3]v2=[4,3] is geometrically close to cluster C, but k-means assigned it to A (dist to A=1.97, dist to C=2.12). This is a borderline point.

At query time, we compute the distance from qq to each centroid:

  dist(q, A) = √((5-2.3)² + (4-2.0)²) = 3.36
  dist(q, B) = √((5-8.5)² + (4-8.5)²) = 5.70
  dist(q, C) = √((5-5.5)² + (4-1.5)²) = 2.55   ← nearest

With nprobe=1, we only search cluster C: vectors v6=[5,1]v6=[5,1] (dist=3.0) and v7=[6,2]v7=[6,2] (dist=2.24). IVF returns v7v7 as the nearest neighbor. But the true nearest neighbor is v2=[4,3]v2=[4,3] at dist=2\sqrt{2}=1.41, sitting in cluster A. We never look there.

Setting nprobe=2 would search clusters C and A, finding v2v2. This is the fundamental IVF trade-off: nprobe=1 checked 2 out of 8 vectors (75% pruned) but missed the best match. nprobe=nlist degenerates to a flat scan, which is correct but slow.

IVF recall problem

Tuning IVF parameters

Benchmarking IVF against the flat index on 128-dimensional clustered data (using nlist=N=\sqrt{N} and nprobe=5%=5\% of nlist) shows the latency benefits:

> uv run plot_ivf_performance.py
  n=  1,000 | Flat=    1.12 ms | IVF=    0.24 ms
  n=  5,000 | Flat=    5.77 ms | IVF=    2.11 ms
  n= 10,000 | Flat=   11.20 ms | IVF=    1.70 ms
  n= 25,000 | Flat=   29.05 ms | IVF=    4.27 ms
  n= 50,000 | Flat=   66.09 ms | IVF=    5.36 ms
  n=100,000 | Flat=  157.11 ms | IVF=   10.53 ms

IVF vs Flat latency scaling

At 100k vectors, IVF is 15x faster. The gap widens with scale: the time taken for flat index grows linearly while IVF only scans a fixed fraction of the data. But speed alone does not matter if you are returning the wrong results. Sweeping nprobe on the 100k index reveals the trade-off:

  nprobe=  1 (  0.3%) | Recall=0.588 | 2.37 ms
  nprobe=  4 (  1.3%) | Recall=0.986 | 3.19 ms
  nprobe= 16 (  5.1%) | Recall=1.000 | 12.27 ms
  nprobe=316 (100.0%) | Recall=1.000 | 237.25 ms

Recall vs Latency trade-off as nprobe increases

Recall climbs steeply at first (probing just 1.3% of clusters already hits 98.6% recall) and then flatlines while latency keeps growing linearly. Past the knee of the curve, every additional cluster probed costs time without improving results. Finding that optimal value of nprobe is key.

HNSW

Hierarchical Navigable Small World (HNSW) requires no pre-training and no centroid count to tune. The implementation follows the original paper.

HNSW Hierarchy

The name describes the structure:

If you’ve seen a skip list, this is the same idea generalised to vectors: express lanes at the top let you take long hops toward the target, then you drop to denser layers for precision. The difference is that a skip list’s “closeness” is a 1D key comparison; HNSW’s is a vector distance.

Skip List

Image source: Wikipedia

Search enters at the top layer, greedily descends toward the query, then repeats on each denser layer until layer 0. The core primitive is a greedy best-first search with a bounded candidate set: expand the closest unvisited node, but skip any neighbor that cannot improve on the current ef best results. With ef=1 this is pure greedy search; with ef>1 it becomes a beam search that tracks multiple promising paths. This bounds exploration to a local neighborhood rather than the full graph.

def _search_layer(self, query: np.ndarray, entry_ids: list[int], ef: int, layer: int) -> list:
    # Seed the candidate min-heap with all entry points
    candidates = []  # min-heap: (dist, id): closest node explored next
    for node_id in entry_ids:
        dist = np.linalg.norm(query - self._nodes[node_id].vector)
        heapq.heappush(candidates, (dist, node_id))

    best = []    # max-heap (Python 3.14+): worst of current top-ef results at front
    visited = set()

    while candidates:
        d, node_id = heapq.heappop(candidates)
        if node_id in visited:
            continue
        visited.add(node_id)

        heapq.heappush_max(best, (d, node_id))
        if len(best) > ef:
            heapq.heappop_max(best)  # evict the furthest if over budget

        for nb_id in self._nodes[node_id].neighbors.get(layer, []):
            nb_dist = np.linalg.norm(query - self._nodes[nb_id].vector)
            if len(best) < ef or nb_dist < best[0][0]:  # only explore if promising
                heapq.heappush(candidates, (nb_dist, nb_id))

    return sorted(best)  # closest first

The index itself holds two things: the nodes (each storing a vector and a per-layer neighbor list), and bookkeeping for the current entry point and top layer. M, ef_construction, and ef_search are the HNSW parameters OpenSearch exposes for tuning.

@dataclass
class HNSWNode:
    vector: np.ndarray
    neighbors: dict[int, list[int]] = field(default_factory=dict)  # layer → [node_ids]


class HNSWIndex:
    def __init__(self, M: int = 16, ef_construction: int = 200, ef_search: int = 50):
        self._M = M                          # max neighbors per node per layer
        self._ef_construction = ef_construction  # beam width during insert
        self._ef_search = ef_search          # beam width during search
        self._nodes: dict[int, HNSWNode] = {}
        self._entry_point: int | None = None
        self._max_layer: int = 0
        self._mL = 1 / np.log(M)            # layer probability scale

    def _random_level(self) -> int:
        # Exponential distribution: most nodes at layer 0, exponentially fewer above
        return int(-np.log(np.random.uniform()) * self._mL)

    def index(self, vector: np.ndarray) -> None:
        node_id = len(self._nodes)
        node = HNSWNode(vector=vector)
        self._nodes[node_id] = node
        node_layer = self._random_level()

        if self._entry_point is None:
            self._entry_point = node_id
            self._max_layer = node_layer
            return

        # Phase 1: ef=1 greedy descent through layers above node_layer: find a close entry point
        entry_ids = [self._entry_point]
        for layer in range(self._max_layer, node_layer, -1):
            entry_ids = [nid for _, nid in self._search_layer(vector, entry_ids, ef=1, layer=layer)]

        # Phase 2: beam search at each layer from node_layer down to 0: wire edges
        for layer in range(min(node_layer, self._max_layer), -1, -1):
            M_limit = self._M * 2 if layer == 0 else self._M
            results = self._search_layer(vector, entry_ids, ef=self._ef_construction, layer=layer)
            entry_ids = [nid for _, nid in results]

            # Connect new node to its M nearest neighbors at this layer
            node.neighbors[layer] = [nid for _, nid in heapq.nsmallest(M_limit, results)]

            # Bidirectional: each neighbor adds a back-edge, then prunes to M
            for nb_id in node.neighbors[layer]:
                nb = self._nodes[nb_id]
                all_edges = nb.neighbors.get(layer, []) + [node_id]
                scored = [(np.linalg.norm(nb.vector - self._nodes[n].vector), n) for n in all_edges]
                nb.neighbors[layer] = [nid for _, nid in heapq.nsmallest(M_limit, scored)]

        if node_layer > self._max_layer:
            self._max_layer = node_layer
            self._entry_point = node_id

    def search(self, query: np.ndarray, k: int) -> list[tuple[int, float]]:
        # ef=1 descent to layer 1: fast coarse navigation
        entry_ids = [self._entry_point]
        for layer in range(self._max_layer, 0, -1):
            entry_ids = [nid for _, nid in self._search_layer(query, entry_ids, ef=1, layer=layer)]
        # Full beam search at layer 0: precision
        return self._search_layer(query, entry_ids, ef=self._ef_search, layer=0)[:k]

HNSW in action: a worked example

Same 8 vectors as before: v0=[1,2]v0=[1,2], v1=[2,1]v1=[2,1], v2=[4,3]v2=[4,3], v3=[8,9]v3=[8,9], v4=[9,8]v4=[9,8], v5=[8.5,8.5]v5=[8.5,8.5], v6=[5,1]v6=[5,1], v7=[6,2]v7=[6,2]. Suppose the graph has been built with M=2:

Layer 1 (sparse):    v3 ——— v0

Layer 0 (dense):     v0 — v1 — v2
                      |         |
                     v6 — v7   v3
                            \   |
                             v4—v5

To search for q=[5,4]q = [5, 4] with k=1:

Phase 1: Layer 1 (ef=1): Start at entry point v3=[8,9]v3=[8,9] (dist=5.83). Its only layer-1 neighbor is v0=[1,2]v0=[1,2] (dist=4.47). v0v0 is closer, so it becomes the entry. Descend to layer 0.

Phase 2: Layer 0 (ef=ef_search): Start at v0v0 (dist=4.47). Expand its neighbors: v1v1 (4.24), v6v6 (3.0). Pop closest candidate v6v6, expand: v7v7 (2.24). Pop v7v7, expand: v4v4 (5.66, no improvement). Pop v1v1, expand: v2v2 (1.41). v2v2 is now the closest candidate. Expand v2v2: v3v3 (5.83, no improvement). No candidates left that can beat 1.41. Return v2v2.

HNSW found the exact nearest neighbor by visiting most of the 8 nodes. In a toy graph that is expected, but the structure scales logarithmically: at a million vectors with M=16, a search typically visits a few hundred nodes.

Tuning HNSW parameters

Benchmarking HNSW

Our pure-Python implementation above is faithful to the paper, but far too slow for realistic benchmarks. Indexing 100k vectors takes minutes instead of seconds. The numbers below use hnswlib, a production-grade C++ implementation of the same algorithm, with the same M, ef_construction, and ef_search parameters. On the same clustered dataset:

> uv run plot_hnsw_performance.py
  n=  1,000 | Flat=    1.54 ms | IVF=  0.51 ms | HNSW=  0.058 ms
  n=  5,000 | Flat=    5.36 ms | IVF=  0.96 ms | HNSW=  0.042 ms
  n= 10,000 | Flat=   12.10 ms | IVF=  1.57 ms | HNSW=  0.046 ms
  n= 25,000 | Flat=   28.73 ms | IVF=  2.46 ms | HNSW=  0.060 ms
  n= 50,000 | Flat=   64.24 ms | IVF=  6.55 ms | HNSW=  0.073 ms
  n=100,000 | Flat=  143.35 ms | IVF= 11.15 ms | HNSW=  0.095 ms

Flat vs IVF vs HNSW latency scaling

At 100k vectors, HNSW queries take 0.095ms, which is 1500x faster than flat search and 100x faster than IVF. The O(logN)O(\log N) scaling is visible: latency barely moves as the corpus grows from 1k to 100k.

Sweeping ef_search on the 100k index shows the same recall/latency pattern as IVF’s nprobe:

HNSW recall vs latency as ef_search increases

At ef_search=50, recall is 96.8% at 0.115ms per query. Doubling to ef_search=100 reaches 99.6% at 0.160ms. The full useful range fits within sub-millisecond latency.

The recall trade-off

Every approximate algorithm sacrifices some accuracy for speed. Recall@k measures this: of the true kk nearest neighbors, what fraction does the algorithm actually return?

Recall@k=ANN resultTrue kNNk\text{Recall@k} = \frac{|\text{ANN result} \cap \text{True kNN}|}{k}

A recall of 0.95 means the algorithm finds 95 of the true 100 nearest neighbors. The remaining 5 are close. They’re near-misses, not random vectors from the other side of the space. For most applications (RAG, recommendations, semantic search), this is indistinguishable from exact search.

The relationship between recall and speed is different for each algorithm:

FlatIVFHNSW
Recall1.0 (exact)Controlled by nprobeControlled by ef_search
Search timeO(ND)O(N \cdot D)O(NnlistnprobeD)O(\frac{N}{nlist} \cdot nprobe \cdot D)O(logNMD)O(\log N \cdot M \cdot D)
Index memoryO(ND)O(N \cdot D)O(ND)O(N \cdot D) + centroidsO(N(D+M))O(N \cdot (D + M))
Build timeNoneK-means trainingIncremental (no batch step)
Sweet spotN<10,000N < 10{,}000Large NN, memory-constrainedWhen recall and speed both matter

In practice, HNSW typically achieves 95%+ recall while searching less than 1% of the dataset. IVF can match this recall but needs a well-chosen nlist and nprobe, a training step that HNSW avoids entirely. The flat index remains the right choice for small datasets or when you need guaranteed exact results.

Conclusion

These algorithms are building blocks, not final answers. Production systems layer them: IVF with product quantization for compression, HNSW with metadata pre-filtering, or hybrid approaches that use IVF to coarsely shard the graph. The ANN Benchmarks project tracks how these combinations perform across real datasets, and it is the best place to start when choosing an index for a specific workload.