- Vector Index: Optimized for high-dimensional data (images, audio, or text embeddings)
- Full-Text Search Index: Enables fast keyword-based searches
- Scalar Index: Accelerates filtering and sorting of structured numeric or categorical data
Understanding the IVF-PQ Index
An Approximate Nearest Neighbor (ANN) index re-represents data so that searches become faster (with a slight accuracy trade-off vs. brute-force search). LanceDB’s disk-based IVF-PQ index is a variant of the Inverted File Index (IVF) that uses Product Quantization (PQ) to compress embeddings. LanceDB is built on top of Lance, an open-source columnar format designed for performant ML workloads and fast random access. Because of this foundation, LanceDB adopts a primarily disk-based indexing philosophy.IVF-PQ
IVF-PQ combines inverted file partitions with product quantization. The implementation in LanceDB exposes parameters you can fine-tune for index size, throughput, latency, and recall.Product Quantization
Quantization compresses embeddings to speed up search. Product quantization (PQ) divides a high-dimensional vector into equal-sized subvectors. Each subvector maps to the nearest centroid in its PQ codebook.
Quantization is lossy, so reconstructed vectors differ slightly from the originals. This trades accuracy for memory.
Original:
Quantized:
This yields a 128× reduction in memory requirements per vector.
128 × 32 = 4096 bitsQuantized:
4 × 8 = 32 bitsThis yields a 128× reduction in memory requirements per vector.
Inverted File Index (IVF) Implementation
PQ reduces index size, while IVF accelerates search by narrowing the candidate space. IVF partitions the PQ vector space into Voronoi cells by running K-means and using the centroids as region seeds.
During query time, a vector may lie near multiple cells. The nprobe parameter controls how many cells to search (higher nprobe improves accuracy but increases latency).
HNSW Index Implementation
Hierarchical Navigable Small Worlds (HNSW) is one of the most accurate and fastest ANN algorithms for high-dimensional data.Types of ANN Search Algorithms
- Tree-based: organize points into a tree.
- Hash-based: rely on geometric hashing; good theoretical guarantees but weaker empirical performance.
- Graph-based: represent points as a graph. HNSW is graph-based and combines k-NN graphs with skip-list concepts.
Understanding k-Nearest Neighbor Graphs
Each vector becomes a vertex connected to its k nearest neighbors (k edges). A greedy walk from an entry point toward closer neighbors yields good approximate neighbors. Building k-NN graphs exactly is quadratic, so practical systems use approximate constructions or incremental updates. One downside: getting good accuracy can require largek, increasing index size.
Hierarchical Navigable Small Worlds (HNSW)
HNSW improves upon k-ANN graphs by:- Pruning edges with heuristics so each vertex only keeps a small constant number of edges.
- Using multiple layers (like a skip list) and dynamic entry points to speed up search.