Indexing Data in LanceDB

Embeddings for a dataset become searchable through an index, a data structure that stores embeddings for efficient scans and lookups. LanceDB provides a comprehensive suite of indexing strategies:

Vector Index: Optimized for high-dimensional data (images, audio, or text embeddings)
Full-Text Search Index: Enables fast keyword-based searches
Scalar Index: Accelerates filtering and sorting of structured numeric or categorical data

Scalar indexes serve as a foundational optimization layer. They can be combined with vector search (prefilter or post-filter), full-text search, SQL scans, or key-value lookups for rapid primary-key retrievals.

Understanding the IVF-PQ Index

An Approximate Nearest Neighbor (ANN) index re-represents data so that searches become faster (with a slight accuracy trade-off vs. brute-force search). LanceDB’s disk-based IVF-PQ index is a variant of the Inverted File Index (IVF) that uses Product Quantization (PQ) to compress embeddings. LanceDB is built on top of Lance, an open-source columnar format designed for performant ML workloads and fast random access. Because of this foundation, LanceDB adopts a primarily disk-based indexing philosophy.

IVF-PQ

IVF-PQ combines inverted file partitions with product quantization. The implementation in LanceDB exposes parameters you can fine-tune for index size, throughput, latency, and recall.

Product Quantization

Quantization compresses embeddings to speed up search. Product quantization (PQ) divides a high-dimensional vector into equal-sized subvectors. Each subvector maps to the nearest centroid in its PQ codebook.

Quantization is lossy, so reconstructed vectors differ slightly from the originals. This trades accuracy for memory.

Original: 128 × 32 = 4096 bits
Quantized: 4 × 8 = 32 bits
This yields a 128× reduction in memory requirements per vector.

Inverted File Index (IVF) Implementation

PQ reduces index size, while IVF accelerates search by narrowing the candidate space. IVF partitions the PQ vector space into Voronoi cells by running K-means and using the centroids as region seeds.

During query time, a vector may lie near multiple cells. The nprobe parameter controls how many cells to search (higher nprobe improves accuracy but increases latency).

HNSW Index Implementation

Hierarchical Navigable Small Worlds (HNSW) is one of the most accurate and fastest ANN algorithms for high-dimensional data.

Types of ANN Search Algorithms

Tree-based: organize points into a tree.
Hash-based: rely on geometric hashing; good theoretical guarantees but weaker empirical performance.
Graph-based: represent points as a graph. HNSW is graph-based and combines k-NN graphs with skip-list concepts.

Understanding k-Nearest Neighbor Graphs

Each vector becomes a vertex connected to its k nearest neighbors (k edges). A greedy walk from an entry point toward closer neighbors yields good approximate neighbors. Building k-NN graphs exactly is quadratic, so practical systems use approximate constructions or incremental updates. One downside: getting good accuracy can require large k, increasing index size.

Hierarchical Navigable Small Worlds (HNSW)

HNSW improves upon k-ANN graphs by:

Pruning edges with heuristics so each vertex only keeps a small constant number of edges.
Using multiple layers (like a skip list) and dynamic entry points to speed up search.

Layers contain decreasing fractions of the dataset. Greedy search starts at the top layer, descends layer by layer, and finishes with a bottom-layer search to retrieve multiple neighbors.

Get started

User guide

API & SDK Reference

Indexing Data in LanceDB

Understanding the IVF-PQ Index