Versioning & Reproducibility in LanceDB

LanceDB redefines data management for AI/ML workflows with built-in, automatic versioning powered by the Lance columnar format. Every table mutation—appends, updates, deletions, or schema changes — is tracked with zero configuration, enabling:

Time-Travel Debugging: Pinpoint production issues by querying historical table states.
Atomic Rollbacks: Revert terabyte-scale datasets to any prior version in seconds.
ML Reproducibility: Exactly reproduce training snapshots (vectors + metadata).
Branching Workflows: Conduct A/B tests on embeddings/models via lightweight table clones.

Basic Versioning Example

Let’s create a table with sample data to demonstrate LanceDB’s versioning capabilities:

Setting Up the Table

First, let’s create a table with some sample data:

Checking Initial Version

After creating the table, let’s check the initial version information:

Modifying Data

When you modify data through operations like update or delete, LanceDB automatically creates new versions.

Updating Existing Data

Let’s update some existing records to see versioning in action:

Adding New Data

Now let’s add more records to the table:

Checking Version Changes

Let’s see how the versions have changed after our modifications:

Tracking Changes in Schema

LanceDB’s versioning system automatically tracks every schema modification. This is critical when handling evolving embedding models. For example, adding a new vector_minilm column creates a fresh version, enabling seamless A/B testing between embedding generations without recreating the table.

Preparing Data for Embeddings

First, let’s get the data we want to embed:

import pyarrow as pa

# Get data from table
df = table.search().limit(5).to_pandas()

Generating Embeddings

Now let’s generate embeddings using the all-MiniLM-L6-v2 model:

# Let's use "all-MiniLM-L6-v2" model to embed the quotes
model = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")

# Generate embeddings for each quote and pair with IDs
vectors = model.encode(
    df["quote"].tolist(), convert_to_numpy=True, normalize_embeddings=True
)
vector_dim = vectors[0].shape[0]
print(f"Vector dimension: {vector_dim}")

# Add IDs to vectors array with proper column names
vectors_with_ids = [
    {"id": i + 1, "vector_minilm": vec.tolist()} for i, vec in enumerate(vectors)
]

Adding Vector Column to Schema

Now let’s add the vector column to our table schema:

# Add vector column and merge data
table.add_columns(
  {"vector_minilm": f"arrow_cast(NULL, 'FixedSizeList({vector_dim}, Float32)')"}
)

table.merge_insert(
  "id"
).when_matched_update_all().when_not_matched_insert_all().execute(vectors_with_ids)

Checking Version Changes After Schema Modification

Let’s see how the schema change affected our versioning:

# Check versions after schema change
versions = table.list_versions()
version_count_after_embed = len(versions)
version_after_embed = table.version
print(f"Number of versions after adding embeddings: {version_count_after_embed}")
print(f"Current version: {version_after_embed}")

# Verify the schema change
# The table should now include a vector_minilm column containing
# embeddings generated by the all-MiniLM-L6-v2 model
print(table.schema)

Rollback to Previous Versions

LanceDB supports fast rollbacks to any previous version without data duplication.

Viewing All Versions

First, let’s see all the versions we’ve created:

Rolling Back to a Previous Version

Now let’s roll back to before we added the vector column:

Making Changes from Previous Versions

After restoring a table to an earlier version, you can continue making modifications. In this example, we rolled back to a version before adding embeddings. This allows us to experiment with different embedding models and compare their performance.

Switching to a Different Embedding Model

Let’s try a different embedding model (all-mpnet-base-v2) to see how it performs:

# Let's switch to the all-mpnet-base-v2 model to embed the quotes
model = SentenceTransformer("all-mpnet-base-v2", device="cpu")

# Generate embeddings for each quote and pair with IDs
vectors = model.encode(
    df["quote"].tolist(), convert_to_numpy=True, normalize_embeddings=True
)
vector_dim = vectors[0].shape[0]
print(f"Vector dimension: {vector_dim}")

# Add IDs to vectors array with proper column names
vectors_with_ids = [
    {"id": i + 1, "vector_mpnet": vec.tolist()} for i, vec in enumerate(vectors)
]

Adding the New Vector Column

Now let’s add the new vector column with the different model:

# Add vector column and merge data
table.add_columns(
    {"vector_mpnet": f"arrow_cast(NULL, 'FixedSizeList({vector_dim}, Float32)')"}
)

table.merge_insert(
    "id"
).when_matched_update_all().when_not_matched_insert_all().execute(vectors_with_ids)

Checking Version Changes

Let’s see how this new model affects our versioning:

# Check versions after schema change
versions = table.list_versions()
version_count_after_alter_embed = len(versions)
version_after_alter_embed = table.version
print(
    f"Number of versions after switching model: {version_count_after_alter_embed}"
)
print(f"Current version: {version_after_alter_embed}")

# The table should now include a vector_mpnet column containing
# embeddings generated by the all-mpnet-base-v2 model
print(table.schema)

Delete Data From the Table

Let’s demonstrate how deletions also create new versions:

Going Back to Latest Version

First, let’s return to the latest version:

Deleting Data

Now let’s delete some data to see how it affects versioning:

Version History and Operations

Throughout this guide, we’ve demonstrated various operations that create new versions in LanceDB. Here’s a summary of the version history we created:

Initial Creation (v1): Created table with quotes data and basic schema
First Update (v2): Changed “Richard” to “Richard Daniel Sanchez”
Data Append (v3): Added new quotes from both characters
Schema Evolution (v4): Added vector_minilm column for embeddings
Embedding Merge (v5): Populated vector_minilm with embeddings
Version Rollback (v6): Restored to v3 (pre-vector state)
Alternative Schema (v7): Added vector_mpnet column
Alternative Merge (v8): Populated vector_mpnet embeddings
Data Cleanup (v9): Kept only Richard Daniel Sanchez quotes

Each version represents a distinct state of your data, allowing you to:

Track changes over time
Compare different embedding strategies
Revert to previous states
Maintain data lineage for ML reproducibility

System operations like index updates and table compaction automatically increment the table version number. These background processes are tracked in the version history, though their version numbers are omitted from this example for clarity.

Get started

User guide

API & SDK Reference

Versioning & Reproducibility in LanceDB

Basic Versioning Example

Setting Up the Table

Checking Initial Version

Modifying Data

Updating Existing Data

Adding New Data

Checking Version Changes

Tracking Changes in Schema

Preparing Data for Embeddings

Generating Embeddings

Adding Vector Column to Schema

Checking Version Changes After Schema Modification

Rollback to Previous Versions

Viewing All Versions

Rolling Back to a Previous Version

Making Changes from Previous Versions

Switching to a Different Embedding Model

Adding the New Vector Column

Checking Version Changes

Delete Data From the Table

Going Back to Latest Version

Deleting Data

Version History and Operations

Get started

User guide

API & SDK Reference

​Basic Versioning Example

​Setting Up the Table

​Checking Initial Version

​Modifying Data

​Updating Existing Data

​Adding New Data

​Checking Version Changes

​Tracking Changes in Schema

​Preparing Data for Embeddings

​Generating Embeddings

​Adding Vector Column to Schema

​Checking Version Changes After Schema Modification

​Rollback to Previous Versions

​Viewing All Versions

​Rolling Back to a Previous Version

​Making Changes from Previous Versions

​Switching to a Different Embedding Model

​Adding the New Vector Column

​Checking Version Changes

​Delete Data From the Table

​Going Back to Latest Version

​Deleting Data

​Version History and Operations

Basic Versioning Example

Setting Up the Table

Checking Initial Version

Modifying Data

Updating Existing Data

Adding New Data

Checking Version Changes

Tracking Changes in Schema

Preparing Data for Embeddings

Generating Embeddings

Adding Vector Column to Schema

Checking Version Changes After Schema Modification

Rollback to Previous Versions

Viewing All Versions

Rolling Back to a Previous Version

Making Changes from Previous Versions

Switching to a Different Embedding Model

Adding the New Vector Column

Checking Version Changes

Delete Data From the Table

Going Back to Latest Version

Deleting Data

Version History and Operations