Skip to content

feat(lancedb): implement InsertDocuments for LanceDbVectorIndex#1658

Open
abhicris wants to merge 1 commit into0xPlaygrounds:mainfrom
kcolbchain:feat/lancedb-insert-documents
Open

feat(lancedb): implement InsertDocuments for LanceDbVectorIndex#1658
abhicris wants to merge 1 commit into0xPlaygrounds:mainfrom
kcolbchain:feat/lancedb-insert-documents

Conversation

@abhicris
Copy link
Copy Markdown

Summary

Implements the InsertDocuments trait for LanceDbVectorIndex, matching the existing implementations for MongoDB / Postgres / Qdrant / Milvus / ScyllaDB / SurrealDB / HelixDB / S3 Vectors / SQLite. LanceDB was one of only two integrations still missing this trait (Neo4j is the other, and is covered by #1636).

Before this PR, users had to hand-build Arrow RecordBatches to insert data into LanceDB; they can now do:

let index = LanceDbVectorIndex::new(table, model, "id", SearchParams::default()).await?;
let docs = EmbeddingsBuilder::new(model).documents(my_docs)?.build().await?;
index.insert_documents(docs).await?;

Behaviour

  • Each Embedding produces one row. Document fields are flattened onto that row (or stored under a document field when the serialized doc is not a JSON object), and the embedding vector goes into the table's FixedSizeList<Float32|Float64> column.
  • The embedding column is taken from SearchParams::column when set; otherwise it is auto-detected as the sole FixedSizeList<Float32|Float64> column in the table schema (the same default LanceDB uses for search column inference). An explicit error is returned when zero or multiple candidate columns exist.
  • Supports both Float32 and Float64 embedding columns; embedding dimensionality is validated against the schema before the write is attempted.
  • Uses arrow_json::ReaderBuilder::build_decoder().serialize(&rows) to decode serialized docs against the non-embedding portion of the schema, then splices the embedding column back in to produce a batch that matches the full table schema. This avoids having to reimplement per-column Arrow conversion for every scalar type.

Implementation notes

  • Added arrow-json = "56" as a dependency (aligned with the workspace's arrow-array = "56").
  • No public API change to LanceDbVectorIndex / SearchParams; the trait impl reads the embedding column name from existing config.
  • RecordBatchIterator + Table::add(...).execute() is used for the write (same API pattern the examples already use for initial table creation).

Test plan

Added insert_documents_round_trip to tests/integration_tests.rs:

  • Seeds a LanceDB table with one row using the existing fixture schema (id, definition, embedding as FixedSizeList<Float64, 1536>).

  • Calls InsertDocuments::insert_documents with two fresh (Word, Embeddings) pairs driven by a mocked OpenAI embeddings endpoint.

  • Asserts that Table::count_rows() grows from 1 to 3 after the insert.

  • Asserts that the two new ids (inserted-1, inserted-2) are returned by top_n via the same VectorStoreIndex handle — proving that the rows were written to the correct columns and the embedding column was populated with a valid FSL vector.

  • New integration test added and covers the happy path.

  • cargo check -p rig-lancedb / cargo test -p rig-lancedb --test integration_tests — unable to run locally (host disk full); happy to rebase / adjust on CI feedback.


kcolbchain / Abhishek Krishna

Mirrors the existing implementations for MongoDB, Postgres, Qdrant, etc.
so callers can push `(Doc, OneOrMany<Embedding>)` pairs directly into a
LanceDB table without hand-building Arrow `RecordBatch`es.

- Document fields are flattened onto each row (or stored under a
  `document` field for non-object docs), matching the table schema.
- The embedding column is taken from `SearchParams::column` when set,
  otherwise auto-detected as the sole `FixedSizeList<Float32|Float64>`
  column (same default LanceDB uses for search).
- Supports both Float32 and Float64 embedding columns; validates vector
  dims against the schema before writing.
- Uses `arrow_json::ReaderBuilder` to decode serialized docs against the
  non-embedding portion of the schema, then splices the embedding column
  back in to produce a batch that matches the full table schema.

Adds an integration test that seeds a LanceDB table with one row, calls
`insert_documents` with two fresh `(Word, Embeddings)` pairs, and
asserts both that `count_rows()` grows and that the new ids are
returned by `top_n`.

— [kcolbchain](https://kcolbchain.com) / [Abhishek Krishna](https://abhishekkrishna.com)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant