feat(lancedb): implement InsertDocuments for LanceDbVectorIndex#1658
Open
abhicris wants to merge 1 commit into0xPlaygrounds:mainfrom
Open
feat(lancedb): implement InsertDocuments for LanceDbVectorIndex#1658abhicris wants to merge 1 commit into0xPlaygrounds:mainfrom
abhicris wants to merge 1 commit into0xPlaygrounds:mainfrom
Conversation
Mirrors the existing implementations for MongoDB, Postgres, Qdrant, etc. so callers can push `(Doc, OneOrMany<Embedding>)` pairs directly into a LanceDB table without hand-building Arrow `RecordBatch`es. - Document fields are flattened onto each row (or stored under a `document` field for non-object docs), matching the table schema. - The embedding column is taken from `SearchParams::column` when set, otherwise auto-detected as the sole `FixedSizeList<Float32|Float64>` column (same default LanceDB uses for search). - Supports both Float32 and Float64 embedding columns; validates vector dims against the schema before writing. - Uses `arrow_json::ReaderBuilder` to decode serialized docs against the non-embedding portion of the schema, then splices the embedding column back in to produce a batch that matches the full table schema. Adds an integration test that seeds a LanceDB table with one row, calls `insert_documents` with two fresh `(Word, Embeddings)` pairs, and asserts both that `count_rows()` grows and that the new ids are returned by `top_n`. — [kcolbchain](https://kcolbchain.com) / [Abhishek Krishna](https://abhishekkrishna.com)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the
InsertDocumentstrait forLanceDbVectorIndex, matching the existing implementations for MongoDB / Postgres / Qdrant / Milvus / ScyllaDB / SurrealDB / HelixDB / S3 Vectors / SQLite. LanceDB was one of only two integrations still missing this trait (Neo4j is the other, and is covered by #1636).Before this PR, users had to hand-build Arrow
RecordBatches to insert data into LanceDB; they can now do:Behaviour
Embeddingproduces one row. Document fields are flattened onto that row (or stored under adocumentfield when the serialized doc is not a JSON object), and the embedding vector goes into the table'sFixedSizeList<Float32|Float64>column.SearchParams::columnwhen set; otherwise it is auto-detected as the soleFixedSizeList<Float32|Float64>column in the table schema (the same default LanceDB uses for search column inference). An explicit error is returned when zero or multiple candidate columns exist.Float32andFloat64embedding columns; embedding dimensionality is validated against the schema before the write is attempted.arrow_json::ReaderBuilder::build_decoder().serialize(&rows)to decode serialized docs against the non-embedding portion of the schema, then splices the embedding column back in to produce a batch that matches the full table schema. This avoids having to reimplement per-column Arrow conversion for every scalar type.Implementation notes
arrow-json = "56"as a dependency (aligned with the workspace'sarrow-array = "56").LanceDbVectorIndex/SearchParams; the trait impl reads the embedding column name from existing config.RecordBatchIterator+Table::add(...).execute()is used for the write (same API pattern the examples already use for initial table creation).Test plan
Added
insert_documents_round_triptotests/integration_tests.rs:Seeds a LanceDB table with one row using the existing fixture schema (id, definition, embedding as
FixedSizeList<Float64, 1536>).Calls
InsertDocuments::insert_documentswith two fresh(Word, Embeddings)pairs driven by a mocked OpenAI embeddings endpoint.Asserts that
Table::count_rows()grows from 1 to 3 after the insert.Asserts that the two new ids (
inserted-1,inserted-2) are returned bytop_nvia the sameVectorStoreIndexhandle — proving that the rows were written to the correct columns and the embedding column was populated with a valid FSL vector.New integration test added and covers the happy path.
cargo check -p rig-lancedb/cargo test -p rig-lancedb --test integration_tests— unable to run locally (host disk full); happy to rebase / adjust on CI feedback.— kcolbchain / Abhishek Krishna