Skip to content

WSFeatherIndex: replace SQLite with Polars#30

Open
jpc wants to merge 3 commits intojpc/wsds-inspect-headfrom
jpc/feather-index
Open

WSFeatherIndex: replace SQLite with Polars#30
jpc wants to merge 3 commits intojpc/wsds-inspect-headfrom
jpc/feather-index

Conversation

@jpc
Copy link
Copy Markdown
Member

@jpc jpc commented Feb 6, 2026

📚 PR Stack

# Branch Base PR
1 jpc/preloading main #16
2 jpc/validate-shards-in-sql jpc/preloading #18
3 jpc/shard-pipe jpc/validate-shards-in-sql #19
4 jpc/sql-without-index jpc/shard-pipe #20
5 jpc/special-shard-columns jpc/sql-without-index #21
6 jpc/remove-in-progress jpc/special-shard-columns #22
7 jpc/fix-jupyter-repr jpc/remove-in-progress #23
8 jpc/sql-select-dotted jpc/fix-jupyter-repr #24
9 jpc/optimize-durations jpc/sql-select-dotted #25
10 jpc/source-links jpc/optimize-durations #26
11 jpc/audio-qol jpc/source-links #27
12 jpc/warn-subsampling-in-dataloader jpc/audio-qol #28
13 jpc/wsds-inspect-head jpc/warn-subsampling-in-dataloader #29
14 jpc/feather-index jpc/wsds-inspect-head #30 ◀️
15 jpc/indexer jpc/feather-index #31
16 jpc/keys-diff jpc/indexer #32
17 jpc/fields-cleanup jpc/keys-diff #33
18 jpc/s3-shards jpc/fields-cleanup #34
19 jpc/indexing-fixes jpc/s3-shards #35
20 jpc/validate-keys-on-the-fly jpc/indexing-fixes #36
21 jpc/repr-missing-last jpc/validate-keys-on-the-fly #37
22 jpc/drop-slots jpc/repr-missing-last #38

  • WSFeatherIndex: replace SQLite with Polars

@jpc
Copy link
Copy Markdown
Member Author

jpc commented Feb 6, 2026

This is still WIP, not enabled anywhere yet.

@jpc jpc marked this pull request as ready for review February 6, 2026 19:37
# Binary search in sorted name series
idx = self._name_df.select(pl.col('name').search_sorted(file_name, side='right')).item()

if idx >= len(self._names) or self._names[idx] != file_name:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._names is undefined

raise RuntimeError("episode-name-index.feather is required to search by episode name")

# Binary search in sorted name series
idx = self._name_df.select(pl.col('name').search_sorted(file_name, side='right')).item()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search_sorted returns the index where the element should be inserted (the point after the match).

Find indices where elements should be inserted to maintain order.
Polars documentation

So the match index we are looking for actually would be idx-1 instead of idx.

# Binary search for the shard containing this index
# search_sorted with side="right" returns index where global_index would be inserted
# We want the shard where global_offset <= global_index, so subtract 1
idx = self._shard_df.select(pl.col('segment_id').search_sorted(global_index, side='right')).item()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as below

Copy link
Copy Markdown
Contributor

@shahbaz-humeai shahbaz-humeai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, but we we should address the comments before merging.

@jpc
Copy link
Copy Markdown
Member Author

jpc commented Mar 12, 2026

Thanks. I think we could skip this one PR since we are not very likely to actually use this code. I was able to remove the SQLite creation bottleneck.

The main benefit of this code that remains is to be able to lazy download the relatively huge key index. But we could also split the SQLite into two databases with the same effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants