WSFeatherIndex: replace SQLite with Polars by jpc · Pull Request #30 · HumeAI/wsds

jpc · 2026-02-06T18:22:50Z

📚 PR Stack

#	Branch	Base	PR
1	`jpc/preloading`	`main`	#16
2	`jpc/validate-shards-in-sql`	`jpc/preloading`	#18
3	`jpc/shard-pipe`	`jpc/validate-shards-in-sql`	#19
4	`jpc/sql-without-index`	`jpc/shard-pipe`	#20
5	`jpc/special-shard-columns`	`jpc/sql-without-index`	#21
6	`jpc/remove-in-progress`	`jpc/special-shard-columns`	#22
7	`jpc/fix-jupyter-repr`	`jpc/remove-in-progress`	#23
8	`jpc/sql-select-dotted`	`jpc/fix-jupyter-repr`	#24
9	`jpc/optimize-durations`	`jpc/sql-select-dotted`	#25
10	`jpc/source-links`	`jpc/optimize-durations`	#26
11	`jpc/audio-qol`	`jpc/source-links`	#27
12	`jpc/warn-subsampling-in-dataloader`	`jpc/audio-qol`	#28
13	`jpc/wsds-inspect-head`	`jpc/warn-subsampling-in-dataloader`	#29
14	`jpc/feather-index`	`jpc/wsds-inspect-head`	#30 ◀️
15	`jpc/indexer`	`jpc/feather-index`	#31
16	`jpc/keys-diff`	`jpc/indexer`	#32
17	`jpc/fields-cleanup`	`jpc/keys-diff`	#33
18	`jpc/s3-shards`	`jpc/fields-cleanup`	#34
19	`jpc/indexing-fixes`	`jpc/s3-shards`	#35
20	`jpc/validate-keys-on-the-fly`	`jpc/indexing-fixes`	#36
21	`jpc/repr-missing-last`	`jpc/validate-keys-on-the-fly`	#37
22	`jpc/drop-slots`	`jpc/repr-missing-last`	#38

WSFeatherIndex: replace SQLite with Polars

jpc · 2026-02-06T19:37:12Z

This is still WIP, not enabled anywhere yet.

shahbaz-humeai · 2026-03-11T20:00:48Z

wsds/ws_feather_index.py

+        # Binary search in sorted name series
+        idx = self._name_df.select(pl.col('name').search_sorted(file_name, side='right')).item()
+
+        if idx >= len(self._names) or self._names[idx] != file_name:


self._names is undefined

shahbaz-humeai · 2026-03-11T20:06:24Z

wsds/ws_feather_index.py

+            raise RuntimeError("episode-name-index.feather is required to search by episode name")
+
+        # Binary search in sorted name series
+        idx = self._name_df.select(pl.col('name').search_sorted(file_name, side='right')).item()


search_sorted returns the index where the element should be inserted (the point after the match).

Find indices where elements should be inserted to maintain order.
Polars documentation

So the match index we are looking for actually would be idx-1 instead of idx.

shahbaz-humeai · 2026-03-11T20:08:06Z

wsds/ws_feather_index.py

+        # Binary search for the shard containing this index
+        # search_sorted with side="right" returns index where global_index would be inserted
+        # We want the shard where global_offset <= global_index, so subtract 1
+        idx = self._shard_df.select(pl.col('segment_id').search_sorted(global_index, side='right')).item()


Same issue as below

shahbaz-humeai

Approving, but we we should address the comments before merging.

jpc · 2026-03-12T11:57:07Z

Thanks. I think we could skip this one PR since we are not very likely to actually use this code. I was able to remove the SQLite creation bottleneck.

The main benefit of this code that remains is to be able to lazy download the relatively huge key index. But we could also split the SQLite into two databases with the same effect.

WSFeatherIndex: replace SQLite with Polars

a1d0471

jpc marked this pull request as ready for review February 6, 2026 19:37

shahbaz-humeai reviewed Mar 11, 2026

View reviewed changes

shahbaz-humeai approved these changes Mar 11, 2026

View reviewed changes

Added a new Polars-based indexing implementation

bf3641d

wsds diff_keys: added a tool for checking key missalignment (#32)

a01beb2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WSFeatherIndex: replace SQLite with Polars#30

WSFeatherIndex: replace SQLite with Polars#30
jpc wants to merge 3 commits intojpc/wsds-inspect-headfrom
jpc/feather-index

jpc commented Feb 6, 2026 •

edited

Loading

Uh oh!

jpc commented Feb 6, 2026

Uh oh!

shahbaz-humeai Mar 11, 2026

Uh oh!

shahbaz-humeai Mar 11, 2026

Uh oh!

shahbaz-humeai Mar 11, 2026

Uh oh!

shahbaz-humeai left a comment

Uh oh!

jpc commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jpc commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📚 PR Stack

Uh oh!

jpc commented Feb 6, 2026

Uh oh!

shahbaz-humeai Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

shahbaz-humeai Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

shahbaz-humeai Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

shahbaz-humeai left a comment

Choose a reason for hiding this comment

Uh oh!

jpc commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jpc commented Feb 6, 2026 •

edited

Loading