WSFeatherIndex: replace SQLite with Polars#30
WSFeatherIndex: replace SQLite with Polars#30jpc wants to merge 3 commits intojpc/wsds-inspect-headfrom
Conversation
|
This is still WIP, not enabled anywhere yet. |
| # Binary search in sorted name series | ||
| idx = self._name_df.select(pl.col('name').search_sorted(file_name, side='right')).item() | ||
|
|
||
| if idx >= len(self._names) or self._names[idx] != file_name: |
There was a problem hiding this comment.
self._names is undefined
| raise RuntimeError("episode-name-index.feather is required to search by episode name") | ||
|
|
||
| # Binary search in sorted name series | ||
| idx = self._name_df.select(pl.col('name').search_sorted(file_name, side='right')).item() |
There was a problem hiding this comment.
search_sorted returns the index where the element should be inserted (the point after the match).
Find indices where elements should be inserted to maintain order.
Polars documentation
So the match index we are looking for actually would be idx-1 instead of idx.
| # Binary search for the shard containing this index | ||
| # search_sorted with side="right" returns index where global_index would be inserted | ||
| # We want the shard where global_offset <= global_index, so subtract 1 | ||
| idx = self._shard_df.select(pl.col('segment_id').search_sorted(global_index, side='right')).item() |
shahbaz-humeai
left a comment
There was a problem hiding this comment.
Approving, but we we should address the comments before merging.
|
Thanks. I think we could skip this one PR since we are not very likely to actually use this code. I was able to remove the SQLite creation bottleneck. The main benefit of this code that remains is to be able to lazy download the relatively huge key index. But we could also split the SQLite into two databases with the same effect. |
📚 PR Stack
jpc/preloadingmainjpc/validate-shards-in-sqljpc/preloadingjpc/shard-pipejpc/validate-shards-in-sqljpc/sql-without-indexjpc/shard-pipejpc/special-shard-columnsjpc/sql-without-indexjpc/remove-in-progressjpc/special-shard-columnsjpc/fix-jupyter-reprjpc/remove-in-progressjpc/sql-select-dottedjpc/fix-jupyter-reprjpc/optimize-durationsjpc/sql-select-dottedjpc/source-linksjpc/optimize-durationsjpc/audio-qoljpc/source-linksjpc/warn-subsampling-in-dataloaderjpc/audio-qoljpc/wsds-inspect-headjpc/warn-subsampling-in-dataloaderjpc/feather-indexjpc/wsds-inspect-headjpc/indexerjpc/feather-indexjpc/keys-diffjpc/indexerjpc/fields-cleanupjpc/keys-diffjpc/s3-shardsjpc/fields-cleanupjpc/indexing-fixesjpc/s3-shardsjpc/validate-keys-on-the-flyjpc/indexing-fixesjpc/repr-missing-lastjpc/validate-keys-on-the-flyjpc/drop-slotsjpc/repr-missing-last