WSSample: print fields with missing shards last by jpc · Pull Request #37 · HumeAI/wsds

jpc · 2026-02-06T18:23:21Z

📚 PR Stack

#	Branch	Base	PR
1	`jpc/preloading`	`main`	#16
2	`jpc/validate-shards-in-sql`	`jpc/preloading`	#18
3	`jpc/shard-pipe`	`jpc/validate-shards-in-sql`	#19
4	`jpc/sql-without-index`	`jpc/shard-pipe`	#20
5	`jpc/special-shard-columns`	`jpc/sql-without-index`	#21
6	`jpc/remove-in-progress`	`jpc/special-shard-columns`	#22
7	`jpc/fix-jupyter-repr`	`jpc/remove-in-progress`	#23
8	`jpc/sql-select-dotted`	`jpc/fix-jupyter-repr`	#24
9	`jpc/optimize-durations`	`jpc/sql-select-dotted`	#25
10	`jpc/source-links`	`jpc/optimize-durations`	#26
11	`jpc/audio-qol`	`jpc/source-links`	#27
12	`jpc/warn-subsampling-in-dataloader`	`jpc/audio-qol`	#28
13	`jpc/wsds-inspect-head`	`jpc/warn-subsampling-in-dataloader`	#29
14	`jpc/feather-index`	`jpc/wsds-inspect-head`	#30
15	`jpc/indexer`	`jpc/feather-index`	#31
16	`jpc/keys-diff`	`jpc/indexer`	#32
17	`jpc/fields-cleanup`	`jpc/keys-diff`	#33
18	`jpc/s3-shards`	`jpc/fields-cleanup`	#34
19	`jpc/indexing-fixes`	`jpc/s3-shards`	#35
20	`jpc/validate-keys-on-the-fly`	`jpc/indexing-fixes`	#36
21	`jpc/repr-missing-last`	`jpc/validate-keys-on-the-fly`	#37 ◀️
22	`jpc/drop-slots`	`jpc/repr-missing-last`	#38

WSSample: print fields with missing shards last

* WSAudio: disable slots because it breaks code auto-reload * Added WSModalShard (#41) * Extract audio codec layer from ws_audio.py into audio_codec.py Separates codec concerns (decoder backends, encoder, format utils) from the data model layer (AudioReader, WSAudio) for better reusability and testability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add ModalFileReader for Modal Volume range requests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Centralize binary column decoding into ws_decode module Extract duplicated npy/pyd/txt/audio decode logic from WSShard and WSS3Shard into a shared decode_sample() function. Dispatch is now based on column type (binary) rather than column-name heuristics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Added WSModalShard * Big renaming and cleanups (#45) * Move index SQL queries from WSDataset into WSIndex - Add old/new index format detection (partition vs dataset_path columns) - Add _partition_col property for unified SQL partition expression - Add lookup_by_index() and lookup_by_key() methods to WSIndex - Add shard_n_samples() and shard_global_offset() via _query_shard() helper - Simplify WSDataset.__getitem__ to delegate to WSIndex lookups - Replace all raw index.query() calls in WSDataset with WSIndex methods - Update ws_tools.py to use new format detection * Fix env var name in README: WSDS_DATASET_PATH → WSDS_DATASET_SEARCH_PATH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Move is_notebook() to utils, guard _ipython_display_ for terminal use - Extract is_notebook() from convplayer.py into utils.py (simplified) - Remove redundant _ipython_display_ from AudioReader and WSAudio (IPython already calls _repr_html_ automatically) - Add is_notebook() guard to WSDataset and WSSample _ipython_display_ so they fall back to print() in terminal IPython sessions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Improve naming consistency across codebase - subdir → column_dir (utils, ws_sample, ws_modal_shard) - shard_name → shard_ref on shard interfaces and WSSample - dataset_path → partition in index and shard code - dataset_dir → dataset_root in ws_indexer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add get_audio() helper in ws_decode, use in WSSample Centralizes audio column lookup logic so it can be reused outside of WSSample (e.g. from plain dicts or other sample types). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add rng parameter to WSDataset for reproducible sampling Allows passing rng=42 (or a Random instance) to get deterministic sample ordering in random_sample() and sql_select(). Also removes unused needs_key variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix hume_wsds module path remapping and sql_filter pl.first() usage - Remap hume_wsds.* loader paths to wsds.* for backward compatibility with old index files that reference the former package name - Use pl.first() instead of exprs[0] in sql_filter for correctness Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update module docstring with rich examples and fix doctests - Add comprehensive docstring to __init__.py with working doctests showcasing SQL queries, random access, lazy loading, and audio - Fix ws_dataset.py doctests (AudioReader src type, add shard_subsample=1) - Fix ws_sink.py doctest (remove invalid batch_size param) - Update tests.py to run wsds module doctests and fix imports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Apply ruff formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Moved the library showcase to README.md * Use shard ref --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai>

* WSSample: always validate that keys match across subdirs * WSSample: print fields with missing shards last (#37) * WSSample: print fields with missing shards last * WSAudio: disable slots because it breaks code auto-reload (#38) * WSAudio: disable slots because it breaks code auto-reload * Added WSModalShard (#41) * Extract audio codec layer from ws_audio.py into audio_codec.py Separates codec concerns (decoder backends, encoder, format utils) from the data model layer (AudioReader, WSAudio) for better reusability and testability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add ModalFileReader for Modal Volume range requests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Centralize binary column decoding into ws_decode module Extract duplicated npy/pyd/txt/audio decode logic from WSShard and WSS3Shard into a shared decode_sample() function. Dispatch is now based on column type (binary) rather than column-name heuristics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Added WSModalShard * Big renaming and cleanups (#45) * Move index SQL queries from WSDataset into WSIndex - Add old/new index format detection (partition vs dataset_path columns) - Add _partition_col property for unified SQL partition expression - Add lookup_by_index() and lookup_by_key() methods to WSIndex - Add shard_n_samples() and shard_global_offset() via _query_shard() helper - Simplify WSDataset.__getitem__ to delegate to WSIndex lookups - Replace all raw index.query() calls in WSDataset with WSIndex methods - Update ws_tools.py to use new format detection * Fix env var name in README: WSDS_DATASET_PATH → WSDS_DATASET_SEARCH_PATH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Move is_notebook() to utils, guard _ipython_display_ for terminal use - Extract is_notebook() from convplayer.py into utils.py (simplified) - Remove redundant _ipython_display_ from AudioReader and WSAudio (IPython already calls _repr_html_ automatically) - Add is_notebook() guard to WSDataset and WSSample _ipython_display_ so they fall back to print() in terminal IPython sessions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Improve naming consistency across codebase - subdir → column_dir (utils, ws_sample, ws_modal_shard) - shard_name → shard_ref on shard interfaces and WSSample - dataset_path → partition in index and shard code - dataset_dir → dataset_root in ws_indexer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add get_audio() helper in ws_decode, use in WSSample Centralizes audio column lookup logic so it can be reused outside of WSSample (e.g. from plain dicts or other sample types). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add rng parameter to WSDataset for reproducible sampling Allows passing rng=42 (or a Random instance) to get deterministic sample ordering in random_sample() and sql_select(). Also removes unused needs_key variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix hume_wsds module path remapping and sql_filter pl.first() usage - Remap hume_wsds.* loader paths to wsds.* for backward compatibility with old index files that reference the former package name - Use pl.first() instead of exprs[0] in sql_filter for correctness Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update module docstring with rich examples and fix doctests - Add comprehensive docstring to __init__.py with working doctests showcasing SQL queries, random access, lazy loading, and audio - Fix ws_dataset.py doctests (AudioReader src type, add shard_subsample=1) - Fix ws_sink.py doctest (remove invalid batch_size param) - Update tests.py to run wsds module doctests and fix imports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Apply ruff formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Moved the library showcase to README.md * Use shard ref --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai>

…andling; ws_indexer: ensure we use relative shard paths when possible (#35) * WSDataset: added an ignore_index option * ws_indexer: improved error handling * ws_indexer: ensure we use relative shard paths when possible * WSSample: always validate that keys match across subdirs (#36) * WSSample: always validate that keys match across subdirs * WSSample: print fields with missing shards last (#37) * WSSample: print fields with missing shards last * WSAudio: disable slots because it breaks code auto-reload (#38) * WSAudio: disable slots because it breaks code auto-reload * Added WSModalShard (#41) * Extract audio codec layer from ws_audio.py into audio_codec.py Separates codec concerns (decoder backends, encoder, format utils) from the data model layer (AudioReader, WSAudio) for better reusability and testability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add ModalFileReader for Modal Volume range requests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Centralize binary column decoding into ws_decode module Extract duplicated npy/pyd/txt/audio decode logic from WSShard and WSS3Shard into a shared decode_sample() function. Dispatch is now based on column type (binary) rather than column-name heuristics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Added WSModalShard * Big renaming and cleanups (#45) * Move index SQL queries from WSDataset into WSIndex - Add old/new index format detection (partition vs dataset_path columns) - Add _partition_col property for unified SQL partition expression - Add lookup_by_index() and lookup_by_key() methods to WSIndex - Add shard_n_samples() and shard_global_offset() via _query_shard() helper - Simplify WSDataset.__getitem__ to delegate to WSIndex lookups - Replace all raw index.query() calls in WSDataset with WSIndex methods - Update ws_tools.py to use new format detection * Fix env var name in README: WSDS_DATASET_PATH → WSDS_DATASET_SEARCH_PATH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Move is_notebook() to utils, guard _ipython_display_ for terminal use - Extract is_notebook() from convplayer.py into utils.py (simplified) - Remove redundant _ipython_display_ from AudioReader and WSAudio (IPython already calls _repr_html_ automatically) - Add is_notebook() guard to WSDataset and WSSample _ipython_display_ so they fall back to print() in terminal IPython sessions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Improve naming consistency across codebase - subdir → column_dir (utils, ws_sample, ws_modal_shard) - shard_name → shard_ref on shard interfaces and WSSample - dataset_path → partition in index and shard code - dataset_dir → dataset_root in ws_indexer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add get_audio() helper in ws_decode, use in WSSample Centralizes audio column lookup logic so it can be reused outside of WSSample (e.g. from plain dicts or other sample types). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add rng parameter to WSDataset for reproducible sampling Allows passing rng=42 (or a Random instance) to get deterministic sample ordering in random_sample() and sql_select(). Also removes unused needs_key variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix hume_wsds module path remapping and sql_filter pl.first() usage - Remap hume_wsds.* loader paths to wsds.* for backward compatibility with old index files that reference the former package name - Use pl.first() instead of exprs[0] in sql_filter for correctness Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update module docstring with rich examples and fix doctests - Add comprehensive docstring to __init__.py with working doctests showcasing SQL queries, random access, lazy loading, and audio - Fix ws_dataset.py doctests (AudioReader src type, add shard_subsample=1) - Fix ws_sink.py doctest (remove invalid batch_size param) - Update tests.py to run wsds module doctests and fix imports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Apply ruff formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Moved the library showcase to README.md * Use shard ref --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai>

* WSDataset: scan the dataset folder even if the index contains a field list * WSShardInterface: remove source_dataset from the from_link interface * WSS3Shard: remote audio shards on S3 * pupyarrow: a pure-Python PyArrow implementation with good lazy-loading support * WSDataset: added an ignore_index option; ws_indexer: improved error handling; ws_indexer: ensure we use relative shard paths when possible (#35) * WSDataset: added an ignore_index option * ws_indexer: improved error handling * ws_indexer: ensure we use relative shard paths when possible * WSSample: always validate that keys match across subdirs (#36) * WSSample: always validate that keys match across subdirs * WSSample: print fields with missing shards last (#37) * WSSample: print fields with missing shards last * WSAudio: disable slots because it breaks code auto-reload (#38) * WSAudio: disable slots because it breaks code auto-reload * Added WSModalShard (#41) * Extract audio codec layer from ws_audio.py into audio_codec.py Separates codec concerns (decoder backends, encoder, format utils) from the data model layer (AudioReader, WSAudio) for better reusability and testability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add ModalFileReader for Modal Volume range requests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Centralize binary column decoding into ws_decode module Extract duplicated npy/pyd/txt/audio decode logic from WSShard and WSS3Shard into a shared decode_sample() function. Dispatch is now based on column type (binary) rather than column-name heuristics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Added WSModalShard * Big renaming and cleanups (#45) * Move index SQL queries from WSDataset into WSIndex - Add old/new index format detection (partition vs dataset_path columns) - Add _partition_col property for unified SQL partition expression - Add lookup_by_index() and lookup_by_key() methods to WSIndex - Add shard_n_samples() and shard_global_offset() via _query_shard() helper - Simplify WSDataset.__getitem__ to delegate to WSIndex lookups - Replace all raw index.query() calls in WSDataset with WSIndex methods - Update ws_tools.py to use new format detection * Fix env var name in README: WSDS_DATASET_PATH → WSDS_DATASET_SEARCH_PATH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Move is_notebook() to utils, guard _ipython_display_ for terminal use - Extract is_notebook() from convplayer.py into utils.py (simplified) - Remove redundant _ipython_display_ from AudioReader and WSAudio (IPython already calls _repr_html_ automatically) - Add is_notebook() guard to WSDataset and WSSample _ipython_display_ so they fall back to print() in terminal IPython sessions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Improve naming consistency across codebase - subdir → column_dir (utils, ws_sample, ws_modal_shard) - shard_name → shard_ref on shard interfaces and WSSample - dataset_path → partition in index and shard code - dataset_dir → dataset_root in ws_indexer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add get_audio() helper in ws_decode, use in WSSample Centralizes audio column lookup logic so it can be reused outside of WSSample (e.g. from plain dicts or other sample types). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add rng parameter to WSDataset for reproducible sampling Allows passing rng=42 (or a Random instance) to get deterministic sample ordering in random_sample() and sql_select(). Also removes unused needs_key variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix hume_wsds module path remapping and sql_filter pl.first() usage - Remap hume_wsds.* loader paths to wsds.* for backward compatibility with old index files that reference the former package name - Use pl.first() instead of exprs[0] in sql_filter for correctness Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update module docstring with rich examples and fix doctests - Add comprehensive docstring to __init__.py with working doctests showcasing SQL queries, random access, lazy loading, and audio - Fix ws_dataset.py doctests (AudioReader src type, add shard_subsample=1) - Fix ws_sink.py doctest (remove invalid batch_size param) - Update tests.py to run wsds module doctests and fix imports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Apply ruff formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Moved the library showcase to README.md * Use shard ref --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai>

* Always keep fields as a list of tuples. * WSS3Shard: remote audio shards on S3 (#34) * WSDataset: scan the dataset folder even if the index contains a field list * WSShardInterface: remove source_dataset from the from_link interface * WSS3Shard: remote audio shards on S3 * pupyarrow: a pure-Python PyArrow implementation with good lazy-loading support * WSDataset: added an ignore_index option; ws_indexer: improved error handling; ws_indexer: ensure we use relative shard paths when possible (#35) * WSDataset: added an ignore_index option * ws_indexer: improved error handling * ws_indexer: ensure we use relative shard paths when possible * WSSample: always validate that keys match across subdirs (#36) * WSSample: always validate that keys match across subdirs * WSSample: print fields with missing shards last (#37) * WSSample: print fields with missing shards last * WSAudio: disable slots because it breaks code auto-reload (#38) * WSAudio: disable slots because it breaks code auto-reload * Added WSModalShard (#41) * Extract audio codec layer from ws_audio.py into audio_codec.py Separates codec concerns (decoder backends, encoder, format utils) from the data model layer (AudioReader, WSAudio) for better reusability and testability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add ModalFileReader for Modal Volume range requests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Centralize binary column decoding into ws_decode module Extract duplicated npy/pyd/txt/audio decode logic from WSShard and WSS3Shard into a shared decode_sample() function. Dispatch is now based on column type (binary) rather than column-name heuristics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Added WSModalShard * Big renaming and cleanups (#45) * Move index SQL queries from WSDataset into WSIndex - Add old/new index format detection (partition vs dataset_path columns) - Add _partition_col property for unified SQL partition expression - Add lookup_by_index() and lookup_by_key() methods to WSIndex - Add shard_n_samples() and shard_global_offset() via _query_shard() helper - Simplify WSDataset.__getitem__ to delegate to WSIndex lookups - Replace all raw index.query() calls in WSDataset with WSIndex methods - Update ws_tools.py to use new format detection * Fix env var name in README: WSDS_DATASET_PATH → WSDS_DATASET_SEARCH_PATH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Move is_notebook() to utils, guard _ipython_display_ for terminal use - Extract is_notebook() from convplayer.py into utils.py (simplified) - Remove redundant _ipython_display_ from AudioReader and WSAudio (IPython already calls _repr_html_ automatically) - Add is_notebook() guard to WSDataset and WSSample _ipython_display_ so they fall back to print() in terminal IPython sessions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Improve naming consistency across codebase - subdir → column_dir (utils, ws_sample, ws_modal_shard) - shard_name → shard_ref on shard interfaces and WSSample - dataset_path → partition in index and shard code - dataset_dir → dataset_root in ws_indexer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add get_audio() helper in ws_decode, use in WSSample Centralizes audio column lookup logic so it can be reused outside of WSSample (e.g. from plain dicts or other sample types). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add rng parameter to WSDataset for reproducible sampling Allows passing rng=42 (or a Random instance) to get deterministic sample ordering in random_sample() and sql_select(). Also removes unused needs_key variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix hume_wsds module path remapping and sql_filter pl.first() usage - Remap hume_wsds.* loader paths to wsds.* for backward compatibility with old index files that reference the former package name - Use pl.first() instead of exprs[0] in sql_filter for correctness Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update module docstring with rich examples and fix doctests - Add comprehensive docstring to __init__.py with working doctests showcasing SQL queries, random access, lazy loading, and audio - Fix ws_dataset.py doctests (AudioReader src type, add shard_subsample=1) - Fix ws_sink.py doctest (remove invalid batch_size param) - Update tests.py to run wsds module doctests and fix imports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Apply ruff formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Moved the library showcase to README.md * Use shard ref --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai>

WSSample: print fields with missing shards last

feff1f6

jpc marked this pull request as ready for review February 6, 2026 19:35

shahbaz-humeai approved these changes Mar 11, 2026

View reviewed changes

shahbaz-humeai merged commit da238e8 into jpc/validate-keys-on-the-fly Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WSSample: print fields with missing shards last#37

WSSample: print fields with missing shards last#37
shahbaz-humeai merged 2 commits intojpc/validate-keys-on-the-flyfrom
jpc/repr-missing-last

jpc commented Feb 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jpc commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📚 PR Stack

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jpc commented Feb 6, 2026 •

edited

Loading