refactor: simplify validator CLI by removing --files parameter#11
refactor: simplify validator CLI by removing --files parameter#11
Conversation
The validator now progressively loads mjson files until it has enough valid hora events to meet the --samples target. This removes the confusion of needing to specify both --files and --samples where insufficient files would result in fewer samples than requested. Usage is now simply: python agari_validator.py /path/to/data --samples 5000
There was a problem hiding this comment.
Pull request overview
Refactors the validator CLI to remove the --files parameter and instead progressively load .mjson files until enough valid hora events are collected to satisfy the --samples target.
Changes:
- Removed
--filesCLI argument and associated file-sampling helper. - Shuffles discovered
.mjsonfiles and incrementally parses them until--samplesvalidhoraevents are collected (then samples down if needed). - Updates summary/returned stats to report
files_processedinstead ofskippedinvalid horas.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| print( | ||
| f"Processed {files_processed} files, extracted {len(all_horas)} valid hora events" | ||
| ) | ||
|
|
||
| # Sample hora events | ||
| # Sample down if we got more than requested |
There was a problem hiding this comment.
If there aren’t enough valid hora events across the dataset to reach num_samples, the script will silently validate fewer than requested. Consider adding an explicit warning when len(all_horas) < num_samples after extraction so users understand the target could not be met.
| except Exception as e: | ||
| files_processed += 1 | ||
| if verbose: |
There was a problem hiding this comment.
files_processed is incremented in both the success and exception paths, so it effectively counts files attempted rather than successfully parsed. The summary text (“Files processed”) and earlier message (“Processed X files…”) can be misleading; consider renaming to files_attempted or tracking separate files_parsed/files_failed counters.
| # Shuffle files for random sampling | ||
| if seed is not None: | ||
| random.seed(seed) | ||
| random.shuffle(all_files) |
There was a problem hiding this comment.
--seed is advertised as providing reproducibility, but Path.rglob() order is filesystem-dependent. With the current implementation, the same seed can still yield different file orders across OS/filesystems. Consider sorting all_files (e.g., all_files.sort()) before random.shuffle(all_files) so seeded runs are deterministic.
The validator now progressively loads mjson files until it has enough valid hora events to meet the --samples target. This removes the confusion of needing to specify both --files and --samples where insufficient files would result in fewer samples than requested.
Usage is now simply:
uv run agari_validator.py /path/to/data --samples 5000