Skip to content

refactor: simplify validator CLI by removing --files parameter#11

Merged
rysb-dev merged 1 commit intomainfrom
refactor/simplify-validator-cli
Feb 4, 2026
Merged

refactor: simplify validator CLI by removing --files parameter#11
rysb-dev merged 1 commit intomainfrom
refactor/simplify-validator-cli

Conversation

@rysb-dev
Copy link
Copy Markdown
Owner

@rysb-dev rysb-dev commented Feb 3, 2026

The validator now progressively loads mjson files until it has enough valid hora events to meet the --samples target. This removes the confusion of needing to specify both --files and --samples where insufficient files would result in fewer samples than requested.

Usage is now simply:
uv run agari_validator.py /path/to/data --samples 5000

The validator now progressively loads mjson files until it has enough
valid hora events to meet the --samples target. This removes the
confusion of needing to specify both --files and --samples where
insufficient files would result in fewer samples than requested.

Usage is now simply:
  python agari_validator.py /path/to/data --samples 5000
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the validator CLI to remove the --files parameter and instead progressively load .mjson files until enough valid hora events are collected to satisfy the --samples target.

Changes:

  • Removed --files CLI argument and associated file-sampling helper.
  • Shuffles discovered .mjson files and incrementally parses them until --samples valid hora events are collected (then samples down if needed).
  • Updates summary/returned stats to report files_processed instead of skipped invalid horas.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1027 to +1031
print(
f"Processed {files_processed} files, extracted {len(all_horas)} valid hora events"
)

# Sample hora events
# Sample down if we got more than requested
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there aren’t enough valid hora events across the dataset to reach num_samples, the script will silently validate fewer than requested. Consider adding an explicit warning when len(all_horas) < num_samples after extraction so users understand the target could not be met.

Copilot uses AI. Check for mistakes.
Comment on lines 1022 to 1024
except Exception as e:
files_processed += 1
if verbose:
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

files_processed is incremented in both the success and exception paths, so it effectively counts files attempted rather than successfully parsed. The summary text (“Files processed”) and earlier message (“Processed X files…”) can be misleading; consider renaming to files_attempted or tracking separate files_parsed/files_failed counters.

Copilot uses AI. Check for mistakes.
Comment on lines +994 to +997
# Shuffle files for random sampling
if seed is not None:
random.seed(seed)
random.shuffle(all_files)
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--seed is advertised as providing reproducibility, but Path.rglob() order is filesystem-dependent. With the current implementation, the same seed can still yield different file orders across OS/filesystems. Consider sorting all_files (e.g., all_files.sort()) before random.shuffle(all_files) so seeded runs are deterministic.

Copilot uses AI. Check for mistakes.
@rysb-dev rysb-dev merged commit 507e5c4 into main Feb 4, 2026
14 checks passed
@rysb-dev rysb-dev deleted the refactor/simplify-validator-cli branch February 4, 2026 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants