Data Generation Second Phase - service.py cache fix and job scripts add#138
Data Generation Second Phase - service.py cache fix and job scripts add#138
service.py cache fix and job scripts add#138Conversation
There was a problem hiding this comment.
Pull Request Overview
This pull request implements a cache fix in the data generation service and adds a new job command for dataset cleaning as part of the second-phase data generation effort. Key changes include:
- Updating the cache logic in service.py to avoid raising an error on cache misses.
- Adjusting the task preference inclusion probabilities in the response mapper.
- Adding a new CLI command (clean_dataset) along with accompanying documentation in benchmarks/README.md.
Reviewed Changes
Copilot reviewed 5 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| benchmarks/README.md | Updated the dataset filename used in the testing/example command. |
| aif_gen/generate/service.py | Changed cache lookup logic to only process non-None cache output. |
| aif_gen/api/response_mapper/response_mapper.py | Adjusted the preference inclusion probabilities. |
| aif_gen/_cli/main.py | Added the clean_dataset command to the CLI. |
| aif_gen/_cli/commands/clean.py | Introduced a new command to clean datasets by removing specified words. |
Files not reviewed (3)
- jobs/generate_all_downsampled.sh: Language not supported
- jobs/generate_all_static.sh: Language not supported
- jobs/validate_all_static.sh: Language not supported
Comments suppressed due to low confidence (1)
aif_gen/generate/service.py:536
- Previously an error was raised for a missing cache entry; please ensure that the updated behavior (doing nothing if output is None) is intended, or consider adding logging or alternative handling for cache misses.
if output is not None:
| logging.info(f'Writing {len(dataset)} samples to {output_data_file}') | ||
| input_dataset.to_json(output_data_file) | ||
| logging.info(f'Wrote {len(dataset)} samples to {output_data_file}') |
There was a problem hiding this comment.
The variable 'dataset' refers to the last iterated dataset rather than the aggregate number of cleaned samples; consider using the total sample count or an appropriately aggregated value to accurately reflect the number of samples written.
| logging.info(f'Writing {len(dataset)} samples to {output_data_file}') | |
| input_dataset.to_json(output_data_file) | |
| logging.info(f'Wrote {len(dataset)} samples to {output_data_file}') | |
| total_samples = sum(len(dataset.samples) for dataset in input_dataset.datasets) | |
| logging.info(f'Writing {total_samples} samples to {output_data_file}') | |
| input_dataset.to_json(output_data_file) | |
| logging.info(f'Wrote {total_samples} samples to {output_data_file}') |
jobs/generate_all_downsampled.sh
Outdated
| # 2) run all generation jobs sequentially | ||
| uv run aif generate \ | ||
| --include-preference-axes \ | ||
| --max_concurrency 256 \ | ||
| --output_file "data/70B_generation/education_qna_direct/data.json" \ | ||
| config/static_copy/education_qna_direct.yaml \ | ||
| Meta-Llama-3.1-70B-Instruct |
There was a problem hiding this comment.
To reiterate, is this necessary to merge? Or just useful as a pre-processing step for our data? By the way, I would rather not have used this in excess (or at all), since it's unclear how removing specific words could alter the latent preference that we are aiming to model.
| logging.info(f'Writing {len(dataset)} samples to {output_data_file}') | ||
| input_dataset.to_json(output_data_file) | ||
| logging.info(f'Wrote {len(dataset)} samples to {output_data_file}') |
There was a problem hiding this comment.
To reiterate, is this necessary to merge? Or just useful as a pre-processing step for our data? By the way, I would rather not have used this in excess (or at all), since it's unclear how removing specific words could alter the latent preference that we are aiming to model.
There was a problem hiding this comment.
Assuming we are finalized on this approach for our first release and paper, we should move this to config (in a subsequent PR)
There was a problem hiding this comment.
Also could be worth fixing #140 while we are here unless you rather do it separately for clean git history
This is a branch used for generating the second phase of our data.