[examples] feat: add filtered ASearcher dataset source to the preprocess script#74
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the asearcher.py preprocessing script to make the --input_json argument optional, automatically downloading a filtered ASearcher dataset from HuggingFace if no local file is provided. The review feedback suggests two improvements: dynamically adjusting the train/test split sizes if the downloaded dataset is smaller than the requested total to prevent a runtime crash, and wrapping the datasets import in a try-except block to provide a user-friendly error message if the optional dependency is missing.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if args.input_json: | ||
| input_json_path = os.path.expanduser(args.input_json) | ||
| df_raw = _read_input_as_dataframe(input_json_path) | ||
| logger.info(f"Loaded {len(df_raw)} records from {input_json_path}") | ||
| else: | ||
| df_raw = _load_filtered_dataset() | ||
| logger.info(f"Loaded {len(df_raw)} records from {HF_FILTERED_REPO}") |
There was a problem hiding this comment.
If the downloaded filtered dataset contains fewer than the default 8,292 rows (8,192 train + 100 test), the script will crash with a ValueError at line 182. To make the script robust, we should dynamically adjust the train/test split sizes if the loaded dataset is smaller than the requested total.
| if args.input_json: | |
| input_json_path = os.path.expanduser(args.input_json) | |
| df_raw = _read_input_as_dataframe(input_json_path) | |
| logger.info(f"Loaded {len(df_raw)} records from {input_json_path}") | |
| else: | |
| df_raw = _load_filtered_dataset() | |
| logger.info(f"Loaded {len(df_raw)} records from {HF_FILTERED_REPO}") | |
| if args.input_json: | |
| input_json_path = os.path.expanduser(args.input_json) | |
| df_raw = _read_input_as_dataframe(input_json_path) | |
| logger.info(f"Loaded {len(df_raw)} records from {input_json_path}") | |
| else: | |
| df_raw = _load_filtered_dataset() | |
| logger.info(f"Loaded {len(df_raw)} records from {HF_FILTERED_REPO}") | |
| if len(df_raw) < args.train_rows + args.test_rows: | |
| args.test_rows = max(1, len(df_raw) // 10) if len(df_raw) > 1 else 0 | |
| args.train_rows = len(df_raw) - args.test_rows |
| def _load_filtered_dataset() -> pd.DataFrame: | ||
| """Download the filtered ASearcher dataset from HuggingFace as a DataFrame.""" | ||
| from datasets import load_dataset | ||
|
|
||
| logger.info(f"Downloading {HF_FILTERED_REPO}:{HF_FILTERED_FILE} from HuggingFace...") | ||
| dataset = load_dataset(HF_FILTERED_REPO, data_files=HF_FILTERED_FILE, split="train") | ||
| return dataset.to_pandas() |
There was a problem hiding this comment.
The datasets library is an optional dependency and might not be installed in all environments. If a user runs this script without --input_json and does not have datasets installed, they will get a generic ImportError. Wrapping the import in a try-except block with a clear, actionable error message will improve the user experience.
def _load_filtered_dataset() -> pd.DataFrame:
"""Download the filtered ASearcher dataset from HuggingFace as a DataFrame."""
try:
from datasets import load_dataset
except ImportError as e:
raise ImportError(
"The 'datasets' library is required to download the default dataset from HuggingFace. "
"Please install it using 'pip install datasets' or provide a local path via '--input_json'."
) from e
logger.info(f"Downloading {HF_FILTERED_REPO}:{HF_FILTERED_FILE} from HuggingFace...")
dataset = load_dataset(HF_FILTERED_REPO, data_files=HF_FILTERED_FILE, split="train")
return dataset.to_pandas()
What does this PR do?
Use
aidenjhwu/ASearcher_en_no-math_Qwen3-8B-reject-sampleas the asearcher dataset source, which filters out: