Skip to content

[examples] feat: add filtered ASearcher dataset source to the preprocess script#74

Merged
Begunner merged 1 commit into
mainfrom
asearcher-data
Jun 29, 2026
Merged

[examples] feat: add filtered ASearcher dataset source to the preprocess script#74
Begunner merged 1 commit into
mainfrom
asearcher-data

Conversation

@Begunner

Copy link
Copy Markdown
Collaborator

What does this PR do?

Use aidenjhwu/ASearcher_en_no-math_Qwen3-8B-reject-sample as the asearcher dataset source, which filters out:

  1. Chinese samples which can't be solved well in wiki
  2. math problems
  3. too easy/difficulty samples

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the asearcher.py preprocessing script to make the --input_json argument optional, automatically downloading a filtered ASearcher dataset from HuggingFace if no local file is provided. The review feedback suggests two improvements: dynamically adjusting the train/test split sizes if the downloaded dataset is smaller than the requested total to prevent a runtime crash, and wrapping the datasets import in a try-except block to provide a user-friendly error message if the optional dependency is missing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +173 to +179
if args.input_json:
input_json_path = os.path.expanduser(args.input_json)
df_raw = _read_input_as_dataframe(input_json_path)
logger.info(f"Loaded {len(df_raw)} records from {input_json_path}")
else:
df_raw = _load_filtered_dataset()
logger.info(f"Loaded {len(df_raw)} records from {HF_FILTERED_REPO}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If the downloaded filtered dataset contains fewer than the default 8,292 rows (8,192 train + 100 test), the script will crash with a ValueError at line 182. To make the script robust, we should dynamically adjust the train/test split sizes if the loaded dataset is smaller than the requested total.

Suggested change
if args.input_json:
input_json_path = os.path.expanduser(args.input_json)
df_raw = _read_input_as_dataframe(input_json_path)
logger.info(f"Loaded {len(df_raw)} records from {input_json_path}")
else:
df_raw = _load_filtered_dataset()
logger.info(f"Loaded {len(df_raw)} records from {HF_FILTERED_REPO}")
if args.input_json:
input_json_path = os.path.expanduser(args.input_json)
df_raw = _read_input_as_dataframe(input_json_path)
logger.info(f"Loaded {len(df_raw)} records from {input_json_path}")
else:
df_raw = _load_filtered_dataset()
logger.info(f"Loaded {len(df_raw)} records from {HF_FILTERED_REPO}")
if len(df_raw) < args.train_rows + args.test_rows:
args.test_rows = max(1, len(df_raw) // 10) if len(df_raw) > 1 else 0
args.train_rows = len(df_raw) - args.test_rows

Comment on lines +134 to +140
def _load_filtered_dataset() -> pd.DataFrame:
"""Download the filtered ASearcher dataset from HuggingFace as a DataFrame."""
from datasets import load_dataset

logger.info(f"Downloading {HF_FILTERED_REPO}:{HF_FILTERED_FILE} from HuggingFace...")
dataset = load_dataset(HF_FILTERED_REPO, data_files=HF_FILTERED_FILE, split="train")
return dataset.to_pandas()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The datasets library is an optional dependency and might not be installed in all environments. If a user runs this script without --input_json and does not have datasets installed, they will get a generic ImportError. Wrapping the import in a try-except block with a clear, actionable error message will improve the user experience.

def _load_filtered_dataset() -> pd.DataFrame:
    """Download the filtered ASearcher dataset from HuggingFace as a DataFrame."""
    try:
        from datasets import load_dataset
    except ImportError as e:
        raise ImportError(
            "The 'datasets' library is required to download the default dataset from HuggingFace. "
            "Please install it using 'pip install datasets' or provide a local path via '--input_json'."
        ) from e

    logger.info(f"Downloading {HF_FILTERED_REPO}:{HF_FILTERED_FILE} from HuggingFace...")
    dataset = load_dataset(HF_FILTERED_REPO, data_files=HF_FILTERED_FILE, split="train")
    return dataset.to_pandas()

@Begunner Begunner merged commit 79750f5 into main Jun 29, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant