[examples] feat: add filtered ASearcher dataset source to the preprocess script by Begunner · Pull Request #74 · verl-project/uni-agent

Begunner · 2026-06-29T11:18:13Z

What does this PR do?

Use aidenjhwu/ASearcher_en_no-math_Qwen3-8B-reject-sample as the asearcher dataset source, which filters out:

Chinese samples which can't be solved well in wiki
math problems
too easy/difficulty samples

gemini-code-assist

Code Review

This pull request updates the asearcher.py preprocessing script to make the --input_json argument optional, automatically downloading a filtered ASearcher dataset from HuggingFace if no local file is provided. The review feedback suggests two improvements: dynamically adjusting the train/test split sizes if the downloaded dataset is smaller than the requested total to prevent a runtime crash, and wrapping the datasets import in a try-except block to provide a user-friendly error message if the optional dependency is missing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-29T11:19:15Z

+    if args.input_json:
+        input_json_path = os.path.expanduser(args.input_json)
+        df_raw = _read_input_as_dataframe(input_json_path)
+        logger.info(f"Loaded {len(df_raw)} records from {input_json_path}")
+    else:
+        df_raw = _load_filtered_dataset()
+        logger.info(f"Loaded {len(df_raw)} records from {HF_FILTERED_REPO}")


If the downloaded filtered dataset contains fewer than the default 8,292 rows (8,192 train + 100 test), the script will crash with a ValueError at line 182. To make the script robust, we should dynamically adjust the train/test split sizes if the loaded dataset is smaller than the requested total.

Suggested change

if args.input_json:

input_json_path = os.path.expanduser(args.input_json)

df_raw = _read_input_as_dataframe(input_json_path)

logger.info(f"Loaded {len(df_raw)} records from {input_json_path}")

else:

df_raw = _load_filtered_dataset()

logger.info(f"Loaded {len(df_raw)} records from {HF_FILTERED_REPO}")

if args.input_json:

input_json_path = os.path.expanduser(args.input_json)

df_raw = _read_input_as_dataframe(input_json_path)

logger.info(f"Loaded {len(df_raw)} records from {input_json_path}")

else:

df_raw = _load_filtered_dataset()

logger.info(f"Loaded {len(df_raw)} records from {HF_FILTERED_REPO}")

if len(df_raw) < args.train_rows + args.test_rows:

args.test_rows = max(1, len(df_raw) // 10) if len(df_raw) > 1 else 0

args.train_rows = len(df_raw) - args.test_rows

gemini-code-assist · 2026-06-29T11:19:15Z

+def _load_filtered_dataset() -> pd.DataFrame:
+    """Download the filtered ASearcher dataset from HuggingFace as a DataFrame."""
+    from datasets import load_dataset
+
+    logger.info(f"Downloading {HF_FILTERED_REPO}:{HF_FILTERED_FILE} from HuggingFace...")
+    dataset = load_dataset(HF_FILTERED_REPO, data_files=HF_FILTERED_FILE, split="train")
+    return dataset.to_pandas()


The datasets library is an optional dependency and might not be installed in all environments. If a user runs this script without --input_json and does not have datasets installed, they will get a generic ImportError. Wrapping the import in a try-except block with a clear, actionable error message will improve the user experience.

def _load_filtered_dataset() -> pd.DataFrame: """Download the filtered ASearcher dataset from HuggingFace as a DataFrame.""" try: from datasets import load_dataset except ImportError as e: raise ImportError( "The 'datasets' library is required to download the default dataset from HuggingFace. " "Please install it using 'pip install datasets' or provide a local path via '--input_json'." ) from e logger.info(f"Downloading {HF_FILTERED_REPO}:{HF_FILTERED_FILE} from HuggingFace...") dataset = load_dataset(HF_FILTERED_REPO, data_files=HF_FILTERED_FILE, split="train") return dataset.to_pandas()

modify preprocess script

5c1a80d

gemini-code-assist Bot reviewed Jun 29, 2026

View reviewed changes

Begunner merged commit 79750f5 into main Jun 29, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[examples] feat: add filtered ASearcher dataset source to the preprocess script#74

[examples] feat: add filtered ASearcher dataset source to the preprocess script#74
Begunner merged 1 commit into
mainfrom
asearcher-data

Begunner commented Jun 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Begunner commented Jun 29, 2026

What does this PR do?

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant