This module handles the generation, downloading, and execution of training and evaluation data. The pipeline is designed to work with large-scale datasets while managing memory constraints through sharded processing.
Install the required packages using:
pip install -r exec_requirements.txtKey dependencies include:
- Data Processing: datasets (2.21.0), pandas (2.2.3), numpy (2.2.4)
- Visualization: matplotlib (3.10.0)
- Code Quality: black (24.8.0), pylint (3.3.4)
- Math Utilities: sympy (1.13.3), scipy (1.15.2)
This module relies on two custom forks with specific functionality:
-
Simple Code Execution (simple-code-execution)
- Provides core code execution functionality
- Used at commit
bf0cb87
-
EvalPlus (evalplus)
- Modified version of EvalPlus with:
- Test limiting capabilities
- Enhanced timing controls
- Additional execution metrics
- Modified version of EvalPlus with:
The data processing pipeline consists of three main steps:
-
Generate Data (
generate.pyin root directory)- Creates files for evaluation or training
- Supports both chat and non-chat formats
- Configurable sampling parameters
python generate.py eval-ds [MODEL_NAME] [OPTIONS]
-
Download Execution Data (
download_exec_data.py)- Downloads and processes datasets:
- DeepMind Code Contests
- GSM8K (math word problems)
- Filters and combines train/validation splits
- Saves in compressed JSONL format
python download_exec_data.py [OUTPUT_DIR]
- Downloads and processes datasets:
-
Execute Shards (
exec_shard.py)- Distributed execution of synthetic data
- Supports both Code Contests and GSM8K tasks
- Includes syntax validation and deduplication
python exec_shard.py [COMMAND] [SHARD_PATH] [DATASET_INFO_PATH]
The pipeline is designed to handle large-scale data processing with the following optimizations:
-
Sharded Processing
- Generation and execution are split into shards
- Manages GPU and CPU memory constraints
- Parallel processing with configurable workers
-
Resource Management
- Configurable timeouts for execution
- Test limiting to prevent resource exhaustion
- Deduplication of predictions to reduce computation
The following environment variables can be configured:
CUDA_VISIBLE_DEVICES: Control GPU device selectionDEFAULT_OUTPUT_DIR: Set output directory (defaults to "outputs")TOKENIZERS_PARALLELISM: Control tokenizer parallelismACCELERATE_DOWNCAST_BF16: Control BF16 precisionPRECISION: Set model precision
Generated data and execution results are organized as follows:
outputs/
├── eval_ds_raw_preds/ # Raw model predictions
│ └── [MODEL_NAME]/
│ └── [SAMPLING_PARAMS]/
├── code_contests.jsonl.gz # Processed code contests data
└── gsm8k.jsonl.gz # Processed GSM8K data