Data Processing Module

This module handles the generation, downloading, and execution of training and evaluation data. The pipeline is designed to work with large-scale datasets while managing memory constraints through sharded processing.

Setup

Requirements

Install the required packages using:

pip install -r exec_requirements.txt

Key dependencies include:

Data Processing: datasets (2.21.0), pandas (2.2.3), numpy (2.2.4)
Visualization: matplotlib (3.10.0)
Code Quality: black (24.8.0), pylint (3.3.4)
Math Utilities: sympy (1.13.3), scipy (1.15.2)

Custom Dependencies

This module relies on two custom forks with specific functionality:

Simple Code Execution (simple-code-execution)
- Provides core code execution functionality
- Used at commit bf0cb87
EvalPlus (evalplus)
- Modified version of EvalPlus with:
  - Test limiting capabilities
  - Enhanced timing controls
  - Additional execution metrics

Workflow

The data processing pipeline consists of three main steps:

Generate Data (generate.py in root directory)
- Creates files for evaluation or training
- Supports both chat and non-chat formats
- Configurable sampling parameters
```
python generate.py eval-ds [MODEL_NAME] [OPTIONS]
```
Download Execution Data (download_exec_data.py)
- Downloads and processes datasets:
  - DeepMind Code Contests
  - GSM8K (math word problems)
- Filters and combines train/validation splits
- Saves in compressed JSONL format
```
python download_exec_data.py [OUTPUT_DIR]
```
Execute Shards (exec_shard.py)
- Distributed execution of synthetic data
- Supports both Code Contests and GSM8K tasks
- Includes syntax validation and deduplication
```
python exec_shard.py [COMMAND] [SHARD_PATH] [DATASET_INFO_PATH]
```

Performance Considerations

The pipeline is designed to handle large-scale data processing with the following optimizations:

Sharded Processing
- Generation and execution are split into shards
- Manages GPU and CPU memory constraints
- Parallel processing with configurable workers
Resource Management
- Configurable timeouts for execution
- Test limiting to prevent resource exhaustion
- Deduplication of predictions to reduce computation

Environment Configuration

The following environment variables can be configured:

CUDA_VISIBLE_DEVICES: Control GPU device selection
DEFAULT_OUTPUT_DIR: Set output directory (defaults to "outputs")
TOKENIZERS_PARALLELISM: Control tokenizer parallelism
ACCELERATE_DOWNCAST_BF16: Control BF16 precision
PRECISION: Set model precision

Output Structure

Generated data and execution results are organized as follows:

outputs/
├── eval_ds_raw_preds/          # Raw model predictions
│   └── [MODEL_NAME]/
│       └── [SAMPLING_PARAMS]/
├── code_contests.jsonl.gz      # Processed code contests data
└── gsm8k.jsonl.gz             # Processed GSM8K data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Processing Module

Setup

Requirements

Custom Dependencies

Workflow

Performance Considerations

Environment Configuration

Output Structure

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Data Processing Module

Setup

Requirements

Custom Dependencies

Workflow

Performance Considerations

Environment Configuration

Output Structure