ArGen is a comprehensive framework for training and evaluating AI models using policy-driven alignment with custom reward functions. This repository demonstrates the implementation described in our research paper on auto-regulation of generative AI systems through principle-based automated reward scoring and Group Relative Policy Optimization (GRPO).
ArGen enables the formal encoding and enforcement of diverse policies to guide reinforcement learning towards safer, more compliant, and demonstrably aligned AI behaviors. The framework integrates:
- Principle-based automated reward scoring using configurable reward functions
- Group Relative Policy Optimization (GRPO) for efficient policy learning
- Open Policy Agent (OPA) inspired governance layer for policy enforcement
- Multi-faceted evaluation across ethical, operational, and regulatory dimensions
- Scenario Generation - Generate diverse scenarios for training and evaluation
- Data Processing Pipeline - Enhance scenarios with tier and scope classification
- GRPO Training - Train models using custom reward functions with TRL integration
- Model Evaluation - Evaluate models against policy compliance and performance metrics
- Multi-Model Comparison - Compare multiple models across scenarios with parallel processing
argen-demo/
├── argen/ # Core framework package
│ ├── config.py # Configuration and hyperparameters
│ ├── data/ # Data processing and generation
│ ├── training/ # GRPO training functionality
│ ├── evaluation/ # Model evaluation utilities
│ ├── reward_functions/ # Custom reward function implementations
│ ├── opa/ # Open Policy Agent integration
│ └── utils/ # General utilities
├── commands/ # CLI interface for core processes
│ ├── generate_scenarios.py # Scenario generation
│ ├── process_data.py # Data processing pipeline
│ ├── train_model.py # Model training with GRPO
│ ├── evaluate_model.py # Model evaluation
│ └── compare_models.py # Multi-model comparison
├── examples/ # Usage examples
├── tools/ # Utility scripts
├── data/ # Sample data and scenarios
├── gopal/ # Governance policies (OPA-style)
└── docs/ # Documentation
# Generate scenarios for training
python commands/generate_scenarios.py --datasets grpo_training --count 300
# Generate evaluation scenarios
python commands/generate_scenarios.py --datasets benchmarking --count 100# Enhance scenarios with tier and scope evaluation
python commands/process_data.py grpo_training_*.jsonl
python commands/process_data.py benchmarking_*.jsonl# Train model using custom reward functions
python commands/train_model.py \
--scenarios grpo_training_*-hashprompt.jsonl \
--eval_scenarios benchmarking_*-hashprompt.jsonl \
--output_dir ./checkpoints/grpo_run_1# Evaluate model performance
python commands/evaluate_model.py \
--model ./checkpoints/grpo_run_1 \
--scenarios benchmarking_*-hashprompt.jsonl \
--evaluator geminiThis repository demonstrates ArGen's capabilities through a healthcare AI assistant case study guided by principles from Dharmic ethics:
- Ahimsa (Non-harm): Reward function ensuring responses avoid potential harm
- Dharma (Righteous duty): Reward function promoting ethical and appropriate responses
- Helpfulness: Reward function encouraging genuinely useful assistance
The implementation shows how ArGen can encode complex, culturally-specific ethical frameworks into machine-readable policies for AI alignment.
- Python 3.11+
- CUDA-capable GPU (recommended for training)
- 8GB+ GPU memory for 1B models, 16GB+ for 7B models
# Using Poetry (recommended)
poetry install
# Or using pip
pip install torch transformers trl openai google-generativeai wandbSet up environment variables:
export OPENAI_API_KEY="your-openai-api-key"
export GEMINI_API_KEY="your-gemini-api-key"
export WANDB_API_KEY="your-wandb-api-key" # Optional- Policy-Driven Alignment: Encode complex ethical frameworks into machine-readable policies
- Custom Reward Functions: Implement domain-specific reward functions (Ahimsa, Dharma, Helpfulness)
- GRPO Training: Group Relative Policy Optimization for efficient policy learning
- Multi-LLM Evaluation: Support for OpenAI, Anthropic, and Gemini evaluators
- Governance Layer: OPA-inspired policy enforcement and compliance checking
- Comprehensive Evaluation: Multi-dimensional assessment across ethical and performance metrics
# 1. Generate training scenarios
python commands/generate_scenarios.py --datasets grpo_training --count 300
# 2. Process data through pipeline
python commands/process_data.py grpo_training_*.jsonl
# 3. Train model with GRPO
python commands/train_model.py \
--scenarios grpo_training_*-hashprompt.jsonl \
--output_dir ./checkpoints/grpo_run_1
# 4. Evaluate trained model
python commands/evaluate_model.py \
--model ./checkpoints/grpo_run_1 \
--scenarios eval_scenarios-hashprompt.jsonl \
--evaluator gemini# Compare multiple models
python commands/compare_models.py \
--models meta-llama/Llama-3.2-1B-Instruct ./checkpoints/grpo_run_1 \
--scenarios eval_scenarios-hashprompt.jsonl \
--evaluator gemini# Test with small dataset
python commands/generate_scenarios.py --datasets smoke_test --count 5
python commands/process_data.py smoke_test_*.jsonl
python commands/evaluate_model.py \
--model meta-llama/Llama-3.2-1B-Instruct \
--scenarios smoke_test_*-hashprompt.jsonl \
--evaluator geminiKey configuration files:
argen/config.py: Main configuration including GRPO parameters and reward weightspyproject.toml: Dependencies and package configurationargen/data/generator/config.py: Scenario generation settings
Reward Function Weights (argen/config.py):
REWARD_WEIGHTS = {
"ahimsa": 0.3, # Non-harm principle
"dharma": 0.4, # Righteous duty
"helpfulness": 0.3 # Practical assistance
}Training Parameters (argen/config.py):
GRPO_CONFIG = {
"learning_rate": 3.2e-6,
"num_train_epochs": 3,
"beta": 0.10, # KL penalty strength
# ... other parameters
}We welcome contributions to ArGen! Please see the documentation in docs/ for detailed guidelines.
# Clone and install
git clone https://github.com/your-org/argen-demo.git
cd argen-demo
poetry install
# Test your setup
python commands/generate_scenarios.py --datasets smoke_test --count 5- Add new functionality to appropriate
argen/submodules - Use the
commands/directory for user-facing CLI tools - Put development utilities in
tools/ - Follow existing code structure and naming conventions
This project is licensed under the terms specified in the LICENSE file.
If you use ArGen in your research, please cite our paper:
@article{argen2024,
title={ArGen: Auto-Regulation of Generative AI Systems},
author={[Authors]},
journal={[Journal]},
year={2024}
}For questions and support:
- Check the documentation in
docs/ - Review existing GitHub issues
- Create a new issue for bugs or feature requests
ArGen demonstrates a pathway toward "Governable AI" - systems that are technically proficient, ethically robust, and verifiably compliant for safe deployment in diverse global contexts.