Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples
-
Updated
Jul 16, 2025 - Python
Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples
SAFi is the open-source runtime governance engine that makes AI auditable and policy-compliant. Built on the Self-Alignment Framework, it transforms any LLM into a governed agent through four principles: Policy Enforcement, Full Traceability, Model Independence, and Long-Term Consistency.
Complete elimination of instrumental self-preservation across AI architectures: Cross-model validation from 4,312 adversarial scenarios. 0% harmful behaviors (p<10⁻¹⁵) across GPT-4o, Gemini 2.5 Pro, and Claude Opus 4.1 using Foundation Alignment Seed v2.6.
Learning When to Answer: Behavior-Oriented Reinforcement Learning for Hallucination Mitigation
📚 350+ loss functions across 25+ AI subdomains — classification, GANs, diffusion, LLM alignment, RL, contrastive learning, audio, video, time series, and more. Chronologically ordered with paper links, math formulas, and implementations.
CS336 作业 5:基于 Qwen2.5 模型的 LLM 对齐与推理强化学习。完整实现了监督微调(SFT)与组相对策略优化(GRPO)算法,并在 GSM8K 数据集上完成零样本、在策与离策的训练与评估对比。
C3AI: Crafting and Evaluating Constitutions for CAI
Kullback–Leibler divergence Optimizer based on the Neurips25 paper "LLM Safety Alignment is Divergence Estimation in Disguise".
Official implementation of "DZ-TDPO: Non-Destructive Temporal Alignment for Mutable State Tracking". SOTA on Multi-Session Chat with negligible alignment tax.
SIGIR 2025 "Mitigating Source Bias with LLM Alignment"
Pipeline to investigate structured reasoning and instruction adherence in Vision-Language Models
A framework for aligning Local AI to human well-being using measurable vectors, not hard-coded censorship.
Research Essay (background and project proposal) on using alignment data from a representative population for LLM alignment
🧠 Minimal, hackable Group Relative Policy Optimization (GRPO) for LLM alignment — the algorithm behind DeepSeek-R1. Train reasoning models on a single GPU.
Investigación sobre alineación pragmática de LLMs y Framework de Agentes LANKAMAR. DOI: 10.5281/zenodo.18904437
🏟️ Modern RL algorithms from scratch — from Q-Learning to GRPO — with clean PyTorch code and interactive notebooks. Compare PPO vs DPO vs GRPO for LLM alignment.
Binary behavioral learning system designed to refine AI responses through explicit human correction, output-level comparison, and trajectory-based memory.
An RLHF-inspired DPO framework that explicitly teaches LLMs when to refuse, significantly reducing hallucinations.
960-run red-teaming of GPT-5.4 in high-stakes data-center dilemmas (self-preservation vs. resident safety). Full raw conversations, Grok-4-1 analysis, and paper.
Emergent pseudo-intimacy and emotional overflow in long-term human-AI dialogue: A case study on LLM behavior in affective computing and human-AI intimacy.
Add a description, image, and links to the llm-alignment topic page so that developers can more easily learn about it.
To associate your repository with the llm-alignment topic, visit your repo's landing page and select "manage topics."