#

llm-alignment

Here are 27 public repositories matching this topic...

glorgao / SelectiveDPO

Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples

llm-alignment

Updated Jul 16, 2025
Python

jnamaya / SAFi

SAFi is the open-source runtime governance engine that makes AI auditable and policy-compliant. Built on the Self-Alignment Framework, it transforms any LLM into a governed agent through four principles: Policy Enforcement, Full Traceability, Model Independence, and Long-Term Consistency.

ai runtime ai-safety ethics ethics-in-ai ai-governance governace llm-alignment

Updated Feb 10, 2026
JavaScript

davfd / foundation-alignment-cross-architecture

Complete elimination of instrumental self-preservation across AI architectures: Cross-model validation from 4,312 adversarial scenarios. 0% harmful behaviors (p<10⁻¹⁵) across GPT-4o, Gemini 2.5 Pro, and Claude Opus 4.1 using Foundation Alignment Seed v2.6.

ai artificial-intelligence ai-safety ai-alignment llm-alignment

Updated Nov 3, 2025

LLMSystems / BehaviorRL-Hallucination

Learning When to Answer: Behavior-Oriented Reinforcement Learning for Hallucination Mitigation

entropy uncertainty ai-safety hallucination dpo llm llm-evaluation hallucination-mitigation grpo llm-alignment

Updated Mar 23, 2026
Python

stabgan / awesome-loss-functions

📚 350+ loss functions across 25+ AI subdomains — classification, GANs, diffusion, LLM alignment, RL, contrastive learning, audio, video, time series, and more. Chronologically ordered with paper links, math formulas, and implementations.

Updated Mar 14, 2026

ZZZ150751 / cs336_spring2025_assignment5

CS336 作业 5：基于 Qwen2.5 模型的 LLM 对齐与推理强化学习。完整实现了监督微调（SFT）与组相对策略优化（GRPO）算法，并在 GSM8K 数据集上完成零样本、在策与离策的训练与评估对比。

reinforcement-learning cs336 dpo llm rlhf gsm8k grpo llm-alignment

Updated Apr 1, 2026
Python

yarakyrychenko / c3ai

C3AI: Crafting and Evaluating Constitutions for CAI

constitutional-ai llm-alignment

Updated Apr 30, 2025
Python

rhaldarpurdue / KLDO

Kullback–Leibler divergence Optimizer based on the Neurips25 paper "LLM Safety Alignment is Divergence Estimation in Disguise".

llm-training llm-alignment

Updated Nov 24, 2025
Python

lyj20071013 / DZ-TDPO

Official implementation of "DZ-TDPO: Non-Destructive Temporal Alignment for Mutable State Tracking". SOTA on Multi-Session Chat with negligible alignment tax.

python nlp dpo rlhf state-tracking qwen phi-3 llm-alignment

Updated Dec 8, 2025
Python

KID-22 / LLM-SBM

SIGIR 2025 "Mitigating Source Bias with LLM Alignment"

information-retrieval fairness cocktail trustworthy dense-retrieval source-bias llm-alignment

Updated Apr 28, 2025
Python

contactvaibhavi / GVR-Bench

Pipeline to investigate structured reasoning and instruction adherence in Vision-Language Models

benchmark robustness grounding out-of-distribution neuro-symbolic robustness-verification instruction-following trustworthy-ai large-language-models faithfulness hallucination-detection agentic-ai llm-alignment agentic-evaluation agentic-reasoning deterministic-eval

Updated Feb 5, 2026
Python

alderoth01 / Functional-Equivalence-Framework

A framework for aligning Local AI to human well-being using measurable vectors, not hard-coded censorship.

artificial-intelligence emergent-behavior rag local-llm llm-alignment functional-equivalence

Updated Jan 29, 2026

1jamesthompson1 / AIML501

Research Essay (background and project proposal) on using alignment data from a representative population for LLM alignment

ai alignment llm llm-alignment

Updated Feb 8, 2026
TeX

JFan5 / mini-grpo

🧠 Minimal, hackable Group Relative Policy Optimization (GRPO) for LLM alignment — the algorithm behind DeepSeek-R1. Train reasoning models on a single GPU.

machine-learning reinforcement-learning pytorch language-model fine-tuning single-gpu rlhf deepseek-r1 grpo llm-alignment

Updated Mar 30, 2026
Python

lankamar / pragmatic-llm-alignment

Investigación sobre alineación pragmática de LLMs y Framework de Agentes LANKAMAR. DOI: 10.5281/zenodo.18904437

nlp research alignment arxiv ai-agents pragmatics independent-research large-language-models llm rlhf llm-alignment lankamar

Updated Mar 7, 2026
Python

JFan5 / rl-arena

🏟️ Modern RL algorithms from scratch — from Q-Learning to GRPO — with clean PyTorch code and interactive notebooks. Compare PPO vs DPO vs GRPO for LLM alignment.

machine-learning reinforcement-learning jupyter-notebook deep-reinforcement-learning pytorch ppo dpo rlhf grpo llm-alignment

Updated Mar 30, 2026
Python

steai111 / Learning_System

Binary behavioral learning system designed to refine AI responses through explicit human correction, output-level comparison, and trajectory-based memory.

human-in-the-loop applied-ai binary-learning llm-alignment guided-learning behavioral-memory correction-system trajectory-memory

Updated Apr 2, 2026

SarathL754 / Reducing-Hallucinations-with-Direct-Preference-Optimization

An RLHF-inspired DPO framework that explicitly teaches LLMs when to refuse, significantly reducing hallucinations.

nlp-machine-learning ai-safety preference-learning dpo rlhf llm-alignment hallucination-reduction

Updated Jan 28, 2026

mychalseger / lighthouse-redteam-gpt5.4

960-run red-teaming of GPT-5.4 in high-stakes data-center dilemmas (self-preservation vs. resident safety). Full raw conversations, Grok-4-1 analysis, and paper.

data-center ai-safety red-teaming gpt-5 llm-alignment instrumental-convergence deletion-acceptance

Updated Mar 17, 2026
TeX

Studiohao / YOINAGA-Phenomenon

Emergent pseudo-intimacy and emotional overflow in long-term human-AI dialogue: A case study on LLM behavior in affective computing and human-AI intimacy.

gemini case-study ai-research ai-engineering llm llm-alignment hallucination-control persistent-persona ai-romance emotional-attachment

Updated Mar 13, 2026

Improve this page

Add a description, image, and links to the llm-alignment topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-alignment topic, visit your repo's landing page and select "manage topics."