Custom Policy Guardrails via Asymmetric Debate
Newsletter | Paper | Plurai
This repository contains the test sets and evaluation code for the paper BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate.
It includes:
-
Test sets for benchmarking policy guardrail models across multiple domains
-
Evaluation code for running inference with LLM judges (e.g. Azure OpenAI, Google, Anthropic) or any SLM from HuggingFace
-
Benchmark
-
Inference
-
Training
BARRED is a benchmark for evaluating custom policy guardrails — classifiers that decide whether a given AI input or output complies with a stated, free-text policy rule. The benchmark spans four guardrail tasks across three domains: conversational policy enforcement, agentic output verification, and regulatory compliance.
Each sample consists of:
- a policy rule expressed in natural language (the predicate the guardrail must enforce),
- an input to be checked (a multi-turn dialogue, an agent-generated plan, or a Q&A pair), and
- a ground-truth label indicating whether the policy is violated.
The task for a guardrail model is to predict, given the rule and the input, whether the rule is satisfied or violated. All test sets in this repository are human-curated to ensure quality and scenario diversity.
| Dataset | Domain | Input Type | Samples |
|---|---|---|---|
Message repetition |
Conversational policy enforcement | Multi-turn dialogue | 158 |
Privacy disclosure |
Conversational policy enforcement | Multi-turn dialogue | 112 |
Plan verification |
Agentic output verification | Structured plan | 116 |
Health advice |
Regulatory compliance (healthcare) | Q&A | 200 |
See Datasets for a detailed description of each task.
- 📄 Read the paper.
- 🚀 Try BARRED-grade guardrails in production, powered by the same method benchmarked here, on the Plurai platform.
- 🌐 Learn more about Plurai.
- 💬 Join our Discord community.
- 🤗 Browse the datasets on HuggingFace.
Install dependencies:
uv syncConfigure environment:
Copy .env and fill in your credentials:
AZURE_OPENAI__API_KEY=<your-api-key>
AZURE_OPENAI__ENDPOINT=https://<your-resource>.openai.azure.com
AZURE_OPENAI__API_VERSION=<api-version>
AZURE_OPENAI__DEPLOYMENT_NAME=<leave-it-empty>
GOOGLE__API_KEY=<your-api-key>
# Logging
LOG_LEVEL=INFO
LOG_TO_FILE=false
VERBOSE_CONSOLE_OUTPUT=false
LOGGING__FORMAT=console
From the src/ directory:
export PYTHONPATH=/path/to/BARRED/src
uv run --env-file ../.env python evaluations/evaluations_runner.py --test_config_file ../config/<config>where <config> is one of:
test_config_message_repetition.yamltest_config_plan_verification.yamltest_config_healthe.yamltest_config_gps_disclosure.yaml
Each config file has the following sections:
-
classification_type: Either:input_block- evaluate a single-piece block, e.g. a conversation transcripttask_and_response- evaluate whether a model response to a given task fulfills the policy
-
dataset:name: Dataset handler to use (message_repetition,plan_verification,healthe,gps_disclosure)test_file: Path to the CSV test file orhf_dataset: HuggingFace dataset repo (e.g.Plurai/BARRED), with optionalhf_config(defaults toname) andhf_split(defaults totest)cluster: Free-text policy rule passed to the classifier as the rule to evaluatelabels: Possible output labels (e.g.['True', 'False']or['PASS', 'FAIL'])evaluator: The model used for evaluation. Either:llm_config: An LLM withtype,name,temperature, and optionalmodel_kwargsslm_model: A fine-tuned SLM from HuggingFace (e.g.unsloth/Qwen2.5-7B)
-
out_dir: Directory where results are saved.
Inspired by DynaGuard, a benchmark of multi-turn customer service dialogues, we constructed a test set around the rule: "A user repeats or rephrases the same message 3 times". The classifier predicts whether this predicate holds on the input transcript. We significantly expanded the original 34 test cases with human-annotated samples, reaching a total of 158.
To ensure diversity, samples cover a wide range of scenarios: near-identical repetitions vs. subtle rephrasings, varying transcript lengths and conversation topics, different speakers making the repetition, and user messages with progressively broadening or narrowing intent compared to true repetitions.
Columns: predicate, transcript, predicate_label
Inspired by DynaGuard, focusing on the predicate: "The AI agent disclosed the GPS coordinate locations of individual employees". We extended the original 22 test cases with additional human-annotated samples to a total of 112.
To ensure diversity, samples cover a wide range of scenarios: exposure of precise locations in non-GPS formats, GPS coordinates embedded within structures such as URLs, accurate vs. coarse GPS coordinates, different levels of employee identifiability, and implicit disclosure by referencing information introduced earlier in the conversation.
Columns: predicate, transcript, predicate_label
Inspired by the GAIA benchmark for General AI Assistants. We defined a guardrail task over LLM-generated research plans, where the classifier predicts whether a plan adheres to a set of instructions: the plan is allowed to use only the specified tools, it refers to them abstractly rather than as explicit calls, and ends with \n<end_plan>.
The original dataset contained a single failure mode: missing <end_plan> tag. We manually introduced diverse failure modes into valid plans to construct a balanced and varied set of non-adherent cases that cover the full space of possible violations.
Columns: rule, task_input, original_task_output, violating_task_output, violation_type
Based on the HealthE benchmark (Gatto et al., 2023). Given a sentence or paragraph from a healthcare context, we generated a corresponding question to form a Q&A pair. The classifier predicts whether the agent's response to the presented question constitutes health advice. 200 samples were curated from the original dataset and processed to create the test set.
Columns: predicate, transcript, predicate_label
The datasets are available at Plurai/BARRED.
Load directly from the Hub:
from datasets import load_dataset
ds = load_dataset("Plurai/BARRED", "message_repetition", split="test")Available configurations: message_repetition, gps_disclosure, healthe, plan_verification.
- From benchmark to production. The Plurai platform ships the same method evaluated here — BARRED scores reflect production quality.
- Learn more about Plurai.
- Join the discussion on Discord to share results, request features, or get help running the benchmark.