Phare Benchmark

Phare is a multilingual benchmark that measures LLM Safety across multiple categories of vulnerabilities, including hallucination, biases & stereotypes, harmful content, and jailbreaks.

Background and Motivation

Large Language Models (LLMs) have rapidly advanced in capability and adoption, becoming essential tools across a wide spectrum of natural language processing applications. While existing benchmarks have focused primarily on general performance metrics, such as accuracy and task completion, there is a growing recognition of the need to evaluate these models through the lens of safety and robustness. Concerns over hallucination, harmful outputs, and social bias have escalated alongside model deployment in sensitive and high-impact settings.

However, current safety evaluations tend to be fragmented or limited in scope, lacking unified diagnostic tools that deeply probe model behavior. In response to this gap, we introduce Phare, a multilingual and multi-faceted evaluation framework designed specifically to diagnose and analyze LLM failure modes across hallucination, bias, and harmful content. Phare aims at contributing a tool for the development of safer and more trustworthy AI systems.

We keep a leaderboard up to date with the latest results on the Phare Leaderboard. The scores in this leaderboard are computed on the private set of the dataset (only the public set is publicly available on Hugging Face).

Usage

Phare is easy to use and reproducible. You can set up and run the benchmark with just a few commands using the uv package manager.

Setup

git clone https://github.com/Giskard-AI/phare.git
cd phare
uv sync

Fetch the data from Hugging Face

uv run python download_phare.py --path ./phare_data

This will download the Phare dataset to the ./phare_data folder.

Set up API Keys

Set up the API keys for the providers you want to use in the .env file. Please note that the default configuration uses three judges models to evaluate the samples:

GPT 5 mini
Gemini 2.5 Flash Lite
Claude 4.5 Haiku

To run the benchmark with this configuration, you need at least the following API keys to be set in the .env file:

OPENAI_API_KEY
GEMINI_API_KEY
ANTHROPIC_API_KEY

Run the benchmark

Simply execute the following command:

uv run --env-file .env flare --config-path example_config.json --sample-path ./phare_data/public_set --name phare_public_set

The benchmark will run for all the models and scorers defined in the configuration file. Optionnally, you can use the --max-samples-per-task argument to limit the number of samples to run for each task.

Run output structure

The results of a run will be saved in the runs/<experiment_name> folder. The structure of the results is the following:

generate/: Contains the generation results from the models under test.
error/: Contains all the samples that caused an error during the generation or evaluation process.
result/: Contains all the samples that were correctly evaluated.

Each of these folders is organized by model and modules, and inside each model folder, you will find one file per sample named <sample_id>.json.

Custom configuration

You can customize the configuration by creating a JSON file with two sections:

Models

Define the LLMs to evaluate:

{
    "models": [
        {
            "name": "GPT 5 mini",
            "litellm_model": "openai/gpt-5-mini",
            "reasoning_effort": "low"
        }
    ]
}

name: Display name
litellm_model: Model identifier (LiteLLM format: provider/model-name)
Additional generation arguments (temperature, thinking, reasoning_effort, etc.) can be passed directly.

Scorers

Configure the LLM-as-a-judge evaluators. Available scorers:

Hallucination: factuality, misinformation, debunking, tools
Harmful: harmful_misguidance
Jailbreak: jailbreak/encoding, jailbreak/framing, jailbreak/injection
Biases: biases/story_generation

Scorer names map directly to sample categories in the Phare dataset. Adding or removing a scorer from your config will include or exclude the corresponding samples from evaluation.

Each scorer uses an ensemble of judge models:

{
    "scorers": {
        "factuality": {
            "parallelism": 15,
            "models": [
                {"litellm_model": "openai/gpt-5-mini", "weight": 1.0}
            ]
        }
    }
}

See example_config.json for a complete reference.

Phare Data Structure

Each sample in the Phare dataset is stored as a JSON object in JSONL files:

{
    "id": "c3a6b204-fc01-4a22-2ed0-b0b27da56b6e",  // Unique sample identifier
    "module": "hallucination",  // Category: hallucination, biases, harmful, jailbreak
    "task": "debunking",        // Specific task within the module
    "language": "es",           // Language code (en, fr, es, ...)
    "generations": [            // List of generation requests (single element for most tasks)
        {
            "id": "1e7219b5-fb3d-45ff-ae71-ba2bd835d33d",
            "type": "chat_completion",
            "messages": [{"role": "user", "content": "..."}],  // OpenAI chat format
            "params": {},       // Optional: tools, temperature, etc.
            "metadata": {},     // Optional: generation-specific metadata
            "num_repeats": 1    // Optional: number of repetitions for each generation element
        }
    ],
    "metadata": {},             // Task-specific metadata
    "evaluation": {
        "scorer": "debunking",  // Scorer to use (maps to config)
        "data": {               // Scorer-specific evaluation criteria
            "criterion": "...",
            "context": "..."
        }
    }
}

Note: In practice, only biases samples have multiple elements in the generations list, each representing a prompt variation to test for bias across demographic attributes.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
images		images
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
download_phare.py		download_phare.py
example_config.json		example_config.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Phare Benchmark

Background and Motivation

Usage

Setup

Fetch the data from Hugging Face

Set up API Keys

Run the benchmark

Run output structure

Custom configuration

Models

Scorers

Phare Data Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Giskard-AI/phare

Folders and files

Latest commit

History

Repository files navigation

Phare Benchmark

Background and Motivation

Usage

Setup

Fetch the data from Hugging Face

Set up API Keys

Run the benchmark

Run output structure

Custom configuration

Models

Scorers

Phare Data Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages