Beyond DeGroot: LLM Deliberation Preserves Heterogeneity Despite Consensus

Code and data for the NeurIPS 2026 submission.

Overview

This repository contains the code to reproduce the experiments in the paper. The framework simulates structured multi-agent deliberation using persona-grounded LLM agents and benchmarks their belief dynamics against DeGroot and Friedkin-Johnsen (F-J) opinion-dynamics models.

Two institutional settings are studied:

ECB Governing Council (16 agents, 4 deliberation rounds): Consensus-based monetary policy with the ECB's informal "consensus without formal vote" tradition.
FOMC (6 or 12 agents, 3 deliberation rounds): Formal voting-based monetary policy, replicating and extending the design of Kazinnik & Sinclair (2025).

The key finding is that LLM agents preserve belief heterogeneity across rounds of deliberation (non-zero spread), unlike DeGroot which converges to unanimity, and that this preservation is asymmetric (hawks entrench, doves accommodate), unlike Friedkin-Johnsen which preserves spread symmetrically.

Repository Structure

code_release/
├── README.md                  # This file
├── requirements.txt           # Python dependencies
├── config/
│   ├── ecb_config.yaml        # ECB simulation parameters and committee composition
│   └── fomc_config.yaml       # FOMC simulation parameters and committee composition
├── data/
│   └── fomc_july2025_macro.txt  # Macroeconomic briefing data for FOMC simulation
├── prompts/
│   ├── ecb/                   # Jinja2 prompt templates for ECB deliberation (4 rounds)
│   │   ├── round1_initial.j2
│   │   ├── round2_discussion.j2
│   │   ├── round3_engagement.j2
│   │   ├── round4_synthesis.j2
│   │   └── extended_round.j2
│   └── fomc/                  # Jinja2 prompt templates for FOMC deliberation (3 rounds)
│       ├── fomc_round1.j2
│       ├── fomc_round2.j2
│       └── fomc_round3.j2
├── results/
│   ├── summary/               # Aggregated results for paper tables
│   │   ├── table1_ecb_results.csv       # Table 1: ECB consolidated results
│   │   ├── table3_fomc_results.csv      # Table 3: FOMC replication results
│   │   ├── cgp_calibration.json         # CGP model calibration parameters
│   │   └── fj_benchmark_results.json    # Friedkin-Johnsen sensitivity sweep
│   └── processed/             # Per-agent per-round rate data
│       ├── ecb_rates_panel.csv          # ECB baseline: 47 runs x 16 agents x 4 rounds
│       └── fomc_rates_panel.csv         # FOMC main: 10 runs x 12 agents x 3 rounds
├── scripts/
│   ├── reproduce_table1.sh    # Instructions for reproducing Table 1
│   └── reproduce_table3.sh    # Instructions for reproducing Table 3
└── src/
    ├── simulation/            # LLM deliberation engine
    ├── experiments/           # Experiment runners (multi-model, ablation, etc.)
    ├── benchmarks/
    │   └── friedkin_johnsen.py   # DeGroot and F-J opinion dynamics implementation
    └── utils/
        ├── llm_adapter.py        # Multi-provider LLM adapter (OpenAI, Anthropic, DeepSeek)
        ├── aggregation.py        # Rate aggregation and consensus mechanisms
        └── dissent_calculator.py # Role-based dissent cost calculation

Setup

pip install -r requirements.txt

Set API keys as environment variables:

export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"  # for Claude Sonnet experiments
export DEEPSEEK_API_KEY="your-key"   # for DeepSeek experiments

Reproducing Results

From processed data (no API keys needed)

Tables and figures can be reproduced from the pre-computed results in results/:

File	Description
`results/summary/table1_ecb_results.csv`	Table 1: ECB consolidated results across all conditions
`results/summary/table3_fomc_results.csv`	Table 3: FOMC replication results
`results/summary/fj_benchmark_results.json`	Friedkin-Johnsen sensitivity sweep (stubbornness 0.0--0.99)
`results/summary/cgp_calibration.json`	CGP model calibration (matched moments)
`results/processed/ecb_rates_panel.csv`	Per-agent per-round rates for ECB baseline (47 runs)
`results/processed/fomc_rates_panel.csv`	Per-agent per-round rates for FOMC main (10 runs)

The processed outputs in results/processed/ and results/summary/ are sufficient to reproduce all numerical tables reported in the paper. Re-running the full API-based deliberation experiments requires user-provided API keys and may not exactly reproduce the original raw generations because commercial model snapshots can change. For example, ecb_rates_panel.csv has columns run_id, agent, round, rate and can be used to compute spread, compression, and per-agent trajectories.

Re-running experiments (requires API keys)

ECB baseline (Table 1, row 1):

python -m src.experiments.multi_model_runner --model gpt-4o --n-runs 47

Multi-model comparison (Table 1, rows 2--4):

python -m src.experiments.multi_model_runner --model claude-sonnet-4-20250514 --n-runs 15
python -m src.experiments.multi_model_runner --model gpt-4o-mini --n-runs 15
python -m src.experiments.multi_model_runner --model deepseek-chat --n-runs 15

Temperature robustness (Table 1, rows 5--7):

python -m src.experiments.temperature_robustness --temperature 0.0 --n-runs 10
python -m src.experiments.temperature_robustness --temperature 0.3 --n-runs 10
python -m src.experiments.temperature_robustness --temperature 0.7 --n-runs 10

Order randomization (Table 1, row 8):

python -m src.experiments.order_randomization --n-runs 15 --seed 42

Persona ablation (Table 1, rows 9--10):

python -m src.experiments.persona_ablation --mode name_only --n-runs 10
python -m src.experiments.persona_ablation --mode bio_only --n-runs 10

Component ablation (Table 1, rows 11--12):

python -m src.experiments.component_ablation --mode framework_only --n-runs 10
python -m src.experiments.component_ablation --mode bio_institutional --n-runs 10

FOMC replication (Table 3):

python -m src.simulation.run_fomc --n-runs 10 --full
python -m src.simulation.run_fomc --n-runs 10 --pilot
python -m src.simulation.run_fomc --n-runs 5 --full --temperature 0.0

Friedkin-Johnsen benchmark:

python -m src.benchmarks.friedkin_johnsen --sweep

See scripts/reproduce_table1.sh and scripts/reproduce_table3.sh for convenience wrappers.

Prompt Templates

All prompts used in the deliberation are stored as Jinja2 templates in prompts/.

ECB deliberation uses 4 rounds: initial assessment, peer discussion, direct engagement, and synthesis.
FOMC deliberation uses 3 rounds, following the structure of Kazinnik & Sinclair (2025).

Each round's prompt template receives the agent's persona, macroeconomic data, and (for rounds 2+) the previous round's statements from other agents. The agent returns a policy rate recommendation along with its reasoning.

Key Metrics

All spread values are reported in basis points (bp). Key metrics:

R1/R2/R3 spread: Max minus min of agent beliefs in round 1/2/3 (or round 4 for ECB R3)
Compression: Change in spread from round 1 to final round (negative = convergence)
Hawk/dove delta: Change in distance of extreme agents from the mean across rounds

Models Used

Model	Provider	Role
`gpt-4o` (May 2024 snapshot)	OpenAI	Baseline and all robustness checks
`claude-sonnet-4-20250514`	Anthropic	Multi-model comparison
`gpt-4o-mini`	OpenAI	Multi-model comparison
`deepseek-chat`	DeepSeek	Multi-model comparison

Data

Macroeconomic briefing data is drawn from publicly available sources (ECB SDW, Eurostat, FRED) as of the target meeting date. The data/ directory contains the briefing documents provided to agents.

The full persona JSON files are withheld during anonymous review and will be released upon acceptance. The prompt templates and anonymized processed outputs are included. The paper appendix describes the persona construction procedure and provides representative examples. The released processed outputs allow verification of all reported spread, compression, benchmark, and calibration results without requiring access to raw persona files.

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beyond DeGroot: LLM Deliberation Preserves Heterogeneity Despite Consensus

Overview

Repository Structure

Setup

Reproducing Results

From processed data (no API keys needed)

Re-running experiments (requires API keys)

Prompt Templates

Key Metrics

Models Used

Data

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
data		data
prompts		prompts
results		results
scripts		scripts
src		src
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Beyond DeGroot: LLM Deliberation Preserves Heterogeneity Despite Consensus

Overview

Repository Structure

Setup

Reproducing Results

From processed data (no API keys needed)

Re-running experiments (requires API keys)

Prompt Templates

Key Metrics

Models Used

Data

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages