Code and data for the NeurIPS 2026 submission.
This repository contains the code to reproduce the experiments in the paper. The framework simulates structured multi-agent deliberation using persona-grounded LLM agents and benchmarks their belief dynamics against DeGroot and Friedkin-Johnsen (F-J) opinion-dynamics models.
Two institutional settings are studied:
- ECB Governing Council (16 agents, 4 deliberation rounds): Consensus-based monetary policy with the ECB's informal "consensus without formal vote" tradition.
- FOMC (6 or 12 agents, 3 deliberation rounds): Formal voting-based monetary policy, replicating and extending the design of Kazinnik & Sinclair (2025).
The key finding is that LLM agents preserve belief heterogeneity across rounds of deliberation (non-zero spread), unlike DeGroot which converges to unanimity, and that this preservation is asymmetric (hawks entrench, doves accommodate), unlike Friedkin-Johnsen which preserves spread symmetrically.
code_release/
├── README.md # This file
├── requirements.txt # Python dependencies
├── config/
│ ├── ecb_config.yaml # ECB simulation parameters and committee composition
│ └── fomc_config.yaml # FOMC simulation parameters and committee composition
├── data/
│ └── fomc_july2025_macro.txt # Macroeconomic briefing data for FOMC simulation
├── prompts/
│ ├── ecb/ # Jinja2 prompt templates for ECB deliberation (4 rounds)
│ │ ├── round1_initial.j2
│ │ ├── round2_discussion.j2
│ │ ├── round3_engagement.j2
│ │ ├── round4_synthesis.j2
│ │ └── extended_round.j2
│ └── fomc/ # Jinja2 prompt templates for FOMC deliberation (3 rounds)
│ ├── fomc_round1.j2
│ ├── fomc_round2.j2
│ └── fomc_round3.j2
├── results/
│ ├── summary/ # Aggregated results for paper tables
│ │ ├── table1_ecb_results.csv # Table 1: ECB consolidated results
│ │ ├── table3_fomc_results.csv # Table 3: FOMC replication results
│ │ ├── cgp_calibration.json # CGP model calibration parameters
│ │ └── fj_benchmark_results.json # Friedkin-Johnsen sensitivity sweep
│ └── processed/ # Per-agent per-round rate data
│ ├── ecb_rates_panel.csv # ECB baseline: 47 runs x 16 agents x 4 rounds
│ └── fomc_rates_panel.csv # FOMC main: 10 runs x 12 agents x 3 rounds
├── scripts/
│ ├── reproduce_table1.sh # Instructions for reproducing Table 1
│ └── reproduce_table3.sh # Instructions for reproducing Table 3
└── src/
├── simulation/ # LLM deliberation engine
├── experiments/ # Experiment runners (multi-model, ablation, etc.)
├── benchmarks/
│ └── friedkin_johnsen.py # DeGroot and F-J opinion dynamics implementation
└── utils/
├── llm_adapter.py # Multi-provider LLM adapter (OpenAI, Anthropic, DeepSeek)
├── aggregation.py # Rate aggregation and consensus mechanisms
└── dissent_calculator.py # Role-based dissent cost calculation
pip install -r requirements.txtSet API keys as environment variables:
export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key" # for Claude Sonnet experiments
export DEEPSEEK_API_KEY="your-key" # for DeepSeek experimentsTables and figures can be reproduced from the pre-computed results in results/:
| File | Description |
|---|---|
results/summary/table1_ecb_results.csv |
Table 1: ECB consolidated results across all conditions |
results/summary/table3_fomc_results.csv |
Table 3: FOMC replication results |
results/summary/fj_benchmark_results.json |
Friedkin-Johnsen sensitivity sweep (stubbornness 0.0--0.99) |
results/summary/cgp_calibration.json |
CGP model calibration (matched moments) |
results/processed/ecb_rates_panel.csv |
Per-agent per-round rates for ECB baseline (47 runs) |
results/processed/fomc_rates_panel.csv |
Per-agent per-round rates for FOMC main (10 runs) |
The processed outputs in results/processed/ and results/summary/ are sufficient to reproduce all numerical tables reported in the paper. Re-running the full API-based deliberation experiments requires user-provided API keys and may not exactly reproduce the original raw generations because commercial model snapshots can change. For example, ecb_rates_panel.csv has columns run_id, agent, round, rate and can be used to compute spread, compression, and per-agent trajectories.
ECB baseline (Table 1, row 1):
python -m src.experiments.multi_model_runner --model gpt-4o --n-runs 47Multi-model comparison (Table 1, rows 2--4):
python -m src.experiments.multi_model_runner --model claude-sonnet-4-20250514 --n-runs 15
python -m src.experiments.multi_model_runner --model gpt-4o-mini --n-runs 15
python -m src.experiments.multi_model_runner --model deepseek-chat --n-runs 15Temperature robustness (Table 1, rows 5--7):
python -m src.experiments.temperature_robustness --temperature 0.0 --n-runs 10
python -m src.experiments.temperature_robustness --temperature 0.3 --n-runs 10
python -m src.experiments.temperature_robustness --temperature 0.7 --n-runs 10Order randomization (Table 1, row 8):
python -m src.experiments.order_randomization --n-runs 15 --seed 42Persona ablation (Table 1, rows 9--10):
python -m src.experiments.persona_ablation --mode name_only --n-runs 10
python -m src.experiments.persona_ablation --mode bio_only --n-runs 10Component ablation (Table 1, rows 11--12):
python -m src.experiments.component_ablation --mode framework_only --n-runs 10
python -m src.experiments.component_ablation --mode bio_institutional --n-runs 10FOMC replication (Table 3):
python -m src.simulation.run_fomc --n-runs 10 --full
python -m src.simulation.run_fomc --n-runs 10 --pilot
python -m src.simulation.run_fomc --n-runs 5 --full --temperature 0.0Friedkin-Johnsen benchmark:
python -m src.benchmarks.friedkin_johnsen --sweepSee scripts/reproduce_table1.sh and scripts/reproduce_table3.sh for convenience wrappers.
All prompts used in the deliberation are stored as Jinja2 templates in prompts/.
- ECB deliberation uses 4 rounds: initial assessment, peer discussion, direct engagement, and synthesis.
- FOMC deliberation uses 3 rounds, following the structure of Kazinnik & Sinclair (2025).
Each round's prompt template receives the agent's persona, macroeconomic data, and (for rounds 2+) the previous round's statements from other agents. The agent returns a policy rate recommendation along with its reasoning.
All spread values are reported in basis points (bp). Key metrics:
- R1/R2/R3 spread: Max minus min of agent beliefs in round 1/2/3 (or round 4 for ECB R3)
- Compression: Change in spread from round 1 to final round (negative = convergence)
- Hawk/dove delta: Change in distance of extreme agents from the mean across rounds
| Model | Provider | Role |
|---|---|---|
gpt-4o (May 2024 snapshot) |
OpenAI | Baseline and all robustness checks |
claude-sonnet-4-20250514 |
Anthropic | Multi-model comparison |
gpt-4o-mini |
OpenAI | Multi-model comparison |
deepseek-chat |
DeepSeek | Multi-model comparison |
Macroeconomic briefing data is drawn from publicly available sources (ECB SDW, Eurostat, FRED) as of the target meeting date. The data/ directory contains the briefing documents provided to agents.
The full persona JSON files are withheld during anonymous review and will be released upon acceptance. The prompt templates and anonymized processed outputs are included. The paper appendix describes the persona construction procedure and provides representative examples. The released processed outputs allow verification of all reported spread, compression, benchmark, and calibration results without requiring access to raw persona files.
MIT License