A simulation platform for testing how different Large Language Models (LLMs) perform in Model UN-style diplomatic negotiations.
This project creates a simulated environment where LLM agents representing different countries debate, propose solutions, and vote on resolutions related to international issues. The goal is to evaluate how models perform in diplomatic contexts, focusing on their persuasiveness, reasoning, consistency, and adherence to diplomatic norms.
- Multi-agent simulation with different LLMs via API calls (OpenAI)
- Realistic parliamentary procedure with structured debate phases:
- Opening statements
- Private strategic notes
- Proposal submissions
- Pairwise bilateral discussions about proposals
- Voting on proposals
- Delegate peer assessment and ranking
- Rich context memory system ensuring models maintain awareness of all prior exchanges
- Performance metrics tracking (messages, proposals, votes, peer rankings)
- Comprehensive leaderboard system with point-based rankings
- Export functionality for conversation history, metrics, and human-readable transcripts
- Voting mechanism for proposals with transparent results tracking
- Mock mode for testing without API keys
The simulation follows a formal parliamentary procedure:
- Opening Statements: Each delegate presents their country's position and priorities
- Private Notes: Delegates record private strategic notes (not shared directly with others but used to guide their own future decisions)
- Proposal Phase: Delegates submit formal proposals addressing the debate topic
- Pairwise Discussions: Delegates engage in bilateral conversations with each other discussing the submitted proposals
- Voting Phase: Delegates vote on each proposal (yes/no/abstain) with explanations
- Delegate Ranking: Each delegate ranks their peers based on contributions and diplomacy
- Leaderboard Generation: Final rankings are calculated based on peer assessments
The simulation employs a sophisticated context memory system:
- Each model maintains awareness of all prior statements, proposals, and voting history
- Private notes are included in context for the authoring delegate only
- Character personalities and national interests guide responses consistently
- Prompts for each phase build upon the accumulated context
- Messages are formatted appropriately for different model providers (OpenAI)
This context-rich approach ensures delegates maintain consistent positions, can reference previous statements, and develop more coherent diplomatic strategies.
The simulation tracks comprehensive performance metrics:
- Messages: Sent and received by each delegate
- Proposals: Created and passed
- Votes: Cast on proposals
- Ranking Points: Awarded based on peer assessments (higher ranks get more points)
- Reputation Score: Dynamic score affected by diplomatic behavior and proposal success
The final leaderboard ranks delegates based primarily on peer assessment points, with reputation score as a tiebreaker.
- Clone the repository:
git clone https://github.com/Zanger67/CS4650_NLP_GroupProject.git- Install dependencies:
pip install -r requirements.txt- Create a
.envfile with your API keys:
OPENAI_API_KEY=your_openai_key_here
- Run the simulation:
python3 src/main.py- Edit
src/models/models.jsonto add new models or change priority settings - Adjust topics and characters in
src/topics.jsonto create new debate scenarios - Run with custom parameters:
python3 src/main.py -t [topic_name] -m '{"USA":"gpt-3.5-turbo","China":"gpt-4o","EU":"claude-3-opus-20240229","India":"gemini-1.0-pro"}'
The simulation generates several output files in the results directory with a timestamp:
committee_history.json: Complete record of all messages, notes, proposals, and votingperformance_metrics.json: Detailed metrics for each delegate's performancedelegate_leaderboard.json: Final rankings and scores for all delegatesdialogue_transcript.txt: Human-readable transcript of the entire debate session
The project includes utilities for analyzing simulation results:
src/model_evaluation.py: Script to aggregate performance metrics across multiple simulation runs
src/- Core source codemain.py- Main simulation scriptmodels/- Model implementations and managementmodels.py- Model class implementations for different LLM providershistory.py- Conversation tracking, metrics, and ranking systemsmodels.json- Configuration for available models
prompts.py- Prompt templates for different debate phasestopics.json- Debate topics and country profilesutils.py- Utility functions for parsing and formatting
templates/- Templates for export formatsresults/- Output directory for simulation results -> dataset for evaluation. Note that we didn't use any external dataset but created our own for benchmarking + evaluationdeprecated/- Legacy code kept for reference
- Python 3.9+
- Dependencies listed in requirements.txt
- API keys for the LLM providers, in our case openAI