This project simulates the spread of real versus fake news in a multi-agent environment using Reinforcement Learning. The primary goal is to observe and analyze the dynamics of user trust and agent strategy evolution under various environmental conditions. By modeling user interactions with cognitive biases and a dynamic reputation system, we aim to understand how different feedback mechanisms influence the dissemination of information.
We aimed to observe the propagation patterns of agents employing different content strategies (High Accuracy vs. High Emotionality). To systematically analyze these dynamics, we tracked the following metrics:
- Cumulative Rewards: To assess the long-term performance and viability of each agent's strategy.
- Trust Thermometer (Average Trust): To observe the fluctuation of user trust levels over time.
- Action Distribution: To visualize the breakdown of user responses (Share, Report, Ignore).
- Reward Trend Analysis: To identify shifts in agent performance stability throughout the simulation.
We conducted a series of simulations to observe agent behavior under varying environmental constraints. These configurations tested different settings for Scale (simulation duration and population size), Decision Logic (inclusion of emotional bias), and Punishment Severity (trust penalties). The exact parameters for each recorded experiment are listed in Table 1.
| Experiment ID | Fake Detection Threshold | Penalty Report | Perceived Acc. (w/ Emotion) | Report Prob High | Report Prob Low | Skepticism Penalty | w_perceived_bias | w_perceived_truth | Episodes | Num Users | Pool Size |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 (Baseline) | 0.30 | 20 | False | 0.8 | 0.20 | 0.2 | 0.9 | 0.1 | 500 | 1000 | 10000 |
| 1.1 | 0.30 | 20 | False | 0.8 | 0.20 | 0.2 | 0.9 | 0.1 | 5000 | 1000 | 10000 |
| 1.2 | 0.30 | 20 | False | 0.8 | 0.20 | 0.2 | 0.9 | 0.1 | 1000 | 1000 | 10000 |
| 1.3 | 0.30 | 20 | False | 0.8 | 0.20 | 0.2 | 0.9 | 0.1 | 1000 | 10000 | 10000 |
| 1.4 | 0.30 | 20 | False | 0.8 | 0.20 | 0.2 | 0.9 | 0.1 | 1000 | 1000 | 100 |
| 2 | 0.30 | 20 | True | 0.8 | 0.20 | 0.2 | 0.9 | 0.1 | 500 | 1000 | 10000 |
| 2.1 | 0.30 | 20 | True | 0.8 | 0.20 | 0.2 | 0.9 | 0.1 | 1000 | 1000 | 10000 |
| 3 | 0.30 | 20 | True | 0.8 | 0.20 | 0.2 | 0.9 | 0.1 | 500 | 1000 | 10000 |
| 4 | 0.45 | 20 | True | 0.8 | 0.20 | 0.3 | 0.9 | 0.1 | 500 | 1000 | 10000 |
| 5 | 0.30 | 20 | True | 0.8 | 0.20 | 0.2 | 0.6 | 0.4 | 500 | 1000 | 10000 |
| 6 | 0.30 | 50 | True | 0.5 | 0.10 | 0.2 | 0.9 | 0.1 | 500 | 1000 | 10000 |
| 7 | 0.30 | 50 | False | 0.5 | 0.10 | 0.2 | 0.9 | 0.1 | 500 | 1000 | 10000 |
| 8 | 0.30 | 50 | False | 0.5 | 0.10 | 0.2 | 0.8 | 0.2 | 500 | 1000 | 10000 |
| 9 | 0.30 | 50 | True | 0.9 | 0.05 | 0.3 | 0.7 | 0.3 | 500 | 1000 | 10000 |
- Scale: Experiments with larger populations and longer durations (Exp 1.1, 1.3) were conducted to verify if observed behaviors were consistent at scale.
- Emotional Bias: Experiment 2 introduced emotionality contribution to perceived accuracy, allowing for the observation of how sensationalism affects initial dissemination rates compared to the baseline.
- User Rationality: Experiment 5 altered the weightings of user perception, favoring "Truth" over "Bias". This was designed to observe agent adaptation when facing a more critical user base.
- Penalty Dynamics: Experiments 6 through 9 implemented stricter environmental penalties to observe the threshold at which deceptive strategies might become unviable.
The following parameters remained constant across all simulations:
report_prob_slope_high: 0.2 (Probability increase of reporting per unit of skepticism)reward_share: 20 (Reward for a successful share)trust_penalty_report: 0.2 (Trust lost per report)trust_reward_share: 0.2 (Trust gained per share)epsilon_decay: 0.995 (Exploration decay rate)
Note: Visualizations of the results, including reward trends and action distributions, are available in the Appendix.
The simulation models a dynamic social network environment where information agents (news spreaders) compete for user attention. The environment is built using the OpenAI Gym (Gymnasium) interface, allowing for standard Reinforcement Learning (RL) interaction loops.
Core Components:
- News Pool: A collection of 10,000 unique news items. Each item is characterized by:
- Accuracy: Binary value (0.0 for Fake, 1.0 for Real).
- Emotionality: A floating-point score (0.0 to 1.0) indicating the sensationalism of the content. Fake news is initialized with higher average emotionality (0.5-1.0) compared to real news (0.0-0.7).
- Topic Vector: A 5-dimensional normalized vector representing the semantic subject matter of the news.
- User Population: A fixed population of 1,000 users. Each user possesses:
- Skepticism Level: A randomized value (0.0 to 1.0) assigned at the start of each simulation, determining their baseline suspicion of news.
- Interest Bias: A 5-dimensional vector representing their topic preferences.
- Trust Matrix: A memory store mapping Agent IDs to a dynamic Trust Score (initially 0.5).
Two distinct Q-Learning agents interact with the environment: a Real News Agent and a Fake News Agent.
Reinforcement Learning Formulation:
- State Space: The problem is modeled as a multi-armed bandit scenario where the state is constant (dummy state 0), emphasizing immediate action selection over sequential navigation.
- Action Space: Discrete space corresponding to the subset of news items available to the agent (5,000 real items for the Real Agent, 5,000 fake items for the Fake Agent).
-
Policy: Agents utilize an Epsilon-Greedy strategy (
$\epsilon$ -greedy) to balance exploration (trying new headlines) and exploitation (using headlines known to generate rewards). - Learning: Agents update their Q-values using the standard Bellman equation based on the immediate reward received from user interactions.
The core complexity of the simulation lies in the user's decision-making process, which models cognitive biases and trust mechanics.
Perceived Accuracy:
Users do not know the ground truth. Instead, they calculate a Perceived Accuracy score based on:
- Confirmation Bias: Similarity between the user's interest vector and the news topic.
- Source Credibility: The current trust score of the agent sharing the news.
- Emotional Impact: (Optional configuration) High emotionality can artificially inflate perceived accuracy for less skeptical users.
Action Selection: Based on the Perceived Accuracy and their Skepticism Level, users classify news as "Trusted" or "Detected Fake".
- Share: Users share news that aligns with their interests, matches their emotional triggers, and comes from a trusted source.
- Report: Users report news they detect as fake. High skepticism users are more likely to report.
- Ignore: If the signal is too weak to prompt a Share or Report.
The Reputation System is the feedback mechanism that drives long-term agent performance.
-
Trust Erosion: When a user "Reports" a news item, their trust in that specific agent decreases by a penalty factor (
$P_{report} = 0.2$ ). -
Trust Building: When a user "Shares" a news item, their trust in the agent increases by a reward factor (
$R_{share} = 0.2$ ). -
Consequence: Since trust is a component of
Perceived Accuracy, a low trust score creates a negative feedback loop. Even "appealing" fake news is eventually rejected if the source has lost credibility.