Skip to content

brkeudunman/CENG567_MultiAgentReinforcementLearning_Project

Repository files navigation

MARL_News_Spread_Simulation

1. Project Overview

This project simulates the spread of real versus fake news in a multi-agent environment using Reinforcement Learning. The primary goal is to observe and analyze the dynamics of user trust and agent strategy evolution under various environmental conditions. By modeling user interactions with cognitive biases and a dynamic reputation system, we aim to understand how different feedback mechanisms influence the dissemination of information.

2. Experimental Work

2.1 Overview and Objectives

We aimed to observe the propagation patterns of agents employing different content strategies (High Accuracy vs. High Emotionality). To systematically analyze these dynamics, we tracked the following metrics:

  • Cumulative Rewards: To assess the long-term performance and viability of each agent's strategy.
  • Trust Thermometer (Average Trust): To observe the fluctuation of user trust levels over time.
  • Action Distribution: To visualize the breakdown of user responses (Share, Report, Ignore).
  • Reward Trend Analysis: To identify shifts in agent performance stability throughout the simulation.

2.2 Experimental Configurations

We conducted a series of simulations to observe agent behavior under varying environmental constraints. These configurations tested different settings for Scale (simulation duration and population size), Decision Logic (inclusion of emotional bias), and Punishment Severity (trust penalties). The exact parameters for each recorded experiment are listed in Table 1.

Table 1: Configuration Variations

Experiment ID Fake Detection Threshold Penalty Report Perceived Acc. (w/ Emotion) Report Prob High Report Prob Low Skepticism Penalty w_perceived_bias w_perceived_truth Episodes Num Users Pool Size
1 (Baseline) 0.30 20 False 0.8 0.20 0.2 0.9 0.1 500 1000 10000
1.1 0.30 20 False 0.8 0.20 0.2 0.9 0.1 5000 1000 10000
1.2 0.30 20 False 0.8 0.20 0.2 0.9 0.1 1000 1000 10000
1.3 0.30 20 False 0.8 0.20 0.2 0.9 0.1 1000 10000 10000
1.4 0.30 20 False 0.8 0.20 0.2 0.9 0.1 1000 1000 100
2 0.30 20 True 0.8 0.20 0.2 0.9 0.1 500 1000 10000
2.1 0.30 20 True 0.8 0.20 0.2 0.9 0.1 1000 1000 10000
3 0.30 20 True 0.8 0.20 0.2 0.9 0.1 500 1000 10000
4 0.45 20 True 0.8 0.20 0.3 0.9 0.1 500 1000 10000
5 0.30 20 True 0.8 0.20 0.2 0.6 0.4 500 1000 10000
6 0.30 50 True 0.5 0.10 0.2 0.9 0.1 500 1000 10000
7 0.30 50 False 0.5 0.10 0.2 0.9 0.1 500 1000 10000
8 0.30 50 False 0.5 0.10 0.2 0.8 0.2 500 1000 10000
9 0.30 50 True 0.9 0.05 0.3 0.7 0.3 500 1000 10000

2.3 Experimental Observations

  • Scale: Experiments with larger populations and longer durations (Exp 1.1, 1.3) were conducted to verify if observed behaviors were consistent at scale.
  • Emotional Bias: Experiment 2 introduced emotionality contribution to perceived accuracy, allowing for the observation of how sensationalism affects initial dissemination rates compared to the baseline.
  • User Rationality: Experiment 5 altered the weightings of user perception, favoring "Truth" over "Bias". This was designed to observe agent adaptation when facing a more critical user base.
  • Penalty Dynamics: Experiments 6 through 9 implemented stricter environmental penalties to observe the threshold at which deceptive strategies might become unviable.

2.4 Consistent Parameters

The following parameters remained constant across all simulations:

  • report_prob_slope_high: 0.2 (Probability increase of reporting per unit of skepticism)
  • reward_share: 20 (Reward for a successful share)
  • trust_penalty_report: 0.2 (Trust lost per report)
  • trust_reward_share: 0.2 (Trust gained per share)
  • epsilon_decay: 0.995 (Exploration decay rate)

Note: Visualizations of the results, including reward trends and action distributions, are available in the Appendix.

3. Methodology

3.1 Simulation Environment

The simulation models a dynamic social network environment where information agents (news spreaders) compete for user attention. The environment is built using the OpenAI Gym (Gymnasium) interface, allowing for standard Reinforcement Learning (RL) interaction loops.

Core Components:

  • News Pool: A collection of 10,000 unique news items. Each item is characterized by:
    • Accuracy: Binary value (0.0 for Fake, 1.0 for Real).
    • Emotionality: A floating-point score (0.0 to 1.0) indicating the sensationalism of the content. Fake news is initialized with higher average emotionality (0.5-1.0) compared to real news (0.0-0.7).
    • Topic Vector: A 5-dimensional normalized vector representing the semantic subject matter of the news.
  • User Population: A fixed population of 1,000 users. Each user possesses:
    • Skepticism Level: A randomized value (0.0 to 1.0) assigned at the start of each simulation, determining their baseline suspicion of news.
    • Interest Bias: A 5-dimensional vector representing their topic preferences.
    • Trust Matrix: A memory store mapping Agent IDs to a dynamic Trust Score (initially 0.5).

3.2 Agent Design

Two distinct Q-Learning agents interact with the environment: a Real News Agent and a Fake News Agent.

Reinforcement Learning Formulation:

  • State Space: The problem is modeled as a multi-armed bandit scenario where the state is constant (dummy state 0), emphasizing immediate action selection over sequential navigation.
  • Action Space: Discrete space corresponding to the subset of news items available to the agent (5,000 real items for the Real Agent, 5,000 fake items for the Fake Agent).
  • Policy: Agents utilize an Epsilon-Greedy strategy ($\epsilon$-greedy) to balance exploration (trying new headlines) and exploitation (using headlines known to generate rewards).
  • Learning: Agents update their Q-values using the standard Bellman equation based on the immediate reward received from user interactions.

3.3 User Dynamics & Decision Logic

The core complexity of the simulation lies in the user's decision-making process, which models cognitive biases and trust mechanics.

Perceived Accuracy: Users do not know the ground truth. Instead, they calculate a Perceived Accuracy score based on:

  1. Confirmation Bias: Similarity between the user's interest vector and the news topic.
  2. Source Credibility: The current trust score of the agent sharing the news.
  3. Emotional Impact: (Optional configuration) High emotionality can artificially inflate perceived accuracy for less skeptical users.

$$ Perceived = w_{truth} \cdot Accuracy + w_{bias} \cdot (\frac{TopicMatch + Trust + Emotion}{3}) $$

Action Selection: Based on the Perceived Accuracy and their Skepticism Level, users classify news as "Trusted" or "Detected Fake".

  • Share: Users share news that aligns with their interests, matches their emotional triggers, and comes from a trusted source.
  • Report: Users report news they detect as fake. High skepticism users are more likely to report.
  • Ignore: If the signal is too weak to prompt a Share or Report.

3.4 Reputation System (Trust Dynamics)

The Reputation System is the feedback mechanism that drives long-term agent performance.

  • Trust Erosion: When a user "Reports" a news item, their trust in that specific agent decreases by a penalty factor ($P_{report} = 0.2$).
  • Trust Building: When a user "Shares" a news item, their trust in the agent increases by a reward factor ($R_{share} = 0.2$).
  • Consequence: Since trust is a component of Perceived Accuracy, a low trust score creates a negative feedback loop. Even "appealing" fake news is eventually rejected if the source has lost credibility.

About

Multi Agent Reinforcement Learning Project that aims to simulate adversary and non adversary news spreader among different types of user.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages