Skip to content

aic-factcheck/tracking_emerging_topics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Emerging Topics in Social Networks

Python 3.9+ License: MIT Thesis

A comprehensive framework for temporal topic modeling and analysis of social media data from Threads, Mastodon, and Bluesky platforms. This repository implements and compares four distinct approaches for tracking emerging topics and their evolution over time, with applications in cybersecurity, misinformation detection, and harmful narrative analysis.

Overview

This project addresses the critical need for robust methods to track emerging topics and their temporal evolution on social media platforms. The proliferation of social media has made it a dominant source of information and public discourse, but also a conduit for misinformation and malicious activities. This work provides tools and methodologies for temporal topic analysis with emphasis on interpretability, representation, and temporal coherence.

Key Features

  • Multi-platform data collection: Scrapers for Threads, Mastodon, and Bluesky
  • Novel datasets: Curated and preprocessed temporal datasets from multiple social networks
  • Four temporal topic modeling approaches:
    • NMF with Temporal Regularization - Statistical matrix factorization with temporal smoothing
    • BERTopic - Transformer-based semantic topic modeling
    • NetDTM - Neural dynamic topic model with time-aware attention
    • ST-DBSCAN (Modified) - Spatio-temporal clustering with semantic similarity
  • Interactive visualization tools for exploring topic lifecycles

Repository Structure

.
├── Scraping/                    # Data collection modules
│   ├── threads_scraper.py       # Threads API scraper using metathreads
│   ├── bluesky_scraper.py      # Bluesky/AT Protocol scraper
│   └── mastodon_hashtag_scraper.py  # Mastodon hashtag-based scraper
│
├── Preprocessing/               # Data preprocessing pipelines
│   ├── threads_preprocess.py     # Threads data cleaner
│   ├── bluesky_preprocess.py    # Bluesky data cleaner
│   └── mastodon_preprocess.py   # Mastodon data cleaner
│
├── NMF_Temp_Reg/               # NMF with Temporal Regularization
│   └── NMF_temp_reg.py         # Implementation of temporal NMF
│
├── BERTopic/                   # BERTopic implementation
│   └── bertopic.ipynb          # Jupyter notebook with BERTopic experiments
│
├── NetDTM/                     # Neural Dynamic Topic Model
│   └── NetDTM.py              # PyTorch implementation of NetDTM
│
├── ST_DBSCAN_mod/              # Modified ST-DBSCAN
│   └── stdbscan_modified.py    # Semantic ST-DBSCAN clustering
│
└── docs/                       # Documentation
    └── thesis.md               # Full thesis document

Installation

Prerequisites

  • Python 3.9 or higher
  • CUDA-capable GPU (recommended for NetDTM and BERTopic)

Dependencies

Core dependencies include:

pip install numpy pandas scikit-learn torch transformers sentence-transformers
pip install bertopic umap-learn hdbscan plotly matplotlib seaborn
pip install nltk requests metathreads

For the complete environment setup, see the individual module documentation.

Usage

1. Data Collection

Threads Scraper

from Scraping.threads_scraper import get_thread_posts

# Initialize and scrape threads
posts = get_thread_posts("https://www.threads.net/@username/post/...")

Bluesky Scraper

from Scraping.bluesky_scraper import BlueskyScraper

scraper = BlueskyScraper("username", "password")
posts = scraper.get_profile_posts("handle.bsky.social", 
                                  start_time=datetime(2024, 9, 1),
                                  end_time=datetime(2024, 12, 31))

Mastodon Scraper

from Scraping.mastodon_hashtag_scraper import MastodonHashtagScraper

scraper = MastodonHashtagScraper("https://mastodon.social")
posts = scraper.scrape_hashtag_with_date_range("topic", 
                                               start_date="2024-09-01",
                                               end_date="2024-12-31")

2. Data Preprocessing

from Preprocessing.threads_preprocess import preprocess_threads
from Preprocessing.bluesky_preprocess import preprocess_bluesky
from Preprocessing.mastodon_preprocess import preprocess_mastodon

# Preprocess Threads data
posts, timestamps = preprocess_threads("threads_data.json")

# Preprocess Bluesky data
posts, timestamps, netdtm_data = preprocess_bluesky("bluesky_data.json")

# Preprocess Mastodon data
posts, timestamps, netdtm_data = preprocess_mastodon("mastodon_data.json")

3. Topic Modeling

NMF with Temporal Regularization

from NMF_Temp_Reg.NMF_temp_reg import TemporalNMF

model = TemporalNMF(n_topics=10, temporal_regularization=0.1)
grouped_data = model.preprocess_data(df, text_col='text', time_col='timestamp')
model.fit(grouped_data)

BERTopic

# See BERTopic/bertopic.ipynb for complete workflow
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
topic_model = BERTopic(embedding_model=embedding_model)
topics, probs = topic_model.fit_transform(docs)

NetDTM

from NetDTM.NetDTM import NetDTMTrainer

trainer = NetDTMTrainer(vocab_size, topic_dim, embedding_dim)
trainer.train(train_data, val_data, epochs=100)

Modified ST-DBSCAN

from ST_DBSCAN_mod.stdbscan_modified import SemanticSTDBSCAN

clusterer = SemanticSTDBSCAN(eps_semantic=0.2, eps_temporal=3, min_samples=5)
clusterer.fit(embeddings, timestamps)
labels = clusterer.labels_

Methods Overview

NMF with Temporal Regularization

A statistical approach that extends Non-Negative Matrix Factorization with temporal smoothing. The method applies regularization to pull document-topic distributions toward the previous time period's average, ensuring temporal coherence in topic evolution.

Key Parameters:

  • n_topics: Number of topics to extract
  • temporal_regularization: Weight for temporal smoothing (0-1)
  • max_features: Vocabulary size limit

BERTopic

A transformer-based topic modeling approach leveraging BERT embeddings and c-TF-IDF. Uses UMAP for dimensionality reduction and HDBSCAN for clustering, with dynamic topic modeling capabilities for temporal analysis.

Features:

  • Semantic embeddings via sentence transformers
  • Dynamic topic modeling over time
  • Interactive visualizations

NetDTM (Network Dynamic Topic Model)

A neural approach implementing time-aware attention mechanisms and optimal transport for modeling topic evolution. Uses sinusoidal time embeddings and attention-based topic-word distributions.

Key Components:

  • TimeEmbedding: Sinusoidal positional encoding for timestamps
  • TimeAwareAttention: Attention mechanism incorporating temporal information
  • TimeAwareOT: Optimal transport for topic distribution alignment

Modified ST-DBSCAN

An adaptation of Spatio-Temporal DBSCAN that replaces spatial distance with semantic similarity using cosine similarity on text embeddings. Clusters posts based on both content similarity and temporal proximity.

Parameters:

  • eps_semantic: Semantic similarity threshold
  • eps_temporal: Temporal distance threshold (in days)
  • min_samples: Minimum points for core cluster

Datasets

The repository includes preprocessing pipelines for three social media platforms:

Platform Time Period Data Points Description
Threads Sep 2024+ Variable Thread posts and replies
Mastodon Sep 2024+ Variable Hashtag-based posts
Bluesky Sep 2024+ Variable Profile posts and replies

All datasets are preprocessed to:

  • Remove URLs, mentions, and special characters
  • Filter by minimum text length (30 characters)
  • Remove stopwords and platform-specific noise
  • Format timestamps for temporal analysis

Evaluation & Discussion

The framework includes comparative evaluation across methods focusing on:

  • Interpretability: Human-understandable topic representations
  • Temporal Coherence: Smooth topic evolution over time
  • Representation Quality: Meaningful semantic content
  • Visualization: Interactive exploration of topic lifecycles

Note: Due to the lack of ground truth for temporal topic evolution, classical metrics have limitations. The framework emphasizes visual interpretation and qualitative assessment.

Thesis Reference

This work is based on the Master's Thesis "Emerging Topics in Social Networks" by Bc. Diana Korladinova, supervised by Ing. Jan Drchal, Ph.D. at the Czech Technical University in Prague, Faculty of Electrical Engineering, Department of Computer Science (May 2025).

The full thesis document is available in the official CTU repository.

Citation

@mastersthesis{korladinova2025emerging,
  title={Emerging Topics in Social Networks},
  author={Korladinova, Diana},
  school={Czech Technical University in Prague, Faculty of Electrical Engineering},
  year={2025},
  month={May},
  url={https://dspace.cvut.cz/handle/10467/115824}
}

Acknowledgments

This work was created with the state support of the Ministry of Industry and Trade of the Czech Republic, project no. Z220312000000, within the National Recovery Plan Programme. The access to the computational infrastructure of the OP VVV funded project CZ.02.1.01/0.0/0.0/16_019/0000765 "Research Center for Informatics" is also gratefully acknowledged.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

Contact

For questions or inquiries, please contact:

  • Author: Diana Korladinova (CTU FEE)
  • Supervisor: Ing. Jan Drchal, Ph.D. (CTU FEE)

Copyright © 2025 AIC, FEE, CTU in Prague

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors