A comprehensive framework for temporal topic modeling and analysis of social media data from Threads, Mastodon, and Bluesky platforms. This repository implements and compares four distinct approaches for tracking emerging topics and their evolution over time, with applications in cybersecurity, misinformation detection, and harmful narrative analysis.
This project addresses the critical need for robust methods to track emerging topics and their temporal evolution on social media platforms. The proliferation of social media has made it a dominant source of information and public discourse, but also a conduit for misinformation and malicious activities. This work provides tools and methodologies for temporal topic analysis with emphasis on interpretability, representation, and temporal coherence.
- Multi-platform data collection: Scrapers for Threads, Mastodon, and Bluesky
- Novel datasets: Curated and preprocessed temporal datasets from multiple social networks
- Four temporal topic modeling approaches:
- NMF with Temporal Regularization - Statistical matrix factorization with temporal smoothing
- BERTopic - Transformer-based semantic topic modeling
- NetDTM - Neural dynamic topic model with time-aware attention
- ST-DBSCAN (Modified) - Spatio-temporal clustering with semantic similarity
- Interactive visualization tools for exploring topic lifecycles
.
├── Scraping/ # Data collection modules
│ ├── threads_scraper.py # Threads API scraper using metathreads
│ ├── bluesky_scraper.py # Bluesky/AT Protocol scraper
│ └── mastodon_hashtag_scraper.py # Mastodon hashtag-based scraper
│
├── Preprocessing/ # Data preprocessing pipelines
│ ├── threads_preprocess.py # Threads data cleaner
│ ├── bluesky_preprocess.py # Bluesky data cleaner
│ └── mastodon_preprocess.py # Mastodon data cleaner
│
├── NMF_Temp_Reg/ # NMF with Temporal Regularization
│ └── NMF_temp_reg.py # Implementation of temporal NMF
│
├── BERTopic/ # BERTopic implementation
│ └── bertopic.ipynb # Jupyter notebook with BERTopic experiments
│
├── NetDTM/ # Neural Dynamic Topic Model
│ └── NetDTM.py # PyTorch implementation of NetDTM
│
├── ST_DBSCAN_mod/ # Modified ST-DBSCAN
│ └── stdbscan_modified.py # Semantic ST-DBSCAN clustering
│
└── docs/ # Documentation
└── thesis.md # Full thesis document
- Python 3.9 or higher
- CUDA-capable GPU (recommended for NetDTM and BERTopic)
Core dependencies include:
pip install numpy pandas scikit-learn torch transformers sentence-transformers
pip install bertopic umap-learn hdbscan plotly matplotlib seaborn
pip install nltk requests metathreadsFor the complete environment setup, see the individual module documentation.
from Scraping.threads_scraper import get_thread_posts
# Initialize and scrape threads
posts = get_thread_posts("https://www.threads.net/@username/post/...")from Scraping.bluesky_scraper import BlueskyScraper
scraper = BlueskyScraper("username", "password")
posts = scraper.get_profile_posts("handle.bsky.social",
start_time=datetime(2024, 9, 1),
end_time=datetime(2024, 12, 31))from Scraping.mastodon_hashtag_scraper import MastodonHashtagScraper
scraper = MastodonHashtagScraper("https://mastodon.social")
posts = scraper.scrape_hashtag_with_date_range("topic",
start_date="2024-09-01",
end_date="2024-12-31")from Preprocessing.threads_preprocess import preprocess_threads
from Preprocessing.bluesky_preprocess import preprocess_bluesky
from Preprocessing.mastodon_preprocess import preprocess_mastodon
# Preprocess Threads data
posts, timestamps = preprocess_threads("threads_data.json")
# Preprocess Bluesky data
posts, timestamps, netdtm_data = preprocess_bluesky("bluesky_data.json")
# Preprocess Mastodon data
posts, timestamps, netdtm_data = preprocess_mastodon("mastodon_data.json")from NMF_Temp_Reg.NMF_temp_reg import TemporalNMF
model = TemporalNMF(n_topics=10, temporal_regularization=0.1)
grouped_data = model.preprocess_data(df, text_col='text', time_col='timestamp')
model.fit(grouped_data)# See BERTopic/bertopic.ipynb for complete workflow
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
topic_model = BERTopic(embedding_model=embedding_model)
topics, probs = topic_model.fit_transform(docs)from NetDTM.NetDTM import NetDTMTrainer
trainer = NetDTMTrainer(vocab_size, topic_dim, embedding_dim)
trainer.train(train_data, val_data, epochs=100)from ST_DBSCAN_mod.stdbscan_modified import SemanticSTDBSCAN
clusterer = SemanticSTDBSCAN(eps_semantic=0.2, eps_temporal=3, min_samples=5)
clusterer.fit(embeddings, timestamps)
labels = clusterer.labels_A statistical approach that extends Non-Negative Matrix Factorization with temporal smoothing. The method applies regularization to pull document-topic distributions toward the previous time period's average, ensuring temporal coherence in topic evolution.
Key Parameters:
n_topics: Number of topics to extracttemporal_regularization: Weight for temporal smoothing (0-1)max_features: Vocabulary size limit
A transformer-based topic modeling approach leveraging BERT embeddings and c-TF-IDF. Uses UMAP for dimensionality reduction and HDBSCAN for clustering, with dynamic topic modeling capabilities for temporal analysis.
Features:
- Semantic embeddings via sentence transformers
- Dynamic topic modeling over time
- Interactive visualizations
A neural approach implementing time-aware attention mechanisms and optimal transport for modeling topic evolution. Uses sinusoidal time embeddings and attention-based topic-word distributions.
Key Components:
- TimeEmbedding: Sinusoidal positional encoding for timestamps
- TimeAwareAttention: Attention mechanism incorporating temporal information
- TimeAwareOT: Optimal transport for topic distribution alignment
An adaptation of Spatio-Temporal DBSCAN that replaces spatial distance with semantic similarity using cosine similarity on text embeddings. Clusters posts based on both content similarity and temporal proximity.
Parameters:
eps_semantic: Semantic similarity thresholdeps_temporal: Temporal distance threshold (in days)min_samples: Minimum points for core cluster
The repository includes preprocessing pipelines for three social media platforms:
| Platform | Time Period | Data Points | Description |
|---|---|---|---|
| Threads | Sep 2024+ | Variable | Thread posts and replies |
| Mastodon | Sep 2024+ | Variable | Hashtag-based posts |
| Bluesky | Sep 2024+ | Variable | Profile posts and replies |
All datasets are preprocessed to:
- Remove URLs, mentions, and special characters
- Filter by minimum text length (30 characters)
- Remove stopwords and platform-specific noise
- Format timestamps for temporal analysis
The framework includes comparative evaluation across methods focusing on:
- Interpretability: Human-understandable topic representations
- Temporal Coherence: Smooth topic evolution over time
- Representation Quality: Meaningful semantic content
- Visualization: Interactive exploration of topic lifecycles
Note: Due to the lack of ground truth for temporal topic evolution, classical metrics have limitations. The framework emphasizes visual interpretation and qualitative assessment.
This work is based on the Master's Thesis "Emerging Topics in Social Networks" by Bc. Diana Korladinova, supervised by Ing. Jan Drchal, Ph.D. at the Czech Technical University in Prague, Faculty of Electrical Engineering, Department of Computer Science (May 2025).
The full thesis document is available in the official CTU repository.
@mastersthesis{korladinova2025emerging,
title={Emerging Topics in Social Networks},
author={Korladinova, Diana},
school={Czech Technical University in Prague, Faculty of Electrical Engineering},
year={2025},
month={May},
url={https://dspace.cvut.cz/handle/10467/115824}
}This work was created with the state support of the Ministry of Industry and Trade of the Czech Republic, project no. Z220312000000, within the National Recovery Plan Programme. The access to the computational infrastructure of the OP VVV funded project CZ.02.1.01/0.0/0.0/16_019/0000765 "Research Center for Informatics" is also gratefully acknowledged.
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
For questions or inquiries, please contact:
- Author: Diana Korladinova (CTU FEE)
- Supervisor: Ing. Jan Drchal, Ph.D. (CTU FEE)
Copyright © 2025 AIC, FEE, CTU in Prague