Authors: Tal Ishon, Yoav Goldberg, Uri Shaham
Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While many recent methods incorporate external knowledge (e.g., pre-trained embeddings), such reliance can limit applicability in emerging or underexplored domains. PRISM is a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single-cell RNA-seq data show that PRISM improves topic coherence and interpretability, often rivaling externally guided approaches.
PRISM is a corpus-intrinsic framework for enhancing LDA by distilling informative structure directly from the target corpus, without reliance on external knowledge, making it particularly well suited to data-scarce and under-characterized domains such as biological scRNA-seq.
Core goals
- Methodological goal (text): Develop PRISM as a state-of-the-art corpus-intrinsic topic modeling framework that improves LDA’s coherence and interpretability using only corpus-derived signals, and remains competitive with approaches that rely on external resources.
- Scientific goal (biology): Translate PRISM to scRNA-seq to discover gene-expression programs in data-scarce, under-characterized settings, where curated priors and pretrained representations may be fragmented, biased, or poorly matched to the target modality.
Both goals are realized through the corpus-intrinsic design specified in the paper.
PRISM is a corpus-intrinsic initialization pipeline for LDA with two stages:
- Corpus-derived embeddings
- Compute a PPMI matrix from document-level co-occurrence (each document is a context window).
- Build a word similarity graph using cosine similarity between PPMI rows.
- Apply diffusion maps (anisotropic normalization,
$t=1$ ) to obtain low-dimensional word embeddings.
- Prior estimation + LDA initialization
- Fit a GMM over embeddings to obtain soft topic memberships
$p(z\mid w)$ , then convert to$p(w\mid z)$ via Bayes’ rule using unigram frequencies. - Estimate a vocabulary-level Dirichlet prior
$\hat{\beta}$ (method-of-moments). - Initialize MALLET LDA with vector-valued
$\hat{\beta}$ , keeping the LDA generative process and inference unchanged.
# 1) Clone repo
git clone https://github.com/shaham-lab/PRISM.git
cd PRISM
# 2) Run the setup script
chmod +x setup_mallet.sh
./setup_mallet.shPRISM requires a patched build of MALLET 2.0.8 to support a vector-valued
# 3) Set venv
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtDatasets are expected under data/ as <DATASET>.json files.
PRISM/
└── data/
├── 20NewsGroup.json
├── BBC.json
├── M10.json
├── DBLP.json
└── TrumpTweets.json
Each json file is a JSON array of documents and each document is represented as a list of pre-tokenized word strings (e.g., data/BBC.json).
Below are the sources for the datasets used in our evaluation:
- 20NewsGroup - https://github.com/MIND-Lab/OCTIS/tree/master/preprocessed_datasets/20NewsGroup (octis)
- BBC News - https://github.com/MIND-Lab/OCTIS/tree/master/preprocessed_datasets/BBC_News (octis)
- M10 - https://github.com/MIND-Lab/OCTIS/tree/master/preprocessed_datasets/M10 (octis)
- DBLP - https://github.com/MIND-Lab/OCTIS/tree/master/preprocessed_datasets/DBLP (octis)
- Trump's Tweets - https://www.kaggle.com/datasets/codebreaker619/donald-trump-tweets-dataset (kaggle)
These links provide the original sources for the datasets referenced in the PRISM paper.
To run PRISM from scratch, execute the pipeline in this order.
This step computes the PPMI matrix, builds the word-similarity graph, and generates the prior inputs used later in the pipeline.
Example:
python create_word_embeddings.py \
--n_comp 5 \
--dataset BBC \
--metrics prism \
[OPTIONS]Required arguments:
--n_comp: Number of components/topics used when creating the prior--dataset: Dataset name, used to locate the saved prior files--metric: Metric name used in the saved prior filename, such asprismorglove-h, --help: Show the help message
Example:
python methods_of_moments.py \
--n_comp 5 \
--dataset BBC \
--metric prismRequired arguments:
--n_comp: Number of components/topics used when creating the prior--dataset: Dataset name, used to locate the saved prior files--metric: Metric name used in the saved prior filename, such asprismorglove
For more information:
-h, --help: Show the help message
Example:
python main.py \
--n_comp 20 \
--dataset 20NewsGroup \
--metric prism \
--num_iterations 1000 \
--run_number FINAL \
--seed 0@article{ishon2026prism,
title={PRISM: PRIor from corpus Statistics for topicModeling},
author={Ishon, Tal and Shaham, Uri},
journal={arXiv preprint arXiv:2601.18900},
year={2026}
}Made for reliable topic discovery in data-scarce, under-characterized domains.
