MSc Thesis — University of Amsterdam, 2023
I studied why some advocacy organizations are treated as authoritative voices in U.S. congressional floor speeches while others are mentioned only in passing. Using NLP text classification and multilevel regression models, I analyzed prominence patterns across the 114th and 115th Congress (2015–2019).
- The top 1% of organizations account for ~31% of all prominent mentions. Prominence is heavily concentrated.
- Organizations using external lobbyists had significantly higher prominence odds (p = 0.001), contradicting the expectation that outsourcing advocacy signals weak leadership.
- Medium-salience policy areas show higher prominence than high-salience ones. Crowded issue spaces appear to dilute individual group visibility.
- Politician seniority has a small but significant negative effect on affording prominence, counter to the intuition that longer-serving members have stronger group relationships.
- ~78,000 Congressional Record documents (114th–115th Congress)
- 5,447 organizations from the Washington Representatives Study (2011)
- ~20,699 mention passages extracted, 19,165 in the analytical sample
- Full-sample design: includes organizations with zero mentions as baseline
- Text classification: SVM with count vectorization (F1 = 0.79, accuracy ~81%, ROC AUC 0.72)
- Statistical models: Generalized Linear Mixed-Effects Models (GLMMs) in R (lme4)
- Random effects: Organization and Policy Area (crossed)
- Training data: 1,000 hand-coded passages following Fraussen et al. (2018) coding scheme
| Notebook | What it does |
|---|---|
| 01_Exploratory_Prominence_Analysis.ipynb | Concentration, distribution, partisan patterns |
| 02_Statistical_Models.ipynb | Python implementation of GLMM models A, B, C |
| 03_Multilevel_Models.Rmd | Primary R implementation with lme4 (thesis models) |
analysis/ Jupyter notebooks and R Markdown for analysis
legacy/ Original pipeline code (data collection, processing, classification, modeling)
data/ Analysis datasets (managed via Git LFS)
git clone https://github.com/kmazurek95/MastersThesis_InterestGroupAnalysis.git
cd MastersThesis_InterestGroupAnalysis
pip install -r requirements.txt
jupyter notebook analysis/01_Exploratory_Prominence_Analysis.ipynb
For R models: install lme4, ggplot2, broom.mixed, readr, dplyr, knitr, then open analysis/03_Multilevel_Models.Rmd in RStudio.
Data files are tracked with Git LFS. Run git lfs pull if CSVs appear as pointer files.
A modernized re-implementation of this pipeline with improved extraction and classification (F1 = 0.91, Cohen's kappa = 0.82) is available at ThesisPipelineRework.
See CITATION.cff for citation metadata.
MIT License - see LICENSE.