NLP analysis of 6,500+ startup press releases to discover latent topics using unsupervised learning. Built as a capstone project for the Galvanize Data Science Immersive program.
- Scraped press releases from PR Newswire's API using a Crunchbase company list as seed data, stored in MongoDB
- Cleaned and featurized text (lemmatization, sentiment analysis via TextBlob, city/state/region extraction)
- Transformed corpus with TF-IDF, then factored the term-document matrix with Non-negative Matrix Factorization (NMF) to extract latent topics
NMF doesn't have a standard accuracy metric or a straightforward way to do cross-validation — you can't remove 20% of rows from the TF-IDF matrix without breaking the factorization. Following published approaches for NMF grid search, I instead zeroed out a random 20% of matrix entries per fold, reconstructed the approximate matrix from W and H, and measured MSE against the original. Five-fold cross-validation showed diminishing returns beyond 3 components (~0.1-0.5% MSE improvement per additional component), but 3 topics were too coarse for human interpretation. After reviewing the top-15 words at several settings, 8 topics gave the best separation.
External validation: the 8 discovered topics correlated well with PR Newswire's own industry classifications, which the model never saw.
Eight latent topics with clear thematic separation:
| Topic | Theme | Top words |
|---|---|---|
| 1 | Insurance | insurance, quote, car, auto, coverage, driver, online |
| 2 | Enterprise SaaS | data, cloud, service, solution, software, enterprise, platform |
| 3 | Healthcare | health, patient, care, medical, hospital, clinical, cancer |
| 4 | Digital Media | content, video, social, game, digital, marketing, medium |
| 5 | Clean Energy | solar, energy, power, renewable, module, utility, battery |
| 6 | Hardware/Printing | epson, printer, label, print, 3d, projector, ink |
| 7 | Mobile Apps | app, mobile, user, device, android, apps, iphone |
| 8 | SEC Filings | statement, forward, risk, securities, future, differ, exchange |
data_acquisition.py— PR Newswire API scraping with rate limit handling, batch processing, HTML→text extraction via BeautifulSoup, MongoDB storagedata_munging.py— Text cleaning, feature engineering (industry dummies, geographic extraction, lemmatization, sentiment)model.py— TF-IDF vectorization, NMF factorization, topic extraction. Run withpython model.py
Python, scikit-learn (TF-IDF, NMF), MongoDB, pandas, TextBlob, BeautifulSoup, Flask (demo app)