🏀 NBA Player Sentiment Engine

A from-scratch NLP system that measures public sentiment toward NBA players from live social data — and a study of which sentiment methods actually hold up on messy, sarcastic sports talk.

Core finding: social sentiment is noisy at the micro level but reliable at the macro level. A single comment is a coin-flip; aggregate a few hundred and a real, stable signal emerges. Off-the-shelf transformers trained on Twitter underperform lexicon methods on Reddit sports vernacular — so domain fit beats model size here.

Built and hand-coded for an Applied AI (AI 102) course. No black-box "call an API and trust it" — every stage is implemented and evaluated.

What it does

Collects real mentions of a player across NBA subreddits using Reddit's public JSON endpoints (posts and their comment threads).
Cleans & normalizes the text — lowercasing, regex cleanup, stopword removal, and lemmatization so dunked/dunking collapse to one token.
Scores sentiment with multiple approaches and compares them head-to-head against a human-labeled test set.
Reports aggregate sentiment per player plus stability/consistency diagnostics.

Methods compared

Approach	Implementation	Takeaway
VADER (lexicon)	NLTK `SentimentIntensityAnalyzer`	Strong, fast baseline; best macro-level reliability here
TF-IDF weighting	`sklearn` `TfidfVectorizer`	Surfaces the terms actually driving a player's sentiment
Pre-trained BERT / roBERTa	HuggingFace `transformers`	Twitter-trained models transfer poorly to Reddit sports slang → lower accuracy on human-labeled comments
Human-labeled evaluation	hand-labeled comment set + `classification_report`	Honest precision/recall instead of vibes

Why it matters (the product angle)

This is the kind of noisy-data-to-signal problem that shows up everywhere in retail and sports decision intelligence: the raw input is messy and individually unreliable, but with the right aggregation and the right (not necessarily fanciest) model, it becomes a trustworthy input to a decision. The project deliberately optimizes for measured accuracy and honest evaluation, not a flashy demo.

Tech stack

Python · NLTK (VADER, lemmatization, stopwords) · scikit-learn (TF-IDF, metrics) · HuggingFace Transformers (BERT/roBERTa) · PyTorch · NumPy · Matplotlib · Reddit public JSON API

Running it

1. Install dependencies

pip install -r requirements.txt

2. Download the NLTK data

import nltk
nltk.download('vader_lexicon')
nltk.download('stopwords')
nltk.download('wordnet')

3. Run a method. Open sentimentModel.py, scroll to the bottom, and uncomment one routine to run (run only one at a time), e.g.:

playerSentimentScore → core VADER scoring
scoreComparisonVisual → visualize scoring methods
consistencyTest → check model stability
preTrainedBERTtest / BERTbaseCasedtest → transformer comparisons

python3 sentimentModel.py

Files

File	Role
`DataPipeline.py`	Reddit collection + text cleaning/normalization
`sentimentModel.py`	Sentiment models, comparison, evaluation, visualization

_{Author: Austin Stevens · Applied AI & Data Analytics, University of Tennessee, Knoxville.}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
DataPipeline.py		DataPipeline.py
README.md		README.md
requirements.txt		requirements.txt
sentimentModel.py		sentimentModel.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏀 NBA Player Sentiment Engine

What it does

Methods compared

Why it matters (the product angle)

Tech stack

Running it

Files

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🏀 NBA Player Sentiment Engine

What it does

Methods compared

Why it matters (the product angle)

Tech stack

Running it

Files

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages