Skip to content

AYOAP/nba-player-sentiment-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏀 NBA Player Sentiment Engine

A from-scratch NLP system that measures public sentiment toward NBA players from live social data — and a study of which sentiment methods actually hold up on messy, sarcastic sports talk.

Core finding: social sentiment is noisy at the micro level but reliable at the macro level. A single comment is a coin-flip; aggregate a few hundred and a real, stable signal emerges. Off-the-shelf transformers trained on Twitter underperform lexicon methods on Reddit sports vernacular — so domain fit beats model size here.

Built and hand-coded for an Applied AI (AI 102) course. No black-box "call an API and trust it" — every stage is implemented and evaluated.


What it does

  1. Collects real mentions of a player across NBA subreddits using Reddit's public JSON endpoints (posts and their comment threads).
  2. Cleans & normalizes the text — lowercasing, regex cleanup, stopword removal, and lemmatization so dunked/dunking collapse to one token.
  3. Scores sentiment with multiple approaches and compares them head-to-head against a human-labeled test set.
  4. Reports aggregate sentiment per player plus stability/consistency diagnostics.

Methods compared

Approach Implementation Takeaway
VADER (lexicon) NLTK SentimentIntensityAnalyzer Strong, fast baseline; best macro-level reliability here
TF-IDF weighting sklearn TfidfVectorizer Surfaces the terms actually driving a player's sentiment
Pre-trained BERT / roBERTa HuggingFace transformers Twitter-trained models transfer poorly to Reddit sports slang → lower accuracy on human-labeled comments
Human-labeled evaluation hand-labeled comment set + classification_report Honest precision/recall instead of vibes

Why it matters (the product angle)

This is the kind of noisy-data-to-signal problem that shows up everywhere in retail and sports decision intelligence: the raw input is messy and individually unreliable, but with the right aggregation and the right (not necessarily fanciest) model, it becomes a trustworthy input to a decision. The project deliberately optimizes for measured accuracy and honest evaluation, not a flashy demo.

Tech stack

Python · NLTK (VADER, lemmatization, stopwords) · scikit-learn (TF-IDF, metrics) · HuggingFace Transformers (BERT/roBERTa) · PyTorch · NumPy · Matplotlib · Reddit public JSON API


Running it

1. Install dependencies

pip install -r requirements.txt

2. Download the NLTK data

import nltk
nltk.download('vader_lexicon')
nltk.download('stopwords')
nltk.download('wordnet')

3. Run a method. Open sentimentModel.py, scroll to the bottom, and uncomment one routine to run (run only one at a time), e.g.:

  • playerSentimentScore → core VADER scoring
  • scoreComparisonVisual → visualize scoring methods
  • consistencyTest → check model stability
  • preTrainedBERTtest / BERTbaseCasedtest → transformer comparisons
python3 sentimentModel.py

Files

File Role
DataPipeline.py Reddit collection + text cleaning/normalization
sentimentModel.py Sentiment models, comparison, evaluation, visualization

Author: Austin Stevens · Applied AI & Data Analytics, University of Tennessee, Knoxville.

About

From-scratch NLP engine measuring NBA player sentiment from Reddit — VADER vs TF-IDF vs BERT/roBERTa with human-labeled evaluation. Noisy at micro, reliable at macro.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages