Real-time tweet sentiment classifier using NLP and ML.
Classifies tweets as positive or negative using a pipeline of text preprocessing and ML models trained on 1.6 million tweets (Sentiment140 dataset).
| Model | Train Accuracy | Test Accuracy | Overfitting Gap |
|---|---|---|---|
| Logistic Regression | 79.87% | 77.67% | 2.2% |
| Bernoulli Naive Bayes | 81.45% | 76.48% | 5.0% |
| LinearSVC | 86.23% | 76.97% | 9.3% |
Logistic Regression selected as final model — smallest train/test gap (2.2%) vs LinearSVC (9.3%), indicating better generalization on unseen data.
=======================================================
Tweet Sentiment
=======================================================
I love this product, it works amazingly Positive 😊
This is the worst experience I have ever Negative 😞
Just got promoted at work, so happy righ Positive 😊
My phone broke and customer service was Negative 😞
Note: The model struggles with negation (e.g. "I am not happy" → Positive). This is a known limitation of bag-of-words + TF-IDF approaches, which lose word order. Future improvement: use a sequence model (LSTM, BERT) that captures context.
Raw Tweet → Clean Text → Tokenize → Remove Stopwords → Stem → TF-IDF Vectors → ML Model → Sentiment Label
Python, Scikit-learn, NLTK, Pandas, NumPy, Matplotlib, Seaborn, Swifter
Sentiment140 — 1.6M tweets (800K positive, 800K negative), sourced from Kaggle.
twitter-sentiment-analysis/
├── notebooks/
│ └── TSA.ipynb
├── README.md
└── requirements.txt
Open notebooks/TSA.ipynb in Google Colab or Jupyter and run all cells.
Kaggle API credentials (
kaggle.json) required for dataset download. See Kaggle API docs for setup.