Skip to content

MahimaBhagwat/Twitter-Sentiment-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Twitter Sentiment Analysis

Real-time tweet sentiment classifier using NLP and ML.

Overview

Classifies tweets as positive or negative using a pipeline of text preprocessing and ML models trained on 1.6 million tweets (Sentiment140 dataset).

Results

Model Train Accuracy Test Accuracy Overfitting Gap
Logistic Regression 79.87% 77.67% 2.2%
Bernoulli Naive Bayes 81.45% 76.48% 5.0%
LinearSVC 86.23% 76.97% 9.3%

Logistic Regression selected as final model — smallest train/test gap (2.2%) vs LinearSVC (9.3%), indicating better generalization on unseen data.

Live Prediction Demo

=======================================================
Tweet                                    Sentiment
=======================================================
I love this product, it works amazingly  Positive 😊
This is the worst experience I have ever Negative 😞
Just got promoted at work, so happy righ Positive 😊
My phone broke and customer service was  Negative 😞

Note: The model struggles with negation (e.g. "I am not happy" → Positive). This is a known limitation of bag-of-words + TF-IDF approaches, which lose word order. Future improvement: use a sequence model (LSTM, BERT) that captures context.

Pipeline

Raw Tweet → Clean Text → Tokenize → Remove Stopwords → Stem → TF-IDF Vectors → ML Model → Sentiment Label

Tech Stack

Python, Scikit-learn, NLTK, Pandas, NumPy, Matplotlib, Seaborn, Swifter

Dataset

Sentiment140 — 1.6M tweets (800K positive, 800K negative), sourced from Kaggle.

Project Structure

twitter-sentiment-analysis/
├── notebooks/
│   └── TSA.ipynb
├── README.md
└── requirements.txt

Run

Open notebooks/TSA.ipynb in Google Colab or Jupyter and run all cells.

Kaggle API credentials (kaggle.json) required for dataset download. See Kaggle API docs for setup.

About

NLP pipeline classifying 1.6M tweets as positive or negative — Logistic Regression, Naive Bayes, and LinearSVC compared via TF-IDF vectorization on the Sentiment140 dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors