The objective of this project is to determine which method of the sentiment analysis provides better results when analyzing X data. We compared the performance of traditional Machine learning models, Deep neural networks and Transformers by leveraging Natural Lnaguage Processing(NLP) techniques. We evualte these models using Accuracy, Confusion Matrix, Precision, Loss.
The dataset used is Kaggle's Sentiment140. Consists of 1.6 million tweets labelled as positive and negative.You can access the dataset on Kaggle at the following link: https://www.kaggle.com/datasets/kazanova/sentiment140
- Distribution of Positive and Negative Tweets
- Distribution of @UserMentions , Links & #Hashtags
- Frequency of Positive words
- Frequency of Negative words
NLP plays a pivotal role in extracting, processing, and understanding textual data to determine the sentiment expressed within it.NLP techniques are used to clean and prepare textual data for analysis. This includes removing noise (e.g., stopwords, punctuation), normalizing text (e.g., lowercasing, stemming, and lemmatization), and handling slang and abbreviations.
- ML Models: Logistic Regression , Naive Bayes
- Neural Network: LSTM
- Transformer: BERT
- Logistic Regression
Widely used statistical model for binary classification. It predicts the probability that a given input belongs to a particular class by applying a logistic function (also known as the sigmoid function) to a linear combination of input features.
Tokenizer: TfidfVectorizer
- Naive Bayes
Probablistic classifier based on Bayes theorem.
Tokenizer: CountVectorizer
-
Evaluation Table
Model Accuracy Dataset size Logistic Regression78.1 1,600,000 Naive Bayes76.7 1,600,000
- LSTM (Long Short-Term Memory)
A type of Recurrent Neural Network, an LSTM recurrent unit tries to “remember” all the past knowledge that the network is seen so far and to “forget” irrelevant data.
Embeddings: Word2Vec
- Evaluation Table
| Model | Accuracy | Dataset size | Number of Epochs |
|---|---|---|---|
LSTM |
79 | 300,000 | 8 |
- BERT(Bidirectional Encoder Representations from Transformer)
BERT is a deep bidirectional, unsupervised language representation, pre-trained using a plain text corpus.BERT converts words into numbers. This process is important because machine learning models use numbers, not words, as inputs. This allows you to train machine learning models on your textual data.
BERT model used: bert-base-multilingual-uncased-sentiment
I was able to use only limited number of epochs due to its long training time and the size of the dataset is greatly reduced for this purpose.
- Evaluation Table
| Model | No of Epochs | Train Loss | Precision | Recall | F1 score | Dataset size |
|---|---|---|---|---|---|---|
BERT |
3 | 38.2 | 80 | 77.6 | 75.4 | 20,000 |
A sentiment analyzer is built using the streamlit interface where user can upload the required sentiment-to-be-found file and after analyzing would give you accurate sentiment next to the text also to help visually would provide you with bar chart and pie chart of the over all analysis. If needed user can install the csv file which contains all the sentiment results directly from it.
Model used: bert-base-multilingual-uncased-sentiment
- Demo video
0801.mp4
- Batch sentiment analysis






