A comprehensive machine learning project that performs binary sentiment classification on Twitter data using Natural Language Processing techniques and Logistic Regression. This model analyzes tweet text to determine whether the sentiment expressed is positive or negative.
- Overview
- Dataset
- Features
- Installation
- Usage
- Model Architecture
- Performance Metrics
- Project Structure
- API Reference
- Contributing
- Future Improvements
- License
This project implements a complete sentiment analysis pipeline that:
- Preprocesses raw tweet data using advanced NLP techniques
- Trains a Logistic Regression classifier on 1.6 million labeled tweets
- Evaluates model performance using standard metrics
- Saves the trained model for future deployment and predictions
- 0: Negative sentiment ๐
- 1: Positive sentiment ๐
The project utilizes the Sentiment140 dataset from Kaggle, which contains:
- 1,600,000 tweets with sentiment labels
- 6 features: target, ids, date, flag, user, text
- Balanced dataset with equal positive and negative samples
- Pre-labeled data for supervised learning
Columns: [target, ids, date, flag, user, text]
Shape: (1,600,000, 6)
Target Distribution: 50% positive, 50% negative
- Special Character Removal: Cleans non-alphabetic characters
- Case Normalization: Converts all text to lowercase
- Stop Words Removal: Eliminates common English stop words
- Stemming: Reduces words to their root form using Porter Stemmer
- Tokenization: Splits text into meaningful tokens
- Feature Extraction: TF-IDF (Term Frequency-Inverse Document Frequency) vectorization
- Data Splitting: 80% training, 20% testing with stratified sampling
- Model Training: Logistic Regression with optimized parameters
- Model Persistence: Saves trained model using pickle for deployment
- Python 3.7+
- Kaggle account and API credentials
-
Clone the repository
git clone https://github.com/Rajaykumar12/Twitter_Sentimental_analysis.git cd Twitter_Sentimental_analysis -
Install dependencies
pip install -r requirements.txt
-
Configure Kaggle API
# Download kaggle.json from your Kaggle account mkdir -p ~/.kaggle cp kaggle.json ~/.kaggle/ chmod 600 ~/.kaggle/kaggle.json
pandas>=1.3.0
numpy>=1.21.0
nltk>=3.6
scikit-learn>=1.0.0
matplotlib>=3.4.0
seaborn>=0.11.0
kaggle>=1.5.12
opendatasets>=0.1.20-
Launch Jupyter Notebook
jupyter notebook Sentiment_analysis.ipynb
-
Run all cells to execute the complete pipeline:
- Data download and preprocessing
- Model training and evaluation
- Model saving and testing
import pickle
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
# Load the trained model
model = pickle.load(open('trained_model.sav', 'rb'))
# Load the vectorizer (you'll need to save this too)
vectorizer = pickle.load(open('vectorizer.sav', 'rb'))
# Preprocess and predict new text
def predict_sentiment(text):
# Apply same preprocessing as training data
processed_text = preprocess_text(text)
vectorized_text = vectorizer.transform([processed_text])
prediction = model.predict(vectorized_text)
return "Positive" if prediction[0] == 1 else "Negative"
# Example usage
sentiment = predict_sentiment("I love this product!")
print(f"Sentiment: {sentiment}")Raw Tweet โ Clean Text โ Tokenize โ Remove Stopwords โ Stem โ TF-IDF โ Features
- Algorithm: Logistic Regression
- Solver: Default (lbfgs)
- Max Iterations: 1000
- Feature Extraction: TF-IDF Vectorization
- Input Dimension: Variable (depends on vocabulary size)
- Output: Binary classification (0 or 1)
| Metric | Training Set | Test Set |
|---|---|---|
| Accuracy | ~79.0% | ~79.0% |
| Dataset Size | 1,280,000 | 320,000 |
- No Overfitting: Similar training and testing accuracy
- Balanced Performance: Works well on both positive and negative sentiments
- Scalable: Can handle large volumes of text data efficiently
Twitter_Sentimental_analysis/
โ
โโโ Sentiment_analysis.ipynb # Main Jupyter notebook
โโโ README.md # Project documentation
โโโ requirements.txt # Python dependencies
โโโ trained_model.sav # Serialized trained model
โโโ kaggle.json # Kaggle API credentials (not tracked)
โ
โโโ sentiment140/ # Dataset directory (auto-created)
โโโ training.1600000.processed.noemoticon.csv
Preprocesses text by removing special characters, converting to lowercase, removing stopwords, and applying stemming.
Parameters:
content(str): Raw text to be processed
Returns:
str: Preprocessed text ready for vectorization
- Data Loading:
pd.read_csv()with proper encoding - Preprocessing: Apply
stemming()function to all tweets - Vectorization:
TfidfVectorizer().fit_transform() - Training:
LogisticRegression().fit() - Evaluation:
accuracy_score()
We welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Model performance improvements
- Additional preprocessing techniques
- Web interface development
- Real-time analysis features
- Documentation enhancements
- Save vectorizer alongside the model for complete pipeline persistence
- Add cross-validation for more robust performance estimation
- Implement confusion matrix and detailed classification metrics
- Create requirements.txt with exact version specifications
- Advanced preprocessing: Handle emojis, URLs, and mentions
- Ensemble methods: Combine multiple algorithms for better accuracy
- Feature engineering: Add sentiment lexicon features
- Model comparison: Test Random Forest, SVM, and Neural Networks
- Deep Learning: Implement LSTM/BERT for improved accuracy
- Multi-class classification: Extend to neutral, very positive/negative
- Real-time API: Deploy model as REST API service
- Web interface: Create user-friendly web application
- Streaming analysis: Process live Twitter data
This project is licensed under the MIT License - see the LICENSE file for details.
- Sentiment140 Dataset: Go et al. (2009) for providing the labeled dataset
- NLTK Library: For comprehensive natural language processing tools
- Scikit-learn: For machine learning algorithms and utilities
- Kaggle: For hosting and providing access to the dataset
Made with โค๏ธ for the NLP and Machine Learning community