This repository contains a machine learning-based application for detecting offensive language in yoruba text. The app is built using either Streamlit or Flask and leverages a Logistic Regression model trained on a dataset of tweets. The model uses TF-IDF Vectorization for text preprocessing and classification.
The goal of this project is to detect if a lanuange in yoruba is based off of text inputs. The app takes a sentence as input and predicts whether it contains offensive language, hate speech, or is normal. The model is trained on a dataset of tweets and uses TF-IDF Vectorization for feature extraction and Logistic Regression for classification.
- Text Input: Users can input a sentence to check for offensive language.
- Real-Time Prediction: The app provides instant predictions using a pre-trained machine learning model.
- Clean and Preprocess Text: The app cleans and preprocesses the input text (e.g., removes emojis, URLs, and special characters) before making predictions.
- Streamlit and Flask Support: The app can be deployed using either Streamlit or Flask.
- Python 3.7 or higher
- pip (Python package manager)
-
Clone the Repository:
git clone https://github.com/DominionAkinrotimi/Yoruba-Offensive-Language-Detection-Model.git cd Yoruba-Offensive-Language-Detection-Model
-
Create a Virtual Environment (Optional but Recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt
-
Download NLTK Data: The app uses NLTK for text preprocessing. Download the required NLTK data by running:
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"
-
Run the Streamlit App:
streamlit run app.py
-
Open the App: The app will open in your default web browser at
http://localhost:8501
. -
Enter Text: Input a sentence in the text box and click Predict to see the result.
-
Run the Flask App:
python main.py
-
Open the App: The app will be available at
http://localhost:5000
. -
Enter Text: Input a sentence in the text box and click Predict to see the result.
The model was developed using the following steps:
-
Data Cleaning:
- Convert text to lowercase.
- Remove emojis, URLs, hashtags, mentions, and special characters.
- Remove digits and extra spaces.
-
Text Preprocessing:
- Tokenize the text.
- Remove stopwords.
-
Feature Extraction:
- Use TF-IDF Vectorization to convert text into numerical features.
-
Model Training:
- Train a Logistic Regression model on the preprocessed data.
-
Model Saving:
- Save the trained model and vectorizer using
joblib
.
- Save the trained model and vectorizer using
offensive-language-detection/
├── app.py # Streamlit app
├── main.py # Flask app
├── requirements.txt # List of dependencies
├── lr_model.pkl # Trained Logistic Regression model
├── tfidf_vectorizer.pkl # Fitted TF-IDF Vectorizer
├── README.md # Project documentation
├── templates/ # Flask HTML templates
│ └── index.html # Flask app homepage
└── notebooks/ # Jupyter notebooks for model development
└── model_development.ipynb
Contributions are welcome! If you'd like to contribute, please follow these steps:
- Fork the repository.
- Create a new branch for your feature or bugfix.
- Commit your changes and push to the branch.
- Submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
- The dataset used for training was sourced from 🙈 (I'm shy).
- Special thanks to the developers of scikit-learn, Streamlit, and Flask for their amazing libraries.
For questions or feedback, please contact:
- Dominion Akinrotimi
- Email: [email protected]
Enjoy using the Offensive Yoruba Language Detection App! 🚀