Classify whether a user is anomalous or normal using supervised and unsupervised models. This is best viewed in google colab here as all the visualisations and outputs are preserved in the interactive python notebook.
Given users' ratings (0-5) of items, identify whether a user is anomalous or not. The data given consists of 3 columns (user, item, rating) e.g. user 3 gives rating 5 for item 1. The dataset is imbalanced as there are few anomalous users compared to normal ones. There are three phases of the project across three weeks where data with labels for previous phase are released as well along with new test set. We are recommended to try at least two supervised methods and one unsupervised method, and be ranked the best team in terms of performance (ROC AUC) to score well for this project.
- Logistic Regression
- KNN Classifier
- Random Forest
- XGBoost Classifier
- Neural Networks
- Autoencoder
- Isolation Forest
- Local Outlier Factor (LOF)
- Feature engineering
- IQR of user's ratings
- no. of items rated/not rated
- no. of items rated neutral
- fsti: The ratio between the number of items rated by the user and the total number of items in the recommender system.
- fsmaxrti: The ratio between the number of items rated by the user with maximum score and the total number of items in the recommender system.
- fsminrti: The ratio between the number of items rated by the user with minimum score and the total number of items in the recommender system.
- fspi: The ratio between the number of popular items rated by the user and the total number of popular items, K, in the recommender system.
- fspii: The ratio between the number of popular items rated by the user and the total number of items rated by the user.
- Models
- Catboost
- SVM
- Dealing with data imbalance
- mixup approach data augmentation for DNN
- SMOTE
- class weights