Classify the given genetic variations/mutations based on evidence from text-based clinical literature.
We have data for gentiec variation and mutation along with text based clinical literature. There are nine classes and we have to classify our data into one of these classes. This task we achieve on basis of the probabilites assigned to each of the class by our model. Since this is a critical problem, along with probabilites we also give the information that out of a given number of top features, how many of them are present in a test data point.
Since we have nine classess, this is a multiclass classification problem.
https://www.kaggle.com/c/msk-redefining-cancer-treatment/data
We have two data files. File 1 contents: Training variants (Id,Gene,Varaition,Class) Training text (Id,Clinical litreature text)
- No low-latency requirement.
- Interpretability is important.
- Errors can be very costly.
- Probability of a data-point belonging to each class is needed.
- Penalize the errors : Log Loss
- Interpretability
- Class probabilities are needed
- No latency requirement
- Log loss
- Confusion matrix : Along with confusion matrix, we also use precision and recall matrix. These give us a better understanding of how our model is working. Dark coloured diagonal means our model is doing good
We will use a number of models here to see which model has the best result. Models Used:
- Naive Bayes
- KNN
- Logistic Regression with class balancing
- Logistic Regression without class balancing
- Random Forest
- Linear SVM
- Stacking classifier (Ensemble model)
- Mijority voting (Ensemble model)
- Remove all the stop words
- Convert the text to lower case
- Replace multiple spaces with a single space
We use two approaches for featurizing data:
- Response coding: Better for Random forest
- One hot encoding: Better for Logistic Regression
- After vectorization, we stack all the three features to get the complete feature set for each data point.
- We split our whole data into training set, test set, cross validation set.
- As it is a multiclass problem, we check that the distribution of classess is consistent through the train, test and cv set.
We tried building a model on each of the individual features to ascertain wheter the feature is stable or not. If the test and cv error and not very much different from train error, we conculded that the feature is stable.
Once we have done the EDA and concluded that we will use all the features, we will now apply our ML models. We use Naive Bayes as our base line model, as it works well on text data.
- Perform hyper parameter tunning to get the best values for the hyper parameters for each model
- Use that hyper parameter to perform the prediction
- Naive Bayes : Alpha (Used in laplace smoothing)
- KNN : Number of neighbourss
- Logistic Regression with class balancing : Lambda ( Regularization parameter)
- Logistic Regression without class balancing : Lambda ( Regularization parameter)
- Random Forest : Number of estimators (Base line models), Depth
- Linear SVM : Lambda ( Regularization parameter)