MachineLearning-Case-Study-2

Classify the given genetic variations/mutations based on evidence from text-based clinical literature.

Personalized Cancer Diagnosis

Business Objective:

We have data for gentiec variation and mutation along with text based clinical literature. There are nine classes and we have to classify our data into one of these classes. This task we achieve on basis of the probabilites assigned to each of the class by our model. Since this is a critical problem, along with probabilites we also give the information that out of a given number of top features, how many of them are present in a test data point.

Machine learning problem type:

Since we have nine classess, this is a multiclass classification problem.

Data Source:

https://www.kaggle.com/c/msk-redefining-cancer-treatment/data

Data Information:

We have two data files. File 1 contents: Training variants (Id,Gene,Varaition,Class) Training text (Id,Clinical litreature text)

Real-world/Business objectives and constraints:

No low-latency requirement.
Interpretability is important.
Errors can be very costly.
Probability of a data-point belonging to each class is needed.

Machine learning objectives and constraints:

Penalize the errors : Log Loss
Interpretability
Class probabilities are needed
No latency requirement

Performance Metric:

Log loss
Confusion matrix : Along with confusion matrix, we also use precision and recall matrix. These give us a better understanding of how our model is working. Dark coloured diagonal means our model is doing good

Approach to solve the problem:

We will use a number of models here to see which model has the best result. Models Used:

Naive Bayes
KNN
Logistic Regression with class balancing
Logistic Regression without class balancing
Random Forest
Linear SVM
Stacking classifier (Ensemble model)
Mijority voting (Ensemble model)

Data Preprocessing:

Remove all the stop words
Convert the text to lower case
Replace multiple spaces with a single space

Data Featurization:

We use two approaches for featurizing data:

Response coding: Better for Random forest
One hot encoding: Better for Logistic Regression
After vectorization, we stack all the three features to get the complete feature set for each data point.

Train, test, cv split:

We split our whole data into training set, test set, cross validation set.
As it is a multiclass problem, we check that the distribution of classess is consistent through the train, test and cv set.

Exploratory Data Analysis:

We tried building a model on each of the individual features to ascertain wheter the feature is stable or not. If the test and cv error and not very much different from train error, we conculded that the feature is stable.

Applying The Models:

Once we have done the EDA and concluded that we will use all the features, we will now apply our ML models. We use Naive Bayes as our base line model, as it works well on text data.

Steps:

Perform hyper parameter tunning to get the best values for the hyper parameters for each model
Use that hyper parameter to perform the prediction

Hyper-parameters:

Naive Bayes : Alpha (Used in laplace smoothing)
KNN : Number of neighbourss
Logistic Regression with class balancing : Lambda ( Regularization parameter)
Logistic Regression without class balancing : Lambda ( Regularization parameter)
Random Forest : Number of estimators (Base line models), Depth
Linear SVM : Lambda ( Regularization parameter)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
PersonalizedCancerDiagnosis_CountVectorizer.ipynb		PersonalizedCancerDiagnosis_CountVectorizer.ipynb
PersonalizedCancerDiagnosis_FeatureEngineering.ipynb		PersonalizedCancerDiagnosis_FeatureEngineering.ipynb
PersonalizedCancerDiagnosis_TFIDF.ipynb		PersonalizedCancerDiagnosis_TFIDF.ipynb
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MachineLearning-Case-Study-2

Personalized Cancer Diagnosis

Business Objective:

Machine learning problem type:

Data Source:

Data Information:

Real-world/Business objectives and constraints:

Machine learning objectives and constraints:

Performance Metric:

Approach to solve the problem:

Data Preprocessing:

Data Featurization:

Train, test, cv split:

Exploratory Data Analysis:

Applying The Models:

Steps:

Hyper-parameters:

About

Uh oh!

Releases

Packages

Languages

rahul1728jha/MachineLearning-Case-Study-2-Cancer-Diagnosis

Folders and files

Latest commit

History

Repository files navigation

MachineLearning-Case-Study-2

Personalized Cancer Diagnosis

Business Objective:

Machine learning problem type:

Data Source:

Data Information:

Real-world/Business objectives and constraints:

Machine learning objectives and constraints:

Performance Metric:

Approach to solve the problem:

Data Preprocessing:

Data Featurization:

Train, test, cv split:

Exploratory Data Analysis:

Applying The Models:

Steps:

Hyper-parameters:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages