Skip to content

majdkhalife/Kaggle_Competition_Titanic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Kaggle Competition: Titanic - Machine Learning from Disaster


This project is a submission for the Titanic - Machine Learning from Disaster competition hosted on Kaggle.The goal is to predict survival outcomes on the Titanic based on passenger attributes using classical machine learning methods.

Overview

Using a combination of feature engineering, ensemble modeling, and cross-validation, this notebook achieves strong performance on the validation set and test submission.

Modeling Strategy:

  • Feature extraction from raw data (titles, categorical encodings)
  • Imputation for missing values
  • One-hot encoding for categorical variables
  • Ensemble classification using:
    • RandomForestClassifier
    • XGBRFClassifier
    • Combined via soft-voting VotingClassifier

Brief explanation of the methods used

Random Forest Core Idea: A Random Forest is an ensemble of decision trees trained on random subsets of the data and features. The final prediction is made by classification or regression.
Key Concepts:

  • Bagging: Each tree is trained on a random sample (with replacement) of the training data.
  • Feature Randomness: At each split, only a random subset of features is considered which reduces correlation between trees.
  • Result: Low bias + low variance leads to strong generalization.

XGBoost Core Idea: XGBoost is a gradient boosting algorithm; trees are built sequentially each trying to correct the errors of the last.
Key Concepts:

  • Boosting: Unlike Random Forest, XGBoost builds trees one after another. Each tree focuses on where the previous model did poorly.
  • Gradient Descent: It minimizes a loss function using gradients.
  • Regularization: XGBoost penalizes overly complex trees (L1/L2 regularization), making it robust against overfitting.

Features Used

Feature Description
Pclass Passenger class (proxy for wealth)
Sex Binary-encoded gender
Age Median-imputed age
SibSp, Parch Number of siblings/spouses and parents/children aboard
Fare Ticket price (proxy for economic status)
Embarked Port of embarkation (one-hot encoded)
Title Extracted honorific from Name field

Results

Metric Value
Cross-Validation Accuracy 83.5% +- 3.9%
Validation Accuracy 83.8%

The model was trained using RepeatedStratifiedKFold with 10 folds repeated 3 times to reduce variance in cross-validation estimates.


Key Techniques

  • Title extraction from passenger names (Mr, Mrs, Miss, etc.) to improve prediction
  • Ensemble learning to boost performance and reduce overfitting
  • Cross-validation for more reliable model evaluation
  • One-hot encoding with drop_first=True to avoid multicollinearity

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published