This project is a submission for the Titanic - Machine Learning from Disaster competition hosted on Kaggle.The goal is to predict survival outcomes on the Titanic based on passenger attributes using classical machine learning methods.
Using a combination of feature engineering, ensemble modeling, and cross-validation, this notebook achieves strong performance on the validation set and test submission.
Modeling Strategy:
- Feature extraction from raw data (titles, categorical encodings)
- Imputation for missing values
- One-hot encoding for categorical variables
- Ensemble classification using:
RandomForestClassifier
XGBRFClassifier
- Combined via soft-voting
VotingClassifier
Random Forest
Core Idea:
A Random Forest is an ensemble of decision trees trained on random subsets of the data and features. The final prediction is made by classification or regression.
Key Concepts:
Bagging
: Each tree is trained on a random sample (with replacement) of the training data.Feature Randomness
: At each split, only a random subset of features is considered which reduces correlation between trees.Result
: Low bias + low variance leads to strong generalization.
XGBoost
Core Idea:
XGBoost is a gradient boosting algorithm; trees are built sequentially each trying to correct the errors of the last.
Key Concepts:
Boosting
: Unlike Random Forest, XGBoost builds trees one after another. Each tree focuses on where the previous model did poorly.Gradient Descent
: It minimizes a loss function using gradients.Regularization
: XGBoost penalizes overly complex trees (L1/L2 regularization), making it robust against overfitting.
Feature | Description |
---|---|
Pclass | Passenger class (proxy for wealth) |
Sex | Binary-encoded gender |
Age | Median-imputed age |
SibSp, Parch | Number of siblings/spouses and parents/children aboard |
Fare | Ticket price (proxy for economic status) |
Embarked | Port of embarkation (one-hot encoded) |
Title | Extracted honorific from Name field |
Metric | Value |
---|---|
Cross-Validation Accuracy | 83.5% +- 3.9% |
Validation Accuracy | 83.8% |
The model was trained using RepeatedStratifiedKFold
with 10 folds repeated 3 times to reduce variance in cross-validation estimates.
- Title extraction from passenger names (Mr, Mrs, Miss, etc.) to improve prediction
- Ensemble learning to boost performance and reduce overfitting
- Cross-validation for more reliable model evaluation
- One-hot encoding with
drop_first=True
to avoid multicollinearity