Predicting the survival of passengers on the Titanic

Phyllis Joy Nabangi
2 min readFeb 8, 2023
source:https://unsplash.com/photos/ToRz-jwncrM

This article covers the whole process of using a machine-learning model to predict the survival of titanic passengers. From data collection, data processing to building models and evaluating them.

Data Preparation and Exploration

Downloaded the 3 required datasets (train, test, and gender) from Kaggle. Loaded the data using pandas and explored their summary. Filled the missing values in the age column in the train and test datasets with the mean age.

Feature Engineering

Extracted the important features (Passenger Id, Age, Sex, Pclass and Survived) needed to describe the data for the train and test dataset.

Performed label encoding on the categorical data to transform it into numeric values in order to able to represent the data set more effectively and result in a better learning performance.

Merged the test dataset and with the gender dataset, which would be used for testing the model.

Model Building and Evaluation

Extracted the independent and dependent variables from the train dataset.

Decided to build a Random Forest model in predicting the survival of passengers. Used GridSearchCV to find the optimal number of estimators to use for the Random Forest Classifier.
The optimal number of estimators selected was 27 from the range of 1 to 30.

Constructed the Random Forest Model with the optimal number of estimators. Fit the model with the train dataset and used the merged test dataset to perform the survival prediction.

The model performance was evaluated using the mean square error and the accuracy score.

Conclusion
If you have found this interesting you can join the kaggle challenge about this prediction and evaluate how your model performs with the actual survival dataset. You can as well go ahead and analyze which group of passengers were most likely to survive.

You can find the code to this project here.

--

--