Titanic Machine Learning Problem: Solution (for Absolute Beginner)
Titanic: Simple Approach
If you are just started in Machine Learning and come up with this Problem and looking for a solution then you are in the right place. This Notebook contains a simple approach to tackle the problem. The solution is not the best but it is the simplistic one from which you will get the intuition behind the problem. And can improve for further accuracy. We'll be trying to predict a classification- survival or deceased.
Import Libraries¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
The Data¶
Let's start by reading in the titanic_train.csv file into a pandas dataframe.
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
train.head()
Investigating Data Analysis¶
Let's begin some exploratory data analysis! We'll start by checking out missing data!
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Drop unnecessary Columns¶
train= train.drop(['PassengerId','Name','Ticket'], axis=1)
test= test.drop(['Name','Ticket'], axis=1
Converting Categorical Features ¶
Embarked Column¶
# Get dummy values for both train and test dataset
# Their are 3 values in embrked: C, Q, S
# drop_first = True: Drop C column as it will be redudant because we can identify the emarked column from S and Q.
embark_train = pd.get_dummies(train['Embarked'],drop_first=True)
emark_test = pd.get_dummies(test['Embarked'], drop_first=True)
# Drop Emarked column
train.drop(['Embarked'],axis=1,inplace=True)
test.drop(['Embarked'],axis=1,inplace=True)
# Concat new embark columns in respective datasets
train = pd.concat([train,embark_train],axis=1)
test = pd.concat([test, emark_test], axis=1)
Cabin Column¶
# Drop Cabin attribute from both the dataset
train.drop("Cabin",axis=1,inplace=True)
test.drop("Cabin", axis=1, inplace=True)
Sex Column¶
Sex column contains Male and Female entries. We will just make one column Male (1- for male and 0- for female) entries.
sex_train = pd.get_dummies(train['Sex'],drop_first=True)
sex_test = pd.get_dummies(test['Sex'], drop_first=True)
train.drop(['Sex'],axis=1,inplace=True)
test.drop(['Sex'],axis=1,inplace=True)
train = pd.concat([train,sex_train],axis=1)
test = pd.concat([test, sex_test], axis=1)
Age
We can see the rich passengers in the higher classes tend to be older. We'll use these mean age values to impute based on Pclass for Age.
# Function to Impute Age
def impute_age(cols):
Age = cols[0]
Pclass = cols[1]
if pd.isnull(Age):
if Pclass == 1:
return 37
elif Pclass == 2:
return 29
else:
return 24
else:
return Age
# Apply the above function to our training and testing datasets
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)
test['Age'] = test[['Age','Pclass']].apply(impute_age,axis=1)
train['Age'] = train['Age'].astype(int)
test['Age'] = test['Age'].astype(int)
Building a Model¶
Train-Test Split¶
X_train = train.drop('Survived', axis=1)
y_train = train['Survived']
X_test = test.drop('PassengerId', axis=1)
Random Forest¶
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, y_train)
RFC_prediction = random_forest.predict(X_test)
random_forest.score(X_train, y_train)
Accuracy: 0.98092031425364756
Score Value¶
The score we get is based on the Training Dataset, it is different when you use it on Test Dataset. After uploading the result.csv file the score is 0.75 which is notable good at the elementary level.¶
Some suggestions to imrove further:¶
- Can grab the tittles from the feature(Mr, Mrs, Dr, etc)
- Cabin column can be a feature
For any help feel free to comment.¶
For full working notebook check out following link:
https://www.kaggle.com/ajay1216/titanic-solution-for-absolute-beginner