Titanic Machine Learning Problem: Solution (for Absolute Beginner)

in #machinelearning7 years ago

Titanic: Simple Approach

If  you are just started in Machine Learning and come up with this Problem  and looking for a solution then you are in the right place. This Notebook contains a simple approach to tackle the problem. The  solution is not the best but it is the simplistic one from which you  will get the intuition behind the problem. And can improve for further  accuracy. We'll be trying to predict a classification- survival or deceased.  


Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


The Data

Let's start by reading in the titanic_train.csv file into a pandas dataframe.

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
train.head()


Investigating Data Analysis

Let's begin some exploratory data analysis! We'll start by checking out missing data!

sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Drop unnecessary Columns

train= train.drop(['PassengerId','Name','Ticket'], axis=1)
test= test.drop(['Name','Ticket'], axis=1

Converting Categorical Features  

Embarked Column

# Get dummy values for both train and test dataset
# Their are 3 values in embrked: C, Q, S
# drop_first = True: Drop C column as it will be redudant because we can identify the emarked column from S and Q.
embark_train = pd.get_dummies(train['Embarked'],drop_first=True)
emark_test = pd.get_dummies(test['Embarked'], drop_first=True)

# Drop Emarked column
train.drop(['Embarked'],axis=1,inplace=True)
test.drop(['Embarked'],axis=1,inplace=True)

# Concat new embark columns in respective datasets
train = pd.concat([train,embark_train],axis=1)
test = pd.concat([test, emark_test], axis=1) 

Cabin Column

# Drop Cabin attribute from both the dataset
train.drop("Cabin",axis=1,inplace=True)
test.drop("Cabin", axis=1, inplace=True)


Sex Column

Sex column contains Male and Female entries. We will just make one column Male (1- for male and 0- for female) entries.

sex_train = pd.get_dummies(train['Sex'],drop_first=True)
sex_test = pd.get_dummies(test['Sex'], drop_first=True)

train.drop(['Sex'],axis=1,inplace=True)
test.drop(['Sex'],axis=1,inplace=True)

train = pd.concat([train,sex_train],axis=1)
test = pd.concat([test, sex_test], axis=1)


Age

We can see the rich passengers in the higher classes tend to be  older. We'll use these mean age values to impute based on Pclass for  Age.

# Function to Impute Age
def impute_age(cols):
   Age = cols[0]
   Pclass = cols[1]
   
   if pd.isnull(Age):

       if Pclass == 1:
           return 37

       elif Pclass == 2:
           return 29

       else:
           return 24

   else:
       return Age
# Apply the above function to our training and testing datasets
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)
test['Age'] = test[['Age','Pclass']].apply(impute_age,axis=1)

train['Age'] = train['Age'].astype(int)
test['Age']    = test['Age'].astype(int)

Building a Model

Train-Test Split

X_train = train.drop('Survived', axis=1)
y_train = train['Survived']
X_test = test.drop('PassengerId', axis=1)

Random Forest

from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, y_train)
RFC_prediction = random_forest.predict(X_test)
random_forest.score(X_train, y_train)
Accuracy: 0.98092031425364756

Score Value

The  score we get is based on the Training Dataset, it is different when you  use it on Test Dataset. After uploading the result.csv file the score  is 0.75 which is notable good at the elementary level.

Some suggestions to imrove further:

  • Can grab the tittles from the feature(Mr, Mrs, Dr, etc)
  • Cabin column can be a feature

For any help feel free to comment.


For full working notebook check out following link: 

https://www.kaggle.com/ajay1216/titanic-solution-for-absolute-beginner

Thank You

Upvote + Follow: @arnav1216