Titanic Machine Learning Problem: Solution (for Absolute Beginner)

arnav1216 (35)in #machinelearning • 8 years ago

Titanic: Simple Approach

If you are just started in Machine Learning and come up with this Problem and looking for a solution then you are in the right place. This Notebook contains a simple approach to tackle the problem. The solution is not the best but it is the simplistic one from which you will get the intuition behind the problem. And can improve for further accuracy. We'll be trying to predict a classification- survival or deceased.

Import Libraries¶

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

The Data¶

Let's start by reading in the titanic_train.csv file into a pandas dataframe.

train = pd.read_csv('../input/train.csv')

test = pd.read_csv('../input/test.csv')

train.head()

Investigating Data Analysis¶

Let's begin some exploratory data analysis! We'll start by checking out missing data!

sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Drop unnecessary Columns¶

train= train.drop(['PassengerId','Name','Ticket'], axis=1)

test= test.drop(['Name','Ticket'], axis=1

Converting Categorical Features ¶

Embarked Column¶

# Get dummy values for both train and test dataset

# Their are 3 values in embrked: C, Q, S

# drop_first = True: Drop C column as it will be redudant because we can identify the emarked column from S and Q.

embark_train = pd.get_dummies(train['Embarked'],drop_first=True)

emark_test = pd.get_dummies(test['Embarked'], drop_first=True)



# Drop Emarked column

train.drop(['Embarked'],axis=1,inplace=True)

test.drop(['Embarked'],axis=1,inplace=True)



# Concat new embark columns in respective datasets

train = pd.concat([train,embark_train],axis=1)

test = pd.concat([test, emark_test], axis=1)

Cabin Column¶

# Drop Cabin attribute from both the dataset

train.drop("Cabin",axis=1,inplace=True)

test.drop("Cabin", axis=1, inplace=True)

Sex Column¶

Sex column contains Male and Female entries. We will just make one column Male (1- for male and 0- for female) entries.

sex_train = pd.get_dummies(train['Sex'],drop_first=True)

sex_test = pd.get_dummies(test['Sex'], drop_first=True)



train.drop(['Sex'],axis=1,inplace=True)

test.drop(['Sex'],axis=1,inplace=True)



train = pd.concat([train,sex_train],axis=1)

test = pd.concat([test, sex_test], axis=1)

Age

We can see the rich passengers in the higher classes tend to be older. We'll use these mean age values to impute based on Pclass for Age.

# Function to Impute Age

def impute_age(cols):

    Age = cols[0]

    Pclass = cols[1]

    

    if pd.isnull(Age):



        if Pclass == 1:

            return 37



        elif Pclass == 2:

            return 29



        else:

            return 24



    else:

        return Age

# Apply the above function to our training and testing datasets

train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)

test['Age'] = test[['Age','Pclass']].apply(impute_age,axis=1)



train['Age'] = train['Age'].astype(int)

test['Age']    = test['Age'].astype(int)

Building a Model¶

Train-Test Split¶

X_train = train.drop('Survived', axis=1)

y_train = train['Survived']

X_test = test.drop('PassengerId', axis=1)

Random Forest¶

from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(n_estimators=100)

random_forest.fit(X_train, y_train)

RFC_prediction = random_forest.predict(X_test)

random_forest.score(X_train, y_train)

Accuracy: 0.98092031425364756

Score Value¶

The score we get is based on the Training Dataset, it is different when you use it on Test Dataset. After uploading the result.csv file the score is 0.75 which is notable good at the elementary level.¶

Some suggestions to imrove further:¶

Can grab the tittles from the feature(Mr, Mrs, Dr, etc)
Cabin column can be a feature

For any help feel free to comment.¶

For full working notebook check out following link:

https://www.kaggle.com/ajay1216/titanic-solution-for-absolute-beginner

Thank You

Upvote + Follow: @arnav1216

#datascience #kaggle #dataset #science

8 years ago in #machinelearning by arnav1216 (35)

$0.00

2 votes