Exploratory Data Analysis with Titanic Project Part I

Titanic Dataset is a classic Dataset for classfication problem in data science Competition platform: Kaggle. You can reach this introductory competition here Through this project, I will walk through the entire data science pipeline, data collection, data manipulation, data wraggling, data visulization, data modeling and data evaluation. Machine learning techniques include logistic regression, SVM, decisioni trees, random forests and neural networks.

This is the first part of the whole project. The codes is run on Python.

import important packages

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

Read titanic traning dataset

titanic = pd.read_csv('train.csv')

Check the first 5 rows of train dataset

titanic.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Check the attributes we got

titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

From Kaggle’s website, we can clearly see the meaning of each attribute

Get the descriptive statistics of each column (only for numberic variables)

titanic.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

There are some missing values in Age column

Data Visualization

Visualize some columns

Visualize the distributions of Survivial

survive = titanic['Survived']

survive.value_counts()

0    549
1    342
Name: Survived, dtype: int64

More people did not survive

sns.countplot(x='Survived',data=titanic)

<matplotlib.axes._subplots.AxesSubplot at 0x2341656f278>

png

Visualize the distribution of sex

titanic['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

sns.countplot(x='Sex',data=titanic)

<matplotlib.axes._subplots.AxesSubplot at 0x23416696e80>

png

Visualize the distribution of Age

#sns.distplot(titanic['Age'])                  #error because of missing values

Drop null values in age

sns.distplot(titanic['Age'].dropna())

<matplotlib.axes._subplots.AxesSubplot at 0x234168005c0>

png

Use Pandas built-in visualization

titanic['Age'].hist(bins=30)

<matplotlib.axes._subplots.AxesSubplot at 0x23418c44668>

png

titanic['Age'].plot(kind='hist')

<matplotlib.axes._subplots.AxesSubplot at 0x23418b92438>

png

Visualize the distribution of PClass

titanic['Pclass'].value_counts()

  491
  216
  184
Name: Pclass, dtype: int64

Visualize the distribution of Parch # parents/children

sns.countplot(x='Pclass',data=titanic)

<matplotlib.axes._subplots.AxesSubplot at 0x2341eb8aa90>

png

Visualize fares

plt.figure(figsize=(10,3))
titanic['Fare'].hist(bins=30)

<matplotlib.axes._subplots.AxesSubplot at 0x2341ece6b00>

png

sns.distplot(titanic['Fare'])

<matplotlib.axes._subplots.AxesSubplot at 0x2341edcf630>

png

There might be some outliers in the Fare Column

Visualize the distribution of #of Siblings/Spouses

sns.countplot(x='SibSp',data=titanic)

<matplotlib.axes._subplots.AxesSubplot at 0x23420106ba8>

png

sns.countplot(x='Parch',data=titanic)

<matplotlib.axes._subplots.AxesSubplot at 0x23420112fd0>

png

Bi-variate Analysis

Sex with Survival

sns.countplot(x='Survived',hue='Sex',data=titanic)

<matplotlib.axes._subplots.AxesSubplot at 0x2341e085f28>

png

Fare with Survival

sns.boxplot(x='Survived',y='Fare',data=titanic)

<matplotlib.axes._subplots.AxesSubplot at 0x234201d4668>

png

It indicated that people those survived had more expensive fares. Also, in People who survived, there was one cost over 500. It might indicate the outliers.

You can also visualize how sex differs with different fare and different sexes

sns.boxplot(x='Survived',y='Fare',hue='Sex',data=titanic)

<matplotlib.axes._subplots.AxesSubplot at 0x2341ee50f98>

png

We may have to handle fares data

Number of Sibilings/family

sns.countplot(hue='Survived',x='SibSp',data=titanic)

<matplotlib.axes._subplots.AxesSubplot at 0x2341ee50e48>

png

People who have one sibling/spouse survived more likely that the single people

sns.countplot(hue='Survived',x='Parch',data=titanic)

<matplotlib.axes._subplots.AxesSubplot at 0x234200ac2b0>

png

Missing Values

there is missing value in age column and cabin column

titanic.isnull().head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	False	False	False	False	False	False	False	False	False	False	True	False
1	False	False	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	True	False
3	False	False	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	True	False

Visualize the missing values

sns.heatmap(titanic.isnull(),yticklabels=False,cbar=False,cmap='viridis')

<matplotlib.axes._subplots.AxesSubplot at 0x1b0d47885c0>

png

Cabin has a lot of missing values, we may just drop this colum. There are some missing values in Age column. We may have to handle them well because age might be an important attribute