In [5]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
import warnings 
warnings.filterwarnings('ignore') # To silent the seaborn plotting warning

In [7]:

# Althernate way to read the dataset is directly from the URL. Uncomment these if you have difficuly following above command

train_df = pd.read_csv('https://raw.githubusercontent.com/nphardly/titanic/master/data/inputs/train.csv')
test_df = pd.read_csv('https://raw.githubusercontent.com/nphardly/titanic/master/data/inputs/test.csv')


# Luckily  Titanic dataset is available with seaborn package.
# train_df = sns.load_dataset("titanic")

train_df.head()

Out[7]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

How many Male and Female were on the Titanic?¶

In [8]:

train_df[['Sex', 'Name']].groupby('Sex').count()

Out[8]:

	Name
Sex
female	314
male	577

How many female pessenger were travelling in Pclass 3?¶

In [9]:

train_df[['Sex', 'Pclass', 'Name']].groupby(['Sex', 'Pclass']).count()

Out[9]:

		Name
Sex	Pclass
female	1	94
	2	76
	3	144
male	1	122
	2	108
	3	347

Univariate Data Analysis¶

This type of data consists of only one variable. The analysis of univariate data is thus the simplest form of analysis since the information deals with only one quantity that changes. It does not deal with causes or relationships and the main purpose of the analysis is to describe the data and find patterns that exist within it. The example of a univariate data can be height.

Plotting univariate distributions¶

Plotting Distribution¶

In [10]:

heights = [160, 150, 181, 162, 145, 167, 145, 146, 135, 150, 154]
sns.distplot(heights)

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x1086fb748>

Let's explore the Titanic Dataset¶

Density Graph¶

In [11]:

df = train_df.dropna()
sns.distplot(df['Age'], bins=10)

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x107b926a0>

Count Plot¶

In [12]:

sns.countplot("Sex", data=train_df)

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x11eded128>

In [13]:

sns.countplot('Pclass',data=train_df)

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x11ee3bba8>

In [14]:

ax = sns.countplot('Survived',data=train_df)
plt.xticks(np.arange(0,2,1),['{}'.format(i) for i in ['Died','Survived']])

Out[14]:

([<matplotlib.axis.XTick at 0x11ef0cac8>,
  <matplotlib.axis.XTick at 0x11ef0c3c8>],
 <a list of 2 Text xticklabel objects>)

Bivariate Data Analysis¶

Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the concept of relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.

How many male passenger survived in the accident?¶

In [15]:

train_df[['Sex', 'Survived', 'Name']].groupby(['Sex', 'Survived']).count()

Out[15]:

		Name
Sex	Survived
female	0	81
female	1	233
male	0	468
male	1	109

In [16]:

sns.countplot("Sex", hue='Survived', data=train_df)

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x11ee3bcc0>

Excercise¶

Note: sibsp is Number of Siblings/Spouses Aboard. Parch is Number of Parents/Children Aboard

Plot the count distribution of Parch in the Titanic Dataset
Plot the count distribution of number of Males and Females across different 'Pclass'

In [24]:

sns.countplot('Parch',data=train_df)

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x11f4e4240>

In [25]:

# Solution to Ex: 2
sns.countplot("Sex",hue='Survived', data=train_df)

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x11f5b6a20>

Visualizing statistical relationships¶

The dataset "tips" explores the records of people visiting restaurant

In [26]:

tips = sns.load_dataset("tips")
tips.head()

Out[26]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

In [27]:

sns.relplot(x="total_bill", y="tip", data=tips)

Out[27]:

<seaborn.axisgrid.FacetGrid at 0x11f671588>

In [28]:

sns.relplot(x="total_bill", y="tip", data=tips, hue='smoker')

Out[28]:

<seaborn.axisgrid.FacetGrid at 0x11f7bf320>

Excercise¶

Plot the number of males and females in the 'tips' dataset.
Plot the distribution of smoker in Males and Females.
Plot the relation graph between tip and day.

In [29]:

# @title Please try yourself (Solution 3)
sns.countplot('sex', data=tips)

Out[29]:

<matplotlib.axes._subplots.AxesSubplot at 0x11f7bfe48>

In [30]:

# @title Please try yourself (Solution 4)
sns.countplot('sex', hue='smoker', data=tips)

Out[30]:

<matplotlib.axes._subplots.AxesSubplot at 0x11f981438>

In [31]:

# @title Please try yourself (Solution 5)
sns.relplot('day', 'tip', data=tips, kind='line')

Out[31]:

<seaborn.axisgrid.FacetGrid at 0x11fad3320>

Plotting with categorical data¶

Categorical scatterplots¶

In [32]:

sns.catplot(x="day", y="total_bill", data=tips)

Out[32]:

<seaborn.axisgrid.FacetGrid at 0x11fac9128>

In [34]:

sns.catplot(x="day", y="total_bill", hue='sex', data=tips)

Out[34]:

<seaborn.axisgrid.FacetGrid at 0x11fa42d30>

Count of Male and Female on Friday¶

In [35]:

# Do it yourself please

Visualizing pairwise relationships in a dataset¶

In [36]:

# Titanic Dataset
train_df = pd.read_csv('https://raw.githubusercontent.com/nphardly/titanic/master/data/inputs/train.csv')

sns.pairplot(train_df, vars=["Survived", "Fare", "Pclass", "SibSp"])

Out[36]:

<seaborn.axisgrid.PairGrid at 0x11ffdd2b0>

Comments

Data Visualization Part-1

How many Male and Female were on the Titanic?¶

How many female pessenger were travelling in Pclass 3?¶

Univariate Data Analysis¶

Plotting univariate distributions¶

Plotting Distribution¶

Let's explore the Titanic Dataset¶

Density Graph¶

Count Plot¶

Bivariate Data Analysis¶

How many male passenger survived in the accident?¶

Excercise¶

Visualizing statistical relationships¶

Excercise¶

Plotting with categorical data¶

Categorical scatterplots¶

Count of Male and Female on Friday¶

Visualizing pairwise relationships in a dataset¶

Published

Category

Contact