import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
import warnings
warnings.filterwarnings('ignore') # To silent the seaborn plotting warning
# Althernate way to read the dataset is directly from the URL. Uncomment these if you have difficuly following above command
train_df = pd.read_csv('https://raw.githubusercontent.com/nphardly/titanic/master/data/inputs/train.csv')
test_df = pd.read_csv('https://raw.githubusercontent.com/nphardly/titanic/master/data/inputs/test.csv')
# Luckily Titanic dataset is available with seaborn package.
# train_df = sns.load_dataset("titanic")
train_df.head()
How many Male and Female were on the Titanic?¶
train_df[['Sex', 'Name']].groupby('Sex').count()
How many female pessenger were travelling in Pclass 3?¶
train_df[['Sex', 'Pclass', 'Name']].groupby(['Sex', 'Pclass']).count()
Univariate Data Analysis¶
This type of data consists of only one variable. The analysis of univariate data is thus the simplest form of analysis since the information deals with only one quantity that changes. It does not deal with causes or relationships and the main purpose of the analysis is to describe the data and find patterns that exist within it. The example of a univariate data can be height.
Plotting univariate distributions¶
Plotting Distribution¶
heights = [160, 150, 181, 162, 145, 167, 145, 146, 135, 150, 154]
sns.distplot(heights)
Let's explore the Titanic Dataset¶
Density Graph¶
df = train_df.dropna()
sns.distplot(df['Age'], bins=10)
Count Plot¶
sns.countplot("Sex", data=train_df)
sns.countplot('Pclass',data=train_df)
ax = sns.countplot('Survived',data=train_df)
plt.xticks(np.arange(0,2,1),['{}'.format(i) for i in ['Died','Survived']])
Bivariate Data Analysis¶
Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the concept of relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.
How many male passenger survived in the accident?¶
train_df[['Sex', 'Survived', 'Name']].groupby(['Sex', 'Survived']).count()
sns.countplot("Sex", hue='Survived', data=train_df)
Excercise¶
Note: sibsp is Number of Siblings/Spouses Aboard. Parch is Number of Parents/Children Aboard
- Plot the count distribution of Parch in the Titanic Dataset
- Plot the count distribution of number of Males and Females across different 'Pclass'
sns.countplot('Parch',data=train_df)
# Solution to Ex: 2
sns.countplot("Sex",hue='Survived', data=train_df)
Visualizing statistical relationships¶
The dataset "tips" explores the records of people visiting restaurant
tips = sns.load_dataset("tips")
tips.head()
sns.relplot(x="total_bill", y="tip", data=tips)
sns.relplot(x="total_bill", y="tip", data=tips, hue='smoker')
Excercise¶
- Plot the number of males and females in the 'tips' dataset.
- Plot the distribution of smoker in Males and Females.
- Plot the relation graph between tip and day.
# @title Please try yourself (Solution 3)
sns.countplot('sex', data=tips)
# @title Please try yourself (Solution 4)
sns.countplot('sex', hue='smoker', data=tips)
# @title Please try yourself (Solution 5)
sns.relplot('day', 'tip', data=tips, kind='line')
sns.catplot(x="day", y="total_bill", data=tips)
sns.catplot(x="day", y="total_bill", hue='sex', data=tips)
Count of Male and Female on Friday¶
# Do it yourself please
Visualizing pairwise relationships in a dataset¶
# Titanic Dataset
train_df = pd.read_csv('https://raw.githubusercontent.com/nphardly/titanic/master/data/inputs/train.csv')
sns.pairplot(train_df, vars=["Survived", "Fare", "Pclass", "SibSp"])