In [0]:

!pip3 install --upgrade seaborn # Updates the seaborn version in the collab notebook.
import warnings
warnings.filterwarnings('ignore')

Collecting seaborn
  Downloading https://files.pythonhosted.org/packages/a8/76/220ba4420459d9c4c9c9587c6ce607bf56c25b3d3d2de62056efe482dadc/seaborn-0.9.0-py3-none-any.whl (208kB)
    100% |████████████████████████████████| 215kB 24.1MB/s 
Requirement already satisfied, skipping upgrade: scipy>=0.14.0 in /usr/local/lib/python3.6/dist-packages (from seaborn) (1.1.0)
Requirement already satisfied, skipping upgrade: matplotlib>=1.4.3 in /usr/local/lib/python3.6/dist-packages (from seaborn) (3.0.3)
Requirement already satisfied, skipping upgrade: numpy>=1.9.3 in /usr/local/lib/python3.6/dist-packages (from seaborn) (1.14.6)
Requirement already satisfied, skipping upgrade: pandas>=0.15.2 in /usr/local/lib/python3.6/dist-packages (from seaborn) (0.22.0)
Requirement already satisfied, skipping upgrade: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=1.4.3->seaborn) (2.3.1)
Requirement already satisfied, skipping upgrade: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=1.4.3->seaborn) (0.10.0)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=1.4.3->seaborn) (2.5.3)
Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=1.4.3->seaborn) (1.0.1)
Requirement already satisfied, skipping upgrade: pytz>=2011k in /usr/local/lib/python3.6/dist-packages (from pandas>=0.15.2->seaborn) (2018.9)
Requirement already satisfied, skipping upgrade: six in /usr/local/lib/python3.6/dist-packages (from cycler>=0.10->matplotlib>=1.4.3->seaborn) (1.11.0)
Requirement already satisfied, skipping upgrade: setuptools in /usr/local/lib/python3.6/dist-packages (from kiwisolver>=1.0.1->matplotlib>=1.4.3->seaborn) (40.8.0)
Installing collected packages: seaborn
  Found existing installation: seaborn 0.7.1
    Uninstalling seaborn-0.7.1:
      Successfully uninstalled seaborn-0.7.1
Successfully installed seaborn-0.9.0

Box plot¶

Box plot, also called the box-and-whisker plot:

a way to show the distribution of values based on the five-number summary: minimum, first quartile, median, third quartile, and maximum.

Image of Yaktocat

Median¶

The median is the value that separates the higher half of a data from the lower half. It’s calculated by the following steps: order your values, and find the middle one. For example, if we have the numbers 1, 3, 4, 7, 8, 8, 9, the median will be 7.

First quartile¶

The first quartile is the median of the data values to the left of the median in our ordered values. Ex: For the numbers 1, 3, 4, 7, 8, 8, 9, the first quartile will be ?

Third quartile¶

The third quartile is the median of the data values to the right of the median in our ordered values. Ex: For the numbers 1, 3, 4, 7, 8, 8, 9, the Third quartile will be ?

Interquartile Range(IQR)¶

The IQR approximates the amount of spread in the middle 50% of the data. The formula is the third quartile - the first quartile. Ex: For the numbers 1, 3, 4, 7, 8, 8, 9, the IQR will be ?

Outlier¶

An outlier is a data value that lies outside the overall pattern. A commonly used rule says that a value is an outlier if it’s less than the first quartile - 1.5 IQR or high than the third quartile + 1.5 IQR.

Maximum and Minimum¶

The minimum and the maximum are just the min and max values from our data. (outliers are not included)

In [0]:

import seaborn as sns
sns.set(style="whitegrid")

tips = sns.load_dataset("tips")
sns.boxplot(y=tips["total_bill"])

Out[0]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7ead8ea080>

In [0]:

sns.boxplot(x="day", y="total_bill", data=tips)

Out[0]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7ead9184a8>

Excercise¶

Draw a boxplot for smoker and non-smoker for the above plot.
Draw the boxplot for males and females for the above plot.

In [0]:

# @title Please try yourself
# Hint: Remember `hue` from last class?

Use swarmplot() to show the datapoints on top of the boxes:¶

In [0]:

sns.boxplot(x="day", y="total_bill", data=tips)
sns.swarmplot(x="day", y="total_bill", data=tips, color=".25")

Out[0]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7eab7a8cc0>

In [0]:

sns.catplot(x="day", y="total_bill", data=tips, kind='box')

Out[0]:

<seaborn.axisgrid.FacetGrid at 0x7f7ea987ef60>

In [0]:

What Is Correlation and Why Is It Useful?¶

Correlation is one of the most widely used — and widely misunderstood — statistical concepts. In this overview, we provide the definitions and intuition behind several types of correlation and illustrate how to calculate correlation using the Python pandas library.

The term "correlation" refers to a mutual relationship or association between quantities. In almost any business, it is useful to express one quantity in terms of its relationship with others. For example, sales might increase when the marketing department spends more on TV advertisements, or a customer's average purchase amount on an e-commerce website might depend on a number of factors related to that customer. Often, correlation is the first step to understanding these relationships and subsequently building better business and statistical models.

Why is correlation a useful metric?¶

Correlation can help in predicting one quantity from another
Correlation can (but often does not, as we will see in some examples below) indicate the presence of a causal relationship
Correlation is used as a basic quantity and foundation for many other modeling techniques

Correlation

Correlation in Pandas¶

df.corr()

In [0]:

tips.corr() # That was easy, right?!!

Out[0]:

	total_bill	tip	size
total_bill	1.000000	0.675734	0.598315
tip	0.675734	1.000000	0.489299
size	0.598315	0.489299	1.000000

Excercise¶

For the tips dataset which features are most correlated ?¶

- ¶

Heat Maps¶

Visualizing data with heatmaps is a great way to do exploratory data analysis, when you have a data set with multiple variables. Heatmaps can reveal general pattern in the dataset, instantly.

It is very useful in visualizing the concentration of values between two dimensions of a matrix.

In [0]:

sns.heatmap(tips.corr(), linewidths=.5)

Out[0]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7eab72c470>

Titanic Correlation¶

In [0]:

import pandas as pd
titanic_df = pd.read_csv('https://raw.githubusercontent.com/nphardly/titanic/master/data/inputs/train.csv')

In [0]:

sns.heatmap(titanic_df.corr(), linewidths=.5, cmap="RdBu")

Out[0]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7ea9913320>

Excercise¶

What are the top two most correlated features in the dataset? Pclass and Fare, Sibsp and Parch
Plot the relation graph between top two most correlated data in the titanic dataset (Hint: remember sns.relplot?)

Bar Plot¶

A bar plot represents an estimate of central tendency for a numeric variable with the height of each rectangle and provides some indication of the uncertainty around that estimate using error bars

In [0]:

sns.barplot(x="day", y="total_bill", data=tips)

Out[0]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7ea9f3a278>

Excersise¶

Plot the bar plot of total bill for Males and Females across days.

In [0]:

# @title Please try yourself

Lets Explore a new DataSet- GapMinder¶

It provides the average life expectancy, gdp per capita and population size for more than 100 countries.

In [0]:

data_url = 'http://bit.ly/2cLzoxH'
gapminder = pd.read_csv(data_url)
gapminder.head(10)

Out[0]:

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106
5	Afghanistan	1977	14880372.0	Asia	38.438	786.113360
6	Afghanistan	1982	12881816.0	Asia	39.854	978.011439
7	Afghanistan	1987	13867957.0	Asia	40.822	852.395945
8	Afghanistan	1992	16317921.0	Asia	41.674	649.341395
9	Afghanistan	1997	22227415.0	Asia	41.763	635.341351

Develop insights about this new dataset¶

Average LifeExp of people country wise
Average gdpPercap of people continent wise
Plot relation between gdpPercap and LifeExp

... go on and discover some interesting relations

In [0]:

# @title Please try yourself
# No solution this time.

In [0]:

sns.relplot(x= 'year', y = 'gdpPercap', data=gapminder, kind='line')

Out[0]:

<seaborn.axisgrid.FacetGrid at 0x7f7ead9d3e10>

Can we make the plots more informative?¶

Hint: catplot, relplot, countplot, pairplot?

In [0]:

Learn Yourself¶

https://seaborn.pydata.org/api.html Explore interesting features and plot by yourself

In [0]:

Comments

Data Visualization Part-2

Box plot¶

Median¶

First quartile¶

Third quartile¶

Interquartile Range(IQR)¶

Outlier¶

Maximum and Minimum¶

Excercise¶

Use swarmplot() to show the datapoints on top of the boxes:¶

What Is Correlation and Why Is It Useful?¶

Why is correlation a useful metric?¶

Correlation in Pandas¶

Excercise¶

For the tips dataset which features are most correlated ?¶

- ¶

Heat Maps¶

Titanic Correlation¶

Excercise¶

Bar Plot¶

Excersise¶

Lets Explore a new DataSet- GapMinder¶

Develop insights about this new dataset¶

Can we make the plots more informative?¶

Learn Yourself¶

Published

Category

Contact