Connecting Dots
  • Home
  • About me
  • Categories
  • Tags
  • Archives

Data Visualization Part-2

In [0]:
!pip3 install --upgrade seaborn # Updates the seaborn version in the collab notebook.
import warnings
warnings.filterwarnings('ignore')
Collecting seaborn
  Downloading https://files.pythonhosted.org/packages/a8/76/220ba4420459d9c4c9c9587c6ce607bf56c25b3d3d2de62056efe482dadc/seaborn-0.9.0-py3-none-any.whl (208kB)
    100% |████████████████████████████████| 215kB 24.1MB/s 
Requirement already satisfied, skipping upgrade: scipy>=0.14.0 in /usr/local/lib/python3.6/dist-packages (from seaborn) (1.1.0)
Requirement already satisfied, skipping upgrade: matplotlib>=1.4.3 in /usr/local/lib/python3.6/dist-packages (from seaborn) (3.0.3)
Requirement already satisfied, skipping upgrade: numpy>=1.9.3 in /usr/local/lib/python3.6/dist-packages (from seaborn) (1.14.6)
Requirement already satisfied, skipping upgrade: pandas>=0.15.2 in /usr/local/lib/python3.6/dist-packages (from seaborn) (0.22.0)
Requirement already satisfied, skipping upgrade: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=1.4.3->seaborn) (2.3.1)
Requirement already satisfied, skipping upgrade: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=1.4.3->seaborn) (0.10.0)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=1.4.3->seaborn) (2.5.3)
Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=1.4.3->seaborn) (1.0.1)
Requirement already satisfied, skipping upgrade: pytz>=2011k in /usr/local/lib/python3.6/dist-packages (from pandas>=0.15.2->seaborn) (2018.9)
Requirement already satisfied, skipping upgrade: six in /usr/local/lib/python3.6/dist-packages (from cycler>=0.10->matplotlib>=1.4.3->seaborn) (1.11.0)
Requirement already satisfied, skipping upgrade: setuptools in /usr/local/lib/python3.6/dist-packages (from kiwisolver>=1.0.1->matplotlib>=1.4.3->seaborn) (40.8.0)
Installing collected packages: seaborn
  Found existing installation: seaborn 0.7.1
    Uninstalling seaborn-0.7.1:
      Successfully uninstalled seaborn-0.7.1
Successfully installed seaborn-0.9.0

Box plot¶

Box plot, also called the box-and-whisker plot:

  • a way to show the distribution of values based on the five-number summary: minimum, first quartile, median, third quartile, and maximum.

Image of Yaktocat

Median¶

The median is the value that separates the higher half of a data from the lower half. It’s calculated by the following steps: order your values, and find the middle one. For example, if we have the numbers 1, 3, 4, 7, 8, 8, 9, the median will be 7.

First quartile¶

The first quartile is the median of the data values to the left of the median in our ordered values. Ex: For the numbers 1, 3, 4, 7, 8, 8, 9, the first quartile will be ?

Third quartile¶

The third quartile is the median of the data values to the right of the median in our ordered values. Ex: For the numbers 1, 3, 4, 7, 8, 8, 9, the Third quartile will be ?

Interquartile Range(IQR)¶

The IQR approximates the amount of spread in the middle 50% of the data. The formula is the third quartile - the first quartile. Ex: For the numbers 1, 3, 4, 7, 8, 8, 9, the IQR will be ?

Outlier¶

An outlier is a data value that lies outside the overall pattern. A commonly used rule says that a value is an outlier if it’s less than the first quartile - 1.5 IQR or high than the third quartile + 1.5 IQR.

Maximum and Minimum¶

The minimum and the maximum are just the min and max values from our data. (outliers are not included)

In [0]:
import seaborn as sns
sns.set(style="whitegrid")

tips = sns.load_dataset("tips")
sns.boxplot(y=tips["total_bill"])
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7ead8ea080>
In [0]:
sns.boxplot(x="day", y="total_bill", data=tips)
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7ead9184a8>

Excercise¶

  • Draw a boxplot for smoker and non-smoker for the above plot.
  • Draw the boxplot for males and females for the above plot.
In [0]:
# @title Please try yourself
# Hint: Remember `hue` from last class?

Use swarmplot() to show the datapoints on top of the boxes:¶

In [0]:
sns.boxplot(x="day", y="total_bill", data=tips)
sns.swarmplot(x="day", y="total_bill", data=tips, color=".25")
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7eab7a8cc0>
In [0]:
 
In [0]:
sns.catplot(x="day", y="total_bill", data=tips, kind='box')
Out[0]:
<seaborn.axisgrid.FacetGrid at 0x7f7ea987ef60>
In [0]:
 

What Is Correlation and Why Is It Useful?¶

Correlation is one of the most widely used — and widely misunderstood — statistical concepts. In this overview, we provide the definitions and intuition behind several types of correlation and illustrate how to calculate correlation using the Python pandas library.

The term "correlation" refers to a mutual relationship or association between quantities. In almost any business, it is useful to express one quantity in terms of its relationship with others. For example, sales might increase when the marketing department spends more on TV advertisements, or a customer's average purchase amount on an e-commerce website might depend on a number of factors related to that customer. Often, correlation is the first step to understanding these relationships and subsequently building better business and statistical models.

Why is correlation a useful metric?¶

  • Correlation can help in predicting one quantity from another
  • Correlation can (but often does not, as we will see in some examples below) indicate the presence of a causal relationship
  • Correlation is used as a basic quantity and foundation for many other modeling techniques

Correlation

Correlation in Pandas¶

df.corr()

In [0]:
tips.corr() # That was easy, right?!! 
Out[0]:
total_bill tip size
total_bill 1.000000 0.675734 0.598315
tip 0.675734 1.000000 0.489299
size 0.598315 0.489299 1.000000

Excercise¶

For the tips dataset which features are most correlated ?¶

- ¶

Heat Maps¶

Visualizing data with heatmaps is a great way to do exploratory data analysis, when you have a data set with multiple variables. Heatmaps can reveal general pattern in the dataset, instantly.

  • It is very useful in visualizing the concentration of values between two dimensions of a matrix.
In [0]:
sns.heatmap(tips.corr(), linewidths=.5)
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7eab72c470>

Titanic Correlation¶

In [0]:
import pandas as pd
titanic_df = pd.read_csv('https://raw.githubusercontent.com/nphardly/titanic/master/data/inputs/train.csv')
In [0]:
sns.heatmap(titanic_df.corr(), linewidths=.5, cmap="RdBu")
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7ea9913320>

Excercise¶

  • What are the top two most correlated features in the dataset? Pclass and Fare, Sibsp and Parch

  • Plot the relation graph between top two most correlated data in the titanic dataset (Hint: remember sns.relplot?)

Bar Plot¶

A bar plot represents an estimate of central tendency for a numeric variable with the height of each rectangle and provides some indication of the uncertainty around that estimate using error bars

In [0]:
sns.barplot(x="day", y="total_bill", data=tips)
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7ea9f3a278>

Excersise¶

  • Plot the bar plot of total bill for Males and Females across days.
In [0]:
# @title Please try yourself

Lets Explore a new DataSet- GapMinder¶

It provides the average life expectancy, gdp per capita and population size for more than 100 countries.

In [0]:
data_url = 'http://bit.ly/2cLzoxH'
gapminder = pd.read_csv(data_url)
gapminder.head(10)
Out[0]:
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710
3 Afghanistan 1967 11537966.0 Asia 34.020 836.197138
4 Afghanistan 1972 13079460.0 Asia 36.088 739.981106
5 Afghanistan 1977 14880372.0 Asia 38.438 786.113360
6 Afghanistan 1982 12881816.0 Asia 39.854 978.011439
7 Afghanistan 1987 13867957.0 Asia 40.822 852.395945
8 Afghanistan 1992 16317921.0 Asia 41.674 649.341395
9 Afghanistan 1997 22227415.0 Asia 41.763 635.341351

Develop insights about this new dataset¶

  • Average LifeExp of people country wise
  • Average gdpPercap of people continent wise
  • Plot relation between gdpPercap and LifeExp

... go on and discover some interesting relations

In [0]:
# @title Please try yourself
# No solution this time. 
In [0]:
sns.relplot(x= 'year', y = 'gdpPercap', data=gapminder, kind='line')
Out[0]:
<seaborn.axisgrid.FacetGrid at 0x7f7ead9d3e10>

Can we make the plots more informative?¶

Hint: catplot, relplot, countplot, pairplot?

In [0]:

Learn Yourself¶

  • https://seaborn.pydata.org/api.html Explore interesting features and plot by yourself
In [0]:
 
Comments
comments powered by Disqus

Published

Mar 15, 2019

Category

Data Analytics with Python

Contact

  • Powered by Pelican. Theme: Elegant by Talha Mansoor