Assignment 1: Data Analysis¶
# Imports - Needed for tasks .If some import is missing, please import here
import numpy as np
import pandas as pd
import statsmodels.api as sm
import scipy.stats as stats
Setup¶
Data: the responses collected from a survery of the COGS 108 class.
- There are 417 observations in the data, covering 10 different 'features'.
Research Question: Do students in different majors have different heights?
Background: Physical height has previously shown to correlate with career choice, and career success. More recently it has been demonstrated that these correlations can actually be explained by height in high school, as opposed to height in adulthood (1). It is currently unclear whether height correlates with choice of major in university.
Reference: 1) http://economics.sas.upenn.edu/~apostlew/paper/pdf/short.pdf
Hypothesis: We hypothesize that there will be a relation between height and chosen major.
Notes - Assignment Outline¶
Assignment 1:¶
Parts 1-3 of this assignment are modeled on being a revision of what we have already learnt.
This mimics, and gets you working with, something like what you will need for your final project.
- Reading Data
- Processing Data
- Plotting
- Analysing Data
Assignment 2 (will come later):¶
We will explore some statistical tests on data.
- Part 4: Testing Distribution
- Part 5: Data Analysis - Drawing conclusions!!
===============================================================================================================¶
Part 1: Load & Clean the Data¶
Fixing messy data makes up a large amount of the work of being a Data Scientist.
The real world produces messy measurements and it is your job to find ways to standardize your data such that you can make useful analyses out of it.
In this section, you will learn, and practice, how to successfully deal with unclean data.
1a) Load the data¶
Import datafile 'COGS108_IntroQuestionnaireData.csv' into a DataFrame called 'df'.
# YOUR CODE HERE
assert isinstance(df, pd.DataFrame)
# Check out the data
df.head(5)
Those column names are a bit excessive, so first let's rename them - code provided below to do so.
# Renaming the columns of the dataframe
df.columns = ["timestamp", "year", "major", "age", "gender", "height",
"weight", "eye_color", "born_in_CA", "favorite_icecream"]
Pandas has a very useful function for detecting missing data. This function is called 'isnull()'.
If you have a dataframe called 'df', then calling 'df.isnull()' will return another dataframe of the same size as 'df' where every cell is either True of False.
Each True or False is the answer to the question 'is the data in this cell null?'. So, False, means the cell is not null (and therefore, does have data). True means the cell is null (does not have data).
This function is very useful because it allows us to find missing data very quickly in our dataframe. As an example, consider the code below.
# Check the first few rows of the 'isnull' dataframe
df.isnull().head(5)
If you print out more, and scroll down, you'll see some rows with missing data. For example:
df.isnull().iloc[48:50, :]
Check an example, row 49, in which an entry has missing data
Granted, the example above is not very informative. As you can see, the output of 'isnull()' is a dataframe where the values at each cell is either True or False. Most cells have the value of 'False'. We expect this to be the case since most people gave out answers to each question in our survey.
However, some rows such as row 49 show that some people chose not to answer certain questions. In the case of row 49, it seems that someone did not give out an answer for 'What year (in school) are you?'
However, what if wanted to use 'isnull()' to see all rows where our dataframe 'df' has missing values? In other words, what if we want to see the ACTUAL rows with missing values instead of this dataframe with True or False cells. For that, we need to write the following line of code:
df[df.isnull().any(axis=1)]
1b) Find missing data¶
Find all rows that have missing data in them. Save the ouput, as a dataframe, into a variable called 'rows_to_drop'. In other words, copy over and use the line of code that we gave out in the cell above.
# YOUR CODE HERE
assert isinstance(rows_to_drop, pd.DataFrame)
assert rows_to_drop.shape == (29, 10)
You need to run & read the following couple of cells - but don't have to add any code:
You need to run & read the following couple of cells - but don't have to add any code:
Real world data is messy. As an example of it, we consider the data shown in rows_to_drop (below). If you've done everything correctly so far, you should see an unexpected response with emojis at index 357. These types of responses, although funny, are hard to parse when dealing with big datasets. We'll learn about solutions to these types of problems in the upcoming cells
rows_to_drop
In the cell below, briefly explain below how 'df[df.isnull().any(axis=1)]' works, in a couple sentences.
Include an explanation of what 'any(axis=1)' means and how it affects the code.
1c) Drop the rows with NaN values¶
Drop any rows with missing data, but only for the columns 'major', 'height', 'gender' and 'age'. These will be the data of primary interest for our analyses, so we drop missing data here. Note that there are other missing data (in other rows) but this is fine for our analyses, so we keep them. To do this, use the pandas 'dropna' method, inplace, using the 'subset' arguments to specify columns.
# YOUR CODE HERE
assert df.shape == (404, 10)
Now we have to standardize the data!
Check all different values given for majors. It's a lot!
df["major"].unique()
We'll write a function performing some simple substring checking in order to group many responses together
def standardize_major(string):
string = string.lower()
string = string.strip()
if "cog" in string:
output = "COGSCI"
elif "computer" in string:
output = "COMPSCI"
elif "cs" in string:
output = "COMPSCI"
elif "math" in string:
output = "MATH"
elif "electrical" in string:
output = "ECE"
elif "bio" in string:
output = "BIO"
elif "chem" in string:
output = "CHEM"
# Otherwise, if uncaught - keep as is
else:
output = string
return output
Applying the transformation
df["major"] = df["major"].apply(standardize_major)
Previewing the results of the previous transformation. It looks a lot better, though it's not perfect, but we'll run with this
df["major"].unique()
Next let's check the 'gender' column.
Check the different responses received for gender, including how many of each response we have
df["gender"].value_counts()
standardize_gender
: Function to standardize the gender responses
Note: for the purposes of the following analyses, we will keep self-reported gender for categories in which we have a sizable number of responses, in this case, those which correspond to 'female' and 'male'
def standardize_gender(gender):
gender = gender.lower()
gender = gender.strip()
if gender in ['female', 'f', 'woman', 'women']:
output = 'female'
elif gender in ['male', 'm', 'man', 'men']:
output = 'male'
else:
output = np.nan
return output
Apply the transformation, and drop any rows with missing gender information
df["gender"] = df["gender"].apply(standardize_gender)
df.dropna(subset=['gender'], inplace=True)
# Check the results
df["gender"].unique()
Now you will write some code to standardize some of the other data columns.
1d) Standardize other columns¶
Find, programatically, the number of unique responses to in the 'year' column. Save the result in a variable named 'num_unique_responses'. Hint: you can answer this question using the 'unique' method, used above.
# YOUR CODE HERE
assert num_unique_responses
assert isinstance(num_unique_responses, int)
Print out all the different answers in 'year'
# YOUR CODE HERE
The line of code above shows us the different values we got, to the question 'What year (in school) are you?'.
As you can tell, it is a mess!. For example, if you are a junior student, then you might have answered: 3, three, third, 3rd year, junior, junior year, Junior, etc.
That is an issue. We want to be able to analyze this data and, in order to do this successfully, we need to all answers with the same meaning to be written in the same way. Therefore, we're gonna have to transform answers such as '3, third, 3rd, junior, etc' into a single possible value. We'll do this for all values that mean the same.
In the rest of Part 1, we will work on writing code, organized into functions that will allow us to transform similar respones into the same value. We will call this process: standardizing the data.
The cell below provides an example for the kind of code you will need to write to answer this question. This example is separate from our actual data, and is a potential function we might use to standardize messy data - in this case, hypothetical data to the question 'What is your favourite major python version?'.
Note some things used in this example that you need to use to standardize data:
- string methods, such as 'lower' and 'strip' to transform strings
- the 'replace' string method, to replace a set of characters with something else
- if/else statements that check what's in our string (number, letters, etc)
- type casting, for example using 'int()' to turn a variable into an integer
- using 'np.nan' (which stands for 'not a number') to denote missing or unknown data
def example_standardize_function(str_in):
"""Standardize data to the question 'what is your favourite major python version?'
Parameters
----------
str_in : string
A provided answer.
Returns
-------
int_out : int or np.nan
A standardized integer response.
"""
# Make the input all lowercase
str_in = str_in.lower()
# Drop all whitespace
str_in = str_in.strip()
# Replace things (and then strip again afterwords)
# Note that the 'replace' replaces the first argument, with the second
# The first argument does not need to be present in the string,
# if it's not there 'replace' does nothing (but does not error), so the code moves on.
str_in = str_in.replace('version', '')
str_in = str_in.replace('python', '')
str_in = str_in.strip()
# Cast to integer, if what's left seems appropriate
if str_in.isnumeric() and len(str_in) == 1:
out = int(str_in)
# Otherwise, consider input was probably ill-formed, return nan
else:
out = np.nan
return out
# Check how this function help standardize data:
# Example possible answers to the question 'What is your favourite major version of Python':
print('INPUT', '\t\t-\t', 'OUTPUT')
for inp in ['version 3', '42', '2', 'python 3', 'nonsense-lolz']:
print('{:10s} \t-\t {:1.0f}'.format(inp, example_standardize_function(inp)))
1e) Standardize 'year' column¶
Write a function named 'standardize_year' that takes in as input a string and returns an integer.
The function will do the following (in the order specified): Note that for these detailed instructions, each line corresponds to one line of code you need to write.
- 1) convert all characters of the string into lowercase
2) strip the string of all leading and trailing whitespace
3) replace any occurences of 'first' with '1'
- 4) replace any occurences of 'second' with '2'
- 5) replace any occurences of 'third' with '3'
- 6) replace any occurences of 'fourth' with '4'
- 7) replace any occurences of 'fifth' with '5'
8) replace any occurences of 'sixth' with '6'
9) replace any occurences of 'freshman' with '1'
- 10) replace any occurences of 'sophomore' with '2'
- 11) replace any occurences of 'junior' with '3'
12) replace any occurences of 'senior' with 4'
13) replace any occurences of 'year' with '' (remove it from the string)
- 14) replace any occurences of 'th' with '' (remove it from the string)
- 15) replace any occurences of 'rd' with '' (remove it from the string)
16) replace any occurences of 'nd' with '' (remove it from the string)
17) strip the string of all leading and trailing whitespace (again)
- 18) If the resulting string is a number and it is less than 10, then cast it into an integer and return that value
- 19) Else return np.nan to symbolize that the student's response was not a valid entry
HINTS: you will need to use the functions 'lower()', 'strip()', isnumeric() and 'replace()'
# YOUR CODE HERE
# Implement this method based on above requirements
def standardize_year(str_in):
pass
assert standardize_year('2nd') == 2
assert standardize_year('sophomore') == 2
assert standardize_year('3rd year') == 3
assert standardize_year('5th') == 5
assert standardize_year('7 ') == 7
assert standardize_year('randomText') is np.nan
1f) Transform 'year' column¶
Use 'standardize_year' to transform the data in column 'What year (in school) are you?'. Hint: use the apply function AND remember to save your output inside the dataframe
# YOUR CODE HERE
assert len(df["year"].unique()) == 7
Assuming that all is correct up to this point, the line below should show all values now found in df.
It should look a lot better. With this data, we can now make insightful analyses.
You should see an array with elements 1,2,3,4,5,6 and nan (not necessarily in that order).
Note that if you check the data type of this column, you'll see that pandas converts these numbers to 'float', even though the applied function returns 'int', because 'np.nan' is considered a float. This is fine.
df["year"].unique()
Let's do it again. Let's take a look at the responses in the 'weight' column, and then standardize them.
# First, ensure that all types are consistent, use strings
df["weight"] = df["weight"].astype(str)
# Check all the different answers we received
df["weight"].unique()
1g) Standardize 'weight' column¶
Write a function named 'standardize_weight' that takes in as input a string and returns an integer. The function will do the following (in the order specified):
- 1) convert all characters of the string into lowercase
2) strip the string of all leading and trailing whitespace
3) replace any occurences of 'lbs' with '' (remove it from the string)
- 4) replace any occurences of 'lb' with '' (remove it from the string)
5) replace any occurences of 'pounds' with '' (remove it from the string)
6) If the string contains the substring 'kg', then:
- 6.1) replace 'kg' with ''
- 6.2) cast the string into an integer type using the function 'int()'
- 6.3) multiply the resulting integer by 2 (an approximate conversion of kilograms to pounds)
- 6.4) cast the resulting number back into a string with the function 'str()'
7) Strip the string of its whitespaces (again)
- 8) If the resulting string is numeric: cast it into an integer and return the resulting value
- 9) Else: return np.nan
# YOUR CODE HERE
assert standardize_weight('34 lbs') == 34
assert standardize_weight('101 kg') == 202
1h) Transform 'weight' column¶
Use 'standardize_weight' to transform the data in the 'weight' column. Hint: use the apply function AND remember to save your output inside the dataframe
# YOUR CODE HERE
assert df["weight"].unique().shape == (83,)
Now, let's see the result of our hard work . The code below should output all numbers (or nan).
df["weight"].unique()
So far, you've gotten a taste of what it is like to deal with messy data. It's not easy, as you can tell.
The last variable we need to standardize for the purposes of our analysis is 'height'. We will standardize that one for you.
Do read the code below and try to understand what it is doing.
# First, we'll look at the possible values for height
df["height"].unique()
It seems like we'll have to handle different measurement systems. Ugh, ok... Let's write a function that converts all those values to inches
def standardize_height(string):
orig = string
output = None
# Basic string pre-processing
string = string.lower()
string = string.strip()
string = string.replace("foot", "ft")
string = string.replace("feet", "ft")
string = string.replace("inches", "in")
string = string.replace("inch", "in")
string = string.replace("meters", "m")
string = string.replace("meter", "m")
string = string.replace("centimeters", "cm")
string = string.replace("centimeter", "cm")
string = string.replace(",", "")
string = string.strip()
# CASE 1: string is written in the format FEET <DIVIDER> INCHES
dividers = ["'", "ft", "’", '”', '"','`', "-", "''"]
for divider in dividers:
# Split it into its elements
elements = string.split(divider)
# If the divider creates two elements
if (len(elements) >= 2) and ((len(string) -1) != string.find(divider)):
feet = elements[0]
inch = elements[1] if elements[1] is not '' else '0'
# Cleaning extranious symbols
for symbol in dividers:
feet = feet.replace(symbol, "")
inch = inch.replace(symbol, "")
inch = inch.replace("in","")
# Removing whitespace
feet = feet.strip()
inch = inch.strip()
# By this point, we expect 'feet' and 'inch' to be numeric
# If not...we ignore this case
if feet.replace('.', '').isnumeric() and inch.replace('.', '').isnumeric():
# Converting feet to inches and adding it to the curent inches
output = (float(feet) * 12) + float(inch)
break
# CASE 2: string is written in the format FEET ft INCHES in
if ("ft" in string) and ("in" in string):
# Split it into its elements
elements = string.split("ft")
feet = elements[0]
inch = elements[1]
# Removing extroneous symbols and stripping whitespace
inch = inch.replace("inch", "")
inch = inch.replace("in", "")
feet = feet.strip()
inch = inch.strip()
# By this point, we expect 'feet' and 'inch' to be numeric
# If not...we ignore this case
if feet.replace('.', '').isnumeric() and inch.replace('.', '').isnumeric():
# Converting feet to inches and adding it to the curent inches
output = (float(feet) * 12) + float(inch)
# CASE 3: answer was given ONLY in cm
# Convert to inches: approximately 0.39 inches in a meter
elif "cm" in string:
centimeters = string.replace("cm", "")
centimeters = centimeters.strip()
if centimeters.replace('.', '').isnumeric():
output = float(centimeters) * 0.39
# CASE 4: answer was given ONLY in meters
# Convert to inches: approximately 39 inches in a meter
elif "m" in string:
meters = string.replace("m", "")
meters = meters.strip()
if meters.replace('.', '').isnumeric():
output = float(meters)*39
# CASE 5: answer was given ONLY in feet
elif "ft" in string:
feet = string.replace("ft", "")
feet = feet.strip()
if feet.replace('.', '').isnumeric():
output = float(feet)*12
# CASE 6: answer was given ONLY in inches
elif "in" in string:
inches = string.replace("in", "")
inches = inches.strip()
if inches.replace('.', '').isnumeric():
output = float(inches)
# CASE 7: answer not covered by existing scenarios / was invalid.
# Return NaN
if not output:
output = np.nan
return output
Apply
- the transformation and
- dropping invalid rows
# YOUR CODE HERE
Check the height data, after applying our standardization
df["height"].unique()
Ensuring that the data types are correct - type cast age to int.
df["age"] = df["age"].astype(np.int64)
# Check out the data, after we've cleaned it!
df.head()
Check that the dataframe has the right number of columns. If this doesn't pass - check your code in the section above.
assert len(df) == 367
2a) Scatter Matrix¶
Plot the data, using scatter_matrix (free to use any library you want- seaborn, pandas(yes you can plot using pandas), matplot)
# YOUR CODE HERE
2b) Bar Chart¶
Plot a bar chart showing the number of students in each majors.
Hints: you can use 'value_counts' to get the counts for each major. You can then use the 'plot' method from pandas for plotting - you don't need matplotlib.
Hints: If you remember seaborn method we learned in Visulization
# YOUR CODE HERE
2c) Histogram for COGSCI¶
Plot a histogram of the height data for all students who wrote 'COGSCI' as their major.
# YOUR CODE HERE
2d) Histogram for COMPSCI¶
Plot a histogram of the height data for all students who wrote 'COMPSCI' as their major.
# YOUR CODE HERE
Part 3: Exploring The Data¶
Beyond just plotting the data, we should check some other basic properties of the data. This serves both as a way to get a 'feel' for the data, and to look for any quirks or oddities about the data, that may indicate issues that need resolving. To do this, let's explore that data a bit (not limiting ourselves to only features that we plan to use - exploring the dataset as a whole can help us find any issues).
Notes:
- You answers should NOT be pandas objects (Series or DataFrames), extract answers so the variables are ints, floats or strings (as appropriate).
- You must answer these questions programmatically: do not count / check and hard code particular values.
3a) Number of majors¶
How many different majors are in the dataset? Save this number to a variable 'n_majors'.
# YOUR CODE HERE
assert n_majors
3b) Range of 'age'¶
What is the range (max value - min value) of ages in the dataset? Save this number to a variable 'r_age'
# YOUR CODE HERE
assert r_age
3c) Most popular ice-cream flavour¶
What is the most popular ice cream flavour? Save the ice cream name to the variable 'f_ice', and the number of people who like it to a variable 'n_ice'.
Hint: you can get these values using the 'value_counts' method.
# YOUR CODE HERE
assert n_ice
assert f_ice
3d) Unique favourite ice cream¶
How many people have a unique favourite ice cream? (How many ice cream flavours are only 1 persons favourite?) Save this number to a variable 'u_ice'
# YOUR CODE HERE
assert u_ice