Statistical Case Study for Data Science – Foundational Statistics for Data Science and Machine Learning with Python – Learning By Doing

Statistical Case Study For Data Science - Foundational Statistics for Data Science and Machine Learning in Python - Learning By Doing

In this post of the series “Foundational Statistics for Data Science and Machine Learning with Python”, we will go through Statistical Case Study presented for data of human body’s Stats published here. The Case Study poses some Analytical Cases which will help in polishing statistical skills required for Data Analysis, Data Science and Machine Learning. For each Case Study Analysis, we will overview the required Statistics background and then implement those statistical principles in Python with Pandas.

DATA OVERVIEW:

The data contains attributes namely Temperature, Gender and Heart Rate. You can download the data from here. We will be using Pandas DataFrames and will import data directly from the link.

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats as st
import numpy as np

>>> df1 = pd.read_csv('https://ww2.amstat.org/publications/jse/datasets/normtemp.dat.txt')
>>> df1.head()
   96.3    1    70
0  96.7    1    71
1  96.9    1    74
2  97.0    1    80
3  97.1    1    73
4  97.1    1    75

DATA CLEANING:

Let’s look at the shape of our DataFrame df1 to see number of rows and columns:

>>> df1.shape
(129, 1)

Note that when we download and open the text file in a text editor, we will notice that each record is in a new line and each field is separated by some spaces. Since the default separator for read_csv function is ‘comma’, the above mentioned code for read_csv function has imported our data in a single column.

To handle this, we will use separator argument for read_csv as follows:

df = pd.read_csv('https://ww2.amstat.org/publications/jse/datasets/normtemp.dat.txt', sep = '\s+', header=None)
>>> df.head()
      0  1   2
0  96.3  1  70
1  96.7  1  71
2  96.9  1  74
3  97.0  1  80
4  97.1  1  73

>>> df.shape
(129, 3)

So now we got 129 records and 3 columns as expected. We used header=None argument so that Pandas will treat our first row in the data file as data rather than column headings which are not present in the data file.

We will define column names as follows:

df.columns = ['temperature', 'gender','heart_rate']
>>> df.head()
   temperature  gender  heart_rate
0         96.3       1          70
1         96.7       1          71
2         96.9       1          74
3         97.0       1          80
4         97.1       1          73

CASE STUDY ANALYSIS # 1:

– Is the distribution of temperatures normal?

Normal Distribution Background and Why its important in training our Data Models For Data Science:

In probability theory, the normal (aka Gaussian) distribution is a very common class of statistical distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate. The normal distribution is sometimes informally called the bell curve.[1]

Normal Distribution

Many Machine Learning Models expect that data fed to these models follows a normal distribution[2]. So, after you have got your data cleaned, you should definitely check what distribution it follows. Some of the Machine Learning and Statistical Models which assumes a normally distributed input data are:

  • Gaussian naive Bayes
  • Least Squares based (regression)models
  • LDA
  • QDA

Skewness: The coefficient of Skewness is a measure for the degree of symmetry in the variable distribution.

Skewness in normal distribution - Foundational Statistics for Data Science and Machine Learning - Learning By Doing.io
SKEWNESS – SKEWED DISTRIBUTION

 

Python Implementation and Analysis of Normal Distribution:

Let’s plot a histogram to visualize the distribution of temperature attribute of the data:

>>> pd.DataFrame.hist(df, column='temperature')
array([[]], dtype=object)
>>> plt.show()
Temperature Histrogram - Foundational Statistics for Data Science and Machine Learning
Temperature Histogram – Foundational Statistics for Data Science and Machine Learning

Here we see that the distribution seems slightly skewed to the right in the histogram plot above, but takes the form of a normal distribution.

CASE STUDY ANALYSIS # 2:

– Is the true population mean really 98.6 degrees F?

BACKGROUND:

Some basics to recollect to go through the distribution (mean, median, mode, variance and standard deviation)
Mean: It is the sum of all observations divided by number of observations

mean - Foundational Statistics for Data Science and Machine Learning - Learning by Doing.io
CALCULATE MEAN

Median: When all the observations are sorted in the ascending order, the median is exactly the middle value.
– Median is equal to 50th percentile.
– If the distribution of the data is Normal, then the median is equal to the arithmetic mean (which also equals Mode).
– The median is not sensitive to extreme values/outliers/noise, and therefore it may be a better measure of central tendency than the arithmetic mean.
Variance and Standard Deviation: Standard deviation gives the measure of the spread of the data. Average of squared differences from the mean is variance and square root of variance is Standard Deviation.[3][4]
One Sample T-Test: To verify our results, we will perform one sample t-test. A one-sample t-test checks whether a sample mean differs from the population(whole data) mean.

Variance - Foundational Statistics for Data Science and Machine Learning - Learning by Doing.io
VARIANCE

 

Standard Deviation - Foundational Statistics for Data Science and Machine Learning - Learning by Doing.io
STANDARD DEVIATION

 

MEAN, MEDIAN, STANDARD DEVIATION and ONE SAMPLE T-Test in PYTHON:

As mentioned in the referenced site, in the gender the value 1 corresponds to male while 0 corresponds to female. Let’s replace the corresponding values in our DataFrame:

>>> clean_ups = {"gender":{1:"male", 2:"female"},}
>>> df.replace(clean_ups,inplace=True)
>>> df.head()
   temperature gender  heart_rate
0         96.3   male          70
1         96.7   male          71
2         96.9   male          74
3         97.0   male          80
4         97.1   male          73

To get some common stats for our DataFrame, let’s use pandas describe function for getting our Means and Standard Deviations:

>>> df.describe()
       temperature  heart_rate
count   130.000000  130.000000
mean     98.249231   73.761538
std       0.733183    7.062077
min      96.300000   57.000000
25%      97.800000   69.000000
50%      98.300000   74.000000
75%      98.700000   79.000000
max     100.800000   89.000000


>>> df.median()
temperature    98.3
heart_rate     74.0
dtype: float64

To verify our results, we will perform one sample t-test as explained earlier:

>>> one_sample = st.ttest_1samp(df['temperature'], popmean=98.6)
>>> one_sample
Ttest_1sampResult(statistic=-5.4548232923640771, pvalue=2.4106320415610081e-07)

The t-statistic is -5.455 and the p-value is 0.0000002411. Since the p-value is very low, its highly unlikely that the population’s Temperature Mean can be 98.6 (as given in Case Study Analysis # 2’s question) . So we can easily reject the null hypothesis posed in this Case Study that the population Mean is really 98.6 degrees F.

CASE STUDY ANALYSIS # 3:

Is There a Significant Difference Between Males and Females in Normal Temperature?

BACKGROUND:

To analyze statistical significance of the difference between the Mean Temperatures of Male and Female, we will use t-statistic with t-test. A t-test’s statistical significance indicates whether or not the difference between two groups’ averages most likely reflects a “real” difference in the population from which the groups were sampled.

A one-sample t-test checks whether a sample mean differs from the population(whole data) mean.
A two-sample t-test investigates whether the means of two independent data samples differ from one another. In a two-sample test, the null hypothesis is that the means of both groups are the same.

In our case, since we to compare two variables Male and Female, we will use two-sample t-test.

T- tests are supported by P-Value. A p-value is used in hypothesis testing to help you support or reject the null hypothesis. The p-value is the evidence against a null hypothesis. The smaller the p-value, the strong the evidence that you should reject the null hypothesis.

P-values are expressed as decimals although it may be easier to understand what they are if you convert them to a percentage. For example, a p value of 0.0254 is 2.54%. This means there is a 2.54% chance your results could be random (i.e. happened by chance). That’s pretty tiny. On the other hand, a large p-value of .9(90%) means your results have a 90% probability of being completely random and not due to anything in your experiment. Therefore, the smaller the p-value, the more important (“significant”) your results.[5]

T-test and T-statistics for Statistical Relationship in Data with Implementation in PYTHON

First let’s make two individual DataFrames for Male and Female Temperatures:

>>> male = df[df['gender'] == 'male']
>>> female = df[df['gender'] == 'female']

Lets plot a Histogram for Temperatures of both the DataFrames:

>>> bins = np.linspace(97, 99, 1000)
>>> plt.hist(female['temperature'])
(array([  3.,   2.,   4.,  12.,  15.,  20.,   6.,   1.,   1.,   1.]), array([  96.4 ,   96.84,   97.28,   97.72,   98.16,   98.6 ,   99.04,
         99.48,   99.92,  100.36,  100.8 ]), )
>>> plt.hist(male['temperature'])
(array([  1.,   2.,   5.,   7.,   8.,  14.,   8.,  11.,   5.,   4.]), array([ 96.3 ,  96.62,  96.94,  97.26,  97.58,  97.9 ,  98.22,  98.54,
        98.86,  99.18,  99.5 ]), )
>>> plt.show()
Male_female_temperature_histrogram - Foundational Statistics for Data Science and Machine Learning
Temperature Histogram for Male and Female

Now to get the size of DataFrames and their means:

>>> len(male)
65
>>> len(female)
65
>>> 
>>> male['temperature'].mean()
98.1046153846154
>>> female['temperature'].mean()
98.39384615384616

We have 65 records for both Male and Female and there is a difference in the Mean. Now we will perform t-test to

>>> two_sample = st.ttest_ind(male['temperature'], female['temperature'])
>>> two_sample

The t-statistic is -2.285 and the p-value is 0.024. As explained earlier, the p-value of 0.024 corresponds to 2.4%. This means there is a 2.4% chance that our result could be random (i.e. happened by chance). That’s pretty low and therefore it’s highly unlikely that the mean temperatures of Male and Female are equal.

[1] https://en.wikipedia.org/wiki/Normal_distribution
[2] http://rishy.github.io/stats/2015/07/21/normal-distributions/
[3] https://www.khanacademy.org/math/probability/data-distributions-a1/summarizing-spread-distributions/a/calculating-standard-deviation-step-by-step
[4] https://tekmarathon.com/2015/11/13/importance-of-data-distribution-in-training-machine-learning-models/
[5] http://www.statisticshowto.com/support-or-reject-null-hypothesis/

Electronics Engineer by book, Software Architect and Technopreneur by passion, Open Source Enthusiast, Problem Hacker, Enabler, Do-Tank, Blogger, Autodidact, Yogi and an avid Reader. Involved in Building Products. Having loads of experience and technical expertise in areas ranging from Full Stack Web Application Development to Big Data Analysis, Modeling, Processing and Visualization, he is currently involved in working on Python, Django, Javascript, SQL, Bootstrap, PostgreSQL, RRD (Round Robin Database), MySQL, MonetDB, LevelDB, BerkeleyDB, Redis, Apache Spark, Pandas, SciPy, NumPy etc.

Ali Raza received his Masters Degree in Electronics Engineering which involved Research focused on Machine Learning. He is currently working as a Chief Technical Officer at BitWits (Pvt) Limited, CEO & Founder at DataLysis.io and CEO & Founder at LearningByDoing.io.

Please follow and like us:

Leave a Reply

Your email address will not be published. Required fields are marked *