In this post of the series **“Foundational Statistics for Data Science and Machine Learning with Python”**, we will go through Statistical Case Study presented for data of human body’s Stats published here. The Case Study poses some Analytical Cases which will help in polishing statistical skills required for Data Analysis, Data Science and Machine Learning. For each Case Study Analysis, we will overview the required Statistics background and then implement those statistical principles in Python with Pandas.

## DATA OVERVIEW:

The data contains attributes namely Temperature, Gender and Heart Rate. You can download the data from here. We will be using Pandas DataFrames and will import data directly from the link.

import pandas as pd import matplotlib import matplotlib.pyplot as plt import scipy.stats as st import numpy as np >>> df1 = pd.read_csv('https://ww2.amstat.org/publications/jse/datasets/normtemp.dat.txt') >>> df1.head() 96.3 1 70 0 96.7 1 71 1 96.9 1 74 2 97.0 1 80 3 97.1 1 73 4 97.1 1 75

## DATA CLEANING:

Let’s look at the shape of our DataFrame df1 to see number of rows and columns:

>>> df1.shape (129, 1)

Note that when we download and open the text file in a text editor, we will notice that each record is in a new line and each field is separated by some spaces. Since the default separator for read_csv function is ‘comma’, the above mentioned code for read_csv function has imported our data in a single column.

To handle this, we will use separator argument for read_csv as follows:

df = pd.read_csv('https://ww2.amstat.org/publications/jse/datasets/normtemp.dat.txt', sep = '\s+', header=None) >>> df.head() 0 1 2 0 96.3 1 70 1 96.7 1 71 2 96.9 1 74 3 97.0 1 80 4 97.1 1 73 >>> df.shape (129, 3)

So now we got 129 records and 3 columns as expected. We used **header=None** argument so that Pandas will treat our first row in the data file as data rather than column headings which are not present in the data file.

We will define column names as follows:

df.columns = ['temperature', 'gender','heart_rate']

>>> df.head() temperature gender heart_rate 0 96.3 1 70 1 96.7 1 71 2 96.9 1 74 3 97.0 1 80 4 97.1 1 73

**CASE STUDY ANALYSIS # 1:**

*– Is the distribution of temperatures normal?*

**Normal Distribution Background and Why its important in training our Data Models For Data Science:**

**Normal Distribution Background and Why its important in training our Data Models For Data Science:**

In probability theory, the normal (aka Gaussian) distribution is a very common class of statistical distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate. The normal distribution is sometimes informally called the bell curve.[1]

Many ** Machine Learning Models** expect that data fed to these models follows a normal distribution[2]. So, after you have got your data cleaned, you should definitely check what distribution it follows. Some of the

**Machine Learning and Statistical Models**which assumes a normally distributed input data are:

- Gaussian naive Bayes
- Least Squares based (regression)models
- LDA
- QDA

**Skewness: **The coefficient of Skewness is a measure for the degree of symmetry in the variable distribution.

**Python Implementation and Analysis of Normal Distribution:**

**Python Implementation and Analysis of Normal Distribution:**

Let’s plot a histogram to visualize the distribution of *temperature* attribute of the data:

>>> pd.DataFrame.hist(df, column='temperature') array([[]], dtype=object) >>> plt.show()

Here we see that the distribution seems slightly skewed to the right in the histogram plot above, but takes the form of a normal distribution.

**CASE STUDY ANALYSIS # 2:**

*– Is the true population mean really 98.6 degrees F?
*

*BACKGROUND*:

*BACKGROUND*:

Some basics to recollect to go through the distribution (mean, median, mode, variance and standard deviation)

**Mean:** It is the sum of all observations divided by number of observations

**Median:** When all the observations are sorted in the ascending order, the median is exactly the middle value.

– Median is equal to 50th percentile.

– If the distribution of the data is Normal, then the median is equal to the arithmetic mean (which also equals **Mode**).

– The median is not sensitive to extreme values/outliers/noise, and therefore it may be a better measure of central tendency than the arithmetic mean.

**Variance and Standard Deviation:** Standard deviation gives the measure of the spread of the data. Average of squared differences from the mean is **variance** and square root of variance is Standard Deviation.[3][4]

**One Sample T-Test:** To verify our results, we will perform one sample t-test. A **one-sample t-test** checks whether a sample mean differs from the population(whole data) mean.

**MEAN, MEDIAN, STANDARD DEVIATION and ONE SAMPLE T-Test in PYTHON:**

**MEAN, MEDIAN, STANDARD DEVIATION and ONE SAMPLE T-Test in PYTHON:**

As mentioned in the referenced site, in the gender the value 1 corresponds to male while 0 corresponds to female. Let’s replace the corresponding values in our DataFrame:

>>> clean_ups = {"gender":{1:"male", 2:"female"},} >>> df.replace(clean_ups,inplace=True) >>> df.head() temperature gender heart_rate 0 96.3 male 70 1 96.7 male 71 2 96.9 male 74 3 97.0 male 80 4 97.1 male 73

To get some common stats for our DataFrame, let’s use pandas describe function for getting our Means and Standard Deviations:

>>> df.describe() temperature heart_rate count 130.000000 130.000000 mean 98.249231 73.761538 std 0.733183 7.062077 min 96.300000 57.000000 25% 97.800000 69.000000 50% 98.300000 74.000000 75% 98.700000 79.000000 max 100.800000 89.000000 >>> df.median() temperature 98.3 heart_rate 74.0 dtype: float64

To verify our results, we will perform one sample t-test as explained earlier:

>>> one_sample = st.ttest_1samp(df['temperature'], popmean=98.6) >>> one_sample Ttest_1sampResult(statistic=-5.4548232923640771, pvalue=2.4106320415610081e-07)

The t-statistic is -5.455 and the p-value is 0.0000002411. Since the p-value is very low, its highly unlikely that the population’s Temperature Mean can be 98.6 (as given in Case Study Analysis # 2’s question) . So we can easily reject the null hypothesis posed in this Case Study that the population Mean is really 98.6 degrees F.

**CASE STUDY ANALYSIS # 3:**

Is There a Significant Difference Between Males and Females in Normal Temperature?

*BACKGROUND*:

*BACKGROUND*:

To analyze statistical significance of the difference between the Mean Temperatures of Male and Female, we will use **t-statistic** with **t-test**. A t-test’s statistical significance indicates whether or not the difference between two groups’ averages most likely reflects a “real” difference in the population from which the groups were sampled.

A **one-sample t-test** checks whether a sample mean differs from the population(whole data) mean.

A** two-sample t-test** investigates whether the means of two independent data samples differ from one another. In a two-sample test, the null hypothesis is that the means of both groups are the same.

In our case, since we to compare two variables Male and Female, we will use **two-sample t-test**.

T- tests are supported by **P-Value**. A p-value is used in hypothesis testing to help you support or reject the null hypothesis. The p-value is the evidence against a null hypothesis. The smaller the p-value, the strong the evidence that you should reject the null hypothesis.

**P-values** are expressed as decimals although it may be easier to understand what they are if you convert them to a percentage. For example, a p value of 0.0254 is 2.54%. This means there is a 2.54% chance your results could be random (i.e. happened by chance). That’s pretty tiny. On the other hand, a large p-value of .9(90%) means your results have a 90% probability of being completely random and not due to anything in your experiment. Therefore, the smaller the p-value, the more important (“significant”) your results.[5]

**T-test and T-statistics for Statistical Relationship in Data with Implementation in PYTHON**

**T-test and T-statistics for Statistical Relationship in Data with Implementation in PYTHON**

First let’s make two individual DataFrames for Male and Female Temperatures:

>>> male = df[df['gender'] == 'male'] >>> female = df[df['gender'] == 'female']

Lets plot a Histogram for Temperatures of both the DataFrames:

>>> bins = np.linspace(97, 99, 1000) >>> plt.hist(female['temperature']) (array([ 3., 2., 4., 12., 15., 20., 6., 1., 1., 1.]), array([ 96.4 , 96.84, 97.28, 97.72, 98.16, 98.6 , 99.04, 99.48, 99.92, 100.36, 100.8 ]), ) >>> plt.hist(male['temperature']) (array([ 1., 2., 5., 7., 8., 14., 8., 11., 5., 4.]), array([ 96.3 , 96.62, 96.94, 97.26, 97.58, 97.9 , 98.22, 98.54, 98.86, 99.18, 99.5 ]), ) >>> plt.show()

Now to get the size of DataFrames and their means:

>>> len(male) 65 >>> len(female) 65 >>> >>> male['temperature'].mean() 98.1046153846154 >>> female['temperature'].mean() 98.39384615384616

We have 65 records for both Male and Female and there is a difference in the Mean. Now we will perform t-test to

>>> two_sample = st.ttest_ind(male['temperature'], female['temperature']) >>> two_sample

The t-statistic is -2.285 and the p-value is 0.024. As explained earlier, the p-value of 0.024 corresponds to 2.4%. This means there is a 2.4% chance that our result could be random (i.e. happened by chance). That’s pretty low and therefore it’s highly unlikely that the mean temperatures of Male and Female are equal.

[1] https://en.wikipedia.org/wiki/Normal_distribution

[2] http://rishy.github.io/stats/2015/07/21/normal-distributions/

[3] https://www.khanacademy.org/math/probability/data-distributions-a1/summarizing-spread-distributions/a/calculating-standard-deviation-step-by-step

[4] https://tekmarathon.com/2015/11/13/importance-of-data-distribution-in-training-machine-learning-models/

[5] http://www.statisticshowto.com/support-or-reject-null-hypothesis/

Electronics Engineer by book, Software Architect and Technopreneur by passion, Open Source Enthusiast, Problem Hacker, Enabler, Do-Tank, Blogger, Autodidact, Yogi and an avid Reader. Involved in Building Products. Having loads of experience and technical expertise in areas ranging from Full Stack Web Application Development to Big Data Analysis, Modeling, Processing and Visualization, he is currently involved in working on Python, Django, Javascript, SQL, Bootstrap, PostgreSQL, RRD (Round Robin Database), MySQL, MonetDB, LevelDB, BerkeleyDB, Redis, Apache Spark, Pandas, SciPy, NumPy etc.

Ali Raza received his Masters Degree in Electronics Engineering which involved Research focused on Machine Learning. He is currently working as a Chief Technical Officer at BitWits (Pvt) Limited, CEO & Founder at DataLysis.io and CEO & Founder at LearningByDoing.io.

With thanks! Valuable information!

With thanks! Valuable information!