The post Statistical Case Study for Data Science – Foundational Statistics for Data Science and Machine Learning with Python – Learning By Doing appeared first on Learning By Doing - A Hands On Approach To Professional Development.

]]>The data contains attributes namely Temperature, Gender and Heart Rate. You can download the data from here. We will be using Pandas DataFrames and will import data directly from the link.

import pandas as pd import matplotlib import matplotlib.pyplot as plt import scipy.stats as st import numpy as np >>> df1 = pd.read_csv('https://ww2.amstat.org/publications/jse/datasets/normtemp.dat.txt') >>> df1.head() 96.3 1 70 0 96.7 1 71 1 96.9 1 74 2 97.0 1 80 3 97.1 1 73 4 97.1 1 75

Let’s look at the shape of our DataFrame df1 to see number of rows and columns:

>>> df1.shape (129, 1)

Note that when we download and open the text file in a text editor, we will notice that each record is in a new line and each field is separated by some spaces. Since the default separator for read_csv function is ‘comma’, the above mentioned code for read_csv function has imported our data in a single column.

To handle this, we will use separator argument for read_csv as follows:

df = pd.read_csv('https://ww2.amstat.org/publications/jse/datasets/normtemp.dat.txt', sep = '\s+', header=None) >>> df.head() 0 1 2 0 96.3 1 70 1 96.7 1 71 2 96.9 1 74 3 97.0 1 80 4 97.1 1 73 >>> df.shape (129, 3)

So now we got 129 records and 3 columns as expected. We used **header=None** argument so that Pandas will treat our first row in the data file as data rather than column headings which are not present in the data file.

We will define column names as follows:

df.columns = ['temperature', 'gender','heart_rate']

>>> df.head() temperature gender heart_rate 0 96.3 1 70 1 96.7 1 71 2 96.9 1 74 3 97.0 1 80 4 97.1 1 73

*– Is the distribution of temperatures normal?*

In probability theory, the normal (aka Gaussian) distribution is a very common class of statistical distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate. The normal distribution is sometimes informally called the bell curve.[1]

Many ** Machine Learning Models** expect that data fed to these models follows a normal distribution[2]. So, after you have got your data cleaned, you should definitely check what distribution it follows. Some of the

- Gaussian naive Bayes
- Least Squares based (regression)models
- LDA
- QDA

**Skewness: **The coefficient of Skewness is a measure for the degree of symmetry in the variable distribution.

Let’s plot a histogram to visualize the distribution of *temperature* attribute of the data:

>>> pd.DataFrame.hist(df, column='temperature') array([[]], dtype=object) >>> plt.show()

Here we see that the distribution seems slightly skewed to the right in the histogram plot above, but takes the form of a normal distribution.

*– Is the true population mean really 98.6 degrees F?
*

Some basics to recollect to go through the distribution (mean, median, mode, variance and standard deviation)

**Mean:** It is the sum of all observations divided by number of observations

**Median:** When all the observations are sorted in the ascending order, the median is exactly the middle value.

– Median is equal to 50th percentile.

– If the distribution of the data is Normal, then the median is equal to the arithmetic mean (which also equals **Mode**).

– The median is not sensitive to extreme values/outliers/noise, and therefore it may be a better measure of central tendency than the arithmetic mean.

**Variance and Standard Deviation:** Standard deviation gives the measure of the spread of the data. Average of squared differences from the mean is **variance** and square root of variance is Standard Deviation.[3][4]

**One Sample T-Test:** To verify our results, we will perform one sample t-test. A **one-sample t-test** checks whether a sample mean differs from the population(whole data) mean.

As mentioned in the referenced site, in the gender the value 1 corresponds to male while 0 corresponds to female. Let’s replace the corresponding values in our DataFrame:

>>> clean_ups = {"gender":{1:"male", 2:"female"},} >>> df.replace(clean_ups,inplace=True) >>> df.head() temperature gender heart_rate 0 96.3 male 70 1 96.7 male 71 2 96.9 male 74 3 97.0 male 80 4 97.1 male 73

To get some common stats for our DataFrame, let’s use pandas describe function for getting our Means and Standard Deviations:

>>> df.describe() temperature heart_rate count 130.000000 130.000000 mean 98.249231 73.761538 std 0.733183 7.062077 min 96.300000 57.000000 25% 97.800000 69.000000 50% 98.300000 74.000000 75% 98.700000 79.000000 max 100.800000 89.000000 >>> df.median() temperature 98.3 heart_rate 74.0 dtype: float64

To verify our results, we will perform one sample t-test as explained earlier:

>>> one_sample = st.ttest_1samp(df['temperature'], popmean=98.6) >>> one_sample Ttest_1sampResult(statistic=-5.4548232923640771, pvalue=2.4106320415610081e-07)

The t-statistic is -5.455 and the p-value is 0.0000002411. Since the p-value is very low, its highly unlikely that the population’s Temperature Mean can be 98.6 (as given in Case Study Analysis # 2’s question) . So we can easily reject the null hypothesis posed in this Case Study that the population Mean is really 98.6 degrees F.

Is There a Significant Difference Between Males and Females in Normal Temperature?

To analyze statistical significance of the difference between the Mean Temperatures of Male and Female, we will use **t-statistic** with **t-test**. A t-test’s statistical significance indicates whether or not the difference between two groups’ averages most likely reflects a “real” difference in the population from which the groups were sampled.

A **one-sample t-test** checks whether a sample mean differs from the population(whole data) mean.

A** two-sample t-test** investigates whether the means of two independent data samples differ from one another. In a two-sample test, the null hypothesis is that the means of both groups are the same.

In our case, since we to compare two variables Male and Female, we will use **two-sample t-test**.

T- tests are supported by **P-Value**. A p-value is used in hypothesis testing to help you support or reject the null hypothesis. The p-value is the evidence against a null hypothesis. The smaller the p-value, the strong the evidence that you should reject the null hypothesis.

**P-values** are expressed as decimals although it may be easier to understand what they are if you convert them to a percentage. For example, a p value of 0.0254 is 2.54%. This means there is a 2.54% chance your results could be random (i.e. happened by chance). That’s pretty tiny. On the other hand, a large p-value of .9(90%) means your results have a 90% probability of being completely random and not due to anything in your experiment. Therefore, the smaller the p-value, the more important (“significant”) your results.[5]

First let’s make two individual DataFrames for Male and Female Temperatures:

>>> male = df[df['gender'] == 'male'] >>> female = df[df['gender'] == 'female']

Lets plot a Histogram for Temperatures of both the DataFrames:

>>> bins = np.linspace(97, 99, 1000) >>> plt.hist(female['temperature']) (array([ 3., 2., 4., 12., 15., 20., 6., 1., 1., 1.]), array([ 96.4 , 96.84, 97.28, 97.72, 98.16, 98.6 , 99.04, 99.48, 99.92, 100.36, 100.8 ]), ) >>> plt.hist(male['temperature']) (array([ 1., 2., 5., 7., 8., 14., 8., 11., 5., 4.]), array([ 96.3 , 96.62, 96.94, 97.26, 97.58, 97.9 , 98.22, 98.54, 98.86, 99.18, 99.5 ]), ) >>> plt.show()

Now to get the size of DataFrames and their means:

>>> len(male) 65 >>> len(female) 65 >>> >>> male['temperature'].mean() 98.1046153846154 >>> female['temperature'].mean() 98.39384615384616

We have 65 records for both Male and Female and there is a difference in the Mean. Now we will perform t-test to

>>> two_sample = st.ttest_ind(male['temperature'], female['temperature']) >>> two_sample

The t-statistic is -2.285 and the p-value is 0.024. As explained earlier, the p-value of 0.024 corresponds to 2.4%. This means there is a 2.4% chance that our result could be random (i.e. happened by chance). That’s pretty low and therefore it’s highly unlikely that the mean temperatures of Male and Female are equal.

[1] https://en.wikipedia.org/wiki/Normal_distribution

[2] http://rishy.github.io/stats/2015/07/21/normal-distributions/

[3] https://www.khanacademy.org/math/probability/data-distributions-a1/summarizing-spread-distributions/a/calculating-standard-deviation-step-by-step

[4] https://tekmarathon.com/2015/11/13/importance-of-data-distribution-in-training-machine-learning-models/

[5] http://www.statisticshowto.com/support-or-reject-null-hypothesis/

The post Statistical Case Study for Data Science – Foundational Statistics for Data Science and Machine Learning with Python – Learning By Doing appeared first on Learning By Doing - A Hands On Approach To Professional Development.

]]>The post Pivoting Data with Pandas in Python – Data Analytics in Python – Learning By Doing appeared first on Learning By Doing - A Hands On Approach To Professional Development.

]]>In this section of the series Data Analytics in Python, we will go through Pivot Tables in Pandas which is a handy technique for exploring data from different dimensions and extracting insights from Data. Pivoting is one of the main techniques used in Business Intelligence Solutions and Data Science for slicing and Dicing the data.

Lets consider initial sales data of the 5 Regions for the years 2016, 2017 and 2018.

Year |
Central |
East |
North |
South |
West |

2016 | 300 | 150 | 500 | 325 | 200 |

2017 | 200 | 300 | 450 | 300 | 200 |

2018 | 250 | 225 | 150 | 375 | 150 |

**Table 1: Initial Sales Data Table**

Now consider that a report is needed to be generated where the Data Values of Years is to be shown as columns and generate a summary of each year accordingly something as follows:

Year |
2016 |
2017 |
2018 |

Central | 300 | 200 | 250 |

East | 150 | 300 | 225 |

North | 500 | 450 | 150 |

South | 325 | 300 | 375 |

West | 200 | 200 | 150 |

**Table 2: Preview of the Desired Result **

For generating the required output illustrated above, we will need to Pivot our data. Let’s dive in Python and first make a Pandas’ DataFrame out of the initial data presented in **Table 1**:

import pandas as pd df = pd.DataFrame({'Year': ['2016','2017','2018'], 'North' : [500, 450, 150], 'East' : [150, 300, 225], 'South': [325, 300, 375], 'West': [200, 200, 150], 'Central':[300, 200, 250],}) >>> df Central East North South West Year 0 300 150 500 325 200 2016 1 200 300 450 300 200 2017 2 250 225 150 375 150 2018

To generate the required output illustrated in **Table 2**, we will use Pandas DataFrame’s pivot_table:

>>> df.pivot_table(columns='Year') Year 2016 2017 2018 Central 300 200 250 East 150 300 225 North 500 450 150 South 325 300 375 West 200 200 150

A more readable output of the above function is as follows:

Year |
2016 |
2017 |
2018 |

Central | 300 | 200 | 250 |

East | 150 | 300 | 225 |

North | 500 | 450 | 150 |

South | 325 | 300 | 375 |

West | 200 | 200 | 150 |

**Table 3: Desired Result Generated with Pandas Pivot Table **

In this way, we can change the dimensions of our data and generate different insights out of it.

The post Pivoting Data with Pandas in Python – Data Analytics in Python – Learning By Doing appeared first on Learning By Doing - A Hands On Approach To Professional Development.

]]>The post Applying Arbitrary Functions for Grouping Data with Pandas – Data Analytics in Python – Learning By Doing appeared first on Learning By Doing - A Hands On Approach To Professional Development.

]]>>>> import pandas as pd >>> df = pd.DataFrame({'Student':['Beth', 'Alex', 'Diana', 'Adrian'], 'Age': [18, 19, 18, 19], 'Math': [75, 82, 89, 85], 'Science': [65, 75, 86, 90], 'Teacher': ['William', 'William', 'Robert', 'Robert']})

Just to get an idea how our data looks, we can print the records as a Table:

>>> df.head() Age Math Science Student Teacher 0 18 75 65 Beth William 1 19 82 75 Alex William 2 18 89 86 Diana Robert 3 19 85 90 Adrian Robert

Consider the following max function applied on GroupBy Teacher:

>>> df.groupby('Teacher').max() Age Math Science Student Teacher Robert 19 89 90 Diana William 19 82 75 Beth

The pre-defined max function can also be used in the following way:

>>> df.groupby('Teacher').apply(max) Age Math Science Student Teacher Teacher Robert 19 89 90 Diana Robert William 19 82 75 Beth William

In the code above, we passed function as an argument to ‘apply’ function. Notice that in this way we can also pass custom defined functions and get our desired results. Lets define a function which finds best teacher in our case:

def best_teacher(group_dframe): return pd.DataFrame({'Math': [group_dframe.loc[group_dframe.Math.idxmax()].Teacher], 'Science': [group_dframe.loc[group_dframe.Science.idxmax()].Teacher]})

The function above takes a Pandas Grouped DataFrame as an argument and in turn returns a DataFrame with Teacher’s name corresponding to the Subjects’ max scores.

Lets examine the function more closely. Consider the list which is being passed as a value for key ‘Math’ in the dictionary defined in the function above:

[group_dframe.loc[group_dframe.Math.idxmax()].Teacher]

Lets disect the above list step by step for better understanding of whats going on.

group_dframe.Math.idxmax()

The above line returns the index of the maximum value for Math.

group_dframe.loc[group_dframe.Math.idxmax()]

Now by using .loc function, we will fetch the row by using the previously fetched index of maximum value for Math. For more on .loc, you can see my post How to use .loc, .iloc, .ix in Pandas .

Now finally:

group_dframe.loc[group_dframe.Math.idxmax()].Teacher

The line above fetches the Teacher from the row extracted in the previous step. Since that row was for the maximum score for Math, the Teacher returned here is the one whose students get maximum marks in Maths.

Now lets define a groupby DataFrame and apply our function:

>>> group_dframe = df.groupby('Age') >>> group_dframe.apply(best_teacher) Math Science Age 18 0 Robert Robert 19 0 Robert Robert

In this way, we fetched the best teacher according to the age group for each subject based on the max scores.

The post Applying Arbitrary Functions for Grouping Data with Pandas – Data Analytics in Python – Learning By Doing appeared first on Learning By Doing - A Hands On Approach To Professional Development.

]]>The post Introduction to Grouping Data with Pandas – Data Analytics in Python – Learning By Doing appeared first on Learning By Doing - A Hands On Approach To Professional Development.

]]>In this series tutorial for Data Analytics in Python , we would be using Python with Pandas for generating simple insights from Data by using some grouping techniques. Grouping of data is basically aggregation of data on the basis of some columns or attributes. Groupby basically splits data into different groups depending on the columns provided.

Lets consider a simple data of Students and their marks in two subjects along with their respective teachers:

>>> import pandas as pd >>> df = pd.DataFrame({'Student':['Beth', 'Alex', 'Diana', 'Adrian'], 'Age': [18, 19, 18, 19], 'Math': [75, 82, 89, 85], 'Science': [65, 75, 86, 90], 'Teacher': ['William', 'William', 'Robert', 'Robert']})

Just to get an idea how our data looks, we can print the records as a Table:

>>> df.head() Age Math Science Student Teacher 0 18 75 65 Beth William 1 19 82 75 Alex William 2 18 89 86 Diana Robert 3 19 85 90 Adrian Robert

Now we would try to extract some basic insights from our Pandas DataFrame using GroupBy Function:

>>> df.groupby('Teacher').describe() Age Math Science Teacher Robert count 2.000000 2.000000 2.000000 mean 18.500000 87.000000 88.000000 std 0.707107 2.828427 2.828427 min 18.000000 85.000000 86.000000 25% 18.250000 86.000000 87.000000 50% 18.500000 87.000000 88.000000 75% 18.750000 88.000000 89.000000 max 19.000000 89.000000 90.000000 William count 2.000000 2.000000 2.000000 mean 18.500000 78.500000 70.000000 std 0.707107 4.949747 7.071068 min 18.000000 75.000000 65.000000 25% 18.250000 76.750000 67.500000 50% 18.500000 78.500000 70.000000 75% 18.750000 80.250000 72.500000 max 19.000000 82.000000 75.000000

So here we can see some direct insights about the teachers. For instance in the case above, we can see that Robert’s students are performing better than William’s considering the Mean values produced above. We can see from this that may be Robert is a better teacher than Williams or has better students or something like that. We can filter the Teacher, Robert’s Data from the DataFrame as follows to validate our insights:

>>> df[df['Teacher']=='Robert'] Age Math Science Student Teacher 2 18 89 86 Diana Robert 3 19 85 90 Adrian Robert

For more Data Filtering Techniques using Pandas, visit Data Analytics in Python – Data Filtering with Pandas – Learning By Doing

We can go further by getting their the Medians of our Pandas DataFrames:

>>> df.groupby('Teacher').median() Age Math Science Teacher Robert 18.5 87.0 88.0 William 18.5 78.5 70.0

And we can further extract insights on the basis of Teachers and their Student’s Age by using Group By on two columns and getting their Median in the following way:

>>> df.groupby(['Teacher', 'Age']).median() Math Science Teacher Age Robert 18 89 86 19 85 90 William 18 75 65 19 82 75

In the next post, we will look at how we can apply Arbitrary functions while using group by in Pandas.

The post Introduction to Grouping Data with Pandas – Data Analytics in Python – Learning By Doing appeared first on Learning By Doing - A Hands On Approach To Professional Development.

]]>The post Data Analytics in Python – Data Filtering with Pandas – Learning By Doing appeared first on Learning By Doing - A Hands On Approach To Professional Development.

]]>

Machine Learning and Data Analytics initially involves going through the data. In this Hands On tutorial, we would be using Python with Pandas for data filtering. The way we filter data in SQL, Pandas also provides several ways to filter the data to perform analysis on a specific set of data.

For this Hands On tutorial for Machine Learning, we would be using IRIS data from UCI Machine Learning Repository:

Lets fetch data and define it as a Pandas Data frame:

import pandas as pd df = pd.read_csv( filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None, sep=',') df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class'] df.dropna(how="all", inplace=True) # drops the empty line at file-end

You may print and check the data by:

df.head() df.tail()

Filter Results:

# Select rows where df.petal_len is greater than 4.5 df[df['petal_len'] > 4.5]

FILTER WITH ‘AND’ LOGICAL OPERATOR in PANDAS

# Select rows where df.petal_len is greater than 4.5 AND less than 5.5 df[(df['petal_len'] > 4.5) & (df['petal_len'] < 5.5)]

FILTER WITH ‘OR’ LOGICAL OPERATOR in PANDAS

# Select rows where df.petal_len is greater than 5.5 OR less than 1.0 df[(df['petal_len'] > 5.5) | (df['petal_len'] < 2.0)]

FILTER WITH ‘NOT’ OPERATOR in PANDAS

# Select all the classes (Iris flower types) except Iris-virginica df[~(df['class'] == 'Iris-virginica')]

The post Data Analytics in Python – Data Filtering with Pandas – Learning By Doing appeared first on Learning By Doing - A Hands On Approach To Professional Development.

]]>The post Data Analytics in Python – How to use .loc, .iloc, .ix in Pandas – Learning by Doing appeared first on Learning By Doing - A Hands On Approach To Professional Development.

]]>In the field of Data Science and Machine Learning, the very first thing after getting access to data is to Analyze it. Data Analysis is the most important part of extracting any valuable information from the data.

Before applying any Machine Learning Model or Techniques it is necessary to get to know the data attributes and dimensions in order to treat it accordingly. In this tutorial, we will be using Hands On approach to go through and analyze an actual data which is used for Machine Learning. We will be using Python and Pandas for this purpose and use .loc, .iloc, .ix in Pandas. We will start with loading the data and defining its Labels and Classes as per Data description mentioned in the Machine Learning Data Repository.

import pandas as pd df = pd.read_csv( filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None, sep=',') df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class'] df.dropna(how="all", inplace=True) # drops the empty line at file-end df.head() df.tail() df = df.set_index('class')

SELECTING A COLUMN IN PANDAS:

df['petal_len']

SELECTING MULTIPLE COLUMN IN PANDAS:

df[['petal_len', 'petal_wid']]

SELECTING ALL ROWS BY INDEX LABEL:

# Select all rows with class 'Iris-virginica' df.loc['Iris-virginica']

SELECTING ROWS IN PANDAS

# Select every row up to 5 df.iloc[:4] # Select the forth and fifth row df.iloc[3:4] # Select every row after the fifth row df.iloc[4:]

SELECTING COLUMNS IN PANDAS

# Select the first 2 columns df.iloc[:,:2]

The post Data Analytics in Python – How to use .loc, .iloc, .ix in Pandas – Learning by Doing appeared first on Learning By Doing - A Hands On Approach To Professional Development.

]]>