Introduction to Grouping Data with Pandas – Data Analytics in Python – Learning By Doing

Python Pandas Machine Learning

 

In this series tutorial for Data Analytics in Python , we would be using Python with Pandas for generating simple insights from Data by using some grouping techniques. Grouping of data is basically aggregation of data on the basis of some columns or attributes. Groupby basically splits data into different groups depending on the columns provided.

Lets consider a simple data of Students and their marks in two subjects along with their respective teachers:

>>> import pandas as pd

>>> df = pd.DataFrame({'Student':['Beth', 'Alex', 'Diana', 'Adrian'],
                  'Age': [18, 19, 18, 19],
                  'Math': [75, 82, 89, 85],
                  'Science': [65, 75, 86, 90],
                  'Teacher': ['William', 'William', 'Robert', 'Robert']})

Just to get an idea how our data looks, we can print the records as a Table:

>>> df.head()
   Age  Math  Science Student  Teacher
0   18    75       65    Beth  William
1   19    82       75    Alex  William
2   18    89       86   Diana   Robert
3   19    85       90  Adrian   Robert

Now we would try to extract some basic insights from our Pandas DataFrame using GroupBy Function:

>>> df.groupby('Teacher').describe()
                     Age       Math    Science
Teacher                                       
Robert  count   2.000000   2.000000   2.000000
        mean   18.500000  87.000000  88.000000
        std     0.707107   2.828427   2.828427
        min    18.000000  85.000000  86.000000
        25%    18.250000  86.000000  87.000000
        50%    18.500000  87.000000  88.000000
        75%    18.750000  88.000000  89.000000
        max    19.000000  89.000000  90.000000
William count   2.000000   2.000000   2.000000
        mean   18.500000  78.500000  70.000000
        std     0.707107   4.949747   7.071068
        min    18.000000  75.000000  65.000000
        25%    18.250000  76.750000  67.500000
        50%    18.500000  78.500000  70.000000
        75%    18.750000  80.250000  72.500000
        max    19.000000  82.000000  75.000000

So here we can see some direct insights about the teachers. For instance in the case above, we can see that Robert’s students are performing better than William’s considering the Mean values produced above. We can see from this that may be Robert is a better teacher than Williams or has better students or something like that. We can filter the Teacher, Robert’s Data from the DataFrame as follows to validate our insights:

>>> df[df['Teacher']=='Robert']
   Age  Math  Science Student Teacher
2   18    89       86   Diana  Robert
3   19    85       90  Adrian  Robert

For more Data Filtering Techniques using Pandas, visit Data Analytics in Python – Data Filtering with Pandas – Learning By Doing

We can go further by getting their the Medians of our Pandas DataFrames:

>>> df.groupby('Teacher').median()
          Age  Math  Science
Teacher                     
Robert   18.5  87.0     88.0
William  18.5  78.5     70.0

And we can further extract insights on the basis of Teachers and their Student’s Age by using Group By on two columns and getting their Median in the following way:

>>> df.groupby(['Teacher', 'Age']).median()
             Math  Science
Teacher Age               
Robert  18     89       86
        19     85       90
William 18     75       65
        19     82       75

In the next post, we will look at how we can apply Arbitrary functions while using group by in Pandas.

Electronics Engineer by book, Software Architect and Technopreneur by passion, Open Source Enthusiast, Problem Hacker, Enabler, Do-Tank, Blogger, Autodidact, Yogi and an avid Reader. Involved in Building Products. Having loads of experience and technical expertise in areas ranging from Full Stack Web Application Development to Big Data Analysis, Modeling, Processing and Visualization, he is currently involved in working on Python, Django, Javascript, SQL, Bootstrap, PostgreSQL, RRD (Round Robin Database), MySQL, MonetDB, LevelDB, BerkeleyDB, Redis, Apache Spark, Pandas, SciPy, NumPy etc.

Ali Raza received his Masters Degree in Electronics Engineering which involved Research focused on Machine Learning. He is currently working as a Chief Technical Officer at BitWits (Pvt) Limited, CEO & Founder at DataLysis.io and CEO & Founder at LearningByDoing.io.

Please follow and like us:

2 thoughts on “Introduction to Grouping Data with Pandas – Data Analytics in Python – Learning By Doing”

Leave a Reply

Your email address will not be published. Required fields are marked *