Applying Arbitrary Functions for Grouping Data with Pandas – Data Analytics in Python – Learning By Doing

Python Pandas Machine Learning

In this tutorial, we will explore how we can apply Arbitrary functions to our groupings in Pandas.Its a handy technique while analyzing and performing data analytics with Python and Pandas. We can use Pandas GroupBy using Higher Order Function and apply Custom Aggregations.

>>> import pandas as pd
>>> df = pd.DataFrame({'Student':['Beth', 'Alex', 'Diana', 'Adrian'],
                  'Age': [18, 19, 18, 19],
                  'Math': [75, 82, 89, 85],
                  'Science': [65, 75, 86, 90],
                  'Teacher': ['William', 'William', 'Robert', 'Robert']})

Just to get an idea how our data looks, we can print the records as a Table:

>>> df.head()
   Age  Math  Science Student  Teacher
0   18    75       65    Beth  William
1   19    82       75    Alex  William
2   18    89       86   Diana   Robert
3   19    85       90  Adrian   Robert

Consider the following max function applied on GroupBy Teacher:

>>> df.groupby('Teacher').max()
         Age  Math  Science Student
Teacher                            
Robert    19    89       90   Diana
William   19    82       75    Beth

The pre-defined max function can also be used in the following way:

>>> df.groupby('Teacher').apply(max)
         Age  Math  Science Student  Teacher
Teacher                                     
Robert    19    89       90   Diana   Robert
William   19    82       75    Beth  William

In the code above, we passed function as an argument to ‘apply’ function. Notice that in this way we can also pass custom defined functions and get our desired results. Lets define a function which finds best teacher in our case:

def best_teacher(group_dframe):
    return pd.DataFrame({'Math': [group_dframe.loc[group_dframe.Math.idxmax()].Teacher],
                        'Science': [group_dframe.loc[group_dframe.Science.idxmax()].Teacher]})

The function above takes a Pandas Grouped DataFrame as an argument and in turn returns a DataFrame with Teacher’s name corresponding to the Subjects’ max scores.

Lets examine the function more closely. Consider the list which is being passed as a value for key ‘Math’ in the dictionary defined in the function above:

[group_dframe.loc[group_dframe.Math.idxmax()].Teacher]

Lets disect the above list step by step for better understanding of whats going on.

group_dframe.Math.idxmax()

The above line returns the index of the maximum value for Math.

group_dframe.loc[group_dframe.Math.idxmax()]

Now by using .loc function, we will fetch the row by using the previously fetched index of maximum value for Math. For more on .loc, you can see my post How to use .loc, .iloc, .ix in Pandas .

Now finally:

group_dframe.loc[group_dframe.Math.idxmax()].Teacher

The line above fetches the Teacher from the row extracted in the previous step. Since that row was for the maximum score for Math, the Teacher returned here is the one whose students get maximum marks in Maths.

Now lets define a groupby DataFrame and apply our function:

>>> group_dframe = df.groupby('Age')
>>> group_dframe.apply(best_teacher)
         Math Science
Age                  
18  0  Robert  Robert
19  0  Robert  Robert

In this way, we fetched the best teacher according to the age group for each subject based on the max scores.

Electronics Engineer by book, Software Architect and Technopreneur by passion, Open Source Enthusiast, Problem Hacker, Enabler, Do-Tank, Blogger, Autodidact, Yogi and an avid Reader. Involved in Building Products. Having loads of experience and technical expertise in areas ranging from Full Stack Web Application Development to Big Data Analysis, Modeling, Processing and Visualization, he is currently involved in working on Python, Django, Javascript, SQL, Bootstrap, PostgreSQL, RRD (Round Robin Database), MySQL, MonetDB, LevelDB, BerkeleyDB, Redis, Apache Spark, Pandas, SciPy, NumPy etc.

Ali Raza received his Masters Degree in Electronics Engineering which involved Research focused on Machine Learning. He is currently working as a Chief Technical Officer at BitWits (Pvt) Limited, CEO & Founder at DataLysis.io and CEO & Founder at LearningByDoing.io.

Please follow and like us:

Leave a Reply

Your email address will not be published. Required fields are marked *