In this tutorial, we will explore how we can apply Arbitrary functions to our groupings in Pandas.Its a handy technique while analyzing and performing data analytics with Python and Pandas. We can use Pandas GroupBy using Higher Order Function and apply Custom Aggregations.

>>> import pandas as pd >>> df = pd.DataFrame({'Student':['Beth', 'Alex', 'Diana', 'Adrian'], 'Age': [18, 19, 18, 19], 'Math': [75, 82, 89, 85], 'Science': [65, 75, 86, 90], 'Teacher': ['William', 'William', 'Robert', 'Robert']})

Just to get an idea how our data looks, we can print the records as a Table:

>>> df.head() Age Math Science Student Teacher 0 18 75 65 Beth William 1 19 82 75 Alex William 2 18 89 86 Diana Robert 3 19 85 90 Adrian Robert

Consider the following max function applied on GroupBy Teacher:

>>> df.groupby('Teacher').max() Age Math Science Student Teacher Robert 19 89 90 Diana William 19 82 75 Beth

The pre-defined max function can also be used in the following way:

>>> df.groupby('Teacher').apply(max) Age Math Science Student Teacher Teacher Robert 19 89 90 Diana Robert William 19 82 75 Beth William

In the code above, we passed function as an argument to ‘apply’ function. Notice that in this way we can also pass custom defined functions and get our desired results. Lets define a function which finds best teacher in our case:

def best_teacher(group_dframe): return pd.DataFrame({'Math': [group_dframe.loc[group_dframe.Math.idxmax()].Teacher], 'Science': [group_dframe.loc[group_dframe.Science.idxmax()].Teacher]})

The function above takes a Pandas Grouped DataFrame as an argument and in turn returns a DataFrame with Teacher’s name corresponding to the Subjects’ max scores.

Lets examine the function more closely. Consider the list which is being passed as a value for key ‘Math’ in the dictionary defined in the function above:

[group_dframe.loc[group_dframe.Math.idxmax()].Teacher]

Lets disect the above list step by step for better understanding of whats going on.

group_dframe.Math.idxmax()

The above line returns the index of the maximum value for Math.

group_dframe.loc[group_dframe.Math.idxmax()]

Now by using .loc function, we will fetch the row by using the previously fetched index of maximum value for Math. For more on .loc, you can see my post How to use .loc, .iloc, .ix in Pandas .

Now finally:

group_dframe.loc[group_dframe.Math.idxmax()].Teacher

The line above fetches the Teacher from the row extracted in the previous step. Since that row was for the maximum score for Math, the Teacher returned here is the one whose students get maximum marks in Maths.

Now lets define a groupby DataFrame and apply our function:

>>> group_dframe = df.groupby('Age') >>> group_dframe.apply(best_teacher) Math Science Age 18 0 Robert Robert 19 0 Robert Robert

In this way, we fetched the best teacher according to the age group for each subject based on the max scores.

Electronics Engineer by book, Software Architect and Technopreneur by passion, Open Source Enthusiast, Problem Hacker, Enabler, Do-Tank, Blogger, Autodidact, Yogi and an avid Reader. Involved in Building Products. Having loads of experience and technical expertise in areas ranging from Full Stack Web Application Development to Big Data Analysis, Modeling, Processing and Visualization, he is currently involved in working on Python, Django, Javascript, SQL, Bootstrap, PostgreSQL, RRD (Round Robin Database), MySQL, MonetDB, LevelDB, BerkeleyDB, Redis, Apache Spark, Pandas, SciPy, NumPy etc.

Ali Raza received his Masters Degree in Electronics Engineering which involved Research focused on Machine Learning. He is currently working as a Chief Technical Officer at BitWits (Pvt) Limited, CEO & Founder at DataLysis.io and CEO & Founder at LearningByDoing.io.