Groupby, Agg and Transform Tutorial

Groupby, agreegation and transform function is frequently used in data analysis with Pandas. Let’s see their application with some examples.

import pandas as pd
sales = pd.read_excel('D:/lj2/Machine-Learning/Exploratory Analysis/sales_transactions.xlsx')
sales.head()
account name order sku quantity unit price ext price
0 383080 Will LLC 10001 B1-20000 7 33.69 235.83
1 383080 Will LLC 10001 S1-27722 11 21.12 232.32
2 383080 Will LLC 10001 B1-86481 3 35.99 107.97
3 412290 Jerde-Hilpert 10005 S1-06532 48 55.82 2679.36
4 412290 Jerde-Hilpert 10005 S1-82801 21 13.62 286.02

Task 1. Group by get sum price for each order

Reference

sales.groupby('order')['ext price'].sum()
order
10001     576.12
10005    8185.49
10006    3724.49
Name: ext price, dtype: float64

Task 2. Calculate proportion of cost of each item in each order

transform() function is used

sales['TotalPrice']=sales.groupby('order')['ext price'].transform(sum)
sales['Portion']=sales['ext price']/sales['TotalPrice']
sales['Portion']=sales['Portion'].apply(lambda x: str(round(x*100,3))+"%")
sales
account name order sku quantity unit price ext price TotalPrice Portion
0 383080 Will LLC 10001 B1-20000 7 33.69 235.83 576.12 40.934%
1 383080 Will LLC 10001 S1-27722 11 21.12 232.32 576.12 40.325%
2 383080 Will LLC 10001 B1-86481 3 35.99 107.97 576.12 18.741%
3 412290 Jerde-Hilpert 10005 S1-06532 48 55.82 2679.36 8185.49 32.733%
4 412290 Jerde-Hilpert 10005 S1-82801 21 13.62 286.02 8185.49 3.494%
5 412290 Jerde-Hilpert 10005 S1-06532 9 92.55 832.95 8185.49 10.176%
6 412290 Jerde-Hilpert 10005 S1-47412 44 78.91 3472.04 8185.49 42.417%
7 412290 Jerde-Hilpert 10005 S1-27722 36 25.42 915.12 8185.49 11.18%
8 218895 Kulas Inc 10006 S1-27722 32 95.66 3061.12 3724.49 82.189%
9 218895 Kulas Inc 10006 B1-33087 23 22.55 518.65 3724.49 13.925%
10 218895 Kulas Inc 10006 B1-33364 3 72.30 216.90 3724.49 5.824%
11 218895 Kulas Inc 10006 B1-20000 -1 72.18 -72.18 3724.49 -1.938%

Task 3. Multi-index groupby

Item    Price  Minimum Most_Common_Price
0 Coffee  1      1       2
1 Coffee  2      1       2
2 Coffee  2      1       2
3 Tea     3      3       4
4 Tea     4      3       4
5 Tea     4      3       4
item = pd.DataFrame({'Item':['Coffee','Coffee','Coffee','Coffee','Coffee','Coffee','Coffee','Coffee',
                             'Tea','Tea','Tea','Tea','Tea','Tea','Tea'],
                     'Brand':['Star','Star','Moon','Moon','Star','Moon','Star','Moon',
                              'Garden','Garden','Park','Garden','Garden','Park','Park'],
                     'Price':[1,1,2,2,3,3,3,4,4,5,5,5,5,6,6]})
item
Brand Item Price
0 Star Coffee 1
1 Star Coffee 1
2 Moon Coffee 2
3 Moon Coffee 2
4 Star Coffee 3
5 Moon Coffee 3
6 Star Coffee 3
7 Moon Coffee 4
8 Garden Tea 4
9 Garden Tea 5
10 Park Tea 5
11 Garden Tea 5
12 Garden Tea 5
13 Park Tea 6
14 Park Tea 6
item.groupby('Item')['Brand'].value_counts()
Item    Brand 
Coffee  Moon      4
        Star      4
Tea     Garden    4
        Park      3
Name: Brand, dtype: int64

It is a multi-index series

Task 3.1. Multi-index groupby with relative portion

Python Data Science

Method 1.

item.groupby('Item')['Brand'].value_counts()/item.groupby('Item')['Brand'].count()
Item    Brand 
Coffee  Moon      0.500000
        Star      0.500000
Tea     Garden    0.571429
        Park      0.428571
Name: Brand, dtype: float64

Method 2 (works better for more complex dataframe).

grouper = item.groupby('Item')['Brand'].value_counts()
grouper/grouper.groupby(level=[0]).transform(sum)
Item    Brand 
Coffee  Moon      0.500000
        Star      0.500000
Tea     Garden    0.571429
        Park      0.428571
Name: Brand, dtype: float64

Task 3.2. Get item in multi-index series

Grouper is a special dataframe. Let’s study it.

grouper
Item    Brand 
Coffee  Moon      4
        Star      4
Tea     Garden    4
        Park      3
Name: Brand, dtype: int64
grouper.Coffee
Brand
Moon    4
Star    4
Name: Brand, dtype: int64
grouper['Coffee','Moon']
4
## create a pivot table
grouper.unstack()
Brand Garden Moon Park Star
Item
Coffee NaN 4.0 NaN 4.0
Tea 4.0 NaN 3.0 NaN
grouper.unstack().stack()
Item    Brand 
Coffee  Moon      4.0
        Star      4.0
Tea     Garden    4.0
        Park      3.0
dtype: float64

Task 3.3. More levels get proportions

# add one more column
item['Flavor']=['Black','Black','Black','Latte','Black','Latte','Black','Latte',
                 'Fruit','Milk','Fruit','Milk','Milk','Milk','Fruit']
item
Brand Item Price Flavor
0 Star Coffee 1 Black
1 Star Coffee 1 Black
2 Moon Coffee 2 Black
3 Moon Coffee 2 Latte
4 Star Coffee 3 Black
5 Moon Coffee 3 Latte
6 Star Coffee 3 Black
7 Moon Coffee 4 Latte
8 Garden Tea 4 Fruit
9 Garden Tea 5 Milk
10 Park Tea 5 Fruit
11 Garden Tea 5 Milk
12 Garden Tea 5 Milk
13 Park Tea 6 Milk
14 Park Tea 6 Fruit
item.groupby(['Item','Flavor'])['Brand'].value_counts()
Item    Flavor  Brand 
Coffee  Black   Star      4
                Moon      1
        Latte   Moon      3
Tea     Fruit   Park      2
                Garden    1
        Milk    Garden    3
                Park      1
Name: Brand, dtype: int64
item.groupby(['Item','Flavor'])['Brand'].count()
Item    Flavor
Coffee  Black     5
        Latte     3
Tea     Fruit     3
        Milk      4
Name: Brand, dtype: int64

Task 3.3.1 We still need to calculate the proportion of each group

# it will give us a problem
#item.groupby(['Item','Flavor'])['Brand'].value_counts()/item.groupby(['Item','Flavor'])['Brand'].count()

grouper1=item.groupby(['Item','Flavor'])['Brand'].value_counts()
grouper1/grouper1.groupby(level=[0,1]).transform(sum)
Item    Flavor  Brand 
Coffee  Black   Star      0.800000
                Moon      0.200000
        Latte   Moon      1.000000
Tea     Fruit   Park      0.666667
                Garden    0.333333
        Milk    Garden    0.750000
                Park      0.250000
Name: Brand, dtype: float64

Task 3.3.2. get most frequent combination for each item.

grouper1
Item    Flavor  Brand 
Coffee  Black   Star      4
                Moon      1
        Latte   Moon      3
Tea     Fruit   Park      2
                Garden    1
        Milk    Garden    3
                Park      1
Name: Brand, dtype: int64
grouper1[grouper1==grouper1.groupby(level=[0,1]).transform(max)]
Item    Flavor  Brand 
Coffee  Black   Star      4
        Latte   Moon      3
Tea     Fruit   Park      2
        Milk    Garden    3
Name: Brand, dtype: int64

Task 3.3.3 get the most frequent price

Inspired by Stackflow problem

grouper2=item.groupby((['Item','Brand'])).Price
grouper2.value_counts()
Item    Brand   Price
Coffee  Moon    2        2
                3        1
                4        1
        Star    1        2
                3        2
Tea     Garden  5        3
                4        1
        Park    6        2
                5        1
Name: Price, dtype: int64
item['Most Frequent']=grouper2.transform(lambda x: x.mode()[0])
item
Brand Item Price Flavor Most Frequent
0 Star Coffee 1 Black 1
1 Star Coffee 1 Black 1
2 Moon Coffee 2 Black 2
3 Moon Coffee 2 Latte 2
4 Star Coffee 3 Black 1
5 Moon Coffee 3 Latte 2
6 Star Coffee 3 Black 1
7 Moon Coffee 4 Latte 2
8 Garden Tea 4 Fruit 5
9 Garden Tea 5 Milk 5
10 Park Tea 5 Fruit 6
11 Garden Tea 5 Milk 5
12 Garden Tea 5 Milk 5
13 Park Tea 6 Milk 6
14 Park Tea 6 Fruit 6

For future study

Python Data Science