Groupby, Agg and Transform Tutorial
Groupby, agreegation and transform function is frequently used in data analysis with Pandas. Let’s see their application with some examples.
import pandas as pd
sales = pd.read_excel('D:/lj2/Machine-Learning/Exploratory Analysis/sales_transactions.xlsx')
sales.head()
account | name | order | sku | quantity | unit price | ext price | |
---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | 10001 | B1-20000 | 7 | 33.69 | 235.83 |
1 | 383080 | Will LLC | 10001 | S1-27722 | 11 | 21.12 | 232.32 |
2 | 383080 | Will LLC | 10001 | B1-86481 | 3 | 35.99 | 107.97 |
3 | 412290 | Jerde-Hilpert | 10005 | S1-06532 | 48 | 55.82 | 2679.36 |
4 | 412290 | Jerde-Hilpert | 10005 | S1-82801 | 21 | 13.62 | 286.02 |
Task 1. Group by get sum price for each order
sales.groupby('order')['ext price'].sum()
order
10001 576.12
10005 8185.49
10006 3724.49
Name: ext price, dtype: float64
Task 2. Calculate proportion of cost of each item in each order
transform() function is used
sales['TotalPrice']=sales.groupby('order')['ext price'].transform(sum)
sales['Portion']=sales['ext price']/sales['TotalPrice']
sales['Portion']=sales['Portion'].apply(lambda x: str(round(x*100,3))+"%")
sales
account | name | order | sku | quantity | unit price | ext price | TotalPrice | Portion | |
---|---|---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | 10001 | B1-20000 | 7 | 33.69 | 235.83 | 576.12 | 40.934% |
1 | 383080 | Will LLC | 10001 | S1-27722 | 11 | 21.12 | 232.32 | 576.12 | 40.325% |
2 | 383080 | Will LLC | 10001 | B1-86481 | 3 | 35.99 | 107.97 | 576.12 | 18.741% |
3 | 412290 | Jerde-Hilpert | 10005 | S1-06532 | 48 | 55.82 | 2679.36 | 8185.49 | 32.733% |
4 | 412290 | Jerde-Hilpert | 10005 | S1-82801 | 21 | 13.62 | 286.02 | 8185.49 | 3.494% |
5 | 412290 | Jerde-Hilpert | 10005 | S1-06532 | 9 | 92.55 | 832.95 | 8185.49 | 10.176% |
6 | 412290 | Jerde-Hilpert | 10005 | S1-47412 | 44 | 78.91 | 3472.04 | 8185.49 | 42.417% |
7 | 412290 | Jerde-Hilpert | 10005 | S1-27722 | 36 | 25.42 | 915.12 | 8185.49 | 11.18% |
8 | 218895 | Kulas Inc | 10006 | S1-27722 | 32 | 95.66 | 3061.12 | 3724.49 | 82.189% |
9 | 218895 | Kulas Inc | 10006 | B1-33087 | 23 | 22.55 | 518.65 | 3724.49 | 13.925% |
10 | 218895 | Kulas Inc | 10006 | B1-33364 | 3 | 72.30 | 216.90 | 3724.49 | 5.824% |
11 | 218895 | Kulas Inc | 10006 | B1-20000 | -1 | 72.18 | -72.18 | 3724.49 | -1.938% |
Task 3. Multi-index groupby
Item Price Minimum Most_Common_Price
0 Coffee 1 1 2
1 Coffee 2 1 2
2 Coffee 2 1 2
3 Tea 3 3 4
4 Tea 4 3 4
5 Tea 4 3 4
item = pd.DataFrame({'Item':['Coffee','Coffee','Coffee','Coffee','Coffee','Coffee','Coffee','Coffee',
'Tea','Tea','Tea','Tea','Tea','Tea','Tea'],
'Brand':['Star','Star','Moon','Moon','Star','Moon','Star','Moon',
'Garden','Garden','Park','Garden','Garden','Park','Park'],
'Price':[1,1,2,2,3,3,3,4,4,5,5,5,5,6,6]})
item
Brand | Item | Price | |
---|---|---|---|
0 | Star | Coffee | 1 |
1 | Star | Coffee | 1 |
2 | Moon | Coffee | 2 |
3 | Moon | Coffee | 2 |
4 | Star | Coffee | 3 |
5 | Moon | Coffee | 3 |
6 | Star | Coffee | 3 |
7 | Moon | Coffee | 4 |
8 | Garden | Tea | 4 |
9 | Garden | Tea | 5 |
10 | Park | Tea | 5 |
11 | Garden | Tea | 5 |
12 | Garden | Tea | 5 |
13 | Park | Tea | 6 |
14 | Park | Tea | 6 |
item.groupby('Item')['Brand'].value_counts()
Item Brand
Coffee Moon 4
Star 4
Tea Garden 4
Park 3
Name: Brand, dtype: int64
It is a multi-index series
Task 3.1. Multi-index groupby with relative portion
Method 1.
item.groupby('Item')['Brand'].value_counts()/item.groupby('Item')['Brand'].count()
Item Brand
Coffee Moon 0.500000
Star 0.500000
Tea Garden 0.571429
Park 0.428571
Name: Brand, dtype: float64
Method 2 (works better for more complex dataframe).
grouper = item.groupby('Item')['Brand'].value_counts()
grouper/grouper.groupby(level=[0]).transform(sum)
Item Brand
Coffee Moon 0.500000
Star 0.500000
Tea Garden 0.571429
Park 0.428571
Name: Brand, dtype: float64
Task 3.2. Get item in multi-index series
Grouper is a special dataframe. Let’s study it.
grouper
Item Brand
Coffee Moon 4
Star 4
Tea Garden 4
Park 3
Name: Brand, dtype: int64
grouper.Coffee
Brand
Moon 4
Star 4
Name: Brand, dtype: int64
grouper['Coffee','Moon']
4
## create a pivot table
grouper.unstack()
Brand | Garden | Moon | Park | Star |
---|---|---|---|---|
Item | ||||
Coffee | NaN | 4.0 | NaN | 4.0 |
Tea | 4.0 | NaN | 3.0 | NaN |
grouper.unstack().stack()
Item Brand
Coffee Moon 4.0
Star 4.0
Tea Garden 4.0
Park 3.0
dtype: float64
Task 3.3. More levels get proportions
# add one more column
item['Flavor']=['Black','Black','Black','Latte','Black','Latte','Black','Latte',
'Fruit','Milk','Fruit','Milk','Milk','Milk','Fruit']
item
Brand | Item | Price | Flavor | |
---|---|---|---|---|
0 | Star | Coffee | 1 | Black |
1 | Star | Coffee | 1 | Black |
2 | Moon | Coffee | 2 | Black |
3 | Moon | Coffee | 2 | Latte |
4 | Star | Coffee | 3 | Black |
5 | Moon | Coffee | 3 | Latte |
6 | Star | Coffee | 3 | Black |
7 | Moon | Coffee | 4 | Latte |
8 | Garden | Tea | 4 | Fruit |
9 | Garden | Tea | 5 | Milk |
10 | Park | Tea | 5 | Fruit |
11 | Garden | Tea | 5 | Milk |
12 | Garden | Tea | 5 | Milk |
13 | Park | Tea | 6 | Milk |
14 | Park | Tea | 6 | Fruit |
item.groupby(['Item','Flavor'])['Brand'].value_counts()
Item Flavor Brand
Coffee Black Star 4
Moon 1
Latte Moon 3
Tea Fruit Park 2
Garden 1
Milk Garden 3
Park 1
Name: Brand, dtype: int64
item.groupby(['Item','Flavor'])['Brand'].count()
Item Flavor
Coffee Black 5
Latte 3
Tea Fruit 3
Milk 4
Name: Brand, dtype: int64
Task 3.3.1 We still need to calculate the proportion of each group
# it will give us a problem
#item.groupby(['Item','Flavor'])['Brand'].value_counts()/item.groupby(['Item','Flavor'])['Brand'].count()
grouper1=item.groupby(['Item','Flavor'])['Brand'].value_counts()
grouper1/grouper1.groupby(level=[0,1]).transform(sum)
Item Flavor Brand
Coffee Black Star 0.800000
Moon 0.200000
Latte Moon 1.000000
Tea Fruit Park 0.666667
Garden 0.333333
Milk Garden 0.750000
Park 0.250000
Name: Brand, dtype: float64
Task 3.3.2. get most frequent combination for each item.
grouper1
Item Flavor Brand
Coffee Black Star 4
Moon 1
Latte Moon 3
Tea Fruit Park 2
Garden 1
Milk Garden 3
Park 1
Name: Brand, dtype: int64
grouper1[grouper1==grouper1.groupby(level=[0,1]).transform(max)]
Item Flavor Brand
Coffee Black Star 4
Latte Moon 3
Tea Fruit Park 2
Milk Garden 3
Name: Brand, dtype: int64
Task 3.3.3 get the most frequent price
Inspired by Stackflow problem
grouper2=item.groupby((['Item','Brand'])).Price
grouper2.value_counts()
Item Brand Price
Coffee Moon 2 2
3 1
4 1
Star 1 2
3 2
Tea Garden 5 3
4 1
Park 6 2
5 1
Name: Price, dtype: int64
item['Most Frequent']=grouper2.transform(lambda x: x.mode()[0])
item
Brand | Item | Price | Flavor | Most Frequent | |
---|---|---|---|---|---|
0 | Star | Coffee | 1 | Black | 1 |
1 | Star | Coffee | 1 | Black | 1 |
2 | Moon | Coffee | 2 | Black | 2 |
3 | Moon | Coffee | 2 | Latte | 2 |
4 | Star | Coffee | 3 | Black | 1 |
5 | Moon | Coffee | 3 | Latte | 2 |
6 | Star | Coffee | 3 | Black | 1 |
7 | Moon | Coffee | 4 | Latte | 2 |
8 | Garden | Tea | 4 | Fruit | 5 |
9 | Garden | Tea | 5 | Milk | 5 |
10 | Park | Tea | 5 | Fruit | 6 |
11 | Garden | Tea | 5 | Milk | 5 |
12 | Garden | Tea | 5 | Milk | 5 |
13 | Park | Tea | 6 | Milk | 6 |
14 | Park | Tea | 6 | Fruit | 6 |