Groupby, Agg and Transform Tutorial

Groupby, agreegation and transform function is frequently used in data analysis with Pandas. Let’s see their application with some examples.

import pandas as pd
sales = pd.read_excel('D:/lj2/Machine-Learning/Exploratory Analysis/sales_transactions.xlsx')
sales.head()

	account	name	order	sku	quantity	unit price	ext price
0	383080	Will LLC	10001	B1-20000	7	33.69	235.83
1	383080	Will LLC	10001	S1-27722	11	21.12	232.32
2	383080	Will LLC	10001	B1-86481	3	35.99	107.97
3	412290	Jerde-Hilpert	10005	S1-06532	48	55.82	2679.36
4	412290	Jerde-Hilpert	10005	S1-82801	21	13.62	286.02

Task 1. Group by get sum price for each order

Reference

sales.groupby('order')['ext price'].sum()

order
10001     576.12
10005    8185.49
10006    3724.49
Name: ext price, dtype: float64

Task 2. Calculate proportion of cost of each item in each order

transform() function is used

sales['TotalPrice']=sales.groupby('order')['ext price'].transform(sum)
sales['Portion']=sales['ext price']/sales['TotalPrice']
sales['Portion']=sales['Portion'].apply(lambda x: str(round(x*100,3))+"%")
sales

	account	name	order	sku	quantity	unit price	ext price	TotalPrice	Portion
0	383080	Will LLC	10001	B1-20000	7	33.69	235.83	576.12	40.934%
1	383080	Will LLC	10001	S1-27722	11	21.12	232.32	576.12	40.325%
2	383080	Will LLC	10001	B1-86481	3	35.99	107.97	576.12	18.741%
3	412290	Jerde-Hilpert	10005	S1-06532	48	55.82	2679.36	8185.49	32.733%
4	412290	Jerde-Hilpert	10005	S1-82801	21	13.62	286.02	8185.49	3.494%
5	412290	Jerde-Hilpert	10005	S1-06532	9	92.55	832.95	8185.49	10.176%
6	412290	Jerde-Hilpert	10005	S1-47412	44	78.91	3472.04	8185.49	42.417%
7	412290	Jerde-Hilpert	10005	S1-27722	36	25.42	915.12	8185.49	11.18%
8	218895	Kulas Inc	10006	S1-27722	32	95.66	3061.12	3724.49	82.189%
9	218895	Kulas Inc	10006	B1-33087	23	22.55	518.65	3724.49	13.925%
10	218895	Kulas Inc	10006	B1-33364	3	72.30	216.90	3724.49	5.824%
11	218895	Kulas Inc	10006	B1-20000	-1	72.18	-72.18	3724.49	-1.938%

Task 3. Multi-index groupby

Item    Price  Minimum Most_Common_Price
Coffee  1      1       2
Coffee  2      1       2
Coffee  2      1       2
Tea     3      3       4
Tea     4      3       4
Tea     4      3       4

item = pd.DataFrame({'Item':['Coffee','Coffee','Coffee','Coffee','Coffee','Coffee','Coffee','Coffee',
                             'Tea','Tea','Tea','Tea','Tea','Tea','Tea'],
                     'Brand':['Star','Star','Moon','Moon','Star','Moon','Star','Moon',
                              'Garden','Garden','Park','Garden','Garden','Park','Park'],
                     'Price':[1,1,2,2,3,3,3,4,4,5,5,5,5,6,6]})

item

	Brand	Item	Price
0	Star	Coffee	1
1	Star	Coffee	1
2	Moon	Coffee	2
3	Moon	Coffee	2
4	Star	Coffee	3
5	Moon	Coffee	3
6	Star	Coffee	3
7	Moon	Coffee	4
8	Garden	Tea	4
9	Garden	Tea	5
10	Park	Tea	5
11	Garden	Tea	5
12	Garden	Tea	5
13	Park	Tea	6
14	Park	Tea	6

item.groupby('Item')['Brand'].value_counts()

Item    Brand 
Coffee  Moon      4
        Star      4
Tea     Garden    4
        Park      3
Name: Brand, dtype: int64

It is a multi-index series

Task 3.1. Multi-index groupby with relative portion

Python Data Science

Method 1.

item.groupby('Item')['Brand'].value_counts()/item.groupby('Item')['Brand'].count()

Item    Brand 
Coffee  Moon      0.500000
        Star      0.500000
Tea     Garden    0.571429
        Park      0.428571
Name: Brand, dtype: float64

Method 2 (works better for more complex dataframe).

grouper = item.groupby('Item')['Brand'].value_counts()
grouper/grouper.groupby(level=[0]).transform(sum)

Item    Brand 
Coffee  Moon      0.500000
        Star      0.500000
Tea     Garden    0.571429
        Park      0.428571
Name: Brand, dtype: float64

Task 3.2. Get item in multi-index series

Grouper is a special dataframe. Let’s study it.

grouper

Item    Brand 
Coffee  Moon      4
        Star      4
Tea     Garden    4
        Park      3
Name: Brand, dtype: int64

grouper.Coffee

Brand
Moon    4
Star    4
Name: Brand, dtype: int64

grouper['Coffee','Moon']

## create a pivot table
grouper.unstack()

Brand	Garden	Moon	Park	Star
Item
Coffee	NaN	4.0	NaN	4.0
Tea	4.0	NaN	3.0	NaN

grouper.unstack().stack()

Item    Brand 
Coffee  Moon      4.0
        Star      4.0
Tea     Garden    4.0
        Park      3.0
dtype: float64

Task 3.3. More levels get proportions

# add one more column
item['Flavor']=['Black','Black','Black','Latte','Black','Latte','Black','Latte',
                 'Fruit','Milk','Fruit','Milk','Milk','Milk','Fruit']

item

	Brand	Item	Price	Flavor
0	Star	Coffee	1	Black
1	Star	Coffee	1	Black
2	Moon	Coffee	2	Black
3	Moon	Coffee	2	Latte
4	Star	Coffee	3	Black
5	Moon	Coffee	3	Latte
6	Star	Coffee	3	Black
7	Moon	Coffee	4	Latte
8	Garden	Tea	4	Fruit
9	Garden	Tea	5	Milk
10	Park	Tea	5	Fruit
11	Garden	Tea	5	Milk
12	Garden	Tea	5	Milk
13	Park	Tea	6	Milk
14	Park	Tea	6	Fruit

item.groupby(['Item','Flavor'])['Brand'].value_counts()

Item    Flavor  Brand 
Coffee  Black   Star      4
                Moon      1
        Latte   Moon      3
Tea     Fruit   Park      2
                Garden    1
        Milk    Garden    3
                Park      1
Name: Brand, dtype: int64

item.groupby(['Item','Flavor'])['Brand'].count()

Item    Flavor
Coffee  Black     5
        Latte     3
Tea     Fruit     3
        Milk      4
Name: Brand, dtype: int64

Task 3.3.1 We still need to calculate the proportion of each group

# it will give us a problem
#item.groupby(['Item','Flavor'])['Brand'].value_counts()/item.groupby(['Item','Flavor'])['Brand'].count()

grouper1=item.groupby(['Item','Flavor'])['Brand'].value_counts()
grouper1/grouper1.groupby(level=[0,1]).transform(sum)

Item    Flavor  Brand 
Coffee  Black   Star      0.800000
                Moon      0.200000
        Latte   Moon      1.000000
Tea     Fruit   Park      0.666667
                Garden    0.333333
        Milk    Garden    0.750000
                Park      0.250000
Name: Brand, dtype: float64

Task 3.3.2. get most frequent combination for each item.

grouper1

Item    Flavor  Brand 
Coffee  Black   Star      4
                Moon      1
        Latte   Moon      3
Tea     Fruit   Park      2
                Garden    1
        Milk    Garden    3
                Park      1
Name: Brand, dtype: int64

grouper1[grouper1==grouper1.groupby(level=[0,1]).transform(max)]

Item    Flavor  Brand 
Coffee  Black   Star      4
        Latte   Moon      3
Tea     Fruit   Park      2
        Milk    Garden    3
Name: Brand, dtype: int64

Task 3.3.3 get the most frequent price

Inspired by Stackflow problem

grouper2=item.groupby((['Item','Brand'])).Price

grouper2.value_counts()

Item    Brand   Price
Coffee  Moon    2        2
                3        1
                4        1
        Star    1        2
                3        2
Tea     Garden  5        3
                4        1
        Park    6        2
                5        1
Name: Price, dtype: int64

item['Most Frequent']=grouper2.transform(lambda x: x.mode()[0])

item

	Brand	Item	Price	Flavor	Most Frequent
0	Star	Coffee	1	Black	1
1	Star	Coffee	1	Black	1
2	Moon	Coffee	2	Black	2
3	Moon	Coffee	2	Latte	2
4	Star	Coffee	3	Black	1
5	Moon	Coffee	3	Latte	2
6	Star	Coffee	3	Black	1
7	Moon	Coffee	4	Latte	2
8	Garden	Tea	4	Fruit	5
9	Garden	Tea	5	Milk	5
10	Park	Tea	5	Fruit	6
11	Garden	Tea	5	Milk	5
12	Garden	Tea	5	Milk	5
13	Park	Tea	6	Milk	6
14	Park	Tea	6	Fruit	6

For future study