Clustering Grocery Items

Project Description

This project is the 9th problem at Take-Home Data Science Challenge.

Company XYZ is an online grocery store. In the current version of the website, they have manually grouped the items into a few categories based on their experience. However, they now have a lot of data about user purchase history. Therefore, they would like to put the data into use!

This is what they asked you to do:

The company founder wants to meet with some of the best customers to go through a focus group with them. You are asked to send the ID of the following customers to the founder:
- the customer who bought the most items overall in her lifetime
- for each item the customer who bought that product the most
Cluster items based on user co-purchase history. That is, create clusters of products that have the highest probability of being bought together. The goal of this is to replace the old/manually created categories with these new ones. Each item can belong to just one cluster.

Data Loading

import pandas as pd

purchase=pd.read_csv("grocery/purchase_history.csv")
items = pd.read_csv("grocery/item_to_id.csv")

purchase.head()

	user_id	id
0	222087	27,26
1	1343649	6,47,17
2	404134	18,12,23,22,27,43,38,20,35,1
3	1110200	9,23,2,20,26,47,37
4	224107	31,18,5,13,1,21,48,16,26,2,44,32,20,37,42,35,4...

Data Manipulation

The dataset we have right now it eh user_id with the ids of items they bought. However, for each user, its item record is aggregated into strings. So first, we need extract info from it. The best way is to generate a table with each user and each product.

I have detailed the related techniques in the last blog. Long to Wide

If you cannot understand why I did it, please go the last post.

unique_items = set([int(item) for rows in purchase.id.str.split(",") for item in rows])
items_dict = dict.fromkeys(unique_items,0)

def frequency(df):
    ids = df.id.str.split(",").sum()
    items_dict = dict.fromkeys(unique_items,0)
    for i in ids:
        items_dict[int(i)] = items_dict.get(int(i),0)+1
    return pd.Series(items_dict)

users = purchase.groupby("user_id").apply(frequency)
users.head()

	1	2	3	4	5	6	7	8	9	10	...	39	40	41	42	43	44	45	46	47	48
user_id
47	0	1	1	1	0	0	0	0	0	0	...	0	0	0	0	0	1	1	1	0	0
68	0	0	0	0	0	1	0	0	0	1	...	1	0	0	1	0	0	0	0	0	0
113	0	0	1	0	0	0	0	0	1	0	...	0	0	0	0	1	0	0	1	0	0
123	0	0	0	1	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
223	1	1	0	0	0	1	0	0	0	0	...	0	0	1	0	0	0	1	0	0	0

5 rows × 48 columns

users.columns.name = "Items"
users.head()

Items	1	2	3	4	5	6	7	8	9	10	...	39	40	41	42	43	44	45	46	47	48
user_id
47	0	1	1	1	0	0	0	0	0	0	...	0	0	0	0	0	1	1	1	0	0
68	0	0	0	0	0	1	0	0	0	1	...	1	0	0	1	0	0	0	0	0	0
113	0	0	1	0	0	0	0	0	1	0	...	0	0	0	0	1	0	0	1	0	0
123	0	0	0	1	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
223	1	1	0	0	0	1	0	0	0	0	...	0	0	1	0	0	0	1	0	0	0

5 rows × 48 columns

Which customer buy most?

users.sum(axis=1).sort_values(ascending=False).head()

user_id
269335    72
367872    70
599172    64
397623    64
377284    63
dtype: int64

Customer 269335 buys the most

For each item, which customer buy most?

most_purchases = pd.concat([users.T.max(axis=1),users.T.idxmax(axis=1)],axis=1)
most_purchases.columns = ['Count','UserID']
most_purchases

	Count	UserID
Items
1	4	31625
2	5	31625
3	4	154960
4	3	5289
5	3	217277
6	4	334664
7	3	175865
8	3	151926
9	4	269335
10	4	618914
11	3	367872
12	3	557904
13	4	653800
14	3	172120
15	3	143741
16	3	73071
17	4	366155
18	5	917199
19	3	31625
20	4	885474
21	4	884172
22	4	1199670
23	5	920002
24	3	189913
25	4	68282
26	4	967573
27	4	956666
28	4	204624
29	4	394348
30	2	21779
31	3	289360
32	4	109578
33	3	1310207
34	4	305916
35	3	450482
36	4	269335
37	4	46757
38	4	255546
39	5	599172
40	4	38872
41	4	133355
42	4	80215
43	4	996380
44	4	31625
45	5	1198106
46	4	1218645
47	4	384935
48	3	335841

First apply association rules

To find product-wise probability to buy together

binary = users.applymap(lambda x:0 if x==0 else 1)

binary.head()

Items	1	2	3	4	5	6	7	8	9	10	...	39	40	41	42	43	44	45	46	47	48
user_id
47	0	1	1	1	0	0	0	0	0	0	...	0	0	0	0	0	1	1	1	0	0
68	0	0	0	0	0	1	0	0	0	1	...	1	0	0	1	0	0	0	0	0	0
113	0	0	1	0	0	0	0	0	1	0	...	0	0	0	0	1	0	0	1	0	0
123	0	0	0	1	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
223	1	1	0	0	0	1	0	0	0	0	...	0	0	1	0	0	0	1	0	0	0

5 rows × 48 columns

Another association rules applied project goes here: click me

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

frequent_itemsets = apriori(binary, min_support=0.02, use_colnames=True)

rules = association_rules(frequent_itemsets, support_only=True,min_threshold=0.01)

rules = rules[["antecedents","consequents","support"]]
rules=rules[(rules['antecedents'].apply(len)==1)&(rules['consequents'].apply(len)==1)]

rules.antecedents=rules.antecedents.apply(lambda x: [item for item in x][0])
rules.consequents=rules.consequents.apply(lambda x: [item for item in x][0])

rules.support = rules.support.apply(lambda x: 1-x)
rules.head()

	antecedents	consequents	support
0	1	2	0.802090
1	2	1	0.802090
2	1	3	0.878843
3	3	1	0.878843
4	1	4	0.940526

rules.support.describe()

count    2182.000000
mean        0.924455
std         0.036978
min         0.792968
25%         0.905636
50%         0.925779
75%         0.956470
max         0.979988
Name: support, dtype: float64

Using pivot table to convert to similarity matrix and try clustering

AgglomerativeClustering is a hierachical method.
DBSCAN is a density-based clustering method.
Spectral Clusteirng works for the similarity matrix(affinity matrix).
KMeans is most common used clustering method based on partition.

from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from sklearn.cluster import SpectralClustering
from sklearn.cluster import KMeans

data_matrix = rules.pivot_table(index = "antecedents",columns="consequents",values = "support").fillna(0)
#model = AgglomerativeClustering(affinity='precomputed', n_clusters=10, linkage='average').fit(data_matrix)
#model = DBSCAN(metric="precomputed",eps=0.1).fit(data_matrix)
#model = SpectralClustering(n_clusters = 11,affinity='precomputed').fit(data_matrix)
model = KMeans(n_clusters = 10).fit(data_matrix)

labels = pd.DataFrame({"label":pd.Series(model.labels_),"Item_id":list(range(1,49))})
results=items.merge(labels,on='Item_id').sort_values("label")

results['Index']=results.groupby('label').cumcount()+1

results

	Item_name	Item_id	label	Index
12	eggs	14	0	1
15	butter	8	0	2
14	yogurt	48	0	3
13	milk	16	0	4
11	cheeses	21	0	5
41	sandwich bags	7	1	1
40	aluminum foil	15	1	2
39	toilet paper	33	1	3
38	paper towels	24	1	4
20	waffles	5	2	1
19	pasta	31	2	2
0	coffee	43	3	1
36	laundry detergent	18	3	2
37	dishwashing	27	3	3
3	soda	9	3	4
2	juice	38	3	5
1	tea	23	3	6
28	cherries	25	4	1
34	cucumbers	42	4	2
33	cauliflower	45	4	3
32	carrots	10	4	4
31	broccoli	44	4	5
30	apples	32	4	6
29	grapefruit	20	4	7
35	lettuce	2	4	8
27	berries	40	4	9
23	poultry	6	4	10
25	pork	47	4	11
4	sandwich loaves	39	4	12
5	dinner rolls	37	4	13
6	tortillas	34	4	14
7	bagels	13	4	15
26	bananas	46	4	16
47	pet items	3	4	17
18	sugar	1	4	18
24	beef	17	4	19
46	baby items	4	5	1
16	cereals	11	6	1
17	flour	30	6	2
10	ketchup	41	7	1
9	spaghetti sauce	26	7	2
8	canned vegetables	28	7	3
42	shampoo	12	8	1
43	soap	35	8	2
44	hand soap	29	8	3
45	shaving cream	19	8	4
21	frozen vegetables	22	9	1
22	ice cream	36	9	2

Visualizing the clustering

import matplotlib.pyplot as plt

#plt.figure(figsize=(20,8))
#plt.scatter(results.label,results.Index)

fig, ax = plt.subplots(figsize=(20,8))
ax.scatter(results.label, results.Index,c=results.label)
ax.set_title("KMeans Clustering Results")
i=0
for txt in results.Item_name:
    ax.annotate(txt, (results.label.iloc[i], results.Index.iloc[i]))
    i=i+1
    

png

fig, ax = plt.subplots(figsize=(20,8))
ax.scatter(results.label, results.Index,c=results.label)
ax.set_title("Spectral Clustering Results")
i=0
for txt in results.Item_name:
    ax.annotate(txt, (results.label.iloc[i], results.Index.iloc[i]))
    i=i+1

png

Conclusion

**It turns out KMeans and Spectal Clustering work better. **

We can suggest some co-purchase items to be placed nearby for customers’ convenience:

coffe+bagels
broccoli+beef+berries
soda+dinner rolls
pork+cherries+cauliflower

It gives us some hints that we can put items together with their dishes. For example, people are likely to cook broccoli and beef together, then we will put them closely. Then, people will more likely to grab them together.