This is my solution for homework 4 of 95-851 Making Products Count: Data Science for Product Management. This is a good practice for data manipulation and KMeans clustering in sklearn.

The goal of this project is to group bike shops with similar bikes sales pattern. Based on the assumption that bikes are selected by its model, price, and other features, customers may have different preferences towards bike shops. We’ll find this pattern using K means Clustering.

We are given three datasets: bikes, order, bikeshops. The first step is to merge them and aggregate bikes order for all 30 bike shops. Then, to scale the quantity, we transform the quantity to the relative quantity(percentage). Last, we choose the right number of clusters to perform K-means Clustering.

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

Step 1: Read Datasets

bike = pd.ExcelFile("bikes.xlsx").parse("Sheet1")
shop = pd.ExcelFile("bikeshops.xlsx").parse("Sheet1")
order = pd.ExcelFile("orders.xlsx").parse("Sheet1")
order.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15644 entries, 1 to 15644
Data columns (total 6 columns):
order.id       15644 non-null int64
order.line     15644 non-null int64
order.date     15644 non-null datetime64[ns]
customer.id    15644 non-null int64
product.id     15644 non-null int64
quantity       15644 non-null int64
dtypes: datetime64[ns](1), int64(5)
memory usage: 855.5 KB
bike.head()
bike.id model category1 category2 frame price
0 1 Supersix Evo Black Inc. Road Elite Road Carbon 12790
1 2 Supersix Evo Hi-Mod Team Road Elite Road Carbon 10660
2 3 Supersix Evo Hi-Mod Dura Ace 1 Road Elite Road Carbon 7990
3 4 Supersix Evo Hi-Mod Dura Ace 2 Road Elite Road Carbon 5330
4 5 Supersix Evo Hi-Mod Utegra Road Elite Road Carbon 4260
shop.head()
bikeshop.id bikeshop.name bikeshop.city bikeshop.state latitude longitude
0 1 Pittsburgh Mountain Machines Pittsburgh PA 40.440625 -79.995886
1 2 Ithaca Mountain Climbers Ithaca NY 42.443961 -76.501881
2 3 Columbus Race Equipment Columbus OH 39.961176 -82.998794
3 4 Detroit Cycles Detroit MI 42.331427 -83.045754
4 5 Cincinnati Speed Cincinnati OH 39.103118 -84.512020
order.head()
order.id order.line order.date customer.id product.id quantity
1 1 1 2011-01-07 2 48 1
2 1 2 2011-01-07 2 52 1
3 2 1 2011-01-10 10 76 1
4 2 2 2011-01-10 10 52 1
5 3 1 2011-01-10 6 2 1

Step 2: Merge three datasets

order_shop=pd.merge(left=order, right=shop, how='outer', left_on="customer.id", right_on="bikeshop.id")
order_final = pd.merge(left=order_shop,right = bike,left_on = "product.id",right_on="bike.id")
order_final.head()
order.id order.line order.date customer.id product.id quantity bikeshop.id bikeshop.name bikeshop.city bikeshop.state latitude longitude bike.id model category1 category2 frame price
0 1 1 2011-01-07 2 48 1 2 Ithaca Mountain Climbers Ithaca NY 42.443961 -76.501881 48 Jekyll Carbon 2 Mountain Over Mountain Carbon 6070
1 132 6 2011-05-13 2 48 1 2 Ithaca Mountain Climbers Ithaca NY 42.443961 -76.501881 48 Jekyll Carbon 2 Mountain Over Mountain Carbon 6070
2 507 2 2012-06-26 2 48 1 2 Ithaca Mountain Climbers Ithaca NY 42.443961 -76.501881 48 Jekyll Carbon 2 Mountain Over Mountain Carbon 6070
3 528 18 2012-07-16 2 48 1 2 Ithaca Mountain Climbers Ithaca NY 42.443961 -76.501881 48 Jekyll Carbon 2 Mountain Over Mountain Carbon 6070
4 691 13 2013-02-05 2 48 1 2 Ithaca Mountain Climbers Ithaca NY 42.443961 -76.501881 48 Jekyll Carbon 2 Mountain Over Mountain Carbon 6070

Step 3: Convert Unit Price

#order_final.groupby(['model','category1','category2','frame'])['price'].median().median()
#order_final.price.median()
#order_final.groupby(['category1','category2','frame'])['price'].unique().apply(np.median)
order_final.groupby('model').price.unique().apply(lambda x:x[0]).median()
#np.sort(order_final.price.unique())
#np.sort(order_final.price.unique())
3200.0

Median Price for all the models are 3200. So we categorize price lower than and eqaul to 3200 as low, otherwise high.

#order_final["category median"]=order_final.groupby(['category1','category2','frame'])['price'].transform(lambda x:np.median(x.unique()))
#order_final['unitPrice']=(order_final['price']>order_final["category median"]).apply(lambda x: "high" if x==True else "low")
order_final['unitPrice']=order_final['price'].apply(lambda x: 'high' if x>3200 else 'low')

Step 4: Obtain pivot table

summary=order_final.groupby(['bikeshop.name', 'model', 'category1', 'category2', 'frame', 'price','unitPrice']).quantity.sum().reset_index()
customers = summary[["bikeshop.name","model","quantity"]].pivot(index="bikeshop.name",columns = "model",values="quantity").fillna(0).reset_index().rename_axis(None,1)
summary.head()
bikeshop.name model category1 category2 frame price unitPrice quantity
0 Albuquerque Cycles Bad Habit 1 Mountain Trail Aluminum 3200 low 5
1 Albuquerque Cycles Bad Habit 2 Mountain Trail Aluminum 2660 low 2
2 Albuquerque Cycles Beast of the East 1 Mountain Trail Aluminum 2770 low 3
3 Albuquerque Cycles Beast of the East 2 Mountain Trail Aluminum 2130 low 3
4 Albuquerque Cycles Beast of the East 3 Mountain Trail Aluminum 1620 low 1

Get ratio

scaling

customers=customers.set_index("bikeshop.name")
customers['sum']=customers.sum(axis=1)
customers=customers.div(customers["sum"], axis=0).drop('sum',axis=1)
customers.head()
Bad Habit 1 Bad Habit 2 Beast of the East 1 Beast of the East 2 Beast of the East 3 CAAD Disc Ultegra CAAD12 105 CAAD12 Black Inc CAAD12 Disc 105 CAAD12 Disc Dura Ace ... Synapse Sora Trail 1 Trail 2 Trail 3 Trail 4 Trail 5 Trigger Carbon 1 Trigger Carbon 2 Trigger Carbon 3 Trigger Carbon 4
bikeshop.name
Albuquerque Cycles 0.017483 0.006993 0.01049 0.010490 0.003497 0.013986 0.006993 0.000000 0.013986 0.048951 ... 0.000000 0.000000 0.003497 0.006993 0.017483 0.010490 0.006993 0.003497 0.006993 0.006993
Ann Arbor Speed 0.006645 0.009967 0.01495 0.009967 0.003322 0.026578 0.014950 0.016611 0.014950 0.008306 ... 0.009967 0.009967 0.014950 0.009967 0.003322 0.011628 0.000000 0.000000 0.000000 0.011628
Austin Cruisers 0.008130 0.004065 0.00813 0.008130 0.000000 0.020325 0.020325 0.004065 0.024390 0.008130 ... 0.020325 0.016260 0.016260 0.016260 0.008130 0.008130 0.000000 0.000000 0.000000 0.016260
Cincinnati Speed 0.005115 0.000000 0.00000 0.000000 0.000000 0.015345 0.010230 0.015345 0.007673 0.017903 ... 0.012788 0.000000 0.002558 0.002558 0.002558 0.000000 0.010230 0.007673 0.010230 0.020460
Columbus Race Equipment 0.010152 0.000000 0.00000 0.005076 0.002538 0.010152 0.027919 0.027919 0.025381 0.012690 ... 0.015228 0.002538 0.002538 0.005076 0.000000 0.000000 0.010152 0.005076 0.017766 0.005076

5 rows × 97 columns

Step 5: KMeans Clustering

I used elbow method and silhouette analysis to determine the number of clusters. It is hard to tell from elbow method but silhouette analysis clearly suggests K=5.

from sklearn.cluster import KMeans
inertias = []
for i in range(4,9):
    kmean = KMeans(n_clusters=i,random_state=20,n_init=50)
    kmean.fit(customers)
    inertias.append(kmean.inertia_)

import matplotlib.pyplot as plt
plt.figure(figsize=(20,10))
plt.plot(range(4,9),inertias)
    
[<matplotlib.lines.Line2D at 0x148f94dc0b8>]

png

from sklearn.metrics import silhouette_samples, silhouette_score
silhouette = []
for i in range(4,9):
    clusterer = KMeans(n_clusters=i, random_state=20,n_init=50)
    cluster_labels = clusterer.fit_predict(customers)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(customers, cluster_labels)
    silhouette.append(silhouette_avg)
    #print("For n_clusters =",i,
    #      "The average silhouette_score is :", silhouette_avg)

import matplotlib.pyplot as plt
plt.figure(figsize=(20,10))
plt.plot(range(4,9),silhouette)
[<matplotlib.lines.Line2D at 0x148f9c42080>]

png

Step 6: Cluster Analysis

clusterer = KMeans(n_clusters=5, random_state=20,n_init=50)
cluster_labels = clusterer.fit_predict(customers)
customers['group'] = cluster_labels
import seaborn as sns
plt.figure(figsize=(10,10))
sns.heatmap(customers.groupby("group").mean().T)
#plt.figure(figsize=(10,10))
#sns.heatmap(customers.groupby("group").mean())
#customers.groupby("group").mean()
<matplotlib.axes._subplots.AxesSubplot at 0x148f5f982e8>

png

pd.set_option('max_colwidth', 800)
pd.DataFrame(customers.reset_index().groupby("group")["bikeshop.name"].unique())
bikeshop.name
group
0 [Philadelphia Bike Shop, San Antonio Bike Shop]
1 [Ann Arbor Speed, Austin Cruisers, Indianapolis Velocipedes, Miami Race Equipment, Nashville Cruisers, New Orleans Velocipedes, Oklahoma City Race Equipment, Seattle Race Equipment]
2 [Ithaca Mountain Climbers, Pittsburgh Mountain Machines, Tampa 29ers]
3 [Cincinnati Speed, Columbus Race Equipment, Las Vegas Cycles, Louisville Race Equipment, San Francisco Cruisers, Wichita Speed]
4 [Albuquerque Cycles, Dallas Cycles, Denver Bike Shop, Detroit Cycles, Kansas City 29ers, Los Angeles Cycles, Minneapolis Bike Shop, New York Cycles, Phoenix Bi-peds, Portland Bi-peds, Providence Bi-peds]
ratio = lambda x: x.div(x.sum(axis=1),axis=0)
groups = customers.reset_index()[['bikeshop.name','group']]
final=pd.merge(left = summary,right=groups,on="bikeshop.name")
bike_info = summary[['model','category1','category2','frame','unitPrice']].drop_duplicates()

Segmentation Analysis

Now that we get four segmented groups, we need find out the distinctions between them. It is a good way to validate our clustering results.

Since one shop has sold lots of different models of bikes, we just study on the top 10 bikes sold in each cluster.

def top10(groupN):
    return pd.DataFrame(customers[customers.group==groupN].mean().sort_values(ascending=False).iloc[1:]).reset_index().rename(columns={'index': 'model',0:'ratio'}).head(10).merge(bike_info,on='model')
top10(groupN=0)
model ratio category1 category2 frame unitPrice
0 Trigger Carbon 4 0.030415 Mountain Over Mountain Carbon low
1 CAAD12 105 0.027079 Road Elite Road Aluminum low
2 Beast of the East 3 0.026333 Mountain Trail Aluminum low
3 Trail 1 0.024940 Mountain Sport Aluminum low
4 Bad Habit 1 0.022998 Mountain Trail Aluminum low
5 F-Si Carbon 4 0.022449 Mountain Cross Country Race Carbon low
6 CAAD12 Disc 105 0.020957 Road Elite Road Aluminum low
7 Synapse Disc 105 0.020957 Road Endurance Road Aluminum low
8 Trail 2 0.020309 Mountain Sport Aluminum low
9 Synapse Carbon Disc 105 0.019662 Road Endurance Road Carbon low

Cluster 0

Features of bikes: relatively low price, mixed category1 and frames

top10(groupN=1)
model ratio category1 category2 frame unitPrice
0 Synapse Disc Tiagra 0.024608 Road Endurance Road Aluminum low
1 CAAD12 Red 0.022889 Road Elite Road Aluminum low
2 Synapse Sora 0.022305 Road Endurance Road Aluminum low
3 Slice Ultegra 0.022077 Road Triathalon Carbon low
4 Supersix Evo Ultegra 3 0.021857 Road Elite Road Carbon low
5 Synapse Disc 105 0.021122 Road Endurance Road Aluminum low
6 Synapse Carbon Ultegra 4 0.020372 Road Endurance Road Carbon low
7 CAAD12 Ultegra 0.020073 Road Elite Road Aluminum low
8 Slice Ultegra D12 0.019531 Road Triathalon Carbon low
9 Supersix Evo Ultegra 4 0.019435 Road Elite Road Carbon low

Cluster 1

Features of bikes: relatively low price, Road bikes and mixed-type frames

top10(groupN=2)
model ratio category1 category2 frame unitPrice
0 Scalpel-Si Carbon 3 0.034269 Mountain Cross Country Race Carbon high
1 Jekyll Carbon 4 0.030282 Mountain Over Mountain Carbon low
2 Scalpel 29 Carbon Race 0.028039 Mountain Cross Country Race Carbon high
3 Trigger Carbon 3 0.025935 Mountain Over Mountain Carbon high
4 Habit Carbon 2 0.023375 Mountain Trail Carbon high
5 Trigger Carbon 4 0.023261 Mountain Over Mountain Carbon low
6 Catalyst 4 0.021564 Mountain Sport Aluminum low
7 Jekyll Carbon 2 0.021079 Mountain Over Mountain Carbon high
8 Supersix Evo Hi-Mod Dura Ace 2 0.021058 Road Elite Road Carbon high
9 Trigger Carbon 2 0.021043 Mountain Over Mountain Carbon high

Cluster 2

Features of bikes: mixed level price, mountain bikes with Carbon frames

top10(groupN=3)
model ratio category1 category2 frame unitPrice
0 Synapse Hi-Mod Disc Red 0.024675 Road Endurance Road Carbon high
1 Slice Hi-Mod Black Inc. 0.023494 Road Triathalon Carbon high
2 Supersix Evo Hi-Mod Dura Ace 1 0.023053 Road Elite Road Carbon high
3 Slice Hi-Mod Dura Ace D12 0.022961 Road Triathalon Carbon high
4 Synapse Hi-Mod Dura Ace 0.021672 Road Endurance Road Carbon high
5 CAAD12 Red 0.021196 Road Elite Road Aluminum low
6 Synapse Carbon Disc Ultegra 0.020239 Road Endurance Road Carbon high
7 Supersix Evo Ultegra 3 0.020187 Road Elite Road Carbon low
8 Supersix Evo Hi-Mod Utegra 0.019819 Road Elite Road Carbon high
9 Synapse Hi-Mod Disc Black Inc. 0.019755 Road Endurance Road Carbon high

Cluster 3

Features of bikes: relatively high price, Road bikes and carbon frames

top10(groupN=4)
model ratio category1 category2 frame unitPrice
0 F-Si 2 0.021705 Mountain Cross Country Race Aluminum low
1 Catalyst 3 0.019743 Mountain Sport Aluminum low
2 F-Si Carbon 4 0.017716 Mountain Cross Country Race Carbon low
3 Trail 5 0.016265 Mountain Sport Aluminum low
4 CAAD8 Sora 0.016141 Road Elite Road Aluminum low
5 CAAD12 Disc 105 0.015987 Road Elite Road Aluminum low
6 CAAD8 105 0.015796 Road Elite Road Aluminum low
7 Habit 4 0.015531 Mountain Trail Aluminum low
8 Synapse Carbon Disc 105 0.015407 Road Endurance Road Carbon low
9 Trail 3 0.014985 Mountain Sport Aluminum low

Cluster 4

Features of bikes: relatively low price, mixed-type bikes and Aluminum frames

Trying other methods of clustering

Using hierachical clustering

customers = customers.drop('group',axis=1)
customers.head()
Bad Habit 1 Bad Habit 2 Beast of the East 1 Beast of the East 2 Beast of the East 3 CAAD Disc Ultegra CAAD12 105 CAAD12 Black Inc CAAD12 Disc 105 CAAD12 Disc Dura Ace ... Synapse Sora Trail 1 Trail 2 Trail 3 Trail 4 Trail 5 Trigger Carbon 1 Trigger Carbon 2 Trigger Carbon 3 Trigger Carbon 4
bikeshop.name
Albuquerque Cycles 0.017483 0.006993 0.01049 0.010490 0.003497 0.013986 0.006993 0.000000 0.013986 0.048951 ... 0.000000 0.000000 0.003497 0.006993 0.017483 0.010490 0.006993 0.003497 0.006993 0.006993
Ann Arbor Speed 0.006645 0.009967 0.01495 0.009967 0.003322 0.026578 0.014950 0.016611 0.014950 0.008306 ... 0.009967 0.009967 0.014950 0.009967 0.003322 0.011628 0.000000 0.000000 0.000000 0.011628
Austin Cruisers 0.008130 0.004065 0.00813 0.008130 0.000000 0.020325 0.020325 0.004065 0.024390 0.008130 ... 0.020325 0.016260 0.016260 0.016260 0.008130 0.008130 0.000000 0.000000 0.000000 0.016260
Cincinnati Speed 0.005115 0.000000 0.00000 0.000000 0.000000 0.015345 0.010230 0.015345 0.007673 0.017903 ... 0.012788 0.000000 0.002558 0.002558 0.002558 0.000000 0.010230 0.007673 0.010230 0.020460
Columbus Race Equipment 0.010152 0.000000 0.00000 0.005076 0.002538 0.010152 0.027919 0.027919 0.025381 0.012690 ... 0.015228 0.002538 0.002538 0.005076 0.000000 0.000000 0.010152 0.005076 0.017766 0.005076

5 rows × 97 columns

from scipy.cluster.hierarchy import dendrogram, linkage  
from matplotlib import pyplot as plt

linked = linkage(customers, 'ward')

#labelList = customers.index
labelList = range(0,30)

plt.figure(figsize=(20, 10))  
dendrogram(linked,  
            orientation='top',
            labels=labelList,
            distance_sort='descending',
            show_leaf_counts=True)

plt.axhline(y=0.16, color='r', linestyle='-')

plt.show()  

png

4 clusters are chosen here.

newclusters=[]
group1=[2,8,27,16,1,17,14,19]
group2 = [20,25,0,5,12,6,10,18,7,24,15,21,23]
group3 = [3,13,11,29,4,26]
group4 = [28,9,22]

newclusters.append([['group0'],customers.index[group1].tolist()])
newclusters.append([['group1'],customers.index[group2].tolist()])
newclusters.append([['group2'],customers.index[group3].tolist()])
newclusters.append([['group3'],customers.index[group4].tolist()])

pd.DataFrame(np.array(newclusters))
0 1
0 [group0] [Austin Cruisers, Indianapolis Velocipedes, Seattle Race Equipment, Nashville Cruisers, Ann Arbor Speed, New Orleans Velocipedes, Miami Race Equipment, Oklahoma City Race Equipment]
1 [group1] [Philadelphia Bike Shop, San Antonio Bike Shop, Albuquerque Cycles, Dallas Cycles, Los Angeles Cycles, Denver Bike Shop, Kansas City 29ers, New York Cycles, Detroit Cycles, Providence Bi-peds, Minneapolis Bike Shop, Phoenix Bi-peds, Portland Bi-peds]
2 [group2] [Cincinnati Speed, Louisville Race Equipment, Las Vegas Cycles, Wichita Speed, Columbus Race Equipment, San Francisco Cruisers]
3 [group3] [Tampa 29ers, Ithaca Mountain Climbers, Pittsburgh Mountain Machines]

above is new clustering grouping

below is previous clsutering grouping

Compared to the results from KMeans, it was found cluster 0 and 4 were combined.

above group0 –> below group 1

above group1 –> below group0+below group4

above group2 –> below group3

above group3 –> below group2

pd.DataFrame(customers.reset_index().groupby("group")["bikeshop.name"].unique())
bikeshop.name
group
0 [Philadelphia Bike Shop, San Antonio Bike Shop]
1 [Ann Arbor Speed, Austin Cruisers, Indianapolis Velocipedes, Miami Race Equipment, Nashville Cruisers, New Orleans Velocipedes, Oklahoma City Race Equipment, Seattle Race Equipment]
2 [Ithaca Mountain Climbers, Pittsburgh Mountain Machines, Tampa 29ers]
3 [Cincinnati Speed, Columbus Race Equipment, Las Vegas Cycles, Louisville Race Equipment, San Francisco Cruisers, Wichita Speed]
4 [Albuquerque Cycles, Dallas Cycles, Denver Bike Shop, Detroit Cycles, Kansas City 29ers, Los Angeles Cycles, Minneapolis Bike Shop, New York Cycles, Phoenix Bi-peds, Portland Bi-peds, Providence Bi-peds]

Cluster 0 and 4

Cluster 0 –> Features of bikes: relatively low price, mixed-type bikes and frames Cluster 4 –> Features of bikes: relatively low price, mixed-type bikes and Aluminum frames

Cluster 0 and 4 are very similary and they can be combined into big groups: relatively low price, mixed-type bikes and mixed-type frames.

In hierachical clustering, with different criteria to cut tree, the results can be different. Thus, in this sense, cluster 0 and cluster 4 can be combined because of high similarity/closeness.

———————————————— The End ———————————————————————-