These days, the need of understanding paying users among companies, especially those who operate online solutions, has been arising noticeably. As a result, the business intelligence unit should provide a accurate and insightful segmentation analysis to business units as a required course of action. In this article, I would try to run you through a step-by-step instruction on how conduct a segmentation analysis by using K-means clustering.

Figure 1: Photo by John Lockwood on Unsplash

First of all, I would like to mention some of the possible analyses that may fit the task of customer segmentation:

Recency-Frequency-Monetary Model (RFM) is a well-known segmentation technique.
Conversion Funnel segmentation, grouping customers into different groups based on their last step in the funnel, is another good example.
Pure financial-based segmentation, such as dividing users into groups based on their LTV, is a good solution, too.
But the most mentioned technique in my opinion is behavioral segmentation, especially based on the purchase history of customers, which is the key topic in this article.

But how is it relevant to the K-means clustering, or how do I know when to use K-means?

Yes, there may not be a general rule of thumb, however, it all starts with the raw data at hand. That said, you have to examine the data you have and consider what should be a good candidate for your problem. Or you can even run through different techniques & algorithms in order to decide which is the best fit for you. In the context of my problem, I was trying to understand buying tendency through the pattern of purchasing and the intensity of spending, which means the dataset was basically numeric data, thus the distance-based clustering algorithms seem to be good candidates for me.

The drawback of distance-based clustering algorithms:

Because this technique is simple to carry out, it may not provide an extremely accurate clustering result.
It provides directional clues and shows the main characteristics of each segment not the rationale behind those traits.
With a large dataset (i.e. millions of observations), it will be computationally expensive to run this kind of algorithm.

So that is enough for the context, below is the tutorial you need :D.

1. Data Collection and Processing

a. Descriptive Summary

The metrics collected include the number of orders requested - cnt_order, the number of transactions that occurred - cnt_trans, average order value - avg_aov, the number of different services used - cnt_ser, the average price of all services in the month - avg_ser_price, and the count of different services bought on the platform. Please note that the raw data is aggregated on a monthly basis for each customer, or acc_id.

We can see that it is highly right-skewed for most of the variables; therefore, it is necessary to remove the outliers in our analysis. In addition, the number of observations is nearly 144k user-by-month, we just need a smaller number of observations to examine the insight and reduce the burden of calculation.

b. Outlier Removal & Data Sampling

It is reasonable to sample the dataset through a stratified sampling method on average order value variable avg_aov. By using this method, we can reserve the nature of different segments of users based on their basket value — the amount of money they spend on each order throughout the month.

1
2
3
4
5
6
7


# stratified sampling
set.seed(1506)
split <- initial_split(summ_service_by_seller_elt, prop = 0.05, 
                       strata = "avg_aov")
summ_ser_elt_samp  <- training(split)

summary(summ_ser_elt_samp)

The method based on the quantile of the variable is feasible to be applied in this situation, thus the avg_aov is the key measure for calculating the quantile 99%.

1
2
3
4
5
6
7


# indexing outliers
outs <- quantile(summ_ser_elt_samp$avg_aov, 0.99)

# remove outliers
summ_ser_elt_samp_no_outs <- summ_ser_elt_samp[summ_ser_elt_samp$avg_aov <= outs,                                            -which(names(summ_ser_elt_samp) %in% c("cnt_trans", "cnt_cupid"))]

summary(summ_ser_elt_samp_no_outs)

Please note that the count of transactions that occurred is also removed due to the high correlation to the number of orders.

2. Explore the Data

A covariance matrix is necessary to evaluate the relationship between variables and explore the dataset.

1
2
3
4


# covariance matrix
summ_ser_elt_samp_no_outs %>% 
  select(-c("acc_id", "month")) %>% 
  chart.Correlation(histogram=TRUE, pch=19)

It seems that there are notable pair of variables that are highly correlated such as cnt_order and cnt_ser, cnt_bump and cnt_all_bump or cnt_bump and cnt_ser.

Knowing that those variables represent two basic factors of consumption, the quantity consumed and the price or product, we should examine the distribution of the data within those two.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


# relationship aov and cnt_order
plot1 <- 
  summ_ser_elt_samp_no_outs %>% 
  # dplyr::filter(avg_ser_price < 50000, cnt_ser < 200) %>%
  ggplot(aes(x = avg_aov, y = cnt_order)) +
  geom_point() +
  scale_x_continuous(labels = scales::comma)+
  scale_y_continuous(labels = scales::comma)+
  labs(title = "Distribution of paying users' purchases",
       x = "Average Order Value (VND)",
       y = "Numer of orders")+
  theme_minimal()+
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

ggMarginal(plot1, type="density")

From the represented plot, we can see that willing to pay of users varies more than the number of orders they request. The reason could be that users have a limited number of items/products or ads inserted which leaves a consumption cap for the number of purchases users made per month.

The plot also indicates that we lack of big whale source of users, the ones who consume a large number of orders at a high price, saying 100k and above. Most of our paying users fall into the category “pay a little in some cases”.

NOTE:
In real case, you may tackle many other variables that represent different aspects of customer purchase behavior. Therefore, you need to explore more on each of those variables to completely understand your customers.

3. Build the Model

a. Input Standardization

Before striking into the algorithm, it is required to standardize the input for K-means. In this case, I will subtract the measure from the mean and divide it by its standard deviation.

1
2
3
4
5
6


# split the sample data
set.seed(1506)
split2 <- initial_split(summ_ser_elt_samp_no_outs, prop = 0.7, 
                       strata = "avg_aov")
train_dt <- training(split2)
test_dt  <- testing(split2)

1
2
3


# scale the input
train_dt_z <- as.data.frame(cbind(train_dt[1:2], lapply(train_dt[3:13], scale)))
head(train_dt_z)

b. Understanding the algorithm

At the heart of every algorithm is the cost function and in the case of k-means, we use Within Cluster Sum Of Squares (WCSS) to evaluate the error that k-means made and decide which k is the best fit. Below is the formal equation of WCSS:

c. Finding optimal K

The key in optimizing k-means clustering is to find the number of k clusters. We would apply the Elbow method for this process.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


# define calculating within sum of squares function
wss <- function(mydata, k) {
  kmeans(mydata, k, nstart = 10)$tot.withinss
}

# Compute and plot wss for k = 3 to k = 25
k.values <- 3:25

# extract wss for 2-15 clusters
wss_values <- map_dbl(k.values, function(x) wss(train_dt_z[3:13], x))

plot(k.values, wss_values,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

Due to the trade-off between the optimization of the algorithm and the interpretation of the result, the author decides to choose 9 as the optimal k.

1
2
3
4
5
6
7
8


# examine on test data
set.seed(1506)
# deselect `cnt_timer_bump` because it is all zero (depends)
test_dt_z <- as.data.frame(cbind(test_dt[1:2], lapply(test_dt[3:13], scale)))

test_kmean <- kmeans(test_dt_z[3:13], 9, iter.max = 15, nstart = 25)

test_kmean$centers

After 1 round of validation, the result of the model is shown in the next part.

4. Interpret the Result

a. Cluster Centers

The result of the k-means algorithm with 9 clusters is shown as follows.

1
2
3
4


set.seed(1506)
final_kmean <- kmeans(train_dt_z[3:13], 9, nstart = 25)

final_kmean$centers

Due to the result of k-means clustering, we can pick out remarkable groups with differentiated natures as follows:

The Pro Shops: those who spend medium to high on the platform for all types of services, especially in SA (group 8).
The Bumpers: users who heavily spend on bumps only (group 9).
The Heavy Listors: those who pay mostly for listing fees rather than bumps (group 6).
The Light Listors: the same as the heavy listor but less intensive in consumption.
The Value Seeker: who has consumed a little number of newad but is willing to pay high prices (group 1).
The Controller: who focuses on using timer bump rather than other bump services (group 5).
The Shop: who pays heavily on shop-related services (group 7).

b. Visual Interpretation

Conclusion

K-means is one of the top-of-mind when mentioning the clustering method. It is easy to conduct and it will provide general ideas about your dataset. However, it comes with a cost of being computationally expensive in case of large data and a lack of qualitative insights.

Therefore, it is important to note that after having a detailed segmentation, we need to conduct deep-dive analyses as well as run different experiments (A/B tests) in order to understand why and how those segments would react to different approaches.

References

Elbow Method: https://en.wikipedia.org/wiki/Elbow_method_(clustering)

K-means Clustering: https://en.wikipedia.org/wiki/K-means_clustering

Data Mining Textbook: https://www.amazon.com/Data-Mining-Business-Analytics-Applications/dp/1118729277

R-Bloggers: https://www.r-bloggers.com/

Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


# load libraries
library(readr)
library(dplyr)
library(rsample)
library(ggplot2)
library(ggExtra)
library(corrplot)
library(PerformanceAnalytics)
library(GGally)
library(purrr)
library(scales)

1. Data Collection and Processing#

a. Descriptive Summary#

b. Outlier Removal & Data Sampling#

2. Explore the Data#

3. Build the Model#

a. Input Standardization#

b. Understanding the algorithm#

c. Finding optimal K#

4. Interpret the Result#

a. Cluster Centers#

b. Visual Interpretation#

Conclusion#

References#