Code
if (!require(pacman)) install.packages("pacman")
library(pacman)
p_load(
tidyverse, gt, mclust, ggpubr, factoextra )
Cluster analysis is a common statistical technique used in many fields such as marketing, healthcare (Loftus et al. 2022), and sports science (Anıl Duman, Sennaroğlu, and Tuzkaya 2021).
At its core, cluster analysis is a mathematical method for processing data, organizing items into groups or clusters based on how closely associated they are. For example, if you gathered shells from the beach, noted their attributes of size, shape, colour, and sorted similar shells into piles, you would be performing a ‘physical’ cluster analysis. Each pile of shells would be a ‘cluster’ (Romesburg 2004).
Instead of physically sorting objects, a mathematical cluster analysis sorts objects described as data, whereby similar digital descriptions of the objects are mathematically gathered into the same cluster (Romesburg 2004).
The following article steps through a cluster analysis example in R using the banknote
data set and the mclust
package.
Load required packages:
if (!require(pacman)) install.packages("pacman")
library(pacman)
p_load(
tidyverse, gt, mclust, ggpubr, factoextra )
The banknote
data set contains six measurements made on 100 genuine and 100 counterfeit old-Swiss 1000-franc bank notes.
Importantly, there is a variable in the data set called status which groups the banknotes into either genuine or counterfeit:
%>%
banknote arrange(Length) %>%
head(5) %>%
::gt() %>%
gt::tab_options(
gtcolumn_labels.font.weight = "bold",
table.width = pct(80)
)
Status | Length | Left | Right | Bottom | Top | Diagonal |
---|---|---|---|---|---|---|
genuine | 213.8 | 129.8 | 129.5 | 8.4 | 11.1 | 140.9 |
genuine | 213.9 | 130.3 | 129.0 | 8.1 | 9.7 | 141.3 |
counterfeit | 213.9 | 130.7 | 130.5 | 8.7 | 11.5 | 137.8 |
genuine | 214.1 | 129.6 | 129.3 | 7.6 | 10.7 | 141.7 |
counterfeit | 214.2 | 130.0 | 130.2 | 11.0 | 11.2 | 139.5 |
Here is a summary of the banknote data:
<-
banknote_names names(banknote %>% select(-Status))
<- function(df, yvar, fill_var, ...) {
box_plot <-
plot %>%
df ::ggboxplot(
ggpubry = yvar,
fill = fill_var
+
) scale_fill_manual(values = c("#E64B35", "white")) +
theme(
axis.ticks = element_blank(),
axis.text.x = element_blank(),
axis.title.x = element_blank()
)
return(plot)
}
<-
plot_list ::map(banknote_names, ~ box_plot(banknote, yvar = ., fill_var = "Status"))
purrr
<-
original_plot ::wrap_plots(plot_list) +
patchwork::plot_layout(guides = "collect") &
patchworktheme(legend.position = "top")
original_plot
For certain measurements, such as the diagonal, there is a clear difference between genuine and counterfeit banknotes. However, for other measurements, like length, the difference is not as obvious. In a scenario where the status of the banknotes is unknown, cluster analysis can be employed to classify the banknotes as real or fake based on the six available measurement.
mclust
is an R package for model-based clustering, classification, and density estimation based on finite Gaussian mixture modelling (Scrucca et al. 2016).
Let’s use mclust
to run cluster analysis on the banknote
data set (with the status variable removed).
First we create the model with the Mclust
function:
<-
note %>%
banknote select(-Status)
<-
mod1 ::Mclust(note, verbose = F) mclust
Next we look at the model selection:
::fviz_mclust(mod1, "BIC", palette = "npg") factoextra
As you can see the winning model based on Bayesian Information Criterion (BIC) is the VVE model with three clusters (red vertical dotted line). Thus, the Mclust
function automatically performed the analysis based on three clusters.
The cluster plot shows how each banknote was clustered:
::fviz_mclust(mod1, "classification", palette = "npg") factoextra
Let’s take at look at how the model performed based on the known banknote status:
table(banknote$Status, mod1$classification)
1 2 3
counterfeit 16 0 84
genuine 2 98 0
The model performed well in identifying genuine banknotes by grouping the majority of them into cluster 2. However, the counterfeit notes were distributed across two clusters, with 16 notes in cluster 1 and 84 notes in cluster 3. This outcome is sub-optimal, given that we know that there are only two types of banknotes.
It is important to stress that the selected model, based on BIC, may not always be the most optimal. With prior knowledge, the cluster analysis can be constrained to focus on a specific number of clusters. In this case, the prior knowledge that banknotes can only be genuine or fake necessitates the creation of a model with two clusters.
Let’s tweak our model parameters to force the banknotes to be grouped into two-clusters and plot the results:
<-
mod2 ::Mclust(note, G = 2, verbose = F)
mclust
::fviz_mclust(mod2, "classification", palette = "npg") factoextra
Now let’s check how this model perform based on the known banknote status:
table(banknote$Status, mod2$classification)
1 2
counterfeit 100 0
genuine 1 99
Incredibly, the two cluster model has almost perfectly predicted the original known status of banknotes, with only one incorrect classification!
We can double check the results by plotting the data based on the cluster analysis model and compare it to the original data:
<-
note_with_cluster cbind(Cluster = mod2$classification, note) %>%
as_tibble() %>%
mutate(Cluster = ifelse(Cluster == 2, "genuine", "counterfeit"))
<-
plot_list_cluster ::map(banknote_names, ~ box_plot(note_with_cluster, yvar = ., fill_var = "Cluster"))
purrr
<-
model_plot ::wrap_plots(plot_list_cluster) +
patchwork::plot_annotation(title = "Model Data") +
patchwork::plot_layout(guides = "collect", ) &
patchworktheme(legend.position = "top")
model_plot
+
original_plot ::plot_annotation(title = "Original Data") patchwork
As expected, the model data is almost identical to the original!
Cluster analysis is a powerful technique for uncovering insights from data by grouping similar observations. The banknote data set serves as a practical example of how cluster analysis can be applied to distinguish between genuine and counterfeit notes. The insights gained from this analysis demonstrate the potential of cluster analysis in real-world applications.
When performing cluster analysis, it is important to consider modeling different numbers of clusters, not just the most optimal based on BIC. Additionally, prior knowledge can be valuable when determining the optimal number of clusters.