Learn clusters from data
KMeans is an object that facilitates learning clusters from data. Given a collection of points and a user-specified number of clusters, KMeans partitions those points into clusters, assigning a discrete membership to each one. To do this, the algorithm iteratively calculates the positions of a set of centroids: 1 for each cluster. The centroid is an indication of where a mass of points exists, ideally positioning itself at the centre of this mass. Points are then spatially related to these centroids, and this defines which cluster they belong to.
In the widget below you can select from several different datasets. If you’re on a slower computer or phone then selecting one of the (Small) or (Tiny) datasets will be more friendly to your processor. You can specify the number of clusters that you want the algorithm to learn in the number input. Each cluster centroid will be displayed as a large circle overlayed on the data and assigned a unique colour. Each point in the dataset will be clustered and belong to one of these centroids denoted by its matching colour. We encourage you to experiment with a variety of the datasets that are offered and to see how changing the number of clusters effects the result.
KMeans wants to choose centroids that minimise a measurement called inertia. Inertia can be thought of as an indication for how well the data is clustered around the centroids. It is derived by measuring the distance between each data point and the cluster centroid it belongs to, squaring this distance and summing each of these calculations for every point in each cluster. In theory, this means a “good” clustering will have clusters in which all of the points that belong to it are as close as possible to the centroid and far away from the other centroids. If you have an appetite for more low-level information, then this scikit-learn article is a good place to investigate further. There are two key messages that should be emphasised here though:
Inertia makes the assumption that clusters are convex and isotropic. In short, if the distribution of the data is not spherical in nature then the cluster assignments it makes may not be meaningful. If you experimented with the interactive widget above then you may have already observed this problem manifesting with the small or tiny datasets and perhaps even more prominently with the synthetic line.
Inertia is not a normalised metric and exists on a scale of 0 to infinity. The only assumption we can make is that lower inertia values are better and 0 is ideal. In high-dimensional spaces, the curse of dimensionality comes into play and the Euclidian distance values can become inflated, thus effecting the value of the inertia. If you are dealing with a high number of dimensions and KMeans is not producing a useful clustering, then preprocessing the data with a dimension reduction algorithm such as UMAP, MDS or PCA can be valuable.
SciKit Learn Reference on KMeans
A good explanation of the general algorithm.https://scikit-learn.org/stable/modules/clustering.html#k-means