K-means clustering is a popular unsupervised machine learning algorithm that is used to group similar data points together. The algorithm works by iteratively partitioning data points into K clusters based on their similarity, where K is a pre-defined number of clusters that the algorithm aims to create.
Introduction to Clustering
Clustering algorithms seek to learn, from the properties of the data, an optimal division or discrete labeling of groups of points.Many clustering algorithms are available in Scikit-Learn and elsewhere, but perhaps the simplest to understand is an algorithm known as k-means clustering, which is implemented in sklearn.cluster.KMeans .
The *k-*means algorithm searches for a pre-determined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like:
The "cluster center" is the arithmetic mean of all the points belonging to the cluster.
Each point is closer to its own cluster center than to other cluster centers.
Those two assumptions are the basis of the _k_means model
K-means Algorithm
Expectation–maximization (E–M) is a powerful algorithm that comes up in a variety of contexts within data science. _k_-means is a particularly simple and easy-to-understand application of the algorithm.
In short, the expectation–maximization approach here consists of the following procedure:
Guess some cluster centers
Repeat until converged
E-Step: assign points to the nearest cluster center
M-Step: set the cluster centers to the mean
Here the "E-step" or "Expectation step" is so-named because it involves updating our expectation of which cluster each point belongs to.
The "M-step" or "Maximization step" is so-named because it involves maximizing some fitness function that defines the location of the cluster centers — in this case, that maximization is accomplished by taking a simple mean of the data in each cluster.
The literature about this algorithm is vast, but can be summarized as follows: under typical circumstances, each repetition of the E-step and M-step will always result in a better estimate of the cluster characteristics.
Steps followed in K-means clustering
Here are the basic steps involved in K-means clustering:
Initialize K centroids: The algorithm begins by randomly selecting K data points to serve as the initial centroids of the K clusters.
Assign data points to clusters: Each data point is then assigned to the cluster whose centroid is closest to it. This is done using a distance metric, typically the Euclidean distance.
Update centroids: After all data points have been assigned to a cluster, the centroid of each cluster is updated to be the mean of all the data points assigned to that cluster.
Repeat steps 2 and 3 until convergence: Steps 2 and 3 are repeated until the centroids no longer change, or until some other stopping criterion is met.
Output final clusters: Once the algorithm converges, the final clusters are formed by assigning each data point to the cluster whose centroid is closest to it.
Caveats of expectation–maximization
There are a few issues to be aware of when using the expectation–maximization algorithm.
The globally optimal result may not be achieved: First, although the E–M procedure is guaranteed to improve the result in each step, there is no assurance that it will lead to the global best solution.
The number of clusters must be selected beforehand: Another common challenge with _k_-means is that you must tell it how many clusters you expect: it cannot learn the number of clusters from the data.
Selecting the number of clusters with silhouette analysis on KMeans clustering
Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a range of [-1, 1].
Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster.
Calculation of Silhouette score
Silhouette score is used to evaluate the quality of clusters created using clustering algorithms such as K-Means in terms of how well samples are clustered with other samples that are similar to each other. The Silhouette score is calculated for each sample of different clusters. To calculate the Silhouette score for each observation/data point, the following distances need to be found out for each observations belonging to all the clusters:
Mean distance between the observation and all other data points in the same cluster. This distance can also be called a mean intra-cluster distance. The mean distance is denoted by a.
Mean distance between the observation and all other data points of the next nearest cluster. This distance can also be called a mean nearest-cluster distance. The mean distance is denoted by b.
The Silhouette Coefficient for a sample is
????=(????−????)????????????(????,????)
Implementation of K-means
Loading...
Conclusion
K-means clustering is a widely used unsupervised machine learning algorithm that groups similar data points together based on their similarity. It involves iteratively partitioning data points into K clusters, where K is a pre-defined number of clusters. However, there are some caveats to be aware of, such as the need to select the number of clusters beforehand, and the possibility of not achieving the globally optimal result. Silhouette analysis can be used to assess the separation distance between resulting clusters and determine the optimal number of clusters.
Key Takeaways
Clustering algorithms are used to group similar data points together by learning from the properties of data an optimal division or discrete labeling of groups of points.
K-means clustering is a popular unsupervised machine learning algorithm for partitioning data points into K clusters based on their similarity, where K is a pre-defined number of clusters that the algorithm aims to create.
The K-means algorithm searches for a pre-determined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like.
The algorithm starts with the initial K centroids, assigns data points to clusters based on the closest centroid, updates the centroid of each cluster to be the mean of all the data points assigned to that cluster, and repeats the process until the centroids no longer change or a stopping criterion is met.
Expectation–maximization (E–M) is a powerful algorithm that comes up in a variety of contexts within data science, and K-means is a particularly simple and easy-to-understand application of the algorithm.
Two issues to be aware of when using the expectation–maximization algorithm are that the globally optimal result may not be achieved, and the number of clusters must be selected beforehand.
Silhouette analysis can be used to visually assess the parameters like the number of clusters. It provides a measure of how close each point in one cluster is to points in the neighboring clusters and helps identify negative values that indicate that those samples might have been assigned to the wrong cluster.
Quiz
What is k-means clustering?
A supervised machine learning algorithm
An unsupervised machine learning algorithm used to group similar data points together
A data visualization technique
A regression algorithm
Answer: b. An unsupervised machine learning algorithm used to group similar data points together
What is the expectation-maximization approach in k-means clustering?
It involves setting the cluster centers to the mean
It involves assigning points to the nearest cluster center
It involves updating our expectation of which cluster each point belongs to
It involves randomly selecting K data points to serve as the initial centroids of the K clusters
Answer: c. It involves updating our expectation of which cluster each point belongs to
What is the stopping criterion for k-means clustering?
When the algorithm has gone through all the data points
When the centroids no longer change
When the data is perfectly clustered
When the algorithm has reached the maximum number of iterations
Answer: b. When the centroids no longer change
What is the range of the silhouette coefficients used in silhouette analysis?
[-∞, ∞]
[0, ∞]
[-1, 1]
[0, 1]
Answer: c. [-1, 1]
Can k-means clustering learn the number of clusters from the data?
Yes
No
Answer: b. No. The number of clusters must be selected beforehand.