K-Means
K means is clustering algorithm where k denotes the number of clusters which we define before hand. The algorithm works as follows:
- Randomly initialize k cluster centers (select k points).
- Assign each point to the closest cluster center.
- Recompute the cluster centers by taking the mean of all the points in the new cluster.
- Repeat steps 2 and 3 until convergence (When we don’t see a lot of points change clusters).
The above solution is a local optimum and not a global optimum. Mathamatically best cluster centers are the ones that minimize the following objective function:
\[\underset{c}{min} \sum_{i=1}^{k}\sum_{x \in c_i}^{n} \left \| x - \mu_i \right \|^{2}\]Where \(c\) is the set of clusters, \(c_i\) is the \(i^{th}\) cluster, \(\mu_i\) is the mean of the \(i^{th}\) cluster and \(n\) is the number of points in the \(i^{th}\) cluster. This is a NP hard problem and hence we use the above algorithm to find a local optimum.
Generally we run the above algorithm multiple times with different initializations and choose the one with the lowest objective function value.
Important Points
-
K means is sensitive to the initialization of the cluster centers hence we have kmeans++ in which we first pick a centroid and then the points are assigned a probability of begin selected as centroid based on the distance for the current point.
-
K means is sensitive to outliers as the mean is sensitive to outliers to address this we have k medoids in which we use the median instead of the mean.
-
K means is sensitive to the number of clusters we have in the data. To address this we have silhouette score which is a measure of how similar a point is to its own cluster compared to other clusters. The silhouette score is calculated as follows: \(s = \frac{b - a}{max(a,b)}\) Where \(a\) is the mean distance between a point and all other points in the same cluster and \(b\) is the mean distance between a point and all other points in the next nearest cluster. The silhouette score is between -1 and 1. A score of 1 means that the point is very similar to its own cluster and very dissimilar to other clusters. A score of 0 means that the point is on the boundary of two clusters. A score of -1 means that the point is assigned to the wrong cluster.
-
If the densities of the clusters are different then k means will not work well.
-
Non-globular clusters are clusters that are not spherical in shape. K means will not work well on non globular clusters. DBSCAN is a good algorithm for non globular clusters.
Interveiw Questions
-
What are the cases where K-Means clustering fail to give good results?
- When the clusters are not globular in shape.
- When the clusters have different densities.
- Data with outliers.
- When the number of clusters is not known.
-
How to find the optimal number of clusters in K-Means?
1. Elbow method: Plot the number of clusters vs the objective function value. The point where the curve starts to flatten is the optimal number of clusters. 2. Silhouette score: Plot the silhouette score vs the number of clusters. The point where the curve is maximum is the optimal number of clusters.
-
How to initialize the cluster centers in K-Means?
-
Random initialization : We randomly select k points as the cluster centers.
-
Kmeans++ initialization : We first select a point randomly and then we select the next point with a probability proportional to the distance from the current point. This is done until we have k points.
-
K medoids initialization : Insted of using the mean as the cluster center we use the median as the cluster center. This is done to make the algorithm robust to outliers.
-