K-Means

K means is clustering algorithm where k denotes the number of clusters which we define before hand. The algorithm works as follows:

  1. Randomly initialize k cluster centers (select k points).
  2. Assign each point to the closest cluster center.
  3. Recompute the cluster centers by taking the mean of all the points in the new cluster.
  4. Repeat steps 2 and 3 until convergence (When we don’t see a lot of points change clusters).

The above solution is a local optimum and not a global optimum. Mathamatically best cluster centers are the ones that minimize the following objective function:

\[\underset{c}{min} \sum_{i=1}^{k}\sum_{x \in c_i}^{n} \left \| x - \mu_i \right \|^{2}\]

Where \(c\) is the set of clusters, \(c_i\) is the \(i^{th}\) cluster, \(\mu_i\) is the mean of the \(i^{th}\) cluster and \(n\) is the number of points in the \(i^{th}\) cluster. This is a NP hard problem and hence we use the above algorithm to find a local optimum.

Generally we run the above algorithm multiple times with different initializations and choose the one with the lowest objective function value.

Important Points

  1. K means is sensitive to the initialization of the cluster centers hence we have kmeans++ in which we first pick a centroid and then the points are assigned a probability of begin selected as centroid based on the distance for the current point.

  2. K means is sensitive to outliers as the mean is sensitive to outliers to address this we have k medoids in which we use the median instead of the mean.

  3. K means is sensitive to the number of clusters we have in the data. To address this we have silhouette score which is a measure of how similar a point is to its own cluster compared to other clusters. The silhouette score is calculated as follows: \(s = \frac{b - a}{max(a,b)}\) Where \(a\) is the mean distance between a point and all other points in the same cluster and \(b\) is the mean distance between a point and all other points in the next nearest cluster. The silhouette score is between -1 and 1. A score of 1 means that the point is very similar to its own cluster and very dissimilar to other clusters. A score of 0 means that the point is on the boundary of two clusters. A score of -1 means that the point is assigned to the wrong cluster.

  4. If the densities of the clusters are different then k means will not work well.

  5. Non-globular clusters are clusters that are not spherical in shape. K means will not work well on non globular clusters. DBSCAN is a good algorithm for non globular clusters.

Interveiw Questions

  1. What are the cases where K-Means clustering fail to give good results?

    1. When the clusters are not globular in shape.
    2. When the clusters have different densities.
    3. Data with outliers.
    4. When the number of clusters is not known.
  2. How to find the optimal number of clusters in K-Means?

     1. Elbow method: Plot the number of clusters vs the objective function value. The point where the curve starts to flatten is the optimal number of clusters.
     2. Silhouette score: Plot the silhouette score vs the number of clusters. The point where the curve is maximum is the optimal number of clusters.
    
  3. How to initialize the cluster centers in K-Means?

    1. Random initialization : We randomly select k points as the cluster centers.

    2. Kmeans++ initialization : We first select a point randomly and then we select the next point with a probability proportional to the distance from the current point. This is done until we have k points.

    3. K medoids initialization : Insted of using the mean as the cluster center we use the median as the cluster center. This is done to make the algorithm robust to outliers.

results matching ""

    No results matching ""