(You are required to install the newest Java Runtime Environment for executing the applet)

The k-means algorithm is designed for solving the clustering problem among chaos data in the vector space. I introduce this algorithm on purpose for preliminary study because of its simplicity and efficiency.

In the beginning, it is worth to discuss that why do we need to perform clustering? For instance, imagine we may encounter a situation that we want to order our mess, scattered, and numerous books into an empty bookcase. However, this work is pretty tough since we cannot be informed what categories of these books we have; so that we even don’t know how to classify them appropriately.

To handle such problem, as an intelligent mortal, one may picks up each book and rearrange them iteratively till one is conscious of what categories should be extracted, and then, makes all books classified explicitly.

Here we should distinguish between the implication of “classification” and “clustering”. If we already know the “kinds” these books may belong to, and the grids of bookcase are pasted on corresponding labels, this problem comes up to “classification”. What we have to do is merely judge each book and categorize it by the labels. In the “clustering” problem, things become a little bit complicated since we don’t have such prior labels. The clustering algorithm has to deal with unknown data, and cluster them automatically.

k-means can deal with such clustering problem efficiently, moreover, the algorithm is simple enough to be comprehended. But it has drawbacks as well, or rather, it performs awfully in some particular conditions. It only satisfies local optimum clustering, and, is not robust. The meaning of ‘not robust’ intends that results from identical dataset and number of cluster are not always the same. In addition, the performance of k-means is highly depend on the number of k, the number of clusters we choose, and it is also regarded as a heuristic process to discover the proper k.

The algorithm is also termed as Lloyd’s Algorithm. I am interested to the source, therefore search the internet to find the original paper. Google and Wikipedia are omnipotent. :)  After a glance at the paper, I found the algorithm is at first designed for PCM(pulse-code modulation) problem in signal process, 1982. What it desires is to find the local minimun value of noise power in each subset of signal. Details are omitted here since yet I am not striving toward it. Surprisingly, the diagram named Voronoi tessellation can be drawn by this algorithm. An applet demo to simulate 2-dimension clustering problem is implemented for performing.

Figure 1. Voronoi tessellation, click it to see details.

The formula of this algorithm is fairly simple, as below:

where S is the set of x which represents the data, and k is the number of centroids. Gi is the clustering groups with its centroid c. As the formula implies, we can find a minimun N(S) when all terms ||xj - ci|| are minimun.

The abstract of k-means algorithm:

  1. Presume S represents the number of data vectors. Decide the number k of centroids and randomly distribute centroids in vector space or make the existing data the centroids. Intuitively, it is supposed to be S > k
  2. Determine the distance of each datum xj to the centroids ci using Euclidean distance, and designate datum xj to belong a centroid ci with the minimum distance
  3. Calculate the new centroid coordinate for each clustering group Gi
  4. Iterate step 2 to 3 untill no more datum  xj change the centroid ci

Tips:

  • I think this demo is quite simple. All you have to do is that either ramdonize them and start or draw them yourself, then, start.
  • You can try some critcal situations such like in the beginning, three centroids are nearby each other and located at the corner, which would lead this algorithm failed.

Bibliography: