Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Number of Clusters

Choosing the number of clusters is often quite a tricky exercise. Sometimes it is a matter of just try it and see. Other times you have some heuristics that help you to decide. Rattle provides a iterate approach. There is no definitive statistical answer to this issues

In deciding on a size for a robust cluster we need to note that the larger the number of clusters relative to the size of the sample, then the smaller our clusters will be. Perhaps there is a cluster size below which we don;t want to go.

Different cluster algorithms (and even different random seeds) result in different clusters, and how much they differ is a measure of cluster stability.

One approach to identifying a good cluster number is to iterate through multiple clusters and observe the sum of the within sum of squares. Rattle supports this with the Iterate Clusters option (see Figure 9.1), where a plot is also always generated (see Figure 9.2). A heuristic is to choose the number of clusters where we see the largest drop in the sum of the within sum of squares. In Figure 9.2 we might choose 12, 17 or perhaps even 26.

Figure 9.1: KMeans Iteration Interface
Image rattle_kmeans_iterate

Figure 9.2: KMeans Iteration Plot
Image rattle_kmeans_iterate_plot

Copyright © 2004-2010 Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010