Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Histogram


Image rattle-audit-explore-dist-hist-income
A histogram provides a quick and useful graphical view of the spread of the data. A histogram plot in Rattle includes three components. The first of these is obviously the coloured vertical bars. The continuous data in the example here (Distribution of Income) has been partitioned into ranges, and the frequency of each range is displayed as the bar. R is automatically choosing both the partitioning and how the x-axis is labelled here, showing x-axis points at 0, 10,000 (using scientific notation of $1e+05$ which means $1x10^5$, or 10,000), and so on. Thus, we can see that the most frequent range of values is in the $0-5,000$ partition. However, each partition spans quite a large range (a range of $5,000).

The plot also includes a line plot showing the so called density estimate and is a more accurate display of the actual (at least estimated true) distribution of the data (the values of Income). It allows us to see that rather than values in the range $0-5,000$ occurring frequently, in fact there is a much smaller range (perhaps $3,000-5,000$) that occurs very frequently.

The third element of the plot is the so called rug along the bottom of the plot. The rug is a single dimension plot of the data along the number line. It is useful in seeing exactly where data points actually lay. For large collections of data with a relatively even spread of values the rug ends up being quite black, as is the case here, up to about $25,000. Above about $35,000 we can see that there is only a splattering of entities with such values. In fact, from the Summary option, using the Describe check box, we can see that the highest values are actually $36,1092.60, $38,0018.10, $39,1436.70, $40,4420.70, and $42,1362.70.

Image rattle-audit-explore-dist-hist-hours
This second plot, showing the distribution for the variable Hours, illustrates a more normal distribution. It is, roughly speaking, a distribution with a peak in the middle and diminishing on both sides, with regards the frequency. The density plot shows that it is not a very strong normal distribution, and the rug plot indicates that the data take on very distinct values (i.e., one would suggest that they are integer values, as is confirmed through viewing the textual summaries in the Summary option).

Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010