DATA MINING
Desktop Survival Guide
by Graham Williams

# Exploring Distributions

It is usually a good idea to review the distributions of the values of each of the variables in your dataset. The Distributions option allows you to visually explore the distributions for specific variables.

Using graphical tools to visually investigate the data's characteristics can help our understanding the data, in error correction, and in variable selection and variable transformation.

Graphical presentations are more effective for most people, and Rattle provides a graphical summary of the distribution of the data with the Distribution option of the Explore tab.

Visualising data has been an area of study within statistics for many years. A vast array of tools are available within R for presenting data visually and the topic is covered in detail in books in their own right, including () and Tufte.

By choosing the Distributions radio button you can select specific variables of interest, and display various distribution plots. Selecting many variables will lead to many plots being displayed, and so it may be useful to display multiple plots per page (i.e., per window) by setting the appropriate value in the interface. By default, four plots will be displayed per page or window, but you can change this to anywhere from 1 plot per page to 9 plots per page. If we are actually generating fewer plots than our selected number of plots per page, then we will see that the plots will fill up the window rather than leaving empty places. Thus the Plots per Page is really a maximum number of plots per page.

The Annotate check box can be checked to obtain additional annotations on the plots. For a Box Plot, for example, this will add the actual numeric values for the median and quartiles.

Here we illustrate a window with the default four plots. Four plots per page are useful, for example, to display each of the four different types of plots for a single continuous variable. Clockwise, they are the Box Plot, the Histogram, a Cumulative Function Plot, and a Benford's Law Plot. Because we have identified a target variable the plots include the distributions for each subset of entities associated with each value of the target variable, wherever this makes sense to do so (e.g., not for the histogram).

The box and whiskers plot identifies the median and mean of the variable, the spread from the first quartile to the third, and indicates the outliers. The histogram splits the range of values of the variable into segments and shows the number of entities in each segment. The cumulative plot shows the percentage of entities below any particular value of the variable. And the Benford's Law plot compares the distribution of the first digit of the numbers against that which is expected according to Benford's Law. Each of the plots shown here is explained in more detail in the following sections.

For categoric variables three types of plots are supported, more as alternatives than adding extra information: the Bar Plot, the Dot Plot, and the Mosaic Plot. Each plot shows, in its own way, the number of entities that have a particular value for the chosen variable. The bar and dot plots sort the categoric values from the most frequent to the least frequent value. In the example here, we can see that the value Private of the variable Employment is the most frequent, occurring over 1,400 times in this dataset.

A bar plot uses vertical bars while the dot plot uses dots placed horizontally. The dot plot has more of a chance to list the actual values for the variable, whilst a bar plot will have trouble listing all of the values (as illustrated here). The mosaic plot is similar to the bar plot but each ``bar'' splits the values between the distinct values of the target variable (if one is chosen).

Each of the plots is explained in more detail in the following sections.

Subsections