Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google


Outlier Analysis

Rare, unusual, or just plain infrequent events are of interest in data mining in many contexts including fraud in income tax, insurance, and online banking, as well as for marketing. We classify analyses that focus on the discovery of such data items as outlier analysis. Hawkins (1980) captures the concept of an outlier as:

an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.

Outlier detection algorithms often fall into one of the categories of distance-based methods, density-based methods, projection-based methods, and distribution-based methods.

A general approach to identifying outliers is to assume a known distribution for the data and to examine the deviation of individuals from the distribution. Such approaches are common in statistics (Barnett & Lewis, 1994) but such approaches do not scale well.

Distance based methods are common in data mining where the measure of an entities outliedness is based on its distance to nearby entities. The number of nearby entities and the minimum distance are two parameters. (see knorr and ng 1998 vldb24)

Density based approaches from breuning kriegel ng and sander 2000 sigmod LOF: local outlier factor. See also jin tung and han kdd2001.

The early work on outliers was carried out from a statistical view point where outliers are data points that deviate significantly from the identified underlying distribution of the data. Hawkins (1980) is a good textbook on outliers. Barnett & Lewis (1994) is another overview of statistical outliers.

Distance based approaches have been developed by Knorr & Ng (1998), Ramaswamy et al. (2000) and Knorr & Ng (1999). Such approaches usually explore some neighbourhood and do not rely on underlying distributions.

Knorr & Ng (1998) identify outliers by counting the number of neighbours within a specified radius of a data point. The radius $q$ and the threshold number of points $n$ are the only two parameters of the approach. The approach is simple but is inadequate for data that is distributed with uneven density where $q$ and $n$ might need to vary to cope with the changes. Ramaswamy et al. (2000) have a similar approach whereby data points are ranked by the sum of their distance to their nearest $k$ neighbours.

Breunig et al. (1999) and then Breunig et al. (2000) introduce a density based approach to score data points with a local outlier factor (LOF). Jin et al. (2001) introduce a heuristic to more efficiently identify the top outliers using the LOF.

Yamanishi et al. (2000) build mixture models as data becomes available and identifies outliers as those data items causing the most perturbation to the model.

Aggarwal & Yu (2001) explore the issue of outliers in high dimensional space where data tends to be sparse and consequently all data points tend to be equidistant to other points (Beyer et al., 1999) and suggest an algorithm where the high dimensional space is projected to a lower dimensional space having unusually low density. An evolutionary algorithm is proposed to generate candidate subspaces in which outliers are to be searched for.

SmartSifter

Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Saturday, 16 January 2010