DATA MINING
Desktop Survival Guide
by
Graham Williams
Desktop Survival
Project Home
List of Figures
List of Tables
Data Mining with Rattle
Introduction
Data Mining with Rattle
Data Sources
Selecting Data
Exploring Data
Transforming Data
Descriptive Models
Predictive Models
Evaluation and Deployment
Issues
Moving into R
Troubleshooting
R for the Data Miner
R
Data
Graphics in R
Understanding Data
Preparing Data
Descriptive and Predictive Analytics
Issues
Evaluating Models
Reporting
Cluster Analysis
Text Mining
Text Mining
Algorithms
Bagging
Bayes Classifier
Cluster Analysis
Conditional Trees
Hierarchical Clustering
K-Nearest Neighbours
Linear Models
Neural Networks
Support Vector Machines
Open Products
AlphaMiner
Borgelt Data Mining Suite
KNime
R
Rattle
Weka
Closed Products
C4.5
Clementine
Equbits Foresight
GhostMiner
InductionEngine
ODM
Enterprise Miner
Statistica Data Miner
TreeNet
Virtual Predict
Appendicies
Glossary
Bibliography
Index
R
for the Data Miner
Subsections
R
: The Language
Obtaining and Installing R
Installing on Debian GNU/Linux
Installing on MS/Windows
Install MS/Windows Version Under GNU/Linux
Interacting With
R
Basic Command Line
Emacs and ESS
Windows, Icons, Mouse, Pointer--WIMP
Evaluation
Help
Assignment
Libraries and Packages
Searching for Objects
Package Management
Information About a Package
Testing Package Availability
Packages and Namespaces
Basic Programming in
R
Principles
Folders and Files
Flow Control
Functions
Apply
Methods
Objects
System
Running System Commands
System Parameters
Misc
Internet
Memory Management
Memory Usage
Garbage Collection
Errors
Frivolous
Sudoku
Further Resources
Using R
Specific Purposes
Survey Analysis
Data
Data Types
Numbers
Strings
Building Strings
Splitting Strings
Substitution
Trim Whitespace
Evaluating Strings
Logical
Dates and Times
Space
Data Structures
Vectors
Arrays
Lists
Sets
Matricies
Data Frames
Accessing Columns
Removing Columns
General Manipulation
Factors
Elements
Rows and Columns
Finding Index of Elements
Partitions
Head and Tail
Reverse a List
Sorting
Unique Values
Loading Data
Interactive Responses
Interactive Data Entry
Available Datasets
The Iris Dataset
CSV Data Used In The Book
The Wine Dataset
The Cardiac Arrhythmia Dataset
The Adult Survey Dataset
Foreign Formats
Stata Data
Conversions
Saving Data
Reading Direct from URL
Formatted Output
Automatically Generate Filenames
Reading a Large File
Using SQLite
ODBC Data
Database Connection
Excel
Access
Clipboard Data
Map Data
Other Data Formats
Fixed Width Data
Global Positioning System
Documenting a Dataset
Common Data Problems
Graphics in R
Basic Plot
Controlling Axes
Arrow Axes
Legends and Points
Tables Within Plots
Colour
Using GGPlot
Symbols
Multiple Plots
Other Graphic Elements
Maths in Labels
Making an Animation
Animated Mandelbrot
Adding a Logo to a Graphic
Graphics Devices Setup
Screen Devices
Multiple Devices
File Devices
Multiple Plots
Copy and Print Devices
Graphics Parameters
Plotting Region
Locating Points on a Plot
Scientific Notation and Plots
Understanding Data
Single Variable Overviews
Textual Summaries
Multiple Line Plots
Separate Line Plots
Pie Chart
Fan Plot
Stem and Leaf Plots
Histogram
Barplot
Trellis Histogram
Histogram Uneven Distribution
Density Plot
Basic Histogram
Basic Histogram with Density Curve
Practical Histogram
Multiple Variable Overviews
Pivot Tables
Scatterplot
Scatterplot with Marginal Histograms
Multi-Dimension Scatterplot
Correlation Plot
Colourful Correlations
Projection Pursuit
RADVIZ
Parallel Coordinates
Measuring Data Distributions
Textual Summaries
Boxplot
Multiple Boxplots
Boxplot by Class
Tuning a Boxplot
Boxplot From ggplot
Violin Plot
What Distribution
Labelling Outliers
Miscellaneous Plots
Line and Point Plots
Matrix Data
Multiple Plots
Aligned Plots
Probability Scale
Network Plot
Sunflower Plot
Stairs Plot
Graphing Means and Error Bars
Bar Charts With Segments
Bar Plot With Means
Multi-Line Title
Mathematics
Plots for Normality
Basic Bar Chart
Bar Chart Displays
Multiple Dot Plots
Alternative Multiple Dot Plots
3D Plot
Box and Whisker Plot
Box and Whisker Plot: With Means
Clustered Box Plot
Perspective Plots
Star Plot
Residuals Plot
Dates and Times
Simple Time Series
Multiple Time Series
Plot Time Series
Plot Time Series with Axis Labels
Grouping Time Series for Box Plot
Using gGobi
Quality Plots Using
R
Textual Summaries
Stem and Leaf Plots
Histogram
Barplot
Density Plot
Basic Histogram
Basic Histogram with Density Curve
Practical Histogram
Correlation Plot
Colourful Correlations
Measuring Data Distributions
Textual Summaries
Boxplot
Multiple Boxplots
Boxplot by Class
Box and Whisker Plot
Box and Whisker Plot: With Means
Clustered Box Plot
Further Resources
Map Displays
Further Resources
Preparing Data
Data Selection and Extraction
Training and Test Datasets
Data Cleaning
Review Data
Removing Duplicates
Selectively Changing Vector Values
Replace Indices By Names
Missing Values
Remove Levels from a Factor
Removing Outliers
Variable Manipulations
Remove Columns
Reorder Columns
Remove Non-Numeric Columns
Remove Variables with no Variance
Cleaning the Wine Dataset
Cleaning the Cardiac Dataset
Cleaning the Survey Dataset
Imputation
Nearest Neighbours
Multiple Imputation
Data Linking
Simple Linking
Record Linkage
Data Transformation
Aggregation
Sum of Columns
Normalising Data
Binning
Interpolation
Outlier Detection
Variable Selection
Descriptive and Predictive Analytics
Building a Model
Cluster Analysis: K-Means
Summary
Clusters
Basic Clustering
Hot Spots
Alternative Clustering
Other Cluster Examples
Association Analysis: Apriori
Summary
Overview
Algorithm
Usage
Read Transactions
file
format
sep
cols
rm.duplicates
Summary
Apriori
data
parameter
appearance
control
Inspect
Examples
Video Marketing: Transactions From File
Survey Data: Data Preparation
Other Examples
Resources and Further Reading
Classification: Decision Trees
Summary
Overview
Examples
Simple Example
Convert Tree to Rules
Predicting Wine Type
Predicting Salary Group
Predicting Fraud: Underrepresented Classes
Alternatives and Enhancements
Resources and Further Reading
Classification: Boosting
Summary
Overview
AdaBoost Algorithm
Examples
Step by Step
Using gbm
Extensions and Variations
Alternating Decision Tree
Resources and Further Reading
Classification: Random Forests
Summary
Overview
Algorithm
Usage
Random Forest
importance
classwt
Examples
Resources and Further Reading
Issues
Incremental or Online Modelling
Model Tuning
Tuning rpart
Unbalanced Classification
Building Models
Outlier Analysis
Temporal Analysis
Survival Analysis
Evaluation
Basics
Basic Measures
Cross Validation
Graphical Performance Measures
Lift
The ROC Curve
Other Examples
10 Fold Cross Validation
Area Under Curve
Calibration Curves
Reporting
Cluster Analysis
Copyright © 2004-2008 Togaware Pty Ltd
Support further development through the
purchase of the PDF
version of the book.
PDF version is properly formatted and forms a comprehensive book (draft with over 600 pages).
Brought to you by
Togaware
.