DATA MINING
Desktop Survival Guide
by Graham Williams

Text Mining with R

See ttda and tm.

Text mining begins with feature extraction. Techniques include:

Keyword extraction
Bag of words
Term weighting
Co-occurrence of words

Using tm, here is a simple example. The crude dataset contains 20 news articles dealing with crude oil. The data type of the dataset is identified as a text document collection (TextDocCol). We can create our own text document collections using functions provided by the tm package which will read a collection of source documents from a specified directory, and process them into a TextDocCol. We can then take the TextDocCol and using TermDocMatrix generate a weighted count of terms in the documents (remove the weight argument if you just want to use term counting).

The actual data is :

> library(tm) > vignette("tm") > data(crude) > class(crude) [1] "TextDocCol" attr(,"package") [1] "tm" > crude A text document collection with 20 text documents > crude@.Data [[1]] [1] "Diamond Shamrock Corp said that \neffective [...]" [[2]] [1] "OPEC may be forced to meet before a \nscheduled [...]" [...] [[20]] [1] "Argentine crude oil production was \ndown 10.8 pct [...]" > tdm <- TermDocMatrix(crude, weighting = "tf-idf", stopwords = TRUE) An object of class "TermDocMatrix" Slot "Data": 20 x 859 sparse Matrix of class "dgCMatrix" [[ suppressing 859 column names 'barrel', 'brings', 'citing' ... ]] 127 2 2.321928 4.321928 2.736966 2 4.643856 4.321928 2.736966 144 . . . 2.736966 . . . . [...] > tdm <- TermDocMatrix(crude, stopwords = TRUE) > tdm An object of class "TermDocMatrix" Slot "Data": 20 x 859 sparse Matrix of class "dgCMatrix" [[ suppressing 859 column names 'barrel', 'brings', 'citing' ... ]] 127 2 1 1 1 1 2 1 1 2 2 1 2 2 1 1 1 1 1 1 1 5 2 2 3 1 2 144 . . . 1 . . . . . . . . . . . . . . 4 1 12 . 1 5 . . 191 1 1 . . 1 1 . . 2 . . . 1 1 . . 1 . . . 2 1 2 . . . 194 1 1 . . 1 1 . . 3 . . . 2 1 . 1 . . . . 1 1 2 . . . [...]

To transform tdm into a simple matrix to save the word counts or to compute various measures, such as to calculate the Euclidian distance:

> x <- as.matrix(tdm@Data) > write.csv(x, "crude_words.csv") > dist(x, method = "euclidean")

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010