Desktop Survival Guide
by Graham Williams
A very convenient, simple, and universally recognised data format is the trivial text data file with one record of data per line within the file and each line containing comma separated fields. Such a format is referred to as comma separated values (or CSV for short). Such a simple format overs many advantages over proprietary formats, including the straightforward ability to share the data easily amongst many applications. Also, for many processing tasks where all of the data is touched, access to such a simply format is considerably faster than through, for example database querying. While the sophisticated database administrator can certainly explore and tune and index a database to provide targeted, efficient and fast access for particular queries, simple progression through a CSV file requires much less sophistication, generally without sacrificing performance, and often with improved performance.
Another advantage is that the steps written to process a CSV data file, using R to implement the processing, can simply and freely be transferred from platform to platform, whether it be GNU/Linux or MS/Windows. Thus the investment in processing and delivering results from the data are enhanced.
Comma Separated Value (CSV) files can be read and written in R using read.csv and write.table. Our first example obtains a small (12K) CSV file from the Internet(, ) using the download.file function. The data is then loaded into R, with the appropriate column names added (since the dataset doesn't come with the names). We then save the dataset to a new CSV file (with the right headers) using write.table, as well as to a binary R data file using save. The resulting data files will be used in the examples throughout the book.
These datasets will be used throughout the book to illustrate various techniques and approaches to data cleaning, variable selection, and modelling.