Togaware
freedom is in everyone's language Frihed Vrijheid Liberté Freiheit Ελευθερία Свобода Bebas Libertad


Home
Services
Freedoms
Resources

- Rattle

- OpenMoko

- Data Mining

- GNU/Linux

- LaTeX

Supporting

- Analytics/IAPA

- AusDM

- OSDM09

- PAKDD

Hosting

- Gallery

About Us


Data Mining Resources

Rattle is a free and open source data mining toolkit written in the statistical language R using the Gnome graphical interface. It runs under GNU/Linux, Macintosh OS X, and MS/Windows. Rattle his being used in business and for teaching data mining in Australia and internationally.

The free and open source book, The Data Mining Desktop Survival Guide (ISBN 0-9757109-2-3) simply explains the otherwise complex algorithms and concepts of data mining, with examples to illustrate each algorithm using the statistical language R. The book is being written by Dr Graham Williams, based on his 20 years research and consulting experience in machine learning and data mining. An electronic PDF version is available for a small fee from Togaware ($40AUD/$35USD to cover costs and ongoing development);

Other Resources

Using R for Data Mining

The open source statistical programming language R (based on S) is in daily use in academia and in business and government. We use R for data mining within the Australian Taxation Office. Rattle is used by those wishing to interact with R through a GUI.

R is memory based so that on 32bit CPUs you are limited to smaller datasets (perhaps 50,000 up to 100,000, depending on what you are doing). Deploying R on 64bit multiple CPU (AMD64) servers running GNU/Linux with 32GB of main memory provides a powerful platform for data mining.

R is open source, thus providing assurance that there will always be the opportunity to fix and tune things that suit our specific needs, rather than rely on having to convince a vendor to fix or tune their product to suit our needs.

Also, by being open source, we can be sure that the code will always be available, unlike some of the data mining products that have disappearded (e.g., IBM's Intelligent Miner).

Open standards are important for users, but vendors resist them for obvious reasons, and would prefer to lock you in to their products. A number of commercial tools claim support of, for example, the open standard PMML for interoperability (sharing models between applications). But the support is patchy and not worth the effort. We have started a PMML effort in R to attempt to address the desire for interoperability.

Specific commercial statistical products are excellent in handling very large datasets. But they are limited in the analytic algorithms they provide. Commercial vendors, naturally, need to be convinced of the usefulness of implementing new algorithms. On the other hand, a vast selection has been available for deployment in R for a long time.