This was originally shared as a Revolution Analytics Blog Post on 25th October 2016.

Programming is an art and a way we express ourselves. As we write our programs we should keep in mind that someone else is very likely to be reading it. We can facilitate the accessibility of our programs through a clear presentation of the messages we are sharing.

As data scientists we also practice this art of programming. Indeed even more so we aim to share the narrative of our discoveries through our living and breathing of data through programming over the data. Writing programs so that others understand why and how we analysed our data is crucial. Data science is so much more than simply building black box analyses and models and we should be seeking to expose and share the process and particularly the knowledge that is discovered from the data.

Style is important in making the code we share readily accessible. Dictating a style to others is a sensitive issue. We thrive on our freedom to innovate and to express ourselves how we want but we also need consistency in how we do that and a style guide supports that. A style guide also helps us journey through a new language, providing a foundation for developing, over time, our own style in that language.

Through a style guide we share the tips and tricks for communicating clearly through our programs. We communicate through the language — a language that also happens to be executable by a computer. In this language we follow precisely specified syntax to develop sentences, paragraphs, and whole stories. Whilst there is infinite leeway in how we express ourselves in any language we can share a common set of principles as our style guide.

Over the years styles developed for very many different languages have evolved together with the medium for interacting with computers. I have a style guide for R that presents my personal and current choices. This is the style guide I suggest (even require) for projects I lead.

I hope the guide might be useful to others. It augments the other R style guides out there by providing the rationale for my choices. Irrespective of whether specific style suggestions suit you or not, choose your own and use them consistently. Do focus on communicating with others in the first instance and secondarily on the execution of your code (though critical it is).  Think of writing programs as writing narratives for others to read, to enjoy, to learn from and to build upon. It is a creative act to communicate well with our colleagues — be creative with style.

Hands On Data Science: Sharing R Code — With Style

The featured image comes from https://blog.codinghorror.com/new-programming-jargon/ where the concept of Egyptian Brackets is explained.

Graham @ Microsoft

I had the privilege to join a panel in 2014 that explored big data opportunities and challenges. Together, coordinated by Professor Zhi-Hua Zhou, we captured our thoughts into a paper published in the IEEE Computational Intelligence Magazine (Volume 9, Number 4).

It is an honour to learn that we have received a 2017 IEEE Outstanding Paper Award. The paper is:

Zhi-Hua Zhou, Nitesh V. Chawla, Yaochu Jin, Graham J. Williams. “Big data opportunities and challenges: Discussions from data analytics perspectives”, IEEE Computational Intelligence Magazine, vol. 9, no. 4, 2014 November, pp.62-74.

The paper includes a discussion of turning ensemble concepts into the extreme, reflecting on the need for the pendulum to swing back toward protecting privacy, and the resulting focus on massively ensembled models, each “model” modelling an individual across extensive populations. The award was bestowed in November 2017.

I have released an alpha version of Rattle with two significant updates.

Eugene Dubossarsky and his team have been working on a Shiny interface to generate ggplot2  graphics interactively. It is a package called ggraptR. This is now available through Rattle’s Explore tab choosing the Interactive option.

screenshot-from-2016-09-12-124814

In line with Rattle’s philosophy of teaching programming of data by exposing all code through Rattle’s Log tab, ggraptR has a button to generate the plot. You can click the Generate Plot Code button, copy the resulting code and paste it into the R console, knitr document, or jupyter notebook. Execute the code and you generate the plot. Now you can start fine tuning it some more if you like.

The current alpha version has a few niggles that are being sorted out but it is already worth giving it a try.

The second major update is the initial support for Microsoft R Server so that Rattle can now handle datasets of any size. From Rattle’s Data tab choose an XDF file to load.

screenshot-from-2016-09-12-130305

A sample of the full (generally big) dataset will actually be loaded into memory but many of the usual operations will be performed on the XDF dataset on disk. For example, build a decision tree and Rattle will automatically choose rxDTree() for the XDF dataset instead of rpart().

screenshot-from-2016-09-12-130554

Visualise the tree as usual.

screenshot-from-2016-09-12-130617

Performance evaluation is also currently supported.

screenshot-from-2016-09-12-131401

Do check the Log tab to review the commands that were executed underneath.

This is an initial release. There’s still plenty of functionality to expose. Currently implemented for Binary Classification:

  • Data: Load xdf file;
  • Explore: Subset the dataset for interactive exploration;
  • Models: rxDTree, rxDforest;
  • Evaluate: Error Matrix, Risk Chart.

Still to come:

  • Data: Import CSV;
  • Models: boosting, neural network, svm.

You can try this new version out using either Microsoft R Client on MS/Windows or fire up an Azure Linux Data Science Virtual Machine which comes with the developer version of Microsoft R Server installed. Then upgrade the pre-installed Rattle to this new release.

> install.packages(c("rattle", "devtools"))
> devtools::install_bitbucket("kayontoga/rattle")

Graham @ Microsoft

Data Scientists have access to a grammar for preparing data (Hadley Wickham’s tidyr package in R), a grammar for data wrangling (dplyr), and a grammar for graphics (ggplot2).

At an R event hosted by CSIRO in Canberra in 2011 Hadley  noted that we are missing a grammar for machine learning. At the time I doodled some ideas but never developed. I repeat those doodles here. The idea’s are really just that – ideas as a starting point. Experimental code is implemented in the graml package for R which is refining the concepts first explored in the experimental containers package.

A grammar of machine learning can follow the ggplot2 concept of building layer upon layer to define the final model that we build. I like this concept rather than the concept of a data flow for model building. With a data flow a dataset is piped (in R using magrittr’s %>% operator) from one data wrangling step to the next data wrangling step. Hadley’s tidyr and dplyr do this really well.

The concept of a grammar of machine learning begins with recognising that we want to train a model:

train(ds, form(target ~ .))

Simply we want to train a model using some dataset ds where one of the columns of the dataset is named target and we expect to model this variable based on the other variables within the dataset (signified by the ~ .).

Generally in machine learning and statistical model building we split our dataset into a training dataset, a validation dataset, and a testing dataset. Some use only two datasets. Let’s add this in as the next “layer” for our model build.

train(ds, form(target ~ .)) +
  dataPartition(0.7, 0.3)

That is, we ask for 70% of the data randomly sampled to train the model.

We will have already performed our data preparation steps and let’s say that we know in ds the target variable has only two distinct values, yes and no. Thus a binary classification model is called for.

In R we have a tremendous variety of model building algorithms that support binary classification. My favourite has been randomForest so let’s add in our request to train a model using randomForest().

train(ds, formula(target ~ .)) +
  dataPartition(0.7, 0.3) +
  model(randomForest)

Now we might want to do a parameter sweep over the mtry parameter to the randomForest() function which is the number of variables to randomly sample as we build each decision tree.

train(ds, formula(target ~ .)) +
  dataPartition(0.7, 0.3) +
  model(randomForest) +
  tuneSweep(mtry=seq(5, nvars, 1))

Finally to report on the evaluation of the model using the area under the curve (AUC).

train(ds, formula(target ~ .)) +
  dataPartition(0.7, 0.3) +
  model(randomForest) +
  tuneSweep(mtry=seq(5, nvars, 1)) +
  evaluate_auc()

The object returned is a trained model incorporating the additional information requested. Other operations can be performed on this model object, including its deployment into a production system!

We can provide parameters as a lightweight layer above other model building packages with no or minimal effort required to move to a new model builder.

  model(randomForest::randomForest, 
        ntree=100, 
        mtry=4, 
        importance=TRUE,
        replace=FALSE, 
        na.action=randomForest::na.roughfix)

(Image from http://grammar.ccc.commnet.edu/grammar/)

Graham @ Microsoft

A 5-video series called Data Science for Beginners has been released by Microsoft. It introduces practical data science concepts to a non-technical audience… making data science accessible – keeping the language clear and simple as an entry point to understanding data science.

 

http://aka.ms/data-science-for-beginners-1
http://aka.ms/data-science-for-beginners-2
http://aka.ms/data-science-for-beginners-3
http://aka.ms/data-science-for-beginners-4
http://aka.ms/data-science-for-beginners-5

Graham @ Microsoft