The fully open source software stack of the Ubuntu Data Science Virtual Machine (DSVM) hosted on Azure is a great place to support an R workshop or laboratory session or R training. I record  here the simple steps to set up a Linux Data Science Virtual Machine (in the main so I can remember how to do it each time).  Workshop attendees will have their own laptop computers and can certainly install R themselves but with the Ubuntu Data Science Virtual Machine we have a shared and uniformly configured platform which avoids the traditional idiosyncrasies and frustrations that plague a large class installing software on multiple platforms themselves. Instead of speding the first trouble filled hour of a class setting up everyone’s computer we can use a local browser to access either Jupyter Notebooks or RStudio Server running on the DSVM.

Jupyter Notebooks on JupyterHub

We illustrate the session with both Jupyter Notebook supporting multiple users under JupyterHub and as a backup running RStudio Server (for those environments where a secure connection through https is not permitted). Both can be accessed via browsers. JupyterHub uses https (encrypted) which may be blocked by firewalls within organisations. In that case an RStudio Server over http is presented as a backup.

WARNING: Jupyter Notebook has been able to render my laptop computer (under both Linux and Windows, Firefox and IE) unusable after a period of extensive usage when the browser freezes and the machine becomes completely unresponsive.

Jupyter Notebook provides a browser interface with basic literate programming capability. I’ve been a fan of literate programming since my early days as a programmer in the 1980’s when I first came across the concept from Donald Knuth. I now encourage literate data science and it is a delight to see others engaged is urging this approach to data science. Jupyter Notebooks are great for self paced learning intermixing a narrative with actual R code. The R code can be executed in place with results displayed in place as the student works through the material. Jupyter Notebooks are not such a great development environment though. Other environments excel there.

JupyterHub supports multiple users on the one platform, each with their own R/Jupyter process. The Linux Data Science Virtual Machine running on Azure provides these open source environments out of the box.  Access to JupyterHub is through port 8000.

Getting Started – Create a Ubuntu Data Science Virtual Machine

To begin we need to deploy a Ubuntu Data Science Virtual Machine. See the first two steps on my blog post. A DS14 server (or D14_V2 for a SSD based server) having 16 cores and 112 GB of RAM seems a good size (about $40 per day).

We may want to add a disk for user home folders as they can sometimes get quite large during training. To do so follow the Azure instructions:

  1. In the Portal click in the virtual machine.
  2. Click on Disks and Attach New.
  3. Choose the Size. 1000GB is probably okay for a class of 100.
  4. Click OK (takes about 2 minutes).
  5. Now log in to the server through ssh:
  6. The disk is visible as /dev/sdd
    • $ dmesg | grep SCSI
  7. Format the disk
    • $ sudo fdisk /dev/sdd
    • Type
      • n (new partition)
      • p (primary)
      • <enter> (1)
      • <enter> (2048)
      • <enter> (last sector)
      • p (create partition)
      • w (write partition)
    • $ sudo mkfs -t ext4 /dev/sdd1
  8. Create a temporary mount point and mount
    • $ sudo mkdir /mnt/tmp
    • $ sudo mount /dev/sdd1 /mnt/tmp
    • $ mount | grep /sdd1
  9. We will use this disk to mount as /home by default, so set that up
    • Check how much disk is used for /home
      • $ sudo du -sh /home
    • Synchronise /home to the new disk
      • $ sudo rsync -avzh /home/ /mnt/tmp/
    • Identify the unique identifier for the disk
      • $ sudo -i blkid | grep sdd1
    • Tell the system to mount the new disk as /home
      • $ sudo emacs /etc/fstab
      • Add the following single line with the appropriate UUID
        UUID=f395b783-31da-4916-a3a9-8fb56fd7a068 /home ext4 defaults,nofail,discard 1 2
    • Now mount the new disk as /home
      • $ sudo mount /home
    • No longer need the temporary mount so unmount
      • $ sudo umount /mnt/tmp
    • Move to the new version of home and ensure ssh can access
      • $ cd ~
      • $ df -h .
      • $ sudo restorecon -r /home

Connecting to JupyterHub

If you set up a DNS name label dsvmxyz01 and the location is southeastasia then visit:

First time you connect to the site you will be presented with a warning from the browser that the connection is insecure. It is using a self signed certificate to assure the encryption between your browser and the server. That is fine though a little disconcerting. As the user you could simply click through to allow the connection and add an exception. This often involves clicking on Advanced and then Add Exception… and then Confirm Security Exception. It is safe to provide an exception for now. However, best to install a proper certificate!

Install a LetsEncrypt Certificate

We can instead install a free Let’s Encrypt certificate from letsencrypt to have a valid non-self-signed certificate. To do so we first need to allow connection through the https: port (443) through the Azure portal for the dsvm. Then log on to the server and do the following:

$ ssh
$ sudo yum install epel-release
$ sudo yum install httpd mod_ssl python-certbot-apache
$ sudo emacs /etc/httpd/conf.d/ssl.conf
  Within the Virtual Host entry add
    # SSLProtocol all -SSLv2
$ sudo systemctl restart httpd
$ sudo systemctl status httpd
$ sudo certbot --apache -d
$ sudo systemctl start httpd

You should be able to connect now without the certificate warning.

You are presented with a Jupyter Hub Sign in page.

Screenshot from 2016-07-29 09:14:41

Creating User Accounts

Log in to the server. This will depend on whether you set up a ssh-key or a username and password. We assume the latter for this post. On a terminal (or using Putty on Windows), connect as:

$ ssh

You will be prompted for a password.

We can then create user accounts for each user in our workshop. The user accounts are created on the Linux DSVM. Here we create 40 user accounts and record their random usernames and passwords into the file usersinfo.csv on the server:

for i in {1..40}; do 
  u=`openssl rand -hex 2`
  sudo adduser user$u --gecos "" --disabled-password
  p=`openssl rand -hex 5`
  echo "user$u:$p" | sudo chpasswd
  echo user$u:$p >> 'usersinfo.csv'

If the process has issues and you need to start the account creation again then delete the users:

for i in $(cut -d ":" -f1 usersinfo.csv); do 
  sudo deluser --remove-home $i; 

# Check it has been done

tail /etc/passwd
ls /home/

Provide a username/passwd to each participant of the workshop, one line only to each user. The file will begin something like:


Now go back to and Sign in with the Username userce81 and Password d0dfac5a30 (using the username and password from your own usersinfo.csv file.)

Once logged in Jupyter will display a file browser.


Notice a number of notebooks are available. Click the IntroTurorialInR.ipynb for a basic introduction to R.

Screenshot from 2016-07-29 09:15:49

Backup Option – RStudio

JupyterHub requires https and so won’t run internally within a customer site if they have a firewall blocking all SSL (encrypted) communications. In this case RStudio server is a backup option. It is pre-installed on the server and if you followed my instructions above for deploying a DSVM you will hav updated to the latest version too.

Connect to the RStudio server:

Sign in to RStudio with the same Username and Password as above.


Running Rattle through an X2Go Desktop

If you followed my DSVM deployment guide then you will have also set up X2Go on your local computer to support a desktop connection across to the DSVM. This is very convenient in terms of running desktop apparitions, like Rattle,  on the DSVM. Every student in the class gets the same environment.

Shortcuts to the Services

The URLs are rather long and so we can set up either or shortcuts. Visiting the latter we set up two short URLs: as as

We can now use the short URLs to refer to the long URLs.

REMEMBER: Deploy-Compute-Destroy for a cost effective hardware platform for Data Science. Deallocate (Stop) your server when it is not required.

Graham @ Microsoft

Data Scientists have access to a grammar for preparing data (Hadley Wickham’s tidyr package in R), a grammar for data wrangling (dplyr), and a grammar for graphics (ggplot2).

At an R event hosted by CSIRO in Canberra in 2011 Hadley  noted that we are missing a grammar for machine learning. At the time I doodled some ideas but never developed. I repeat those doodles here. The idea’s are really just that – ideas as a starting point. Experimental code is implemented in the graml package for R which is refining the concepts first explored in the experimental containers package.

A grammar of machine learning can follow the ggplot2 concept of building layer upon layer to define the final model that we build. I like this concept rather than the concept of a data flow for model building. With a data flow a dataset is piped (in R using magrittr’s %>% operator) from one data wrangling step to the next data wrangling step. Hadley’s tidyr and dplyr do this really well.

The concept of a grammar of machine learning begins with recognising that we want to train a model:

train(ds, form(target ~ .))

Simply we want to train a model using some dataset ds where one of the columns of the dataset is named target and we expect to model this variable based on the other variables within the dataset (signified by the ~ .).

Generally in machine learning and statistical model building we split our dataset into a training dataset, a validation dataset, and a testing dataset. Some use only two datasets. Let’s add this in as the next “layer” for our model build.

train(ds, form(target ~ .)) +
  dataPartition(0.7, 0.3)

That is, we ask for 70% of the data randomly sampled to train the model.

We will have already performed our data preparation steps and let’s say that we know in ds the target variable has only two distinct values, yes and no. Thus a binary classification model is called for.

In R we have a tremendous variety of model building algorithms that support binary classification. My favourite has been randomForest so let’s add in our request to train a model using randomForest().

train(ds, formula(target ~ .)) +
  dataPartition(0.7, 0.3) +

Now we might want to do a parameter sweep over the mtry parameter to the randomForest() function which is the number of variables to randomly sample as we build each decision tree.

train(ds, formula(target ~ .)) +
  dataPartition(0.7, 0.3) +
  model(randomForest) +
  tuneSweep(mtry=seq(5, nvars, 1))

Finally to report on the evaluation of the model using the area under the curve (AUC).

train(ds, formula(target ~ .)) +
  dataPartition(0.7, 0.3) +
  model(randomForest) +
  tuneSweep(mtry=seq(5, nvars, 1)) +

The object returned is a trained model incorporating the additional information requested. Other operations can be performed on this model object, including its deployment into a production system!

We can provide parameters as a lightweight layer above other model building packages with no or minimal effort required to move to a new model builder.


(Image from

Graham @ Microsoft

I saw a demo of a package for Rapid and Pretty Things in R earlier in the year when it was a work in progress. It is now live on GitHub (but not yet CRAN). It allows you to very quickly visualise data in R using a Shiny GUI to generate ggplot2 underneath. A nice app for some visual analytics.


Screenshot-raptR - Mozilla Firefox

This is a nice example of the power of multiple APIs working together to deliver a solution.

The app uses R’s Shiny to control a map built using the open source JavaScript Leaflet based on public data displaying map tiles generated by  Stamen Design on Open Street Map data.

Thanks to colleague and R guru Hugh Parsonage for pointing to this one on twitter


Screenshot-Mozilla Firefox-1