Install

DeltaRho is available as a local stack that can be installed on a single workstation and a full stack to be installed on a cluster. In either case, we provide options for you to get your DeltaRho environment up and running.

The local stack simply consists of the following:

  • R
  • datadr R package
  • Trelliscope R package

The full stack has the above components as well as the the following on each node of the cluster:

  • Hadoop
  • RHIPE Hadoop connector R package
  • RStudio Server (namenode only)
  • Shiny Server (namenode only)

It is also possible for other connectors and distributed computing technologies to be replaced by Hadoop and RHIPE, such as Spark and SparkR. Regardless, your code stays virtually the same.

Local Stack on Workstation

After installing R, simply launch R, install the base DeltaRho packages from CRAN:

install.packages("datadr")
install.packages("trelliscope")

Now you are ready to try out the quickstart code or begin working through the tutorials, and your environment is suitable for analyzing small to moderate (low gigabyte) data.

Full Stack on Vagrant VM

To get a feel for running in a large-scale DeltaRho environment, we have provided a Vagrant setup that with a few simple commands allows you to provision a virtual machine on your workstation with the full DeltaRho stack running.

The Vagrant script and instructions are available on Github.

Full Stack on Amazon Web Services

We have provided an easy way to get going with DeltaRho in a large-scale environment through a simple set of scripts that provision the DeltaRho environment on Amazon's Elastic MapReduce (EMR). This allows you to spin up virtual clusters on-demand. An Amazon account is required.

This environment comes with RStudio Server running on the master node, so that all you need is a web browser to access R Studio, a fantastic R IDE that will be backed by your own Hadoop cluster.

The EMR scripts and instructions are available on Github.

Full Stack on Your Cluster

Setting up and installing all of the DeltaRho components on your own cluster will require more commitment in terms of hardware, installation, configuration, and administration. We have put together an installation manual that is available here.

Try It

Here is a simple example to get a feel for DeltaRho usage. Commentary about the example is available in the datadr tutorial here.

# install package with housing data
install.packages("housingData")
library(housingData)
library(datadr)
library(trelliscope)

# look at housing data
head(housing)

# divide by county and state
byCounty <- divide(housing,
by = c("county", "state"), update = TRUE)

# look at summaries
summary(byCounty)

# look at overall distribution of median list price
priceQ <- drQuantile(byCounty,
  var = "medListPriceSqft")
  xyplot(q ~ fval, data = priceQ,
  scales = list(y = list(log = 10)))

# slope of fitted line of list price for each county
lmCoef <- function(x)
  coef(lm(medListPriceSqft ~ time, data = x))[2]
# apply lmCoef to each subset
byCountySlope <- addTransform(byCounty, lmCoef)

# look at a subset of transformed data
byCountySlope[[1]]

# recombine all slopes into a single data frame
countySlopes <- recombine(byCountySlope, combRbind)
plot(sort(countySlopes$val))

# make a time series trelliscope display
vdbConn(tempfile("vdb"), autoYes = TRUE)

# make and test panel function
timePanel <- function(x)
  xyplot(medListPriceSqft + medSoldPriceSqft ~ time,
  data = x, auto.key = TRUE, ylab = "$ / Sq. Ft.")
timePanel(byCounty[[1]]$value)

# make and test cognostics function
priceCog <- function(x) {
  list(
    slope = cog(lmCoef(x), desc = "list price slope"),
    meanList = cogMean(x$medListPriceSqft),
    listRange = cogRange(x$medListPriceSqft),
    nObs = cog(sum(!is.na(x$medListPriceSqft)),
    desc = "number of non-NA list prices")
  )
}
priceCog(byCounty[[1]]$value)

# add display panel and cog function to vdb
makeDisplay(byCounty,
  name = "list_sold_vs_time",
  desc = "List and sold price over time",
  panelFn = timePanel, cogFn = priceCog,
  width = 400, height = 400,
  lims = list(x = "same"))

# view the display
view()

You can view a variant of this display here.