Research in the DeltaRho project collectively covers all of the intellectual areas of data science, from cluster hardware design for big data to theoretical statistics. The fundamental approach behind the research is Divide & Recombine (D&R). In D&R, the data are divided into subsets, analytic methods are applied to each subset independently without communication between subsets, and the subsets’ outputs for each method are recombined. Research in statistical theory seeks division methods and recombination methods to optimize the statistical accuracy of D&R results. While D&R is a statistical approach, its goal is simple, fast, scalable computation.
The following publications provide more detail about research relevant to DeltaRho, as well as illustrate the principles of D&R in various applications:
- Divide and Recombine(D&R): Data Science for Large Complex Data, Statistical Analysis and Data Mining, 2014
- Large Complex Data: Divide and Recombine (D&R) with RHIPE, ISI Stat, 2012
- Trelliscope: A System for Detailed Visualization in the Deep Analysis of Large Complex Data, Large Data Analysis and Visualization (LDAV), 2013
- Computing Environment for the Statistical Analysis of Large and Complex Data, Ph.D. Thesis, Saptarshi Guha, 2010
- Visualization Databases for the Analysis of Large Complex Datasets, Journal of Machine Learning Research, 2009
- Statistical Analysis and Modeling of Internet VoIP Traffic for Network Engineering, Electronic Journal of Statistics, 2010
- A Streaming Statistical Algorithm for Detection of SSH Keystroke Packets in TCP Connections, INFORMS Computing Society Conference, 2011
- STL and Local Regression for Modeling Disease Surveillance Counts, BMC Medical Informatics and Decision Making, 2009
Our team consists of faculty, students, and technical staff at the Purdue University Department of Statistics, statisticians and computer scientists at Pacific Northwest National Laboratory, and several independent contributors.