AGRIOS


    There has been a substantial increase in the amount of digital data collected over the last several decades. Together with this data comes the desire to transform it, through analysis, into useful information.  Traditional analytic tools used for analysis fail when the size of the data grows large, and general-purpose database systems used to manage large collections of data cannot perform the sophisticated analyses required.  The analyst who wants to work with large datasets faces a dilemma when it comes to tool choice.

    Our work resolves this tension with a hybrid strategy that integrates R and SciDB:  Agrios.  R is a powerful data analysis tool, and SciDB is an array database management system.  Our integration focuses on the automated movement of data between the two systems, in an effort to improve performance.  Contributions include semantic mappings between the two languages, a cost-based interaction model, a start-to-finish system implementation, and test results quantifying the performance of the hybrid approach.

People

Patrick Leyshock, PhD candidate in Computer Science, Portland State University
David Maier, Maseeh Professor of Emerging Technologies, Portland State University
Kristin Tufte, Research Assistant Professor, Portland State University

Work products

My doctoral dissertation: "Optimizing Data Movement in Hybrid Analytic Systems".

"Minimizing Data Movement through Query Transformation", from the 2014 International Conference on Big Data.

"Data Movement in Hybrid Analytic Systems: A Case for Automation", from the 2014 International Conference on Scientific and Statistical Database Management.

"Agrios: A Hybrid Approach to Big Array Analytics", from the IEEE's 2013 International Conference on Big Data, and the accompanying slides.

Poster from Intel's "Big Data" ISTC meeting, January 2013.

My PhD dissertation proposal from July 2012, and the accompanying slides.

Abstract of lightning talk and poster presentation from XLDB 2012.

Sponsors

National Science Foundation Award #1110917, and Intel's Big Data Science and Technology Center