Virtual Infrastructure for Data Intensive Analysis (VIDIA)
Steven M. Gallo
- Matthew D. Jones, University at Buffalo
- Cynthia D. Cornelius, University at Buffalo
- Jeanette M. Sperhac, University at Buffalo
- Brian M. Lowe, SUNY Oneonta
- Gregory Fulkerson, SUNY Oneonta
- William Wilkerson, SUNY Oneonta
- Brett Heindl, SUNY Oneonta
- Achim Koeddermann, SUNY Oneonta
- James Greenberg, SUNY Oneonta
University at Buffalo
The large datasets culled from social media such as Facebook and Twitter can easily grow to a size that is beyond the capability of commonly used software tools to store and analyze within an acceptable amount of time. Twitter, for example, easily generates 7 Terabytes (TB) of data per day, resulting in a raw dataset of over 210 TB per month (not including intermediate data generated during analysis).
Primarily Undergraduate Institutions (PUIs) typically do not have the computing and networking infrastructure or support personnel needed to allow creation, manipulation, and analysis of these large multi-terabyte datasets by faculty and students. SUNY Oneonta’s existing IT infrastructure is typical of the SUNY comprehensive colleges: total storage shared by its ~6,000 student and 900 faculty/staff users for personal/research data is currently 4 TB. Available software is limited to the standard set of Windows and Macintosh applications, including SPSS, R, Minitab, Atlas.ti, and SAS. As SUNY Oneonta has no high-performance computing (HPC) capability, the visualization tools typically found at large research facilities are not available, strictly limiting the types of analysis that can be carried out.
The SUNY Research Centers, therefore, have an active role to play in supporting data-intensive computing education and analysis at SUNY’s PUIs. Accordingly, in order to provide the tools necessary to expose students to state-of-the-art data-intensive computing and analysis techniques, the Center for Computational Research (CCR) at the University at Buffalo (UB) and SUNY Oneonta will partner to pilot the establishment of a collaborative virtual community, focusing initially on data-intensive computing education in the social sciences. With necessary infrastructure lacking at PUIs, the formation of a virtual community is vitally important to this goal. The size of datasets culled from social media is expected to grow exponentially over time, making the co-location of data with analysis software and HPC resources increasingly more important. The virtual community to be created will provide a collaborative environment where open-source analysis tools, storage space, and costs can be shared amongst multiple campuses. From the perspective of a virtual portal user, data and analysis will be stored and carried out locally. This is the great utility and power of a virtual environment – it removes the constraints of distance from and access to computing resources.
CCR, a leading academic supercomputing facility, maintains over 8,000 processing cores and 500 TB of storage. It has extensive experience both with the development of virtual organizations and analysis tools for HPC users (through its involvement with grid computing in projects such as XSEDE and Open Science Grid), and with collaborative virtual community building (through the VHub and HPC2 projects). VHub (https://www.vhub.org) provides cyberinfrastructure to the global volcanology community, specializing in volcano eruption and hazards modeling. HPC2 (https://hpc2.org) is a partnership between NYSERNet, Rensselaer Polytechnic Institute, Stony Brook University, Brookhaven National Labs, and UB. In both of these projects, an engaged community actively collaborates, contributing educational materials, datasets, and software tools. VHub has been particularly successful, with 1,196 registered users across the globe, 348 of whom ran 7,315 online simulations in 2012. HPC2 will be used for the 2013 Eric Pitman High School Workshop (http://ccr.buffalo.edu/outreach/k-12-outreach/summer-workshop.html) at CCR.
With support from a 2012 Tier 2 IITG grant, SUNY Oneonta explored and evaluated the use of numerous data capture software packages (including Discovertext, GNIP, Trackur, ContenSeer and X1), eventually deploying Trackur because of its cost, ease of use, and ability to download datasets into Excel format. During the fall 2012 and spring 2013 semesters, SUNY Oneonta integrated use of Trackur into courses in Sociology, Political Science and Philosophy, centered upon a social-scientific examination of how moral claims and discourses are created, sustained, altered and challenged within the electronic sphere. Faculty members captured and analyzed data using locally available software (Excel, SPSS), as well as writing code to access the Twitter Application Programming Interface (API). Given Trackur’s limitations (access to only 3,000 records at once), capture of enough data to monitor social media trends was problematic. In spring 2013, the College will expand its capacity for data capture and analysis (using both “home-grown” software and a donation by IBM of its MAP suite) to a Senior Sociology Seminar and two Political Science courses. However, having overcome the initial hurdle of lack of tools and training for SUNY IITG Program (U Buffalo/SUNY Oneonta) dataset analysis, faculty are now faced with the next hurdle: the limitations of the College’s current technology infrastructure. The College has developed a strong working relationship with UB, allowing this proposal to leverage both CCR’s infrastructure and Oneonta’s IITG project to create a collaborative environment where PUIs can conduct intensive data analysis not otherwise possible.
This project will develop a scalable, community-driven infrastructure to expose students and faculty at SUNY Oneonta (and eventually other PUIs) to data-intensive computing and analysis techniques. The environment, not typically available to PUIs, will include an initial set of open-source data analysis tools, storage space, and seamless access to UB’s HPC facilities for analysis. In addition to deploying the environment, UB will train SUNY Oneonta educators in utilizing the platform. While we will deploy an initial set of high-priority tools, tested by SUNY Oneonta social science faculty, eventually, the broader educational community will be encouraged to provide tools it has developed, to provide content, and to utilize the environment for courses and workshops. In CCR’s experience, particularly with VHub, this has proven an effective methodology. We will identify and train a “Campus Champion” on each participating campus (beginning with SUNY Oneonta), to serve as a source of local expertise about the use of the environment by educators and students – an approach used successfully by the NSF XSEDE project.
HUBzero (http://www.hubzero.org), the open-source scientific collaboration platform developed by Purdue University, is currently used by VHub, HPC2, and several other projects at CCR; thus, this project is well positioned to leverage this technology. Using HUBzero, communities are provided with a proven and continually improving platform for collaboration, where they are able to provide content, form ad-hoc groups, publish materials, utilize, and deploy analysis tools via a web browser (no web-programming experience is required). The storage capability of HUBzero will be extended to support large datasets by using iRODS (http://www.irods.org) to manage distributed collections of data and present them to users as a single virtual filesystem in the hub. This will allow us to co-locate data repositories with HPC resources, reducing the need to transfer data over the network (an iRODS prototype was implemented as part of VHub). The infrastructure will initially be deployed using a single HUBzero server, as well as eight dedicated Dell R620 data analysis nodes and 5 TB of iRODS storage to meet the initial data analysis needs of the community; these resources can be scaled as needed. CCR will provide in-kind matching through the contribution of scalable CPU cycles for the project (see attached letter of support). This will allow users to access HPC resources as needed via the HUBzero software.
With the successful completion of this project, the team will engage additional SUNY campuses to grow the community, and explore long-term sustainability by discussing cost-recovery models with SUNY ITEC (also located in Buffalo). This work will also serve as a foundation for the pursuit of external NSF funding in Data Intensive Computing. The project can be substantially scaled back to a proof of concept; a Tier 2 project would still provide a collaborative environment but would rely entirely on opportunistic access to the shared HPC resources at CCR, and is not preferred. In addition, under Tier 2, we would not be able to engage and train Campus Champions, the number of tools and datasets deployed would be substantially reduced, and the likelihood of successful external funding to continue our work would be reduced.