3Qs: Big Data project reaches for the ‘cloud’

May 12, 2014

North­eastern Uni­ver­sity is a key partner in the Mass­a­chu­setts Open Cloud Project, a university-​​industry col­lab­o­ra­tion designed to create a new public cloud com­puting infra­struc­ture to spur big data inno­va­tion. Last month, at the Mass­a­chu­setts Green High Per­for­mance Com­puting Center in Holyoke, Massachusetts—which counts North­eastern as a partner institution—Gov. Deval Patrick announced a $3 mil­lion invest­ment to get the project up and run­ning. Peter Desnoyers, an assis­tant pro­fessor in the Col­lege of Com­puter and Infor­ma­tion Sci­ence at North­eastern, helped bring the ini­tia­tive to fruition. Here, he explains what the project means for both the future of Big Data and the university.

The project is a great example of the synergies between industry and academia in an area such as Boston. It’s the brainchild of Orran Krieger, a computer science professor at Boston University, whom I worked with before we each came to academia. The proposal to the state has been a collaborative effort between the two universities, and the preliminary research and development has been done by a collaborative group of Northeastern and BU students.

A primary goal of the Massachusetts Open Cloud is to enable fundamental research in cloud computing at Northeastern and the other partner universities. This is a rapidly growing and innovating field in which universities could bring much to the table, but it’s dominated by a few providers such as Amazon, Microsoft, and Google—the only organizations with access to perform research in this area. By supporting a variety of cloud implementations—ranging from experimental and pilot services implemented by computer science researchers to production systems for researchers in other fields—the MOC will enable academic research in this field, helping establish us as a center of excellence in cloud computing research.

Big Data refers to the processing of data sets that are far larger than the ability of individual computers to handle, requiring the coordination of hundreds or thousands of systems. It is most commonly applied to applications that handle unstructured text data, as opposed to high-performance computing, which typically works with large, but highly structured, numeric data sets.

Big Data techniques are the basis of much of the Internet economy, from recommendation systems on sites such as Amazon to indexing Web data for search engines such as Google. In addition, the recent availability of large data sets in areas ranging from historical documents to cell phone usage has allowed computational techniques based on Big Data analysis to be applied in transformative ways to areas of research ranging from political science to medicine and beyond.

In the next year you will be seeing research proposals and early research from several Northeastern faculty including myself; Gene Cooperman, professor in the College of Computer and Information Science; and David Kaeli, professor in the College of Engineering. I will also be leading the effort to develop novel infrastructure to allow both experimental and reliable systems to coexist and interoperate within the open cloud, much as they do in today’s Internet.