A Northeastern University undergraduate is leading the development of a new process that will make it possible for certain supercomputers to save their data midway through a computation, preventing the loss of progress due to a computer crash or bug that would otherwise require the machine to be restarted from the beginning.
“Computers are like a car engine — the more complicated they are, the more likely they are to break,” said Greg Kerr, a sophomore computer science major. Kerr said that his protocol applies to high-performance machines known as InfiniBand supercomputers.
Next month, he will present his research at REcon, a computer science conference held annually in Montreal, Canada. He has been selected to give an hour-long talk on the first day of the conference, an honor, for an undergraduate, said Gene Cooperman, a professor in the College of Computer and Information Science, where Kerr is a research assistant.
“If you give your talk on the first day, it means everyone who is there for the conference knows who you are and can talk about your work in the later days,” said Kerr. “It shows that the organizers believe this work is very important and will generate a lot of interest among the attendees.”
InfiniBand is a relatively new computer system that has made high-performance computing more open and accessible since it was developed and released in the early 2000s. Because the system is scalable, it can be used on systems ranging from small computer clusters to some of the world’s largest and most advanced supercomputers.
“This is the networking technology behind some of the worlds largest computers, and yet the number of people who understand the internals of the InfiniBand technology is very small, largely because it is relatively new,” said Cooperman, who urged Kerr to reach out to some of the top InfiniBand experts in the world as he began developing his new process.
No one has been able restart an InfiniBand process midstream. This new work would allow scientists to more efficiently complete massive calculations on expensive computers in high demand.
This summer, Cooperman and several of his doctoral students are working at the Oak Ridge National Laboratory, where some of the nation’s most advanced supercomputers are located, and Kerr believes his work will soon be ready to be applied to those computations.
“I think we’re close,” Kerr said. “We’ve got the main points proven and now we need the summer to iron everything out and work out the bugs.”
Photo by Mary Knox Merrill