440 Huntington Avenue
336 West Village H
Boston, MA 02115
ATTN: Gene Cooperman, 202 WVH
360 Huntington Avenue
Boston, MA 02115
- Fault tolerance and transparent checkpointing
- Supercomputing, parallel computing, cloud computing
- The Internet of Things
- Formal verification
- PhD, Brown University
- BS in mathematics and physics, University of Michigan
Gene Cooperman received his bachelor’s degree in 1974 from the University of Michigan, and his PhD from Brown University in 1978. Prior to joining Northeastern, he was a principal MTS at GTE Laboratories from 1980-1986. He leads the High Performance Computing Laboratory at Northeastern University, and he currently co-leads an Inria associate team in a 3-year project called “FogRein: Steering Efficiency for Distributed Applications.”
In the past, he also held a 5-year IDEX Chair of Attractivity at the University of Toulouse in France, as well as visiting research positions at Concordia University, CERN, and Inria. As a result of the work at CERN, he joined the Geant4 Collaboration and contributed to the foundational paper “GEANT4 – A Simulation Toolkit,” which currently has approximately 25,000 citations and is the most widely cited paper in high energy physics.
Cooperman has worked in a series of interdisciplinary research areas, including applied mathematics, computational and symbolic algebra, numerical analysis, computing in high energy physics, bioinformatics, high performance computing, and computer systems. He has co-authored more than 100 refereed publications, advised PhD students, and personally led several open source software projects:
- Task-Oriented Parallel C/C++: a model for writing parallel software easily
- Roomy: a middleware for big data that uses the many disks of a cluster to simulate many terabytes of RAM, used to show that 26 moves suffice for Rubik’s Cube, a record at its time
- ParGeant4: distributed parallelism for the CERN-based Geant4 software for Monte Carlo particle-matter interaction in high energy physics
Two further open source software projects have developed a life of their own, Geant4-MT and DMTCP. The first project, Geant4 Multithreaded, culminated in January 2014 with the incorporation of Geant4-MT into the Geant4 version 10.0 release, and is now maintained at CERN. In the 15 years prior to this, Geant4 had grown purely as a single-threaded package of almost a million lines of code. Hence, retroactively adding multi-threading to the Geant4 production software was both a major research effort (described in the PhD thesis of Xin Dong) and a major implementation effort that included five members of the Geant4 collaboration.
The second ongoing software project is DMTCP, or Distributed MultiThreaded Checkpointing. The DMTCP approach emphasizes transparent checkpointing or snapshots with no modification to the target application binary, and transparent extensibility to external hardware/software environments like GPUs. The roots of this project began in 2004, and it is now in its third generation, incorporating results from a series of PhD theses and other student work.
The project provides both a production platform and a research platform. It’s available in all major Linux distros and used by independent researchers in more than 100 refereed research publications. Examples of research areas using DMTCP include circuit verification, formal verification, CPU chip design by Intel and others, VLSI circuit simulators, formalization of mathematics, bioinformatics, network simulation, high energy physics, cyber-security, big data, middleware, mobile computing, cloud computing, virtualization of GPUs, and high performance computing.
Enhancement and Support of DMTCP for Adaptive, Extensible Checkpoint-Restart
Enhancement and Support of DMTCP for Adaptive, Extensible Checkpoint-Restart
This project will support a plugin architecture for transparent checkpoint-restart.
Society’s increasingly complex cyberinfrastructure creates a concern for software robustness and reliability. Yet, this same complex infrastructure is threatening the continued use of fault tolerance. Consider when a single application or hardware device crashes. Today, in order to resume that application from the point where it crashed, one must also consider the complex subsystem to which it belongs. While in the past, many developers would write application-specific code to support fault tolerance for a single application, this strategy is no longer feasible when restarting the many inter-connected applications of a complex subsystem. This project will support a plugin architecture for transparent checkpoint-restart. Transparency implies that the software developer does not need to write any application-specific code. The plugin architecture implies that each software developer writes the necessary plugins only once. Each plugin takes responsibility for resuming any interrupted sessions for just one particular component. At a higher level, the checkpoint-restart system employs an ensemble of autonomous plugins operating on all of the applications of a complex subsystem, without any need for application-specific code.
The plugin architecture is part of a more general approach called process virtualization, in which all subsystems external to a process are virtualized. It will be built on top of the DMTCP checkpoint-restart system. One simple example of process virtualization is virtualization of ids. A plugin maintains a virtualization table and arranges for the application code of the process to see only virtual ids, while the outside world sees the real id. Any system calls and library calls using this real id are extended to translate between real and virtual id. On restart, the real ids are updated with the latest value, and the process memory remains unmodified, since it contains only virtual ids. Other techniques employing process virtualization include shadow device drivers, record-replay logs, and protocol virtualization. Some targets of the research include transparent checkpoint-restart support for the InfiniBand network, for programmable GPUs (including shaders), for networks of virtual machines, for big data systems such as Hadoop, and for mobile computing platforms such as Android.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Kapil Arya and Gene Cooperman. “DMTCP: Bringing Interactive Checkpoint?Restart to Python,” Computational Science & Discovery, v.8, 2015, p. 16 pages. doi:10.1088/issn.1749-4699
Jiajun Cao, Matthieu Simoni, Gene Cooperman,
and Christine Morin. “Checkpointing as a Service in Heterogeneous Cloud Environments,” Proc. of 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’15),, 2015, p. 61–70. doi:10.1109/CCGrid.2015.160
"Functional Classification of Protein Structures by Local Structure Matching in Graph Representation", Caitlyn L. Mills, Rohan Garg, Joslynn S. Lee, Liang Tian, Alexandru Suciu, Gene Cooperman, Penny J. Beuning, Mary Jo Ondrechen, Protein Science 27(6), pp. 1125--1135, 2018,
As a result of high-throughput protein structure initiatives, over 14,400 protein structures have been solved by Structural Genomics (SG) centers and participating research groups. While the totality of SG data represents a tremendous contribution to genomics and structural biology, reliable functional information for these proteins is generally lacking. Better functional predictions for SG proteins will add substantial value to the structural information already obtained. Our method described herein, Graph Representation of Active Sites for Prediction of Function (GRASP-Func), predicts quickly and accurately the biochemical function of proteins by representing residues at the predicted local active site as graphs rather than in Cartesian coordinates. We compare the GRASP-Func method to our previously reported method, Structurally Aligned Local Sites of Activity (SALSA), using the Ribulose Phosphate Binding Barrel (RPBB), 6-Hairpin Glycosidase (6-HG), and Concanavalin A-like Lectins/Glucanase (CAL/G) superfamilies as test cases. In each of the superfamilies, SALSA and the much faster method GRASP-Func yield similar correct classification of previously characterized proteins, providing a validated benchmark for the new method. In addition, we analyzed SG proteins using our SALSA and GRASP-Func methods to predict function. Forty-one SG proteins in the RPBB superfamily, nine SG proteins in the 6-HG superfamily, and one SG protein in the CAL/G superfamily were successfully classified into one of the functional families in their respective superfamily by both methods. This improved, faster, validated computational method can yield more reliable predictions of function that can be used for a wide variety of applications by the community.
"MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing", Rohan Garg, Gregory Price, and Gene Cooperman, Proc. of 28th Int. Symp. on High Performance Parallel and Distributed Computing, Phoenix, AZ, USA, ACM, pp. 49--60, June, 2019
Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing problem in HPC. The problem has been complicated by the need to provide checkpoint-restart services for all combinations of an MPI implementation over all network interconnects. This work presents MANA (MPI-Agnostic Network- Agnostic transparent checkpointing), a single code base which supports all MPI implementation and interconnect combinations. The agnostic properties imply that one can checkpoint an MPI application under one MPI implementation and perhaps over TCP, and then restart under a second MPI implementation over InfiniBand on a cluster with a different number of CPU cores per node. This technique is based on a novel split-process approach, which enables two separate programs to co-exist within a single process with a single address space. This work overcomes the limitations of the two most widely adopted transparent checkpointing solutions, BLCR and DMTCP/InfiniBand, which require separate modifications to each MPI implementation and/or underlying network API. The runtime overhead is found to be insignificant both for checkpoint-restart within a single host, and when comparing a local MPI computation that was migrated to a remote cluster against an ordinary MPI computation running natively on that same remote cluster.
Transparently Checkpointing Software Test Benches to Improve Productivity of SoC Verification in an Emulation Environment
"Transparently Checkpointing Software Test Benches to Improve Productivity of SoC Verification in an Emulation Environment", Ankit Garg, Suresh Krishnamurthy, Gene Cooperman, Rohan Garg, and Jeff Evans, 2018 Design and Verification Conference and Exhibition (DVCON-US 2018)
Traditionally hardware emulation has been used in in-circuit emulation (ICE) mode where the design under test (DUT) executes inside the emulator and connected to the real target, which acts as a testbench. Over time, the software testbench-based emulation environments have become very popular, since the users can control its operation remotely from their desktops. This makes emulators an enterprise resource, accessible to a multitude of users spread across continents and multiple time zones. And since emulators are expensive resources, it is important to utilize them efficiently. Full checkpoint save/restore capability of emulation jobs helps the utilization by enabling flexible job scheduling, shortening of jobs by jumping ahead to interesting points for debug, carrying out what-if analysis, etc. Emulators have native save/restore capabilities for the model on the emulator. The software testbenches can be complex in multiple dimensions. For example, they may be using C/C++, SystemC, SystemVerilog, etc. They may be multithreaded and based on multiple processes, they may be using IPC, and so on. It then becomes a challenge to save the states of such sophisticated software testbenches both transparently and through a uniform, reliable, mechanism. Since the objective is to solve the problem at an enterprise level, it is critical to find a uniform solution for a diverse set of software testbenches throughout the enterprise. The DMTCP (Distributed MultiThreaded Checkpointing) package supports such a uniform solution. This paper describes the integration of DMTCP with a virtual testbench-based emulation. This brings large benefits to a real life environment that includes multiple emulators within the larger set of enterprise resources. There are other, additional applications of full checkpoint save/restore for emulation jobs that become apparent only after having gained experience with its use in job management. For example, after having observed the behavior of an application prior to checkpoint, additional triggers can be inserted at the time of restore, to enhance debugging of an exception or other unusual behavior.
"CRUM: Checkpoint-Restart Support for CUDA's Unified Memory", Rohan Garg, Apoorve Mohan, Michael Sullivan, Gene Cooperman, Proc. of IEEE Int. Conf. on Cluster Computing (Cluster'18), pp. 302--313, 2018
Unified Virtual Memory (UVM) was recently introduced on recent NVIDIA GPUs. Through software and hardware support, UVM provides a coherent shared memory across the entire heterogeneous node, migrating data as appropriate. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs have started taking advantage of UVM for the same reasons of superior programmability that UNIX applications long ago switched to assuming the presence of virtual memory. Therefore, checkpointing of UVM will become increasingly important, especially as NVIDIA CUDA continues to gain wider popularity: 87 of the top 500 supercomputers in the latest listings are GPU-accelerated, with a current trend of ten additional GPU-based supercomputers each year. A new scalable checkpointing mechanism, CRUM (Checkpoint-Restart for Unified Memory), is demonstrated for hybrid CUDA/MPI computations across multiple computer nodes. CRUM supports a fast, forked checkpointing, which mostly overlaps the CUDA computation with storage of the checkpoint image in stable storage. The runtime overhead of using CRUM is 6% on average, and the time for forked checkpointing is seen to be a factor of up to 40 times less than traditional, synchronous checkpointing.
"System-level Checkpoint-Restart for Petascale Computing", Jiajun Cao, Kapil Arya, Gene Cooperman, Rohan Garg, Khaled Hamidouche, Shawn Matott, D.K. Panda, Jonathan Perkins, Hari Subramoni, Jerome Vienne. IEEE International Conference on Parallel and Distributed Systems (ICPADS'16). Wuhan, China.
"Recent Developments in Geant4", J. Allison et al. (99 co-authors in total, including G. Cooperman), Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 835, pp. 186--225, Nov 1, 2016
Geant4 is a software toolkit for the simulation of the passage of particles through matter. It is used by a large number of experiments and projects in a variety of application domains, including high energy physics, astrophysics and space science, medical physics and radiation protection. Over the past several years, major changes have been made to the toolkit in order to accommodate the needs of these user communities, and to efficiently exploit the growth of computing power made available by advances in technology. The adaptation of Geant4 to multithreading, advances in physics, detector modeling and visualization, extensions to the toolkit, including biasing and reverse Monte Carlo, and tools for physics and release validation are discussed here.
"Transparent Checkpoint-Restart over InfiniBand", Jiajun Cao, Gregory Kerr, Kapil Arya and Gene Cooperman, ACM Symposium on High Performance Parallel and Distributed Computing (HPDC'14), pp. 13--24, ACM Press, 2014.
Transparently saving the state of the InfiniBand network as part of distributed checkpointing has been a long-standing challenge for researchers. The lack of a solution has forced typical MPI implementations to include custom checkpoint-restart services that “tear down” the network, checkpoint each node in isolation, and then re-connect the network again. This work presents the first example of transparent, system-initiated checkpoint-restart that directly supports InfiniBand. The new approach simplifies current practice by avoiding the need for a privileged kernel module. The generality of this approach is demonstrated by applying it both to MPI and to Berkeley UPC (Unified Parallel C), in its native mode (without MPI). Scalability is shown by checkpointing 2,048 MPI processes across 128 nodes (with 16 cores per node). The run-time overhead varies between 0.8% and 1.7%. While checkpoint times dominate, the network-only portion of the implementation is shown to require less than 100 milliseconds (not including the time to locally write application memory to stable storage).
"A Bit-Compatible Parallelization for ILU(k) Preconditioning", Xin Dong and Gene Cooperman, Proc. of Euro-Par 2011 --- Parallel Processing (Part 2), Springer-Verlag, Lecture Notes in Computer Science 6853, Springer, 2011, pp. 66--77
ILU(k) is a commonly used preconditioner for iterative linear solvers for sparse, non-symmetric systems. It is often preferred for the sake of its stability. We present TPILU(k), the first efficiently parallelized ILU(k) preconditioner that maintains this important stability property. Even better, TPILU(k) preconditioning produces an answer that is bit-compatible with the sequential ILU(k) preconditioning. In terms of performance, the TPILU(k) preconditioning is shown to run faster whenever more cores are made available to it — while continuing to be as stable as sequential ILU(k). This is in contrast to some competing methods that may become unstable if the degree of thread parallelism is raised too far. Where Block Jacobi ILU(k) fails in an application, it can be replaced by TPILU(k) in order to maintain good performance, while also achieving full stability. As a further optimization, TPILU(k) offers an optional level-based incomplete inverse method as a fast approximation for the original ILU(k) preconditioned matrix. Although this enhancement is not bit-compatible with classical ILU(k), it is bit-compatible with the output from the single-threaded version of the same algorithm. In experiments on a 16-core computer, the enhanced TPILU(k)-based iterative linear solver performed up to 9 times faster. As we approach an era of many-core computing, the ability to efficiently take advantage of many cores will become ever more important.
"Multithreaded Geant4: Semi-automatic Transformation into Scalable Thread-Parallel Software", Xin Dong, Gene Cooperman and John Apostolakis, Proc. of Euro-Par 2010 -- Parallel Processing, Lecture Notes in Computer Science 6272, Springer, 2010, pp. 287--30
This work presents an application case study. Geant4 is a 750,000 line toolkit first designed in the mid-1990s and originally intended only for sequential computation. Intel’s promise of an 80-core CPU meant that Geant4 users would have to struggle in the future with 80 processes on one CPU chip, each one having a gigabyte memory footprint. Thread parallelism would be desirable. A semiautomatic methodology to parallelize the Geant4 code is presented in this work. Our experimental tests demonstrate linear speedup in a range from one thread to 24 on a 24-core computer. To achieve this performance, we needed to write a custom, thread-private memory allocator, and to detect and eliminate excessive cache misses. Without these improvements, there was almost no performance improvement when going beyond eight cores. Finally, in order to guarantee the run-time correctness of the transformed code, a dynamic method was developed to capture possible bugs and either immediately generate a fault, or optionally recover from the fault.
"Harnessing Parallel Disks to Solve Rubik's Cube", Daniel Kunkle and Gene Cooperman, J. Symbolic Computation 44(7), 2009, pp. 872--890
The number of moves required to solve any configuration of Rubik’s cube has held a fascination for over 25 years. A new upper bound of 26 is produced. More important, a new methodology is described for finding upper bounds. The novelty is two-fold. First, parallel disks are employed. This allows 1.4×10^1^2 states representing symmetrized cosets to be enumerated in seven terabytes. Second, a faster table-based multiplication is described for symmetrized cosets that attempts to keep most tables in the CPU cache. This enables the product of a symmetrized coset by a generator at a rate of 10 million moves per second.
"DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop", Jason Ansel, Kapil Arya and Gene Cooperman, Proc. of IEEE International Parallel and Distributed Processing Symposium, IPDPS-09, IEEE Press, 2009, 12 pages
DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment of the Large Hadron Collider at CERN. DMTCP transparently checkpoints general cluster computations consisting of many nodes, processes, and threads; as well as typical desktop applications. On 128 distributed cores (32 nodes), checkpoint and restart times are typically 2 seconds, with negligible run-time overhead. Typical checkpoint times are reduced to 0.2 seconds when using forked checkpointing. Experimental results show that checkpoint time remains nearly constant as the number of nodes increases on a medium-size cluster. DMTCP automatically accounts for fork, exec, ssh, mutexes/semaphores, TCP/IP sockets, UNIX domain sockets, pipes, ptys (pseudo-terminals), terminal modes, ownership of controlling terminals, signal handlers, open file descriptors, shared open file descriptors, I/O (including the readline library), shared memory (via mmap), parent-child process relationships, pid virtualization, and other operating system artifacts. By emphasizing an unprivileged, user-space approach, compatibility is maintained across Linux kernels from 2.6.9 through the current 2.6.28. Since DMTCP is unprivileged and does not require special kernel modules or kernel patches, DMTCP can be incorporated and distributed as a checkpoint-restart module within some larger package.
"Geant4: A Simulation Toolkit", S. Agnostelli et al., Nuclear Instruments and Methods in Physics Research Section A 506(3), 2003, pp. 250--303 (over 100 authors, incl. G. Cooperman)
Geant4 is a toolkit for simulating the passage of particles through matter. It includes a complete range of functionality including tracking, geometry, physics models and hits. The physics processes offered cover a comprehensive range, including electromagnetic, hadronic and optical processes, a large set of long-lived particles, materials and elements, over a wide energy range starting, in some cases, from 250 eV and extending in others to the TeV energy range. It has been designed and constructed to expose the physics models utilised, to handle complex geometries, and to enable its easy adaptation for optimal use in different sets of applications. The toolkit is the result of a worldwide collaboration of physicists and software engineers. It has been created exploiting software engineering and object-oriented technology and implemented in the C++ programming language. It has been used in applications in particle physics, nuclear physics, accelerator design, space engineering and medical physics.
"TOP-C: A Task-Oriented Parallel C Interface", G. Cooperman, 5th International Symposium on High Performance Distributed Computing (HPDC-5), IEEE Press, 1996, pp. 141-150
The “holy grail” of parallel software systems is a parallel programming language that will be as easy to use as a sequential one, while maintaining most of the potential efficiency of the underlying parallel hardware. TOP-C (Task-Oriented Parallel C) attempts such a model by presenting a task abstraction that hides much of the details of the underlying hardware. DSM (Distributed Shared Memory) also attempts such a model, but along an orthogonal direction. By presenting a shared memory model of memory, it hides much of the details of message-passing required by the underlying hardware. This article reviews the TOP-C model and then presents ongoing research on combining the advantages of both models in a single system.
"Nearly Linear Time Algorithms for Permutation Groups with a Small Base", L. Babai, G. Cooperman, L. Finkelstein and A. Seress, Proc. of the 1991 International Symposium on Symbolic and Algebraic Computation (ISSAC '91), Bonn, pp. 200-209, July, 1991.
A base of a permutation group G is a subset B of the permutation domain such that only the identity of G fixes B pointwise. The permutation representations of important classes of groups, including all finite simple groups other than the alternating groups, admit O(log n) size bases, where n is the size of the permutation domain. Groups with very small bases dominate the work on permutation groups in much of computational group theory. A series of new combinatorial results allows us to present Monte Carlo algorithms achieving O(n log’ n) (c a constant) time and space performance for such groups with respect to the fundamental operations of finding order and testing membership. (The input is a list of generators of the group.) Previous methods have achieved similar space performance only at the expense of increased time performance. Adaptations of a ‘(cube-doubling” technique [BSZ] and a local expansion property of groups [Ba3] (cf. [Ba4]) are the key to theoretically reducing the time complexity to O(rI log’ n.). The shared principal novelty of the new ideas is in their ability to build and manipulate certain chains of subsets of a group, which are not themselves subgroups, in order to build the point stabilizer subgroup chain. Further combinatorial ideas are used to lower the constant c. Comparative timing estimates, based on asymptotic worst-case analysis, lead us to expect a new implementation to be faster than previous implementations for groups of high degree.