This list is for guidance only. Feel free to change the scope/application
of any of these projects or find one on your own.
CS5600 (Fall'15) project page
BU/NU Cloud Computing Course project page (Sprint'16)
Fault-tolerant DMTCP coordinator
Currently, DMTCP has a centralized coordinator that communicates with
all worker nodes. The centralized nature makes it a single point of
failure for the computation. One way to address is to use distributed
coordinator processes with leader-election. These coordinators can
share the current state using some in-built consensus protocol or
using some external tool such as Zookeeper. One can further extend the
idea to have a tree of coordinators to handle tens of thousands or
worker nodes simultaneously. One can also make it multi-threaded to
further the scalability goals.
Add checkpoint capabilities to Mesos
Apache Mesos is a two-level scheduler that can enhance resource
utilization in a datacenter. Majority of workloads for Mesos are
stateless and thus one can kill tasks without a high performance
penalty if one has to take down a node. However, with stateful
application, the situation is different. Simply killing a task to
relocate it result in a heavy performance penalty. This problem can be
solved by adding checkpoint-restart capabilities to Mesos for doing
Checkpointing valgrind (valgrind attach)
Valgrind is a widely used software that excels at finding memory
leaks. Its usage is simple: valgrind a.out args. Because running under
valgrind is slower than native execution (e.g., 10 times slower or
worse), many users have hoped for a "valgrind attach" feature. This is
probably impossible, since valgrind runs the executable in software
that emulates the underlying assembly language. So, a next-best
option is to run valgrind under DMTCP (or other checkpointing tool)
until the interesting point. Then, one checkpoints. Finally, one can
restart many times, and direct the executable to choose different
execution paths (e.g., different application options) on each restart.
While a VM snapshot could checkpoint valgrind, that is a heavyweight
option. The goal of this project is to use a standard checkpointing
package (or your own custom one) to checkpoint valgrind.
Checkpointing Hadoop (Big Data)
Hadoop was the first full-featured open source version
of the MapReduce software from Google. Its architecture
typically assumes back-end disk nodes with large files,
and a front-end compute node on which resides the
Hadoop executable, and a Hadoop scheduler for the back end.
Checkpointing would be very useful, in order to put aside
a currently running job, when a newer, high-priority job
arrives. Since the files on the back end are large,
the intention is to copy the back end files to a temporary
region as part of the checkpoint, and then to copy them
back as part of the restart.
We have access to some software from INRIA that will manage
the back-end files. The goal of this project is to write
the front-end, including a DMTCP plugin, that will take
special actions at checkpoint and restart to save the
front-end Hadoop application and later restore it. We will
apply this only to the simpler Hadoop, version 1.
Checkpoint support for Docker/Appc Containers
Docker is sometimes called a
lightweight virtual machine, although it does not include a separate
"guest" Linux kernel. It uses the underlying Linux kernel.
Nevertheless, it has gained popularity in many domains where virtual
machines are also used.
Virtual machines have snapshots. The goal
of this project is to checkpoint Docker using DMTCP. (An alternate
checkpointing package that currently works on Docker is
While Docker is normally compiled as a statically linked executable
under GC, there is also a dynamically linked executable for Docker
using GNU GCCGO.
(See The Go
Blog for more information.) In principle, this should make it
easy for DMTCP to checkpoint Docker. However, DMTCP must be extended
to support Linux cgroups and pid namespaces.
There is already a partial implementation of checkpointing of Docker
within the DMTCP team. This will be made available to a team that
tackles this project.
Docker typically runs just a single process.
If time permits, the effort should be extended to support
Supervisor package. Alternatively, the team may prefer a
different extension: the use of plugins to integrate with the Docker
daemon on checkpoint and restart.
Use CRIU as the single process checkpointer in DMTCP. This will allow
one to quickly checkpoint a network of containers across the network.
Security: Multi-architecture Checkpoint-Restart
In defending against malware, it is useful to present
a dynamically shifting "attack surface" against attackers.
One such technique is multi-architecture checkpoint-restart.
An example of such work (as execution migration) is:
Execution Migration in a Heterogeneous-ISA Chip
The goal of this project is to checkpoint under one CPU instruction
set (e.g., Intel), and to restart under a different
CPU instruction set (e.g., ARM).
We will assume that we fully control the target application.
For example, we can compile it under both CPU architectures.
We can also compile it with research compilers such
LLVM is the foundation for the well-known
LLVM allows you to easily modify the compiler
to emit additional code, such as "landmarks" in the prolog
and epilog of a function, where it is acceptable to checkpoint.
Thus, one can checkpoint at one of these landmarks,
and replace the text segment with the text segment of the
other CPU architecture, and then restart at the corresponding
landmark in the alternative text segment. With a little luck,
we can persuade LLVM to emit an almost identical data segment under the
two CPU architectures. The remaining task is then to translate
the call frames of the stack from one CPU architecture to
If a team takes on this project, we will provide additional
lectures on how to modify the LLVM compiler.
Investigate Linux CFS scheduling algorithm for datacenters
The goal is to investigate and recommend best practices for resource
sharing and performance isolation for datacenter workloads managed by
the Linux CFS scheduling algorithm. In particular, we want to avoid
undesired and unpredictable scheduling delays for latency-critical
workloads with sub millisecond SLOs  or underutilizing cores by
assigning them exclusively to latency-critical workloads even during
periods of low load . Our specific target is to identify the right
CFS settings and control of CPU bandwidth for scenarios where
different latency-critical and batch workloads are executing
concurrently on the same server .
Large-scale cluster management at Google with Borg
CPU Bandwidth Control for CFS
Reconciling High Server Utilization and Sub-millisecond
Mesos: Load-balanced scheduling with multiple active masters
Apache Mesos operates in a master-slave hierarchy with one master node
communicating with many slave nodes. In case the leading master fails,
one of the optional standby masters will take over. While this is
crucial for the high-availability and fault-tolerant property of a
Mesos cluster, full load and all traffic will always go to the leading
master. In cases where a master fails due to overload or network
congestion, new masters are prone to be overloaded and fail in similar
The idea is to explore the possibility of having multiple
active masters sharing the workload in large-scale Apache Mesos
clusters. The work involves analysis of current data flow and shared
state, implementing a prototype which lets frameworks be serviced by
multiple masters and evaluating its viability for real-world use.
Build basic scalable distributed NAS using multi-node communication.
Nodes communicate with each other using some quorum protocols
(standard or custom) to elect a master which will serve as Actual NFS
server end point. (In real solution, nodes also can implement raid
level striping and IP takeover for seamless interaction with clients.
For simplicity, each machine is connected to same physical storage
(either NFS or Disks) and all nodes are running as a thread on the
Client-side SSD Cache
Optimize performance of NAS client using Client side SSD Cache (Very
open ended. Can implement multiple caching algorithms.)
Server-side SSD Cache
Optimize performance of NAS server using server-side SSD Cache (Very
open ended. Can implement multiple caching algorithms.)
Implement simple file backup application using cloud storage (box,
dropbox, etc.). Addonbonus: Implement stretchable backup using
multiple cloud storage providers to utilize space efficiently.