Project Ideas
This list is for guidance only. Feel free to change the scope/application
of any of these projects or find one on your own.
See
CS7680 (Spring'17) project page and
BU/NU Cloud Computing Course project page (Sprint'17) for more
ideas.
-
Fault-tolerant DMTCP coordinator
Currently, DMTCP has a centralized coordinator that communicates with
all worker nodes. The centralized nature makes it a single point of
failure for the computation. One way to address is to use distributed
coordinator processes with leader-election. These coordinators can
share the current state using some in-built consensus protocol or
using some external tool such as Zookeeper. One can further extend the
idea to have a tree of coordinators to handle tens of thousands or
worker nodes simultaneously. One can also make it multi-threaded to
further the scalability goals.
-
Checkpoint XPRA-based application
XPRA is the "screen" for X11 and allows
for forwarding X11 graphics between nodes. The core idea here is to
checkpoint a graphical application using XPRA.
-
Enhance checkpoint of mutexes in DMTCP
Mutexes often store the owner TID in their internal data structures.
On restart, if the thread tries to release the lock, it's current-TID
won't match the TID stored in the datastructure and hence unlocking
would fail. The goal of this project is to enhance the DMTCP
checkpointing system to allow mutex to work on restart. This could be
done using the Linux PID namespace or some other mechanism.
-
Add checkpoint capabilities to Mesos
Apache Mesos is a two-level scheduler that can enhance resource
utilization in a datacenter. Majority of workloads for Mesos are
stateless and thus one can kill tasks without a high performance
penalty if one has to take down a node. However, with stateful
application, the situation is different. Simply killing a task to
relocate it result in a heavy performance penalty. This problem can be
solved by adding checkpoint-restart capabilities to Mesos for doing
task migration.
-
Add checkpoint capabilities to GDB (PTRACE)
The goal is to checkpoint GDB session with DMTCP to allow reversible
debugging. You can find more information about a previous
implementation here: Fast
Reversible Debugger.
-
Checkpoint support for Docker/Appc Containers
Docker is sometimes called a
lightweight virtual machine, although it does not include a separate
"guest" Linux kernel. It uses the underlying Linux kernel.
Nevertheless, it has gained popularity in many domains where virtual
machines are also used.
Virtual machines have snapshots. The goal
of this project is to checkpoint Docker using DMTCP. (An alternate
checkpointing package that currently works on Docker is
CRIU.)
While Docker is normally compiled as a statically linked executable
under GC, there is also a dynamically linked executable for Docker
using GNU GCCGO.
(See The Go
Blog for more information.) In principle, this should make it
easy for DMTCP to checkpoint Docker. However, DMTCP must be extended
to support Linux cgroups and pid namespaces.
There is already a partial implementation of checkpointing of Docker
within the DMTCP team. This will be made available to a team that
tackles this project.
Docker typically runs just a single process.
If time permits, the effort should be extended to support
Docker's
Supervisor package. Alternatively, the team may prefer a
different extension: the use of plugins to integrate with the Docker
daemon on checkpoint and restart.
-
DMTCP/CRIU Integration
Use CRIU as the single process checkpointer in DMTCP. This will allow
one to quickly checkpoint a network of containers across the network.
-
Security: Multi-architecture Checkpoint-Restart
In defending against malware, it is useful to present
a dynamically shifting "attack surface" against attackers.
One such technique is multi-architecture checkpoint-restart.
An example of such work (as execution migration) is:
Execution Migration in a Heterogeneous-ISA Chip
Multiprocessor.
The goal of this project is to checkpoint under one CPU instruction
set (e.g., Intel), and to restart under a different
CPU instruction set (e.g., ARM).
We will assume that we fully control the target application.
For example, we can compile it under both CPU architectures.
We can also compile it with research compilers such
as LLVM.
LLVM is the foundation for the well-known
clang compiler.
LLVM allows you to easily modify the compiler
to emit additional code, such as "landmarks" in the prolog
and epilog of a function, where it is acceptable to checkpoint.
Thus, one can checkpoint at one of these landmarks,
and replace the text segment with the text segment of the
other CPU architecture, and then restart at the corresponding
landmark in the alternative text segment. With a little luck,
we can persuade LLVM to emit an almost identical data segment under the
two CPU architectures. The remaining task is then to translate
the call frames of the stack from one CPU architecture to
another.
If a team takes on this project, we will provide additional
lectures on how to modify the LLVM compiler.
Projects from previous semesters
GPUDB: In-Memory GPU NoSQL Database
The goal of GPUDB was to create a faster database, particularly for
queries, utilizing the GPU for fast parallel comparisons. It needed to
be both fast as well as usable, leading to two separate API’s a
driver API to communicate with the GPU and perform database operations
and a user API to easily design and execute queries with little
overhead or confusion to the user. The database was designed to be
most effective for the specific use case of medium sized data sets
needing the fastest possible queries.
Designing a Web Server in Rust
The focus of this project was to leverage the features of the
relatively new Rust programming language in order to build a scalable
web server. The implementation began as a very basic and naive web
server. However, the architecture gradually grew in size as a thread
pool was integrated in order to enable parallelization of tasks,
priority scheduling was introduced to prevent large file requests from
clogging the web server traffic, and server side caching was added to
reduce the number of necessary I/O operations. Logging the incoming
requests and outgoing response statuses was also implemented across
the threads in the thread pool. The ApacheBench tool was used to
measure the performance at the various stages of the web server’s
progression. These benchmarks were compared with those of the Apache
web server in order to determine whether a Rust web server is capable
of matching the scale an existing web server that is widely used
today. This initial evaluation suggests that the Rust web server
design without the prioritization is comparable and perhaps slightly
better in performance than the Apache web server for smaller files.
The priority scheduling effectively serves files in order of file
size, however, continued work is required to improve the overall
performance. Additional work also includes adding other typical
features found in existing web servers.
Distributed Decentralized Redundant Storage Network
The goal of the project was to create a means to store data over a
network that does not require a master server, provides redundancy,
and is highly available. Additionally, it should be frictionless to
either join or leave the network.
-
Distributed Shared Memory
This project provides an overview of implementing a Distributed Shared
Memory(DSM) system as an application library in Linux userspace. The
library uses an invalidation protocol maintain coherence across nodes
in the distributed setup. This DSM implementation provide the user
with a transparent, efficient and scalable shared memory programming
environment.
-
ICFS: Integrated Cloud File System
ICFS provides a unified interface for integrating multiple cloud
services into one file system for combined storage. This requires
splitting of files and distributing across multiple cloud accounts.
Additionally, it also helps in uploading only the changed chunks
preventing the need for re-uploading the whole file. With one step
replication, data anonymity and reduced disk dependency, ICFS is a
unique solution for present cloud scenario.
-
Kernel Threading in XV6
For our final project, we strove to implement kernel based software
threading in xv6. This might serve as a valuable teaching tool in a
time where the physical limitations of CPU archi- tecture is pushing
newer systems towards parallelism rather than faster clock rate.
Without a strong understanding of software threading, today’s software
architect will not be able to get the most out of the hardware
platform running their application. Adding a bare-bones imple-
mentation of threading to a teaching tool like MIT’s xv6 would be a
step towards providing a conduit for this increasingly vital
knowledge. By reusing the existing code whenever possible, our group
was able to achieve basic thread support in xv6 while preserving its
all-important simplicity.
-
Decrease Downtime during Live Process Migration
Migrating a running application across distinct nodes can be
utilitarian in clusters, cloud applications, game servers, and
distributed systems. It facilitates load balancing, proactive
maintenance and uninterrupted services. In this project, we introduce
several approaches to migrate a running process while main- taining
the liveness of that process and discuss the design of each approach.
Furthermore, we analyze them by comparing down-time of each approach
and introduce a more efficient approach with future work.
-
Mini Magic-Pocket
Existing Distributed File Systems only support upload/download file as
a whole, which is inefficient for file syncing service while require
frequent small modification of files. In this project, we add an-
other layer to traditional DFS to introduce the feature of partial
file syncing only upload/download modified parts of files.