Project Ideas

This list is for guidance only. Feel free to change the scope/application of any of these projects or find one on your own. See CS7680 (Spring'17) project page and BU/NU Cloud Computing Course project page (Sprint'17) for more ideas.
  • Fault-tolerant DMTCP coordinator

    Currently, DMTCP has a centralized coordinator that communicates with all worker nodes. The centralized nature makes it a single point of failure for the computation. One way to address is to use distributed coordinator processes with leader-election. These coordinators can share the current state using some in-built consensus protocol or using some external tool such as Zookeeper. One can further extend the idea to have a tree of coordinators to handle tens of thousands or worker nodes simultaneously. One can also make it multi-threaded to further the scalability goals.
  • Checkpoint XPRA-based application

    XPRA is the "screen" for X11 and allows for forwarding X11 graphics between nodes. The core idea here is to checkpoint a graphical application using XPRA.
  • Enhance checkpoint of mutexes in DMTCP

    Mutexes often store the owner TID in their internal data structures. On restart, if the thread tries to release the lock, it's current-TID won't match the TID stored in the datastructure and hence unlocking would fail. The goal of this project is to enhance the DMTCP checkpointing system to allow mutex to work on restart. This could be done using the Linux PID namespace or some other mechanism.
  • Add checkpoint capabilities to Mesos

    Apache Mesos is a two-level scheduler that can enhance resource utilization in a datacenter. Majority of workloads for Mesos are stateless and thus one can kill tasks without a high performance penalty if one has to take down a node. However, with stateful application, the situation is different. Simply killing a task to relocate it result in a heavy performance penalty. This problem can be solved by adding checkpoint-restart capabilities to Mesos for doing task migration.
  • Add checkpoint capabilities to GDB (PTRACE)

    The goal is to checkpoint GDB session with DMTCP to allow reversible debugging. You can find more information about a previous implementation here: Fast Reversible Debugger.
  • Checkpoint support for Docker/Appc Containers

    Docker is sometimes called a lightweight virtual machine, although it does not include a separate "guest" Linux kernel. It uses the underlying Linux kernel. Nevertheless, it has gained popularity in many domains where virtual machines are also used.
    Virtual machines have snapshots. The goal of this project is to checkpoint Docker using DMTCP. (An alternate checkpointing package that currently works on Docker is CRIU.)
    While Docker is normally compiled as a statically linked executable under GC, there is also a dynamically linked executable for Docker using GNU GCCGO. (See The Go Blog for more information.) In principle, this should make it easy for DMTCP to checkpoint Docker. However, DMTCP must be extended to support Linux cgroups and pid namespaces.
    There is already a partial implementation of checkpointing of Docker within the DMTCP team. This will be made available to a team that tackles this project.
    Docker typically runs just a single process. If time permits, the effort should be extended to support Docker's Supervisor package. Alternatively, the team may prefer a different extension: the use of plugins to integrate with the Docker daemon on checkpoint and restart.
  • DMTCP/CRIU Integration

    Use CRIU as the single process checkpointer in DMTCP. This will allow one to quickly checkpoint a network of containers across the network.
  • Security: Multi-architecture Checkpoint-Restart

    In defending against malware, it is useful to present a dynamically shifting "attack surface" against attackers. One such technique is multi-architecture checkpoint-restart. An example of such work (as execution migration) is: Execution Migration in a Heterogeneous-ISA Chip Multiprocessor.
    The goal of this project is to checkpoint under one CPU instruction set (e.g., Intel), and to restart under a different CPU instruction set (e.g., ARM).
    We will assume that we fully control the target application. For example, we can compile it under both CPU architectures. We can also compile it with research compilers such as LLVM. LLVM is the foundation for the well-known clang compiler. LLVM allows you to easily modify the compiler to emit additional code, such as "landmarks" in the prolog and epilog of a function, where it is acceptable to checkpoint. Thus, one can checkpoint at one of these landmarks, and replace the text segment with the text segment of the other CPU architecture, and then restart at the corresponding landmark in the alternative text segment. With a little luck, we can persuade LLVM to emit an almost identical data segment under the two CPU architectures. The remaining task is then to translate the call frames of the stack from one CPU architecture to another.
    If a team takes on this project, we will provide additional lectures on how to modify the LLVM compiler.

Projects from previous semesters

  • GPUDB: In-Memory GPU NoSQL Database

    The goal of GPUDB was to create a faster database, particularly for queries, utilizing the GPU for fast parallel comparisons. It needed to be both fast as well as usable, leading to two separate API’s ­ a driver API to communicate with the GPU and perform database operations and a user API to easily design and execute queries with little overhead or confusion to the user. The database was designed to be most effective for the specific use case of medium sized data sets needing the fastest possible queries.
  • Designing a Web Server in Rust

    The focus of this project was to leverage the features of the relatively new Rust programming language in order to build a scalable web server. The implementation began as a very basic and naive web server. However, the architecture gradually grew in size as a thread pool was integrated in order to enable parallelization of tasks, priority scheduling was introduced to prevent large file requests from clogging the web server traffic, and server side caching was added to reduce the number of necessary I/O operations. Logging the incoming requests and outgoing response statuses was also implemented across the threads in the thread pool. The ApacheBench tool was used to measure the performance at the various stages of the web server’s progression. These benchmarks were compared with those of the Apache web server in order to determine whether a Rust web server is capable of matching the scale an existing web server that is widely used today. This initial evaluation suggests that the Rust web server design without the prioritization is comparable and perhaps slightly better in performance than the Apache web server for smaller files. The priority scheduling effectively serves files in order of file size, however, continued work is required to improve the overall performance. Additional work also includes adding other typical features found in existing web servers.
  • Distributed Decentralized Redundant Storage Network

    The goal of the project was to create a means to store data over a network that does not require a master server, provides redundancy, and is highly available. Additionally, it should be frictionless to either join or leave the network.
  • Distributed Shared Memory

    This project provides an overview of implementing a Distributed Shared Memory(DSM) system as an application library in Linux userspace. The library uses an invalidation protocol maintain coherence across nodes in the distributed setup. This DSM implementation provide the user with a transparent, efficient and scalable shared memory programming environment.
  • ICFS: Integrated Cloud File System

    ICFS provides a unified interface for integrating multiple cloud services into one file system for combined storage. This requires splitting of files and distributing across multiple cloud accounts. Additionally, it also helps in uploading only the changed chunks preventing the need for re-uploading the whole file. With one step replication, data anonymity and reduced disk dependency, ICFS is a unique solution for present cloud scenario.
  • Kernel Threading in XV6

    For our final project, we strove to implement kernel based software threading in xv6. This might serve as a valuable teaching tool in a time where the physical limitations of CPU archi- tecture is pushing newer systems towards parallelism rather than faster clock rate. Without a strong understanding of software threading, today’s software architect will not be able to get the most out of the hardware platform running their application. Adding a bare-bones imple- mentation of threading to a teaching tool like MIT’s xv6 would be a step towards providing a conduit for this increasingly vital knowledge. By reusing the existing code whenever possible, our group was able to achieve basic thread support in xv6 while preserving its all-important simplicity.
  • Decrease Downtime during Live Process Migration

    Migrating a running application across distinct nodes can be utilitarian in clusters, cloud applications, game servers, and distributed systems. It facilitates load balancing, proactive maintenance and uninterrupted services. In this project, we introduce several approaches to migrate a running process while main- taining the liveness of that process and discuss the design of each approach. Furthermore, we analyze them by comparing down-time of each approach and introduce a more efficient approach with future work.
  • Mini Magic-Pocket

    Existing Distributed File Systems only support upload/download file as a whole, which is inefficient for file syncing service while require frequent small modification of files. In this project, we add an- other layer to traditional DFS to introduce the feature of partial file syncing only upload/download modified parts of files.