BU Cloud Computing Course

Project logistics

Mentors:
- Jason Hennessey: email: henn-at-bu-dot-edu
- Ian Denhardt
  - Email: isd-at-bu-dot-edu
  - Availability: Typically MWF 1:15 - 1:45, however am busy 2:00 - 3:00 those same days. Otherwise by appointment in the afternoon during the week.
Min-max team size: 4-7
Expected project hours per week (per team member): 6-8
Will the project be open source? Yes, we will hope to release this under Apache license as part of the HaaS, which we hope to make an OpenStack component

Preferred past experience

Some experience in one or more of the following is preferred in possible team members:

Python
Operating Systems (e.g. CS 552 is recommended)
Linux system (e.g. CS 410, having a personal machine running Linux)

What is the HaaS?

We're developing Hardware as a Service to be a new fundamental layer in cloud datacenters. In a HaaS-enabled datacenter, researchers and developers can independently stand up services on bare hardware, with guaranteed isolation between them. They can scale their deployments by allocating and freeing hardware nodes, without having to change any physical setup of the datacenter. HaaS allows bare metal machines to be used similarly to virtual machines in the cloud.

Why do we need Recursive HaaS?

After allocating new nodes, software must be deployed onto them. Currently, the best deployment options we have involve PXE-booting the nodes into an install image, which can take a while. Just the BIOS POST, the first part of this process, can take 5 or 10 minutes. A faster deployment method is key to users scaling their application up more rapidly.

While it is not acceptable for a general environment, users that have some level of trust with each other could use kexec to move from a running Linux kernel to a pre-boot environment without going through a full hardware reboot. As long as different users of the HaaS agree to leave machines in a state that is ready to kexec, then this could be used to quickly deploy to machines allocated with the HaaS. This could allow massively parallel jobs to run for 10 or 20 seconds across thousands of nodes to quickly accomplish tasks like high resolution rendering or other realtime tasks.

Since HaaS is a small, trusted layer in the data center that is used for both production and research purposes, HaaS needs to be remain stable. Thus, new features need to be well tested before they can be rolled out, simliar to the philosophy of Redhat Enterprise Linux (RHEL). Rather than immediately merging freshly developed functionality like fast provisioning into the stable data center HaaS layer, it would make more sense to have another HaaS into which these changes could be made, similar to how Fedora is the more up to date version of RHEL. It's with this in mind that the other major part of this project, Recursive HaaS comes into play.

Users or groups of users could run their own HaaS on top of the datacenter's base HaaS: the derived HaaS's nodes would be allocated from base HaaS's nodes, the derived HaaS's networks would be set up by the base HaaS, and so forth. But users of the derived HaaS would have an agreement with each other that users of the base HaaS do not have: When freeing nodes, they will be left in a state ready to be kexec'd. This allows users of the derived HaaS to scale up their deployments much more quickly.

Fast provisioning is just an example of recursive HaaS. There are a number of interesting research projects that we see a recursive HaaS implementation enabling.

Possible Goals

Develop a simple pass-through recursive HaaS implementation.
Add kexec support, and build a pre-boot environment that can be provisioned against. For the pre-boot environment:
- Research existing pre-boot environments
- Try different policies for entering the pre-boot environment (whether the user or the HaaS does this; what kind of checking is done by the HaaS or a HaaS administrator)
- Find a way to communicate the new OS to the pre-boot environment
- Perhaps investigate broadcast communication methods, to speed up provisioning many nodes at once
- Could the pre-boot environment be skipped? Perhaps kexec directly from the old OS to the new
Demonstrate a piece of software scaling up through kexec
Measure performance of the kexec method, compared to a cold boot
Characterize level of trust needed between users---potentially even prototype some possible attacks
Investigate allocating additional nodes to the derived HaaS on demand
Study alternatives to kexec

Some Technologies Expected To Be Learned/Used

kexec
Hardware as a Service
REST, Python...