Project logistics
- Mentor: Tom Moyer email: tmoyer-at-ll.mit.edu
- Min-max team size: 2-4
- Expected project hours per week (per team member): 6-8
- Will the project be open source? Yes
Preferred past experience
- Linux (Very important)
- Simulating "production" workloads (Valuable)
- Experience with "Big Data" set of tools (Rather Important)
- A sense of computer security (Valuable)
Project Overview
Data provenance is a record of the history of data traversing or being used by a system or network of systems. Such history information can be used to assure correctness and security of data, and also to help understand and protect the system's operations and information. Currently, provenance systems collect large volumes of information that must be stored and analyzed. We would like to explore the space of available ingest tools and develop scalable architectures specifically for provenance-based storage and analysis. This architecture should be scalable, based on the volume of data, and also allow flexibility in the storage and analysis backends. The project will begin with a survey of existing streaming ingest architectures and identify potential candidate architectures. The students will prototype the candidate architectures on a cloud platform, and measure the performance characteristics and scalability of their solution.
Some Technologies you will learn/use:
- Database knowledge: Graph databases, NoSQL databases, Relational databases
- Graph analysis tools; "Big Data" analysis tools
- Linux Provenance Module