Scalable Data Provenance

Creating the future architecture for storing and processing provenance data

Project logistics

Preferred past experience

Project Overview

Data provenance is a record of the history of data traversing or being used by a system or network of systems. Such history information can be used to assure correctness and security of data, and also to help understand and protect the system's operations and information. Currently, provenance systems collect large volumes of information that must be stored and analyzed. We would like to explore the space of available ingest tools and develop scalable architectures specifically for provenance-based storage and analysis. This architecture should be scalable, based on the volume of data, and also allow flexibility in the storage and analysis backends. The project will begin with a survey of existing streaming ingest architectures and identify potential candidate architectures. The students will prototype the candidate architectures on a cloud platform, and measure the performance characteristics and scalability of their solution.

Some Technologies you will learn/use: