Replicating Data Science Models in the Cloud

Project logistics

Contact: Sri Krishnamurthy s.krishnamurthy at neu.edu
Min-max team size: 4-5
Expected project hours per week (per team member): 6-8
Will the project be open source? Yes

Preferred past experience

Job handling / concepts of distributed computing
Linux basics
Python, R, MongoDB
Docker and HPC scheduling
Web technologies for analytics

Project Overview

When you work on data science projects, typically, the data is given and modelers are expected to build the best models to address the problem. Think of Kaggle for example. The contests typically are for supervised learning problems where the training and testing data are provided and the format for the final predictions are also provided. Modelers typically work on various approaches and submit their best outcomes. Note that approaches to address the problem, tools, packages and libraries could be varied but only the final deliverable should follow the prescribed format. When the submissions are ranked, the ranking is based on a chosen model evaluation criterion. Sometimes, submitters share their approaches through github or other means and sometimes the contest organizers solicit model descriptions for publications etc. This isn�t always ideal since all submitters may not oblige to requests and the task of collating all dependencies and models isn�t easy. In addition, researchers wanting to replicate experiments typically need to piece together all dependencies, code and data to replicate results themselves and it is hard to replicate experiments and their dependencies.

In classrooms, when instructors give data science assignments, the challenge is not only to grade based on the quality of predictions but also credit to approaches, design, experimentation when solving data science assignments. With the plethora of open source tools and packages available, expecting students to all submit code in a particular language/using specific packages isn�t ideal. Even in student teams, there are challenges on doing assignments and students have to elect a laptop on which all configurations are done. I have seen first-hand students swapping laptops during presentations since a laptop has all the required packages and functional code.

The goal of this project is to enable students to create replicable data science environments using Docker so that when students submit assignments, they not only can submit the outcomes, but also a fully functional and replicable data science environment which could be used to replicate and evaluate the submission and approaches. In addition, the environment specification would be atomic enabling anyone to replicate experiments and build upon it. This would enable sharing of best practices and would reduce the barrier when replicating and evaluating different models and approaches. This would also enable developing with the cloud in mind rather than setting experiments and environments locally on laptops.

About us

Sri Krishnamurthy is the founder of QuantUniversity.com QuantUniversity provides training programs in data science, big data and analytics for finance and Energy professionals using a case-study based approach to address industry specific use-cases to train professionals. He is also a faculty member at Northeastern University and teaches Data Science and cognitive computing classes in the MS in Information Systems program and loves to give challenging data science experiments to students as assignments and has struggled many a times to evaluate student submissions when grading needs to be done.

Some Technologies you will learn/use:

Working with Data science algorithms
Docker usage and implementing a real-world system
HPC, Analytics