Replicating Data Science Models in the Cloud
When you work on data science projects,
typically, the data is given and modelers are expected to build the best models
to address the problem. Think of Kaggle for example.
The contests typically are for supervised learning problems where the training
and testing data are provided and the format for the final predictions are also
provided. Modelers typically work on various approaches and submit their best
outcomes. Note that approaches to address the problem, tools, packages and
libraries could be varied but only the final deliverable should follow the
prescribed format. When the submissions are ranked, the ranking is based on a
chosen model evaluation criterion. Sometimes, submitters share their approaches
through github or other means and sometimes the
contest organizers solicit model descriptions for publications etc. This isn�t
always ideal since all submitters may not oblige to requests and the task of
collating all dependencies and models isn�t easy. In addition, researchers
wanting to replicate experiments typically need to piece together all
dependencies, code and data to replicate results themselves and it is hard to
replicate experiments and their dependencies.
In classrooms, when instructors give
data science assignments, the challenge is not only to grade based on the quality
of predictions but also credit to approaches, design, experimentation when
solving data science assignments. With the plethora of open source tools and
packages available, expecting students to all submit code in a particular
language/using specific packages isn�t ideal. Even in student teams, there are
challenges on doing assignments and students have to elect a laptop on which
all configurations are done. I have seen first-hand students swapping laptops
during presentations since a laptop has all the required packages and
functional code.
The goal of this project is to enable
students to create replicable data science environments using Docker so that
when students submit assignments, they not only can submit the outcomes, but
also a fully functional and replicable data science environment which could be
used to replicate and evaluate the submission and approaches. In addition, the
environment specification would be atomic enabling anyone to replicate
experiments and build upon it. This would enable sharing of best practices and
would reduce the barrier when replicating and evaluating different models and
approaches. This would also enable developing with
the cloud in mind rather than setting experiments and environments locally on
laptops.
About us
Sri
Krishnamurthy is the founder of QuantUniversity.com QuantUniversity provides
training programs in data science, big data and analytics for finance and
Energy professionals using a case-study based approach to address industry
specific use-cases to train professionals. He is also a faculty member at
Northeastern University and teaches Data Science and cognitive computing
classes in the MS in Information Systems program and loves to give challenging
data science experiments to students as assignments and has struggled many a
times to evaluate student submissions when grading needs to be done.
Some Technologies you will learn/use: