Project logistics
- Mentor: Jay Vyas and Pei Chen. email: jvyas at redhat dot com
- Mentor trello id: jayunit100 and peistation
- Min-max team size: 3-5
- Expected project hours per week (per team member): 8-10
- Will the project be open source? Yes.
- Mentor available times: Jay will meet with students on Saturdays, between 10:30 AM - 1:00 PM. Pei will announce office hours.
Preferred past experience
Some experience in one or more of the following is preferred in possible team members:
- Apache Spark Streaming for data ingestion and database caching
- Apache cTAKES/Analytics for NLP and topic extraction (cTAKES specific, machine learning stuff)
- BigTopās Spark stack
- OpenStack
- NoSQL (Cassandra, HBase, ...)
Problem
Can we collect and harness the knowledge from large public data sources in real time to monitor and detect adverse drug effects.
Apache cTAKES has the capabilities to extract knowledge from plaintext notes. It can also normalize them into standard concepts based off known ontologies (such as UMLS) (Accounting for synonyms, acronyms, attributes such as negation, etc.).
In a nutshell the end system will:
- stream in large data such as Twitter and/or Public Q/A forums,
- process them through cTAKES to extract concepts, and
- build a system to "cluster" these concepts for analysis in the context of drug safety. (Correlations between top Drug mentions and Signs/Symptoms)
The system tiers:
- Ingestion tier: Real time ingestion and storage of tweets into a database. Can be any apache solution. Pick one of the below, and evaluate these against one another. Each has pros/cons. cassandra? solr? hbase?
- Analytics tier: The analytics tier takes data from ingestion tier, reads from the database, and provides summary statistics. This work will be based on the requirements provided by the cTAKES tier.
- cTAKES tier: Work closely with Pei, and determine what the requirements of the app are from the end user perspective, and confirm (at local, small scale) that ctakes API properly supports those. If patches are necessary, commit them to apache ctakes upstream directly.