Data Science with Social Media for Epidemiology and Public Health
- Author
- Bodnar, Todd
- Published
- [University Park, Pennsylvania] : Pennsylvania State University, 2015.
- Physical Description
- 1 electronic document
- Additional Creators
- Salathé, Marcel
Access Online
- etda.libraries.psu.edu , Connect to this object online.
- Graduate Program
- Restrictions on Access
- Open Access.
- Summary
- The emergence of large scale web data has opened the door to novel research questions to be answered in many fields. However, most work has been done based on either improving a user's experience with a service or increasing the likelihood of click-able advertisements with other fields both showing and receiving less interest in these big data web analytics skills. In this dissertation we apply this type of methodology to epidemiology and public health, a field which, aside from genomics and disease surveillance, tends to focus on traditional methodology such as field research, animal experiments, and mathematical modeling. We begin with a review of previous work in this area and its shortcomings. Specifically, we consider modern web-based disease surveillance systems such as Google Flu Trends and related academic work. We find inconsistencies in the literature around how to measure the accuracies of the models, making cross-publication comparisons difficult. To address this, we reproduce several methods on our own datasets. We find inconsistencies in the expected performance, along with unusual results, and determine to develop a better, more validated approach.We note that these papers do not actually focus on validated individual diagnoses, but instead with simply fitting the population's web behavior with the population's disease incidence. We address this by collaborating with a local health provider to obtain medical records related to patients that had been previously, professionally diagnosed with influenza. We then obtained approximately all Twitter data related to these patients. We then developed classifiers based on this data that could accurately determine if the patient had influenza or not at a given time. We then applied this classifier to a multi-million user dataset consisting of approximately 10 terabytes of Tweets covering a four year span of users located in the United States. To scale to this size, we developed a map-reduce version of the classifier implemented with Apache Hadoop and stored the results in Apache Hive. We used this information to (1) build a novel disease surveillance system that works at any arbitrary geographic scale and (2) to analyze geographic spread of influenza.With this information about disease transmission, we then considered a separate public health question of how to efficiently spread accurate information about a disease. To do this, we studied the spread of information on Twitter related to three public health events: the detection of a novel strain of influenza (H7N9), a measles outbreak related to vaccine refusal, and Autism Awareness Month. As with actual diseases, we start with a network analysis of the spread of information. We then extend these traditional spreading models to include information about the content of the Tweet and the type of person spreading it to develop a more accurate model of information spread.Finally, we conclude with a discussion of future paths that this research could follow such as employing deep learning artificial neural networks to increase the accuracy of our disease diagnosis system or performing experimental manipulation of Twitter users to encourage healthy activities.
- Other Subject(s)
- Genre(s)
- Dissertation Note
- Ph.D. Pennsylvania State University, 2015.
- Reproduction Note
- Microfilm (positive). 1 reel ; 35 mm. (University Microfilms 107-59886)
- Technical Details
- The full text of the dissertation is available as an Adobe Acrobat .pdf file ; Adobe Acrobat Reader required to view the file.
View MARC record | catkey: 16834444