Event Extraction in Twitter to Support Digital Surveillance of Infectious Diseases

Primary supervisor

Additional information

Contact admissions office


  • Competition Funded Project (Students Worldwide)
This research project is one of a number of projects at this institution. It is in competition for funding with one or more of these projects. Usually the project which receives the best applicant will be awarded the funding. Applications for this project are welcome from suitably qualified candidates worldwide. Funding may only be available to a limited set of nationalities and you should read the full department and project details for further information.

Project description

Social media platforms serve as rich sources of information on current events directly impacting on people???s lives such as disease outbreaks. They however produce an overwhelming amount of textual data. Twitter, for instance, generates around 500 million tweets on a daily basis. Thus, the natural language processing (NLP) and text mining research communities have recently taken interest in the automated analysis of social media-generated text. Whilst many efforts have focussed on the analysis of sentiments or opinions conveyed in social media posts such as tweets, only a handful have carried out information extraction tasks such as event extraction in which fine-grained details pertaining to events of interest are automatically recognised.

This PhD project will leverage social media content as a means for measuring infectious disease incidence, i.e., the number of new cases of a particular disease that emerged during a period of time. Each new disease case will be captured as an event that encapsulates associations between a disease, its signs or symptoms and affected geographical areas. To this end, the student will develop NLP methods for extracting fine-grained information from tweets, specifically in terms of the following tasks: (1) the recognition of mentions pertinent to epidemiology, e.g., names of diseases, signs or symptoms, geographic locations; (2) the consolidation of related mentions into events; and (3) the detection of cases which are negated and uncertain. Ultimately, disease incidence will be estimated in a more accurate manner, since such a fine-grained event-based approach should obtain a lower number of false positive disease cases compared to term frequency-based approaches.

▲ Up to the top