Natural language processing and text mining
The natural language processing and text mining group is one of the smallest groups in the Department but over the years has consistently achieved high quality research outputs, attracted significant funding and trained outstanding PhD students.
Its roots lie in the pioneering research in NLP conducted between 1980 and 2000 at the Centre for Computational Linguistics of UMIST (one of the two founding universities of The University of Manchester). Since 2004, the Group has focussed its activities around the interplay of NLP and TM. Its pre-eminence in TM was recognised in 2004 by the award of major funding from JISC/BBSRC/EPSRC to set up the world’s first publicly-funded National Centre for Text Mining (NaCTeM), which immediately became an international centre of text mining expertise. NaCTeM’s ethos has always been to drive forward the state of the art in research, with results then being fed into the development of tools, services and resources (annotated corpora, computational lexica) of benefit to the wider research community.
NaCTeM researchers have excelled in community shared tasks and challenges, notably in BioCreAtIvE III, IV and V, in BioNLP 2011 and 2013 (for the most complex task of event extraction) and most recently obtained two first places in tasks of the 5th CL-SciSumm Shared Task 2019. Moreover, NaCTeM’s participation in DARPA’s $45m Big Cancer Mechanism initiative, in a consortium led by the University of Chicago, saw it produce in 2015 the top performing system for extracting information to support cancer pathway modelling. NaCTeM’s academic and industrial research projects range over many domains from biology and biomedicine to biodiversity, toxicology, neuroscience, materials, history, social sciences, insurance, and health and safety in the construction industry, with funding coming from EPSRC, ESRC, MRC, AHRC, Wellcome Trust, NIH, Pacific Life Re, Lloyd’s Register Foundation, AstraZeneca, DARPA, EC Horizon 2020, JST, the cosmetics and extracts industry, among others.
Applications arising from such research include Thalia, a semantic search engine over more than 20m biomedical abstracts; Facta+, to find unsuspected associations in the biomedical literature; HoM, allowing semantic search of historical medical and public health archives; and RobotAnalyst, supporting the hitherto laborious screening stage of systematic reviewing through active learning techniques. NaCTeM also collaborates closely with the Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, Japan.
The research group leads the UK healthcare text analytics network (Healtex), is part of the Farr Institute’s Health eResearch Centre (HeRC) and has pioneered the creation of the ACL SIGBIOMED special interest group featuring the BioNLP workshops since 2002.
Part of the research group has also delved into text mining applied to social sciences. Our work on social media analytics underpinned by text mining techniques (eg: text classification, sentiment analysis, topic modelling, named entity recognition) has been providing insights into the social "pulse" on issues ranging from customer satisfaction, through to fair work and human rights. Additionally, we seek to enhance civic engagement with our work on the text mining-based analysis of Parliamentary data (eg: UK Hansard archives).