Natural language processing and text mining
The natural language processing and text mining group is one of the smallest groups in the Department but over the years has consistently achieved high quality research outputs, attracted significant funding and trained outstanding PhD students.
Our researchers
Researchers and PhD students from the National Centre for Text Mining have excelled in community shared tasks and challenges, notably in BioCreAtIvE III, IV and V, in BioNLP 2011 and 2013; obtained two first places in tasks of the 5th CL-SciSumm Shared Task 2019 and 4th place in n2c2 2022 for the medication event extraction. Moreover, participation in DARPA’s $45m Big Cancer Mechanism initiative, saw it produce the top performing system for extracting information to support cancer pathway modelling. NaCTeM’s academic and industrial research projects range over many domains from biology and biomedicine to biodiversity, toxicology, neuroscience, materials, history, social sciences, life insurance, mental health, law, working life exposome, health and safety in the construction industry, with funding coming from BBSRC, EPSRC, MRC, AHRC, Wellcome Trust, NIHR, Innovation UK, NIH, Pacific Life Re, Lloyds Register Foundation, AstraZeneca, DARPA, EC Horizon 2020, JST, the cosmetics and extracts industry and the Alan Turing Institute, among others. The centre is also co-organising with the National Library of Medicine, since 2002, BioNLP, a leading international workshop on biomedical text mining as part of the Association of Computational Linguistics Conference.
Applications arising from such research include Thalia, a semantic search engine over more than 20m biomedical abstracts; Facta+, to find unsuspected associations in the biomedical literature; semantic search of historical medical and public health archives; HSESearch, providing semantically-enhanced search over a collection of workplace accident reports in collaboration with HSE; a term-based and citation network search system for COVID-19 literature. Tools for creating high-quality labelled datasets, Aplenty and Palladin.
RobotAnalyst, supports the laborious screening stage of systematic reviewing crucial for evidence-based medicine in collaboration with NICE. NaCTeM also collaborates closely with the Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, Japan. on information extraction for cancer with projects funded by NEDO and Japan Medical Research.
Part of the research group has also delved into text mining applied to social sciences and digital humanities. Our work on social media analytics underpinned by text mining techniques (eg: text classification, sentiment analysis, topic modelling, named entity recognition) has been providing insights into the social "pulse" on issues ranging from customer satisfaction, through to fair work and human rights. Additionally, we seek to enhance civic engagement with our work on the text mining-based analysis of Parliamentary data (eg: UK Hansard archives). Furthermore, we are developing methods for linking and searching through digital content contributed by communities, as part of the Our Heritage, Our Stories (OHOS) project under the Towards a National Collection programme of the UKRI’s Arts and Humanities Research Council.