Much of the world’s primary scientific knowledge is buried in the prose, figures and reference sections of scientific articles, virtually inaccessible to the computational machinery that's needed today to search, extricate and assimilate their sequestered information. Data-sets are trapped in tables and illustrations; factual assertions are shrouded in complex narratives; and relationships between articles are entombed in convoluted reference/citation formats. Extracting and liberating this knowledge automatically - with any degree of accuracy - is hard; the hypothesis of the Lazarus project is that these artefacts are best uncovered by engaging the collective services of the scientific community.
Convincing busy scientists to perform tasks for which there is no immediate personal gain is also hard. The approach taken in Lazarus is therefore to accumulate knowledge and data as a side-effect of tasks that scientists perform unconsciously during their day-to-day interactions with the literature, and to secure that information within a computer-accessible central repository.
Individual scientists routinely perform ‘micro-tasks’ when they annotate, extract data from and organise their personal collections of papers for their own purposes; in the process, they apply intuition and specialist knowledge that far surpasses that of any current computational techniques. However, the results of their labours are lost to the broader community, locked in unstructured notes, spreadsheets, or collections of PDFs, accessible or interpretable only by their creators. The Lazarus project will provide tools that make these routine, tedious micro-tasks easier, and will automatically capture the resulting knowledge, disseminating it to the community as machine-readable, citable data-objects that carry provenance associating them with their creators. These, in turn, can be used as the basis of hypothesis generation and verification, and both to enhance the experience of reading the literature and, ultimately, to enrich online biological databases.
Our objectives are to thus resurrect the data and insights buried in the literature by:
- harnessing the expertise of crowds by harvesting the results of their routine reading activities. Lazarus tools will make tedious ‘micro-tasks’ easier to perform than current manual approaches, pooling and publishing the results for the collective good, and giving users credit for their contribution;
- re-distributing knowledge and credit for the benefit of the crowd. Lazarus' shared resource will contribute to hypothesis generation and the creation of systematic reviews, eventually making it easier to validate claims and assertions found in the biological literature.
The Utopia system
The system consists of two main components:
- Utopia Documents: a PDF viewer optimised for scientific articles that provides the reader with useful features for chasing references, defining terms and extracting objects such as figures and tables,
- Utopia Server: the back-end to the system that accumulates information from readers, exposing the knowledge graph via web services for use in other applications.
When a reader views a scientific article using Utopia Documents, plugin components communicate with external services to analyse the content, depositing the results (with the user’s permission) in the knowledge graph for use by others. Activities such as requesting the definition of a term are used to reinforce the importance of concepts relating to a particular article.
For content mining
- Termite; an efficient named-entity recognition system for life-science content,
- OSRA: a system for converting chemical diagrams into machine-readable form,
- Maxtract: a system for converting maths into machine-readable form,
- Tabulator: a heuristic for identifying the structure of tables,
- Text2genome Annotating genes and genomes with DNA sequences extracted from biomedical articles,
- H2C: an algorithm for converting ‘hamburgers back to cows’ by recovering the logical document structure of a PDF (a reimplementation of the concepts used in pdfx).
- Crossmark: identifies whether an article is current, retracted, or has subsequent errata,
- AltMetric: provides information about an article from social media and bibliometric databases,
- SherpaROMEO: provides information about the article’s licencing.
- Crossref, PubMed, PubMed Central, arXiv: provide bibliometric data about the article, and also used for reference linking.