Unlocking Open Data through Wrapper Generation
Primary supervisor
Contact admissions office
Other projects with the same supervisor
- Data Wrangling
- Deep Learning Architectures for Complex Data Fusion and Integration
- Data Integration & Exploration on Data Lakes
- Fishing in the Data Lake
- Finding a way through the Fog from the Edge to the Cloud
Funding
- Competition Funded Project (Students Worldwide)
This research project is one of a number of projects at this institution. It is in competition for funding with one or more of these projects. Usually the project which receives the best applicant will be awarded the funding. Applications for this project are welcome from suitably qualified candidates worldwide. Funding may only be available to a limited set of nationalities and you should read the full department and project details for further information.
Project description
Open data is made available for wider use by public sector bodies, charities and commercial organisations. Open data sets provide information about the environment, the economy, health, etc. The potential value of this data is considerable, but diverse publishers and publication practices can make the data challenging to integrate for analysis.
The aim of this research project is to support the generation of wrappers for open data sources. A wrapper is a program that extracts the data from a file into a target format. Wrapper generation, given some evidence on how the data might usefully be extracted, generates a wrapper, most likely in some form of domain-specific language.
The research might be expected to involve the following stages, with some measure of iteration: (i) identify representative open data sets and target formats, in particular looking for recurring patterns; (ii) review work on languages for data extraction and transformation, and manually apply them to the cases from (i); (iii) design and implement a domain specific language for use in writing wrappers; (iv) design and implement a search over language expressions, for example using genetic programming, that given some information about the result of extraction, generates a wrapper; (v) evaluate the result, and iterate.
This work extends our ongoing work on cost-effective data preparation, for example as reflected in [1, 2].
[1] N. Konstantinou, E. Abel, L. Bellomarini, A. Bogatu, C. Civili, E. Irfanie, M. Koehler, L. Mazilu, E. Sallinger, A. A. A. Fernandes, G. Gottlob, J. A. Keane, N. W. Paton: VADA: an architecture for end user informed data preparation. J. Big Data 6: 74 (2019).
[2] A. Bogatu, N. W. Paton, A. A. A. Fernandes, M. Koehler: Towards Automatic Data Format Transformations: Data Wrangling at Scale. Comput. J. 62(7): 1044-1060 (2019).
Person specification
For information
- Candidates must hold a minimum of an upper Second Class UK Honours degree or international equivalent in a relevant science or engineering discipline.
- Candidates must meet the School's minimum English Language requirement.
- Candidates will be expected to comply with the University's policies and practices of equality, diversity and inclusion.
Essential
Applicants will be required to evidence the following skills and qualifications.
- You must be capable of performing at a very high level.
- You must have a self-driven interest in uncovering and solving unknown problems and be able to work hard and creatively without constant supervision.
Desirable
Applicants will be required to evidence the following skills and qualifications.
- You will have good time management.
- You will possess determination (which is often more important than qualifications) although you'll need a good amount of both.
General
Applicants will be required to address the following.
- Comment on your transcript/predicted degree marks, outlining both strong and weak points.
- Discuss your final year Undergraduate project work - and if appropriate your MSc project work.
- How well does your previous study prepare you for undertaking Postgraduate Research?
- Why do you believe you are suitable for doing Postgraduate Research?