Research projects

Find a postgraduate research project in your area of interest by exploring the research projects that we offer in the Department of Computer Science.

We have a broad range of research projects for which we are seeking doctoral students. Browse the list of projects on this page or follow the links below to find information on doctoral training opportunities, or applying for a postgraduate research programme.

Alternatively, if you would like to propose your own project then please include a research project proposal and the name of a possible supervisor with your application.

Available projects

List by research theme
List by supervisor

Fishing in the Data Lake

Primary supervisor

Norman Paton

Contact admissions office

Admissions administrator

Other projects with the same supervisor

Funding

Competition Funded Project (Students Worldwide)

This research project is one of a number of projects at this institution. It is in competition for funding with one or more of these projects. Usually the project which receives the best applicant will be awarded the funding. Applications for this project are welcome from suitably qualified candidates worldwide. Funding may only be available to a limited set of nationalities and you should read the full department and project details for further information.

Apply now

Project description

A Data Lake is a (potentially huge) repository of data collected for future reuse. In classical data analytics, data was curated for storage in data warehouses, which support the core analysis tasks of an organization. Such architectures have been effective, but are expensive to develop and maintain. More agile uses of data, in which many data sets are combined in innovative ways are not really the focus of warehouses. In contrast, data lakes aim to bring together diverse data sets, which are combined as needed to produce new insights. This provides a schema-on-read approach, in which a schema is applied to the data when it is to be used, rather than when it is stored.

The schema-on-read model has the effect of deferring the pain of curating the data; finding, selecting, collating and cleaning the data from the data lake are still all likely to be necessary. However, there are also questions such as: What data that might be relevant to my problem is in the data lake? What are the quality problems that might be a barrier to using this data? How are the data sets in the data lake related to each other? How can I search the data lake without being overwhelmed with responses.

In our recent research, we have developed an approach to automating data preparation; given a description of what is required, a data preparation program can be generated that seeks to produce what is needed from the available data sets [1]. However, this assumes that the description of what is required (in the form of a table definition and some example data) is available. In a schema-on-read model, analyses are quite speculative, and may be adapted to what suitable data can be found. We have carried out some research on discovering data in data lakes [2], but this assumes that you know quite a lot about what you are looking for. The proposed research is to develop techniques for offering candidate data to users, along with information on the properties of the different possible data sets, that might then inform the creation of the most suitable data sets to use in practice.

This research might involve the following steps: (a) review existing work on discovering, grouping, annotating and linking data in data lakes; (b) experiment with existing techniques to understand their strengths and limitations; (c) propose new techniques that can be used to automatically index, homogenize, inter-relate or group data in the data lake; (d) evaluate (c) in practice on representative data lakes (for example involving open government data) and iterate.

[1] N. Konstantinou, E. Abel, L. Bellomarini, A. Bogatu, C. Civili, E. Irfanie, M. Koehler, L. Mazilu, E. Sallinger, A. A. A. Fernandes, G. Gottlob, J. A. Keane, N. W. Paton: VADA: an architecture for end user informed data preparation. J. Big Data 6: 74 (2019).
[2] A. Bogatu, A. A.A. Fernandes, N. W. Paton, N. Konstantinou, Dataset Discovery in Data Lakes, 36th IEEE International Conference on Data Engineering, 2020.

Person specification

For information

Candidates must hold a minimum of an upper Second Class UK Honours degree or international equivalent in a relevant science or engineering discipline.
Candidates must meet the School's minimum English Language requirement.
Candidates will be expected to comply with the University's policies and practices of equality, diversity and inclusion.

Essential

Applicants will be required to evidence the following skills and qualifications.

You must be capable of performing at a very high level.
You must have a self-driven interest in uncovering and solving unknown problems and be able to work hard and creatively without constant supervision.

Desirable

Applicants will be required to evidence the following skills and qualifications.

You will have good time management.
You will possess determination (which is often more important than qualifications) although you'll need a good amount of both.

General

Applicants will be required to address the following.

Comment on your transcript/predicted degree marks, outlining both strong and weak points.
Discuss your final year Undergraduate project work - and if appropriate your MSc project work.
How well does your previous study prepare you for undertaking Postgraduate Research?
Why do you believe you are suitable for doing Postgraduate Research?