Title: Crowd sourcing and active learning to scale out supervised learning approaches: a case study for web data extraction
- Speaker: Dr Paolo Merialdo (Universita Roma Tre)
- Host: Norman Paton
- 11th December 2013 at 14:00 in Lecture Theatre 1.4
The recent advent of crowd sourcing platforms (such as Amazon Mechanical Turk) is opening new opportunities to address several issues based on supervised learning approaches. These platforms provide support for managing and assigning mini-tasks to humans actors, and thus can be used to produce massive training dataset. As these platforms facilitate the involvement of a large number of persons to produce, we may say that they represent a solution to 'scale-out' a supervised learning approaches. However, to obtain an efficient and effective process, two main issues need to be addressed. First, since mini-tasks are performed by non-expert, usually unskilled people, they should be extremely simple. Second, since the costs are proportional to the efforts required to the crowd, the number of mini-tasks should be minimized. To address the latter issue, Active Learning represents a natural solution. As a proof of concept, we present a system to infer web wrappers that relies on workers recruited by means of a crowd sourcing platform. Our system adopts a supervised approach to infer wrappers with training data generated by submitting simple queries to a crowdsourcing platform. To address the cost issue, our system selects the queries that more quickly bring to infer an accurate solution, thus minimizing the number of mini-tasks assigned to the crowd.