Our seminar series is free and available for anyone to attend. Unless otherwise stated, seminars take place on Wednesday afternoons at 2pm in the Kilburn Building during teaching season.

If you wish to propose a seminar speaker please contact Antoniu Pop.


Solving the speech jigsaw: A fragment-based approach to noise-robust audio and audio-visual speech recognition

  • Speaker:   Dr  Jon Barker  (University of Sheffield)
  • Host:   Neil Lawrence
  • 14th November 2007 at 14:15 in Lecture Theatre 1.4, Kilburn Building
This talk will examine the problem of automatically recognising speech in the presence of competing sound sources. Particular attention is paid to the simultaneous cochannel speech recognition problem, i.e. two talkers speaking simultaneously over a single communication channel. This is an interesting task as it is a condition that humans handle with relative ease, but which poses a challenge for mainstream automatic speech recognition (ASR) techniques. The talk will focus on work at Sheffield employing the recently recorded Audio-Visual Grid corpus - a data set that has been designed to be suitable for both perceptual studies and small vocabulary ASR experiments.

The talk will be in three parts. The introduction will explain why background sound sources cause a major problem for conventional speech recognition systems. The talk will look to the human speech perception system to motivate potential solutions. The second part of the talk will present a particular human-inspired ASR solution known as `speech fragment decoding'. This approach, which models the `scene analysis' account of auditory perception, combines signal-driven and model-driven processes to piece together glimpses of unmasked signal. Finally, the talk will present comparisons of human and machine listening experiments in which co-channel simultaneous speech recognition performance has been measured over a wide range of signal-to-noise ratios. It will be seen that the fragment-based approach is able to account for the effects of both energetic and so-called informational masking observed in the human data. It will also be seen how the fragment-based approach lends a new role to the visual component of the speech signal potentially boosting the performance of audio-visual speech recognition systems.
▲ Up to the top