Speech and Audio Processing

Spoken language is a prevalent way to exchange information between humans. Allowing machines to act upon this information or to interact with humans in the most natural way thus requires that machines can deduce the meaning of what is being said.
IDLab has expertise on most if not all aspects of speech and audio processing. We currently focus on the following three challenges:

  • Speech recognition, i.e. transcribe at verbatim what is being said.
    Although there has been a steady improvement in the accuracy of speech recognizers, there is still a leap of more than an order of magnitude needed to attain human performance, especially so in the presence of noise, reverberation, and dialectal speech, ... To close this gap, IDLab investigates new dedicated machine learning approaches, new ways of combining the two main information sources (acoustics and linguistics), and various signal processing techniques. Inspiration is frequently found in theories of human speech recognition.
  • Extracting non-verbal information from the audio such as speaker ID (who is speaking), expressed emotion, state of mind, and stress levels in the speech.
    Such paralinguistic information is relevant on itself, e.g. to assess the quality of the customer care service in a company, or it may play an indirect role in grasping the full meaning of what is being said.
  • Speech assessment.
    In domains such as (second) language learning, evaluation of the oral skills of “professional speakers” (e.g. interpreters), and evidence-based speech therapy, it is essential that one can assess the various aspect of speech (such as intelligibility, articulation, or phonation) in an automatic way.

A central point of attention in all these sub-domains is robustness, i.e. find techniques that do not only perform well in select benchmark tests, but also work well in real applications.

Staff

Kris Demuynck, Nilesh Madhu, Jean-Pierre Martens.

Researchers

Catherine Middag, Brecht Desplanques.

Key publications

Speech recognition seems effortless to humans, but is nevertheless a very complex process. Comparison with handwriting recognition, a process that involves similar processing steps but is learned later in life and not practiced on a daily basis by most humans gives a more fair impression of the complexity.
Speech recognition seems effortless to humans, but is nevertheless a very complex process. Comparison with handwriting recognition, a process that involves similar processing steps but is learned later in life and not practiced on a daily basis by most humans gives a more fair impression of the complexity.

 

A semi-automatic subtitling tool developed as a prototype for the VRT (the Flemisch public broadcasting company).
A semi-automatic subtitling tool developed as a prototype for the VRT (the Flemisch public broadcasting company).

The ASISTO webtool (https://asisto.elis.ugent.be/) facilitates evidence based speech therapy by allowing patients to practice at home.
The ASISTO webtool (https://asisto.elis.ugent.be/) facilitates evidence based speech therapy by allowing patients to practice at home.