« Crossroads of Speech and Language »

Special Sessions & Challenges

The Organizing Committee of INTERSPEECH 2019 is proudly announcing the following special sessions and challenges for INTERSPEECH 2019.

Special sessions and challenges focus on relevant ‘special’ topics which may not be covered in regular conference sessions.

Papers have to be submitted following the same schedule and procedure as regular papers; the papers undergo the same review process by anonymous and independent reviewers.

The Interspeech 2019 Computational Paralinguistics Challenge (ComParE)

Styrian Dialects, Continuous Sleepiness, Baby Sounds & Orca Activity

Interspeech ComParE is an open Challenge dealing with states and traits of speakers as manifested in their speech signal’s properties. In this 11th edition, we introduce four new tasks and Sub-Challenges:

  • Styrian Dialects Recognition in Spoken Language,
  • Continuous Sleepiness Estimation in Speech,
  • Baby Sound Recognition,
  • Orca Activity Detection.

Sub-Challenges allow contributors to find their own features with their own machine learning algorithm. However, a standard feature set and tools are provided that may be used. Participants have five trials on the test set per Sub-Challenge. Participation has to be accompanied by a paper presenting the results that undergoes the Interspeech peer-review.

Contributions using the provided or equivalent data are sought for (but not limited to):

  • Participation in a Sub-Challenge
  • Contributions focusing centered around the Challenge topics

Results of the Challenge and Prizes will be presented at Interspeech 2019 in Graz, Austria.

Please visit: http://www.compare.openaudio.eu/compare2019/


  • Björn Schuller (U Augsburg, Germany / Imperial College, UK / audEERING)
  • Anton Batliner (U Augsburg, Germany)
  • Christian Bergler (FAU, Germany)
  • Florian Pokorny (MU Graz, Austria)
  • Jarek Krajewski (U Wuppertal / RUAS Cologne, Germany)
  • Meg Cychosz (UC Berkeley, USA)
The VOiCES from a Distance Challenge

The VOiCES from a distance challenge will be focused on benchmarking and further improving state-of-the-art technologies in the area of speaker recognition and automatic speech recognition (ASR) for far-field speech. The challenge is based on the recently released corpus Voices Obscured on Complex Environmental Settings (VOiCES), were noisy speech was recorded in real reverberant rooms with multiple microphones. Noise sources included babble, music, or television. The challenge will have two tracks for speaker recognition and ASR:

  1. Fixed System - Training data is limited to specific datasets
  2. Open System - Participants can use any external datasets they have access to (private or public)

The participating teams will get early access to the VOiCES phase II data, which will form the evaluation set for the challenge. The special session will be dedicated to the discussion of applied technology, performance thereof and any issues highlighted as a result of the challenge.

For more information visit: https://voices18.github.io/Interspeech2019-Special-Session/


  • Aaron Lawson (SRI International)
  • Colleen Richey (SRI International)
  • Maria Alejandra Barros (Lab41, In-Q-Tel)
  • Mahesh Kumar Nandwana (SRI International)
  • Julien van Hout (SRI International)
The 2019 Automatic Speaker Verification Spoofing and Countermeasures Challenge: ASVspoof Challenge

The INTERSPEECH 2019 special session on Automatic Speaker Verification Spoofing and Countermeasures (ASVspoof) will accelerate anti-spoofing research for automatic speaker verification (ASV).

The first challenge, ASVspoof 2015, focused on speech synthesis and voice conversion spoofing attacks. The second challenge, ASVspoof 2017, focused on replay spoofing attacks. ASVspoof 2019, the third in a series of such challenges will be the first challenge with a broad focus on all three types of spoofing attacks. In a continuation of 2015 and 2017 editions, ASVspoof 2019 promotes the development of generalised spoofing countermeasures, namely countermeasures that perform reliably in the face of unpredictable variation in attack types and algorithms.

ASVspoof 2019 has two sub-challenges:

  • Logical access and speech synthesis/voice conversion attack:
    The data used for ASVspoof 2015 included spoofing attacks generated with text-to-speech (TTS) and voice conversion (VC) attacks generated with the state-of-the-art systems at that time. Since then, considerable progress has been reported by both TTS and VC communities. The quality of synthetic speech produced with today’s best technology is now perceptually indistinguishable from bona fide speech. Since these technologies can be used to project convincing speech signals over the telephone, they pose substantial threats to the reliability of ASV. This scenario is referred to as logical access. The assessment of countermeasures, namely automatic systems that can detect non bona fide, spoofed speech produced with the latest TTS and VC technologies is therefore needed urgently.
  • Physical access and replay attack:
    The ASVspoof 2017 database included various types of replayed audio files recorded at several places via many different devices. Progress in the development of countermeasures for replay detection has been rapid, with substantial improvements in performance being reported each year. The 2019 edition of ASVspoof features a distinct physical access and replay attack condition in the form of a far more controlled evaluation setup than that of the 2017 condition. The physical access scenario is relevant not just to ASV, but also to the emerging problem of fake audio detection that is faced in a host of additional applications including voice interaction and authentication with smart objects (e.g. smart-speakers and voice-driven assistants).

In addition, ASVspoof 2019 will adopt a new t-DCF evaluation metric that reflects the impact of spoofing and of countermeasures on ASV performance.

For more details, please see the challenge site at http://www.asvspoof.org

Organizers (*)

  • Junichi Yamagishi (NII, Japan & Univ. of Edinburgh, UK)
  • Massimiliano Todisco (EURECOM, France)
  • Md Sahidullah (Inria, France)
  • Héctor Delgado (EURECOM, France)
  • Xin Wang (National Institute of Informatics, Japan)
  • Nicholas Evans (EURECOM, France)
  • Tomi Kinnunen (University of Eastern Finland, Finland)
  • Kong Aik Lee (NEC, JAPAN)
  • Ville Vestman (University of Eastern Finland, Finland)
(*) Equal contribution
The Zero Resource Speech Challenge 2019: TTS without T

Typical speech synthesis systems are built with an annotated corpus made of audio from a target voice plus text (and/or aligned phonetic labels). Obtaining such an annotated corpus is costly and not scalable considering the thousands of 'low resource' languages lacking in linguistic expertise or without a reliable orthography.

The ZeroSpeech 2019 challenge addresses this problem by proposing to build a speech synthesizer without any text or phonetic labels, hence, 'TTS without T' (text-to-speech without text). In this challenge, similarly, we provide raw audio for the target voice(s) in an unknown language, but no alignment, text or labels.

Participants will have to rely on automatically discovered subword units and align them to the voice recording in a way that works best for the purpose of synthesizing novel utterances from novel speakers. The task extends previous challenge editions with the requirement to synthesize speech, which provides an additional objective, thereby helping the discovery of acoustic units that are linguistically useful.

For more information please visit: http://www.zerospeech.com/2019/


  • Ewan Dunbar (Laboratoire de Linguistique Formelle, Cognitive Machine Learning [CoML])
  • Emmanuel Dupoux (Cognitive Machine Learning [CoML], Facebook A.I. Research)
  • Robin Algayres (Cognitive Machine Learning [CoML])
  • Sakriani Sakti (Nara Institute of Science and Technology, RIKEN Center for Advanced Intelligence Project)
  • Xuan-Nga Cao (Cognitive Machine Learning [CoML])
  • Mathieu Bernard (Cognitive Machine Learning [CoML])
  • Julien Karadayi (Cognitive Machine Learning [CoML])
  • Juan Benjumea (Cognitive Machine Learning [CoML])
  • Lucas Ondel (Department of Computer Graphics and Multimedia, Brno University of Technology)
  • Alan W. Black (Language Technologies Institute, Carnegie Mellon University)
  • Laurent Besacier (Laboratoire d’Informatique de Grenoble, équipe GETALP)
Spoken Language Processing for Children's Speech

This special session aims to bring together researchers and practitioners from academia and industry working on the challenging task of processing spoken language produced by children.

While recent years have seen dramatic advances in the performance of a wide range of speech processing technologies (such as automatic speech recognition, speaker identification, speech-to-speech machine translation, sentiment analysis, etc.), the performance of these systems often degrades substantially when they are applied to spoken language produced by children. This is partly due to a lack of large-scale data sets containing examples of children's spoken language that can be used to train models but also because children's speech differs from adult speech at many levels, including acoustic, prosodic, lexical, morphosyntactic, and pragmatic.

We envision that this session will bring together researchers working in the field of processing children's spoken language for a variety of downstream applications to share their experiences about what approaches work best for this challenging population.

For more information please visit: https://sites.google.com/view/wocci/home/interspeech-2019-special-session


  • Keelan Evanini (Educational Testing Service)
  • Maryam Najafian (MIT)
  • Saeid Safavi (University of Surrey)
  • Kay Berkling (Duale Hochschule Baden-Württemberg)
Voice quality characterization for clinical voice assessment: Voice production, acoustics, and auditory perception

The assessment of voice quality is relevant to the clinical care of disordered voices. It contributes to the selection and optimization of clinical treatment as well as to the evaluation of the treatment outcome. Levels of description of voice quality include the biomechanics of the vocal folds and their kinematics, temporal and spectral acoustic features, as well as the auditory scoring of hoarseness, hyper- and hypo-functionality, creakiness, diplophonia, harshness, etc. Broad and fuzzy definitions of terms regarding voice quality are in use, which impede scientific and clinical communication.

The aim of the special session is to contribute to the improvement of the clinical assessment of voice quality via a translational approach, which focuses on quantifying and explaining relationships between several levels of description. The goal is to objectify voice quality via (i) the analysis and simulation of vocal fold vibrations by means of high-speed videolaryngoscopy in combination with kinematic or mechanical modelling, (ii) the synthesis of disordered voices joint with auditory experimentation involving disordered voice stimuli, as well as (iii) the statistical analysis and automatic classification of distinct types of voice quality via video and/or audio features.


  • Philipp Aichinger (philipp.aichinger@meduniwien.ac.at)
  • Abeer Alwan (alwan@ee.ucla.edu)
  • Carlo Drioli (carlo.drioli@uniud.it)
  • Jody Kreiman (jkreiman@ucla.edu)
  • Jean Schoentgen (jschoent@ulb.ac.be)
Dynamics of Emotional Speech Exchanges in Multimodal Communication

Research devoted to understanding the relationship between verbal and nonverbal communication modes, and investigating the perceptual and cognitive processes involved in the coding/decoding of emotional states is particularly relevant in the fields of Human-Human and Human-Computer Interaction.

When it comes to speech, it is unmistakable that the same linguistic expression may be uttered for teasing, challenging, stressing, supporting, inquiring, answering or as expressing an authentic doubt. The appropriate continuance of the interaction depends on detecting the addresser’s mood.

To progress towards a better understanding of such interactional facets, more accurate solutions are needed for defining emotional and empathic contents underpinning daily interactional exchanges, developing signal processing algorithms able to capture emotional features from multimodal social signals and building mathematical models integrating emotional behaviour in interaction strategies.

The themes of this special session are multidisciplinary in nature and closely connected in their final aims to identify features from realistic dynamics of emotional speech exchanges. Of particular interest are analyses of visual, textual and audio information and corresponding computational efforts to automatically detect and interpret their semantic and pragmatic contents.

A special issue of the Journal Computer Speech and Language is foreseen as an outcome of this special session.

Details can be found on the web page: http://www.empathic-project.eu/index.php/ssinterspeech2019/


  • ANNA ESPOSITO (iiass.annaesp@tin.it; anna.esposito@unicampania.it)
  • MARIA INÉS TORRES (manes.torres@ehu.eus)
  • OLGA GORDEEVA (olga.gordeeva@acapela-group.com)
  • RAQUEL JUSTO (raquel.justo@ehu.eus)
  • ZORAIDA CALLEJAS CARRIÓN (zoraida@ugr.es)
  • KRISTIINA JOKINEN (kristiina.jokinen@aist.go.jp
  • GENNARO CORDASCO (gennaro.cordasco@unicampania.it)
  • BJIOERN SCHULLER (bjoern.schuller@imperial.ac.uk)
  • CARL VOGEL (vogel@cs.tcd.ie)
  • ALESSANDRO VINCIARELLI (Alessandro.Vinciarelli@glasgow.ac.uk)
  • GERARD CHOLLET (gerard.chollet@telecom-paristech.fr)
  • NEIL GLACKIN (neil.glackin@intelligentvoice.com)
Vocal Accommodation in Human-Computer Interaction

This special session will bring together a cross-disciplinary array of scientists who study the phenomenon of vocal accommodation with respect to the human-computer interface from different points of views, such as (a) how human vocal behavior accommodates to machines, (b) how machines cope with vocal plasticity as an effect of such accommodation behavior, (c) how accommodation of machines to human voices can influence human voice processing, or (d) how knowledge about human-human accommodation can enhance machine performance.

By now there is a large body of research showing that humans continuously accommodate their vocal characteristics during communication to each other or to particular communicative situations (e.g., clear speech in adverse listening situations or infant-directed speech). It is unclear to what degree such vocal accommodation occurs when humans communicate with machines (e.g., when speech, or the speaker, is not recognized correctly by the machine) and the effects it may have on recognition performance. Accommodation of human voices to each other can also affect speaker verification applications in civil or forensic contexts. Speech synthesis systems are increasingly able to accommodate to individual human voices, which opens opportunities with respect to the degree to which humans trust machine voices or to therapeutic effects that machine voices can have. But it also poses threats reaching from vocal fake news over personality theft to spoofing of voice verification access systems. By applying knowledge about human-human accommodation to machines, it will be possible to make speech synthesis more natural and speech recognition more effective.

The overall aim of this special session is thus to gain a better understanding of an emerging field of related problem sets that will be important to understand speech communication in a world in which human-machine interaction is becoming ubiquitous.

More information at: http://tiny.uzh.ch/TA


  • Volker Dellwo (volker.dellwo@uzh.ch) (MA: Trier; PhD: Bonn)
  • Bernd Möbius (moebius@coli.uni-saarland.de) (MA and PhD: Bonn)
  • Elisa Pellegrino (elisa.pellegrino@uzh.ch) (MA and PhD: Naples)
  • Thayabaran Kathiresan (thayabaran.kathiresan@uzh.ch) (MSc: Barcelona and Stockholm)
Privacy in Speech and Audio Interfaces

While service quality of speech and audio interfaces can be improved using interconnected devices and cloud services, it simultaneously increases the likelihood and impact of threats to the users’ privacy. This special session is focused on understanding the privacy issues that appear in speech and audio interfaces, as well as on the methods we have for retaining a level of privacy which is appropriate for the user.

Contributions to this session are invited especially for

  • Privacy-preserving processing methods for speech and audio
  • De-identification and obfuscation for speech and audio
  • User-interface design for privacy in speech and audio
  • Studies and resources on the experience and perception of privacy in speech and audio signals
  • Detection of attacks on privacy in speech and audio interfaces

More information at http://speechprivacy2019.aalto.fi


  • Tom Bäckström (tom.backstrom@aalto.fi)
  • Stephan Sigg (stephan.sigg@aalto.fi)
  • Rainer Martin (rainer.martin@ruhr-uni-bochum.de)
Speech Technologies for Code-Switching in Multilingual Communities

Speech technologies exist for many high resource languages, and attempts are being made to reach the next billion users by building resources and systems for many more languages. Multilingual communities pose many challenges for the design and development of speech processing systems. One of these challenges is code-switching, which is the switching of two or more languages at the conversation, utterance and sometimes even word level.

Code-switching is found in text in social media, instant messaging and blogs in multilingual communities in addition to conversational speech. Monolingual natural language and speech systems fail when they encounter code-switched speech and text. There is a lack of data and linguistic resources for code-switched speech and text. Code-switching provides various interesting challenges to the speech community, such as language modeling for mixed languages, acoustic modeling of mixed language speech, pronunciation modeling and language identification from speech.

The third edition of the special session on speech technologies for code-switching will span these topics, in addition to discussions about data and resources for building code-switched systems.

Web site: https://www.microsoft.com/en-us/research/event/interspeech-2019-special-session-speech-techologies-for-code-switching-in-multilingual-communities/

Organizing Committee:

  • Kalika Bali (Researcher, Microsoft Research India: kalikab@microsoft.com)
  • Alan W Black (Professor, Language Technologies Institute, Carnegie Mellon University, USA: awb@cs.cmu.edu)
  • Julia Hirschberg (Professor, Computer Science Department, Columbia University, USA: julia@cs.columbia.edu)
  • Sunayana Sitaram (Senior Applied Scientist, Microsoft Research India: sunayana.sitaram@microsoft.com)
  • Thamar Solorio (Associate Professor, Department of Computer Science, University of Houston, USA: solorio@cs.uh.edu)
Speech and Language Technology for Alexa Conversational AI

Amazon's Alexa offers novel advantages and some disadvantages compared with previously available speech technology platforms. On the positive side, it makes available reliable, scalable, high-quality far-field recognition, hands-free operation, and excellent support for children and non-native speakers. On the negative side, there are currently limited options for tuning speech recognition, and it is not in general possible to retrieve logged user speech, impacting evaluation methodologies.

We invite papers describing experiences and methods when using Alexa as a development platform. Possible topics include but are not limited to the following:

  • What speech and language technologies are relevant to developing nontrivial Alexa apps? (Robust parsing, intention recognition, dialogue management, user modelling, sentiment analysis, speech translation, ontologies, linked data...)
  • How can machine learning be used in Alexa app development?
  • What new interaction possibilities are opened up by the increasingly "invisible" and pervasive nature of the hands-free, far-field recognition platform?
  • How can Alexa be used more widely? (Smart homes, elderly care, entertainment/games, human-robot interaction...)
  • What evaluation methodologies are appropriate for Alexa?
  • How can ethical issues be addressed? (Handling abuse, trust, privacy...)

We particularly encourage submissions which include live demonstrations, and will provide support for a demo session.


  • Oliver Lemon (Heriot-Watt University)
  • Manny Rayner (University of Geneva, FTI/TIM)
The FEARLESS STEPS Challenge: Massive Naturalistic Audio

The NASA Apollo program relied on a massive team of dedicated scientists, engineers, and specialists working seamlessly together in a cohesive manner to accomplish probably one of mankind’s greatest technological achievements in history. The Fearless Steps Initiative by UTD-CRSS has led to the digitization of 19,000 hours of analog audio data and development of algorithms to extract meaningful information from this multichannel naturalistic data. Further exploring the intricate communication characteristics of problem solving on the scale as complex as going to the moon can lead to the development of novel algorithms beneficial for speech processing and conversational understanding in challenging environments. As an initial step to motivate a streamlined and collaborative effort from the speech and language community, we propose The FEARLESS STEPS (FS-1) Challenge.

Most of the data for the Apollo Missions is unlabeled and has thus far motivated the development of some unsupervised and semi-supervised speech algorithms. The Challenge Tasks for this session encourage the development of such solutions for core speech and language tasks on data with limited ground-truth/low resource availability, and serves as the first step towards extracting high level information from such massive unlabeled corpora.

This edition of the Fearless Steps Challenge with include all or most of the following tasks:

  1. Speech Activity Detection: SAD
  2. Speaker Diarization: SD
  3. Speaker Identification: SID
  4. Automatic Speech Recognition: ASR
  5. Sentiment Detection: SENTIMENT

The necessary ground truth labels and transcripts will be provided for the training/development set data.

For more information, Please visit the release website: https://app.exploreapollo.org/


  • John H.L. Hansen (Center for Robust Speech Systems, Univ. of Texas at Dallas, USA)
  • Abhijeet Sangwan (Speetra Inc., Electrical & Computer Engineering Department at Univ. of Texas at Dallas, USA)
  • Lakshmish Kaushik (Staff Researcher, SONY, Foster City, California, USA)
  • Chengzhu Yu (Researcher, Tencent AI Lab, Seattle, USA)
  • Aditya Joglekar (Univ. of Texas at Dallas, USA)
  • Meena M.C. Shekar (Center for Robust Speech Systems, Univ. of Texas at Dallas, USA)
The Second DIHARD Speech Diarization Challenge (DIHARD II)

The Second DIHARD Speech Diarization Challenge (DIHARD II) is an open challenge of speech diarization in challenging acoustic environments including meeting speech, child language acquisition data, speech in restaurants, and web video. Whereas DIHARD I focused exclusively on diarization from single channel recordings, in conjunction with the organizers of the CHiME challenges, DIHARD II will also include tracks focusing on diarization from multichannel recordings of dinner parties.

Submissions are invited from both academia and industry and may use any dataset (publicly available or proprietary) subject to the challenge rules. Additionally, a development set, which may be used for training, and a baseline system will be provided. Performance will be evaluated using diarization error rate (DER) and a modified version of the Jaccard index. If you are interested and wish to be kept informed, please send an email to the organizers at dihardchallenge@gmail.com and visit the website: https://coml.lscp.ens.fr/dihard/.


  • Neville Ryant (Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA)
  • Alejandrina Cristia (Laboratoire de Sciences Cognitives et Psycholinguistique, ENS, Paris, France)
  • Kenneth Church (Baidu Research, Sunnyvale, CA, USA)
  • Christopher Cieri (Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA)
  • Jun Du (University of Science and Technology of China, Hefei, China)
  • Sriram Ganapathy (Electrical Engineering Department, Indian Institute of Science, Bangalore, India)
  • Mark Liberman (Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA)
The HOLISTIC challenge at Interspeech 2019

The naming of the HOLISTIC challenge is a tribute to the seminal paper “How long is the sentence?” (Grosjean, 1983). The Swiss psycholinguist used gating experiments to show that listeners are capable to estimate the duration remaining of a sentence from its initial (gated) part without the help of sentential priors. He proposed that the information is in the prosody of the gated signal. These experimental results have been assessed by other methods such as button-press or brain responses. A second derivation of the challenge name is the fact that determining the end of a contribution in speech grounds in many fields of linguistics, from prosody to syntax and semantics.

What about machines? Prediction of End-of-Utterance (EoU), Sentence (EoS) and Turn (EoT) is an important component of natural speech and language processing for interactive systems, and in particular for incremental systems that shall initiate action even before an utterance is finished (e.g., start moving the head of a robot). It is a prerequisite for timely backchanneling or turn taking. Several systems have already been proposed for incremental dialog systems.

The HOLISTIC challenge aims at gathering speech scientists and technologists around a common analysis and test bed, where analysis of performance of various predictive models on large data sets can be shared.

URL: http://timobaumann.github.io/HOLISTIC.


  • Timo Baumann (Language Technologies Institute, Carnegie Mellon University)
  • Gérard Bailly (GIPSA-lab, Grenoble)
Speaker Adaptation and Prosody Modeling in Silent Speech Interfaces

During the last several years, there has been significant interest in the articulatory-to-acoustic conversion research field, which is often referred to as “Silent Speech Interfaces” (SSI). This has the main idea of recording the soundless articulatory movement (using EMA, EMG, PMA, ultrasound, lip video, etc.), and automatically generating speech from the movement information, while the subject is not producing any sound. Current SSI systems are either using the “direct synthesis” or the “recognition-followed-by-synthesis” principle. Direct synthesis has the advantage that there is a much smaller delay between articulation and speech generation, which enables conversational use and potential research on human-in-the-loop scenarios.

Within this special session, we call for recent results in multi-speaker / multi-session / prosody generation related articulatory data processing for silent speech synthesis and recognition. For processing the articulatory signals, knowledge is required from fields which are further away from speech processing, e.g. biosignal related 2D/3D image processing, multi-dimensional audio signal processing, audio-visual synchronization. Therefore, we invite cross-fertilization from other fields. We encourage you to bring demonstrations of working systems to present along with your paper.

URL: http://smartlab.tmit.bme.hu/interspeech2019-special-session-ssi


  • Gábor Gosztolya (MTA-SZTE Research Group on Artificial Intelligence, Szeged, Hungary)
  • Lorenz Diener (Cognitive Systems Lab, University of Bremen, Germany)
  • Tamás Gábor Csapó (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary)
  • Tanja Schultz (Cognitive Systems Lab, University of Bremen, Germany)