Comparing human-labeled and AI-labeled speech datasets for TTS

Abstract

As the output quality of neural networks in the fields of automatic speech recognition (ASR) and text-to-speech (TTS) continues to improve, new opportunities are becoming available to train models in a weakly supervised fashion, thus minimizing the manual effort required to annotate new audio data for supervised training. While weak supervision has recently shown very promising results in the domain of ASR, speech synthesis has not yet been thoroughly investigated regarding this technique despite requiring the equivalent training dataset structure of aligned audio-transcript pairs.
In this work, we compare the performance of TTS models trained using a well-curated and manually labeled training dataset to others trained on the same audio data with text labels generated using both grapheme- and phoneme-based ASR models. Phoneme-based approaches seem especially promising, since even for wrongly predicted phonemes, the resulting word is more likely to sound similar to the originally spoken word than for grapheme-based predictions.
For evaluation and ranking, we generate synthesized audio outputs from all previously trained models using input texts sourced from a selection of speech recognition datasets covering a wide range of application domains. These synthesized outputs are subsequently fed into multiple state-of-the-art ASR models with their output text predictions being compared to the initial TTS model input texts. This comparison enables an objective assessment of the intelligibility of the audio outputs from all TTS models, by utilizing metrics like word error rate and character error rate.
Our results not only show that models trained on data generated with weak supervision achieve comparable quality to models trained on manually labeled datasets, but can outperform the latter, even for small, well-curated speech datasets. These findings suggest that the future creation of labeled datasets for supervised training of TTS models may not require any manual annotation but can be fully automated.

Mehr zum Titel

Titel	Comparing human-labeled and AI-labeled speech datasets for TTS
Medien	4th European Conference on the Impact of Artificial Intelligence and Robotics (ICAIR 2024)
Band	2024
Verfasser	Johannes Wirth, Prof. Dr. René Peinl
Veröffentlichungsdatum	2024-12-05
Projekttitel	M4-SKI
Zitation	Wirth, Johannes; Peinl, René (2024): Comparing human-labeled and AI-labeled speech datasets for TTS. 4th European Conference on the Impact of Artificial Intelligence and Robotics (ICAIR 2024) 2024.