WHO IS TALKING? PREPARING THE STUDENTS FOR A DIGITAL FUTURE PERCEPTIONS ON HOW TO HUMANIZE TEXT-TO-SPEECH VOICES

J. Fernandes; P. Duarte

doi:10.21125/iceri.2023.0454

DIGITAL LIBRARY

WHO IS TALKING? PREPARING THE STUDENTS FOR A DIGITAL FUTURE PERCEPTIONS ON HOW TO HUMANIZE TEXT-TO-SPEECH VOICES

J. Fernandes

P. Duarte

CEOS.PP, ISCAP, Polytechnic of Porto (PORTUGAL)

About this paper:

Appears in: ICERI2023 Proceedings
Publication year: 2023
Pages: 1398-1402
ISBN: 978-84-09-55942-8
ISSN: 2340-1095
doi: 10.21125/iceri.2023.0454

Conference name: 16th annual International Conference of Education, Research and Innovation
Dates: 13-15 November, 2023
Location: Seville, Spain

Abstract:

Classically, perspectives on phonetic and prosodic communication features describe an anthropocentric view of speech. However, the development of generative artificial intelligence (AI) has increased the proliferation of text-to-speech tools and virtual voice assistants/intelligent speakers that progressively compete in intelligibility and naturalness with human interlocutors. AI-generated speakers have a different quality of performance concerning the language spoken, but also concerning different aspects typically entailed in the concept of tone (politeness, empathy, assertiveness, doubt, pause, engagement). Business communication for a digital future implies new paths concerning language-assisted tasks. Text-to-speech tools can generate a wide range of professional and entrepreneurial voice-based content, such the narration of different kinds of institutional or commercial videos or other audio contents concerning advertising and promotional aims. However, these artificial outputs require careful analysis and improvement strategies, similar to other post-editing procedures.

In this paper, we share the results of an exploratory study carried out by a group of students within the subject of Linguistics. These students are undergraduates in Administrative Assistance and Translation and must be prepared to meet challenges concerning the use of AI tools do generate business and institutional messages. With this study, our goal is to evaluate the speech quality of two AI voice models in European Portuguese provided by the text-to-speech feature of the video editing tool Clipchamp (https://app.clipchamp.com/) to perform a voice-over task concerning a promotional message. We also aim to find out strategies to improve synthetic voices by providing them with anthropomorphic characteristics. In this fashion, instead of following a traditional approach and studying an oral message produced by a natural voice, the students started their pathway with an analysis of the desirable phonetic and prosodic properties for the voice-over. They proceeded to do a standard European Portuguese phonetic transcription using the IPA (International Phonetic Alphabet). Subsequently, they used the tool to obtain an audio product. Using this output, they analyzed a set of parameters such as intelligibility, naturalness, the accuracy of sound articulation (vowel, semi semi-vowel and consonantal), pitch adequacy according to the type of accent (affective or intellectual), intensity accents, quality of the duration parameter concerning the illocutionary objectives, quality of the melodic curves of the different types of sentences and adequacy of pauses to the communicative intention of the message. They then sought to develop strategies to optimize the phonetic and prosodic machine vocal delivery.

Acknowledgement:
This research is financed by Portuguese national funds through FCT – Fundação para a Ciência e Tecnologia, under the project UIDB/05422/2020.

Keywords:

Prosody, tone–voice, text-to-speech tools, naturalness.

About this paper:

Abstract:

Keywords:

Citation