DIGITAL LIBRARY
A CORPUS FOR THE STUDY ON THE ASSESSMENT OF MANDARIN PRONUNCIATION OF TIBETAN SPEAKERS
Northwest Normal University (CHINA)
About this paper:
Appears in: INTED2020 Proceedings
Publication year: 2020
Pages: 7840-7848
ISBN: 978-84-09-17939-8
ISSN: 2340-1079
doi: 10.21125/inted.2020.2135
Conference name: 14th International Technology, Education and Development Conference
Dates: 2-4 March, 2020
Location: Valencia, Spain
Abstract:
Tibetan speakers always have some types of fixed pronunciation errors when they speak Mandarin, which are affected by their native language pronunciation habits. Therefore, a system assessment that can detect the mispronunciation and overall similarity measurement of syllables or phonemes in Tibetan Mandarin to help learners improve their Mandarin level needs to be studied. A unique corpus is required in order to study on the assessment of Mandarin pronunciation of Tibetan speakers. Unfortunately, there is no such a corpus in this field for the research task. We create a particular corpus by integrating the linguistic theory of Tibetan and Chinese with speech signal processing and machine learning. In this work, we record the non-standard Mandarin audio of Tibetan students and the standard Mandarin audio. These audio recordings share the same text designed by analyzing and comparing the pronunciation characteristics of Tibetan and Chinese. Audio recordings total 5.5 hours that contain 1000 paragraphs, covering 377 syllables without tones and all phonemes in standard Chinese. Then we introduce the recording environment and recording equipments. Futhermore, we set the rules for the annotation of the audio recordings in hierarchical format through praat software: the first layer is the phrase layer, marked with Chinese characters; the second layer is the syllable layer, marked with pinyin; the third layer is the phoneme layer, labeled with Speech Assessment Methods Phonetic Alphabet-Tibetan Standard Chinese(SAMPA-TSC), which is designed by ourselves. Finally, we evaluate the corpus creation in four aspects--coverage, completeness, quality, reusability--and describe the potential of the dataset application.
Keywords:
Tibetan speaker Mandarin, pronunciation assessment, audio recording dataset, SAMPA-TSC.