DIGITAL LIBRARY
ASSESSING MANDARIN PRONUNCIATION FOR TIBETAN LEARNER BY ARTIFICIAL INTELLIGENT SPEECH TECHNOLOGY
Northwest Normal University (CHINA)
About this paper:
Appears in: INTED2021 Proceedings
Publication year: 2021
Pages: 3088-3094
ISBN: 978-84-09-27666-0
ISSN: 2340-1079
doi: 10.21125/inted.2021.0653
Conference name: 15th International Technology, Education and Development Conference
Dates: 8-9 March, 2021
Location: Online Conference
Abstract:
Motivation:
Mandarin is the national standard language in China, which means it needs to be popularized in ethnic minority areas. Tibetan is one of the ethnic minorities with many people, and it is located in remote areas. A shortage of qualified teachers makes it difficult for Tibetan students to learn Mandarin. Various mispronunciation detection systems using speech recognition algorithms have been proposed, but no personalized feedback with native language for assessment results. Therefore, this paper proposed a method to assist Tibetan students' Mandarin pronunciation using artificial intelligent speech technology.

Method:
We used the hybrid Connectionist Temporal Classification (CTC) and self-attention speech recognition model to convert speech features into phonemes for detecting pronunciation errors. Then emotional speech synthesis is adopted to feedback results to the learner. In this stage, firstly, we trained the Tacotron2 model with a large scale neutral Mandarin corpus and a small scale neutral Tibetan corpus. Secondly, we added a text analyzer by front-end for obtaining the input sentence’s emotion label. Thirdly we were fine-tuning the Tacotron2 model that we trained with few emotional corpora while the emotional label was added to the Pre-net of the model. Finally, our model can synthesize Tibetan-Mandarin speech with positive and negative emotion.

Result:
The word error of the hybrid CTC speech recognition model is 26.5% when the CTC decoding weight is 0.2. The subjective evaluation demonstrates that synthesized emotional speech can get 4.0 of the emotional mean opinion score. We synthesized Mandarin utterances with a few mispronunciations and were repeatedly played these utterances to the 20 Tibetan students. After listening for a while, the students are asked to reread the original material. We find that the pronunciation error rate is reduced, and accuracy is enhanced. Further, 20 Tibetan students are randomly divided into two groups, and each group consists of 10 students. The first group received personalized feedback in Mandarin and Tibetan after the assessment, while the second group received neutral Mandarin feedback after the assessment. The correct pronunciation numbers of the first group are increased by 12.2%.

Conclusion:
We build a systematic modeling framework for Tibetan students' Mandarin mispronunciation detection. It can be applied to Tibetan Mandarin education and will effectively assist the Tibetan student to learn Mandarin. The result shows that learning Mandarin's accuracy will improve by using Mandarin-Tibetan emotional speech for positive and negative feedback.
Keywords:
Mandarin education, emotional speech synthesis, minority language speech synthesis, speech recognition, language learning.