MANDARIN PROSODY BOUNDARY PREDICTION FOR IMPROVING MANDARIN LEARNING OF NON-NATIVE SPEAKERS
Northwest Normal University (CHINA)
About this paper:
Conference name: 12th International Conference on Education and New Learning Technologies
Dates: 6-7 July, 2020
Location: Online Conference
Abstract:
Non-native Mandarin speakers always have some types of inherent intonation errors of pronunciation when they speak Mandarin, which is affected by their native language pronunciation habits. Mandarin prosodic structure makes learners speak Chinese sentences in cadence. Therefore, the prediction of prosodic structure from sentences is not only can help learners improving their Mandarin level but also is the key to improving the naturalness of Mandarin speech in the text-to-speech (TTS) system. The higher the accuracy of Mandarin prosody boundary prediction, the more accurate the pronunciation of non-native speakers using the TTS language education system. Most of the existing researches use the statistics-based machine learning method, especially deep learning-based technology such as BiLSTM, to predict the boundaries of the prosodic word and prosodic phrase from Chinese sentence. However, the predictive accuracy is not high, so that the synthesized Mandarin speech is not fluent enough. In this work, we proposed a sequence-to-sequence with attention mechanism (seq2seq+attention) model-based method to improve the prediction accuracy of the prosodic boundaries from Chinese sentences. Firstly, a large-scale text corpus is collected, including 100,000 Chinese sentences as the training corpus that was manually labeled the boundaries of the prosodic word and prosodic phrase under the guidance of a linguistic expert. We then proposed a new feature named syntactic hierarchical number (SHN) to describe the relationship between the syntactic structure and the prosodic structure of Chinese sentences. Finally, we trained the seq2seq+attention model that includes an input layer, an embedding layer, a BiLSTM-based encoder layer, a hidden layer, an LSTM-based decoder layer, and an output layer. The features used for the input layer include word embedding concatenated by part-of-speech, length of the word, and SHN. The experimental results show that the seq2seq+attention model with SHN feature achieves an F1-score of 98.14% in the prosodic word and 83.12% in the prosodic phrase, respectively. The F1-score of prosodic phrase increases by 0.24% compared with the result of the seq2seq+attention model without SHN and 7.02% compared with another method. Therefore, the proposed method can be applied to Mandarin education with artificial intelligence technologies, which uses speech synthesis technology to reduce the influence of native language pronunciation and improve the fluency of speaking Mandarin.Keywords:
TTS, Mandarin education, prosody boundary, attention, syntactic feature.