DEEP LEARNING-BASED SPEECH-TO-IMAGE CONVERSION FOR SCIENCE COURSE

H. Yang; S. Chen; R. Jiang

doi:10.21125/inted.2021.0620

DIGITAL LIBRARY

DEEP LEARNING-BASED SPEECH-TO-IMAGE CONVERSION FOR SCIENCE COURSE

H. Yang

S. Chen

R. Jiang

Northwest Normal University (CHINA)

About this paper:

Appears in: INTED2021 Proceedings
Publication year: 2021
Pages: 2910-2917
ISBN: 978-84-09-27666-0
ISSN: 2340-1079
doi: 10.21125/inted.2021.0620

Conference name: 15th International Technology, Education and Development Conference
Dates: 8-9 March, 2021
Location: Online Conference

Abstract:

Motivation:
According to the law of pupils' cognitive development, intuitive images are more suitable for pupils' classroom learning than abstract imaginations. In elementary science course, hands-on course topics are the best way to ensure the learning effect. However, in remote rural areas, they are faced with the problems of lack of experimental equipment and low scientific literacy of teachers. Therefore, we propose a speech-to-image conversion framework that converts speech descriptions into images that conform to speech semantics. It can provide students with an abstract to intuitive way to learn science course, stimulate students' interest in learning, and improve elementary science course's implementation effect in remote rural areas.

Method:
The framework of speech-to-image conversion includes a speech recognition module and an image generation module. The speech recognition module outputs text from speech, and then the image generation module generates images that are semantically consistent with corresponding speech descriptions based on the output text from speech recognition. In the speech recognition module, we firstly extract acoustic features from speech, and then we use the Transformer network-based speech recognition method to train the acoustic model. We train a deep convolutional generative adversarial network in the image generation module to convert text descriptions to images. Both the generator network and the discriminator network perform feed-forward inference conditioned on the text feature.

Result:
This task requires specific training material, consisting of speech and image pairs. Unfortunately, there is no such database with the right amount of data. Therefore, we used the thchs30 corpus and Oxford-102 database with text descriptions to provide the results of our proof of concept in the experimental part. The result shows that our framework can achieve 79.3% accuracy of the speech-to-image conversion. We experimented on 20 students to aim at teaching in elementary science course. For the teacher's guidance, students describe the features of the object by speech and generate the corresponding feature images. We can explore whether this way has improved students' interest in learning and classroom enthusiasm.

Conclusion:
The paper proposes a framework for converting speech to image for pupil's science course. The results show that this method can effectively stimulate students' interests in learning and easily accept and master the knowledge content. It can promote the better development of elementary science courses in remote rural areas.

Keywords:

Learning and Teaching Methodologies, teaching application, speech to image, speech recognition, image generation.

About this paper:

Abstract:

Keywords:

Citation