A PROPOSAL TO CREATE A PSEUDO-PARALLEL TEXT CORPUS FOR SIMPLIFYING JAPANESE USING DTW

E. Maekawa; H. Murao

doi:10.21125/inted.2023.1745

DIGITAL LIBRARY

A PROPOSAL TO CREATE A PSEUDO-PARALLEL TEXT CORPUS FOR SIMPLIFYING JAPANESE USING DTW

E. Maekawa

H. Murao

Kobe University (JAPAN)

About this paper:

Appears in: INTED2023 Proceedings
Publication year: 2023
Pages: 6542-6550
ISBN: 978-84-09-49026-4
ISSN: 2340-1079
doi: 10.21125/inted.2023.1745

Conference name: 17th International Technology, Education and Development Conference
Dates: 6-8 March, 2023
Location: Valencia, Spain

Abstract:

Text simplification is the task of transforming a complex text into a simple text while preserving the meaning of the sentence. It aids second language learners and supports children and people with language disabilities in comprehending texts.

In recent years, many approaches to text simplification have used natural language processing, such as machine translation and text generation. Machine translation require a large parallel corpus. Since there are many learners, there are various kinds of parallel text corpus between different languages, such as Japanese and English. However, there are fewer kinds of corpora between normal and simplified texts in the same language. It is very expensive to create such a corpus. In this study, we propose a method to generate a pseudo-parallel corpus for Japanese text simplification.

As a resource for the corpus, we picked two related news sites. One is a normal news site (https://www3.nhk.or.jp/news/), and another (https://www3.nhk.or.jp/news/easy/) is a website providing simplified articles translated from the normal news site by hand. We hereafter call the latter a simple news site. In our study, we generate a parallel text corpus by selecting a sentence from the normal news site and finding a corresponding one from the simple news site.

The precise procedure for generating the corpus is as follows: First, we split the text from each news site into words and convert them into vectors. The word vectors are calculated using a Word2Vec model trained by news articles. In order to reduce the computational complexity, we compress the 100-dimensional vectors obtained from word2vec to 30 dimensions using principal component analysis. The contribution ratio of the 30-dimensional vectors becomes 94.8%, and we can say that the explanatory power is high enough.

Then, we apply Dynamic Time Warping (DTW) to find correspondence between sentences. DTW is a method to find the shortest distance paths between each point in time-series data. This time, we treat a sequence of word vectors of a sentence as time series data and can regard the resulting paths as correspondence between words. Finally, we can define a pseudo-parallel text corpus by sets of a sentence on a normal news site and a corresponding word sequence on a simple news site.

To confirm that the meanings of sentence pairs in the corpus are unchanged, we measured the similarity between DTW and Word Mover's Distance (WMD). The degree of similarity in DTW increases as the word order matches, but WMD is unrelated to the word order. The closer the vector distribution, the higher the WMD similarity. In short, DTW and WMD have different indices. However, the correlation coefficient becomes 0.83, indicating a strong correlation. From these experiments, we can say that the proposed method of generating a pseudo-parallel text corpus is useful.

Keywords:

Text Simplification, Dynamic Time Warping, Word Mover's Distance, Japanese Language Learner.

About this paper:

Abstract:

Keywords:

Citation