Applied Sciences, Vol. 13, Pages 10818: Chinese Text De-Colloquialization Technique Based on Back-Translation Strategy and End-to-End Learning

7 months ago 21

Applied Sciences, Vol. 13, Pages 10818: Chinese Text De-Colloquialization Technique Based on Back-Translation Strategy and End-to-End Learning

Applied Sciences doi: 10.3390/app131910818

Authors: Hongkai Liu Zhonglin Ye Haixing Zhao Yanlin Yang

With the development of the Internet, there has been a significant increase in various types of textual information. However, when people engage in the composition of formal texts, they often incorporate their colloquial habits, which can diminish the professionalism and formality of the text. Existing research on Chinese texts primarily focuses on correcting misspelt characters that are visually or phonetically similar, as well as obvious grammatical errors, such as redundancy, omissions, and incorrect word order. However, there is limited research addressing the correction of text that exhibits colloquial expressions without apparent grammatical errors or misspelt characters. This article proposes a novel technique that utilizes deep learning methods to directly transform colloquial textual expressions into formal written expressions. Firstly, a parallel corpus dataset of written and spoken language is constructed using a back-translation strategy. Then, an end-to-end learning mechanism based on neural machine translation is employed, with colloquial text as the source language and written text as the target language. This allows the model to directly transform the colloquial text into text with a formal style. Finally, an evaluation of the proposed approach is conducted using the bilingual evaluation understudy (BLEU) and manual assessment techniques. The experimental results demonstrate that the technology proposed in this paper performs well in the task of de-colloquialization in Chinese texts. The contribution of this paper lies in proposing an automated method for collecting a substitute for manually annotated parallel corpora of spoken and written language, which significantly saves time and reduces the manual cost of constructing the dataset. Furthermore, the application of end-to-end learning techniques from neural machine translation to the task of de-colloquialization allows the trained model to directly generate written language flexibly based on the input of spoken language. This presents a novel solution for the task of the de-colloquialization of Chinese text.

Read Entire Article