Extracting a Chinese Learner Corpus from the Web: Grammatical Error Correction for Learning Chinese as a Foreign Language with Statistical Machine Translation
Abstract
In this paper, we describe the TMU system for the shared task of Grammatical Error Diagnosis for Learning Chinese as a Foreign Language (CFL) at NLP-TEA1. One of the main obstacles in grammatical error correction for CFL is a data bottleneck problem. The Chinese learner corpus at hand (NTNU learner corpus) contains only 1,208 sentences in total, which is obviously insufficient for supervised learning-based techniques. To overcome this problem, we extract a large-scale Chinese learner corpus from a language exchange site called Lang-8, which results in 95,706 sentences (two million words). We use it as a parallel corpus for a phrase-based statistical machine translation (SMT) system, which translates learner sentences into correct sentences.Downloads
Download data is not yet available.
Downloads
Published
2014-11-30
Conference Proceedings Volume
Section
Articles
How to Cite
Extracting a Chinese Learner Corpus from the Web: Grammatical Error Correction for Learning Chinese as a Foreign Language with Statistical Machine Translation. (2014). International Conference on Computers in Education. https://library.apsce.net/index.php/ICCE/article/view/3070