Extracting a Chinese Learner Corpus from the Web: Grammatical Error Correction for Learning Chinese as a Foreign Language with Statistical Machine Translation

Yinchen ZHAO; Mamoru KOMACHI; Hiroshi ISHIKAWA

Extracting a Chinese Learner Corpus from the Web: Grammatical Error Correction for Learning Chinese as a Foreign Language with Statistical Machine Translation

Authors

Yinchen ZHAO Graduate School of System Design, Tokyo Metropolitan University, Japan Author
Mamoru KOMACHI Graduate School of System Design, Tokyo Metropolitan University, Japan Author
Hiroshi ISHIKAWA Graduate School of System Design, Tokyo Metropolitan University, Japan Author

Abstract

In this paper, we describe the TMU system for the shared task of Grammatical Error Diagnosis for Learning Chinese as a Foreign Language (CFL) at NLP-TEA1. One of the main obstacles in grammatical error correction for CFL is a data bottleneck problem. The Chinese learner corpus at hand (NTNU learner corpus) contains only 1,208 sentences in total, which is obviously insufficient for supervised learning-based techniques. To overcome this problem, we extract a large-scale Chinese learner corpus from a language exchange site called Lang-8, which results in 95,706 sentences (two million words). We use it as a parallel corpus for a phrase-based statistical machine translation (SMT) system, which translates learner sentences into correct sentences.

Downloads

Download data is not yet available.

Downloads

Published

2014-11-30

Conference Proceedings Volume

2014: ICCE 2014: The 22nd International Conference on Computers in Education

Section

Articles

How to Cite

Extracting a Chinese Learner Corpus from the Web: Grammatical Error Correction for Learning Chinese as a Foreign Language with Statistical Machine Translation. (2014). International Conference on Computers in Education. https://library.apsce.net/index.php/ICCE/article/view/3070

Download Citation