Extracting a Chinese Learner Corpus from the Web: Grammatical Error Correction for Learning Chinese as a Foreign Language with Statistical Machine Translation

Authors

  • Yinchen ZHAO Graduate School of System Design, Tokyo Metropolitan University, Japan Author
  • Mamoru KOMACHI Graduate School of System Design, Tokyo Metropolitan University, Japan Author
  • Hiroshi ISHIKAWA Graduate School of System Design, Tokyo Metropolitan University, Japan Author

Abstract

In this paper, we describe the TMU system for the shared task of Grammatical Error Diagnosis for Learning Chinese as a Foreign Language (CFL) at NLP-TEA1. One of the main obstacles in grammatical error correction for CFL is a data bottleneck problem. The Chinese learner corpus at hand (NTNU learner corpus) contains only 1,208 sentences in total, which is obviously insufficient for supervised learning-based techniques. To overcome this problem, we extract a large-scale Chinese learner corpus from a language exchange site called Lang-8, which results in 95,706 sentences (two million words). We use it as a parallel corpus for a phrase-based statistical machine translation (SMT) system, which translates learner sentences into correct sentences.

Downloads

Download data is not yet available.

Downloads

Published

2014-11-30

How to Cite

Extracting a Chinese Learner Corpus from the Web: Grammatical Error Correction for Learning Chinese as a Foreign Language with Statistical Machine Translation. (2014). International Conference on Computers in Education. https://library.apsce.net/index.php/ICCE/article/view/3070