LLM-enhanced Math Text Extracting and Keyword-based Hierarchical Labeling for Digital Learning Infrastructure

Authors

  • Taisei YAMAUCHI Graduate School of Informatics, Kyoto University, Japan Author
  • Brendan FLANAGAN Institute for Liberal Arts and Sciences, Kyoto University, Japan; Academic Center for Computing and Media Studies, Kyoto University, Japan Author
  • Hiroaki OGATA Academic Center for Computing and Media Studies, Kyoto University, Japan Author

Abstract

As education becomes increasingly digitized, organizing digital learning materials into curriculum-based units is essential but often burdensome for educators. This study proposes an automatic method for labeling mathematics problems in PDF format using large language models (LLMs) and curriculum keywords. Using OpenAI's o4-mini, we extracted text from MEXT-approved junior high school textbooks and exercise books with high accuracy (98.9% and 99.7%). Unit labels were then assigned by combining keyword-based filtering with embedding similarity (text-embedding-3-small). Compared with a baseline without keyword filtering, expert evaluation favored the keyword-based method (183 vs. 132 cases), confirming that keywords enhance classification accuracy. These results demonstrate that LLM-based extraction is practical for classroom use, requiring only minor manual corrections, and that unit-specific vocabulary contributes to accurate hierarchical labeling. Future work will extend this framework toward content management of LLM-generated materials and unified log analysis to promote personalized learning pathways.

Downloads

Download data is not yet available.

Downloads

Published

2025-12-01

How to Cite

LLM-enhanced Math Text Extracting and Keyword-based Hierarchical Labeling for Digital Learning Infrastructure. (2025). International Conference on Computers in Education. https://library.apsce.net/index.php/ICCE/article/view/5664