LLM-enhanced Math Text Extracting and Keyword-based Hierarchical Labeling for Digital Learning Infrastructure
Abstract
As education becomes increasingly digitized, organizing digital learning materials into curriculum-based units is essential but often burdensome for educators. This study proposes an automatic method for labeling mathematics problems in PDF format using large language models (LLMs) and curriculum keywords. Using OpenAI's o4-mini, we extracted text from MEXT-approved junior high school textbooks and exercise books with high accuracy (98.9% and 99.7%). Unit labels were then assigned by combining keyword-based filtering with embedding similarity (text-embedding-3-small). Compared with a baseline without keyword filtering, expert evaluation favored the keyword-based method (183 vs. 132 cases), confirming that keywords enhance classification accuracy. These results demonstrate that LLM-based extraction is practical for classroom use, requiring only minor manual corrections, and that unit-specific vocabulary contributes to accurate hierarchical labeling. Future work will extend this framework toward content management of LLM-generated materials and unified log analysis to promote personalized learning pathways.Downloads
Download data is not yet available.
Downloads
Published
2025-12-01
Conference Proceedings Volume
Section
Articles
How to Cite
LLM-enhanced Math Text Extracting and Keyword-based Hierarchical Labeling for Digital Learning Infrastructure. (2025). International Conference on Computers in Education. https://library.apsce.net/index.php/ICCE/article/view/5664