Utilization of Japanese Public Educational Data by Retrieval Augmented Generation for Policy Research
DOI:
https://doi.org/10.58459/icce.2024.4869Abstract
Public educational data, including government-conducted national surveys and research cases, are widely available to the public and intended for use in municipal policymaking. However, some of this data has been published in PDF format and remains underutilized. Therefore, this study leverages new tools in the era of generative Al, such as Large Language Model (LLM) and Retrieval Augmented Generation (RAG), to process 705 public educational document PDF files in Japanese. This process involves extracting text, vectorizing it, and generating responses, thereby presenting a case study of methods for effectively utilizing public educational data. This study revealed that without using the RAG, the outputs from GPT-3.5 and GPT-4 were verbose, while the use of the RAG led to more specific answers based on the retrieval results. Furthermore, GPT-4 can be used to evaluate the quality of retrieval results. These results demonstrate that LLMs can be applied to local educational knowledge in countries with local languages, such as Japanese, and suggest that previously underutilized educational data can be leveraged to aid in formulating educational policies.