Comparison of Generative AI and Peer Assessment in Essay Evaluation: Preliminary Study

Authors

  • Yasuhisa TAMURA Faculty of Science and Technology, Sophia University, Japan Author

Abstract

This paper addresses the timely and critical issue of evaluating students' essays using generative AI, particularly by leveraging student peer assessment as a gold standard, departing from traditional teacher-centric evaluations. While essay submission is vital for knowledge consolidation and logical thinking, evaluation remains time-consuming and subjective. Recent advancements in generative AI offer potential solutions for efficient and objective assessment. Prior research has shown moderate to high agreement between generative AI and human (teacher) graders across various contexts, often with detailed rubrics and prompts. However, this study uniquely focuses on student peer evaluations, recognizing their debated but increasingly accepted reliability. This research aims to answer two key questions: (RQ1) Can generative AI, when guided by rubrics, produce evaluation results similar to student peer assessments for natural language outputs? (RQ2) How do evaluations differ across various generative AI models? To address these, we will quantitatively analyze the characteristics of both student peer evaluations and multiple generative AI models. We will also investigate the impact of different rubric description formats (conceptual vs. example-inclusive), focusing on "Potemkin understanding" in AI. A preliminary experiment involving 117 undergraduate research plans, evaluated by five peers and five generative AIs (Gemini Flash, Gemini Pro, ChatGPT 4o, ChatGPT o3, Claude Sonnet 4), showed correlation coefficients (r=0.677-0.698, excluding ChatGPT 4o), which are below the reliability threshold of r > 0.8, indicating a need for improved rubric descriptions. This study proposes a shift from the "automated grading with teacher grades as true value" paradigm to a social constructivist view, recognizing assessment as part of the learning activity. Academically, it will pioneer a new field of multi-stance learning by utilizing diverse assessment language from learners as "weak teachers" for AI. Practically, it promises to improve immediate feedback quality in large lectures and MOOCs, foster assessment literacy, and address regulatory concerns regarding AI opacity and the absence of human judgment by adopting a hybrid structure of student peer evaluation and generative AI. This approach will also facilitate the development of localized evaluation models reflecting cultural and linguistic diversity.

Downloads

Download data is not yet available.

Downloads

Published

2025-12-01

How to Cite

Comparison of Generative AI and Peer Assessment in Essay Evaluation: Preliminary Study. (2025). International Conference on Computers in Education. https://library.apsce.net/index.php/ICCE/article/view/5676