CINXE.COM

Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving</title> <!--Generated on Tue Mar 18 18:27:46 2025 by LaTeXML (version 0.8.8) http://dlmf.nist.gov/LaTeXML/.--> <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <base href="/html/2503.14630v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#S1" title="In Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">1 </span>Introduction</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#S2" title="In Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2 </span>Related Work</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#S3" title="In Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3 </span>Benchmark</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#S4" title="In Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4 </span>Evaluation of Model Performance</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#S4.SS1" title="In 4 Evaluation of Model Performance ‣ Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.1 </span>Results for RQ1: Accuracy in Correctness Evaluation</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#S4.SS2" title="In 4 Evaluation of Model Performance ‣ Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.2 </span>Results for RQ2: Error Detection and Feedback Quality</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#S5" title="In Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5 </span>Limitations and Future Directions</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#S6" title="In Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">6 </span>Discussion and Conclusion</span></a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line ltx_fleqn"> <div class="ltx_para" id="p1"> <p class="ltx_p" id="p1.1">[1,3]<span class="ltx_ERROR undefined" id="p1.1.1">\fnm</span>Priscylla <span class="ltx_ERROR undefined" id="p1.1.2">\sur</span>Silva</p> </div> <div class="ltx_para" id="p2"> <p class="ltx_p" id="p2.1">[1]<span class="ltx_ERROR undefined" id="p2.1.1">\orgname</span>University of São Paulo, <span class="ltx_ERROR undefined" id="p2.1.2">\orgaddress</span><span class="ltx_ERROR undefined" id="p2.1.3">\city</span>São Carlos, <span class="ltx_ERROR undefined" id="p2.1.4">\state</span>SP, <span class="ltx_ERROR undefined" id="p2.1.5">\country</span>Brazil</p> </div> <div class="ltx_para" id="p3"> <p class="ltx_p" id="p3.1">2]<span class="ltx_ERROR undefined" id="p3.1.1">\orgname</span>Federal University of Alagoas, <span class="ltx_ERROR undefined" id="p3.1.2">\orgaddress</span><span class="ltx_ERROR undefined" id="p3.1.3">\city</span>Maceió, <span class="ltx_ERROR undefined" id="p3.1.4">\state</span>AL, <span class="ltx_ERROR undefined" id="p3.1.5">\country</span>Brazil</p> </div> <div class="ltx_para" id="p4"> <p class="ltx_p" id="p4.1">3]<span class="ltx_ERROR undefined" id="p4.1.1">\orgname</span>Federal Institute of Alagoas, <span class="ltx_ERROR undefined" id="p4.1.2">\orgaddress</span><span class="ltx_ERROR undefined" id="p4.1.3">\city</span>Rio Largo, <span class="ltx_ERROR undefined" id="p4.1.4">\state</span>AL, <span class="ltx_ERROR undefined" id="p4.1.5">\country</span>Brazil</p> </div> <h1 class="ltx_title ltx_title_document">Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving</h1> <div class="ltx_authors"> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname"> </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_email"><a href="mailto:priscylla.silva@usp.br">priscylla.silva@usp.br</a> </span></span></span> <span class="ltx_author_before">  </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname"><span class="ltx_ERROR undefined" id="id1.1.id1">\fnm</span>Evandro <span class="ltx_ERROR undefined" id="id2.2.id2">\sur</span>Costa </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_email"><a href="mailto:evandro@ic.ufal.br">evandro@ic.ufal.br</a> </span> <span class="ltx_contact ltx_role_affiliation">* </span> <span class="ltx_contact ltx_role_affiliation">[ </span> <span class="ltx_contact ltx_role_affiliation">[ </span></span></span> </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract</h6> <p class="ltx_p" id="id3.id1">Providing effective feedback is important for student learning in programming problem-solving. In this sense, Large Language Models (LLMs) have emerged as potential tools to automate feedback generation. However, their reliability and ability to identify reasoning errors in student code remain not well understood. This study evaluates the performance of four LLMs (GPT-4o, GPT-4o mini, GPT-4-Turbo, and Gemini-1.5-pro) on a benchmark dataset of 45 student solutions. We assessed the models’ capacity to provide accurate and insightful feedback, particularly in identifying reasoning mistakes. Our analysis reveals that 63% of feedback hints were accurate and complete, while 37% contained mistakes, including incorrect line identification, flawed explanations, or hallucinated issues. These findings highlight the potential and limitations of LLMs in programming education and underscore the need for improvements to enhance reliability and minimize risks in educational applications.</p> </div> <div class="ltx_classification"> <h6 class="ltx_title ltx_title_classification">keywords: </h6> Programming Education, Large Language Models, Automated Feedback, Automated Assessment </div> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2> <div class="ltx_para" id="S1.p1"> <p class="ltx_p" id="S1.p1.1">Despite its foundational role in computing education, computer programming remains a difficult subject for many learners. Programming courses often enroll large numbers of students, making it difficult for instructors to offer timely and personalized feedback. This feedback is crucial for helping learners understand complex concepts, correct misconceptions, and develop problem-solving skills. However, generating such feedback for programming assignments are particularly demanding. Instructors must evaluate whether a solution meets the problem’s requirements, understand the student’s thought process, identify errors, and communicate guidance to foster reflection and learning <cite class="ltx_cite ltx_citemacro_citep">[<a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#bib.bib1" title="">1</a>]</cite>.</p> </div> <div class="ltx_para" id="S1.p2"> <p class="ltx_p" id="S1.p2.1">Automated feedback systems have emerged as a solution to address the challenges of assessing student submissions. Traditional tools typically rely on predefined rules or rigid evaluation frameworks <cite class="ltx_cite ltx_citemacro_citep">[<a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#bib.bib2" title="">2</a>]</cite>. While these methods can be effective in specific contexts, they lack the flexibility to provide individualized and context-aware feedback <cite class="ltx_cite ltx_citemacro_citep">[<a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#bib.bib3" title="">3</a>]</cite>. Recent studies have demonstrated that large language models (LLMs) can be used to overcome this limitation by providing natural student feedback that adapts to each student’s specific needs <cite class="ltx_cite ltx_citemacro_citep">[<a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#bib.bib4" title="">4</a>]</cite>.</p> </div> <div class="ltx_para" id="S1.p3"> <p class="ltx_p" id="S1.p3.1">However, using LLMs in educational contexts raises significant concerns. <cite class="ltx_cite ltx_citemacro_citet">Tyen et al. [<a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#bib.bib5" title="">5</a>]</cite> show that LLMs struggle with reasoning tasks, such as accurately following the step-by-step resolution of arithmetic expressions. These models show less than 50% accuracy in both arithmetic and logical deduction tasks, particularly in identifying and correcting errors, as they often fail to indicate the specific step where a mistake occurred. Furthermore, large language models (LLMs) are susceptible to generating hallucinations or providing misleading information, which may frustrate students and interfere with the learning process <cite class="ltx_cite ltx_citemacro_citep">[<a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#bib.bib6" title="">6</a>]</cite>. Such inaccuracies are particularly concerning in education, where trust and reliability are paramount.</p> </div> <div class="ltx_para" id="S1.p4"> <p class="ltx_p" id="S1.p4.1">In this work, we address these challenges by providing a systematic evaluation of four different LLMs (GPT-4o, GPT-4o-mini, GPT-4-Turbo, and Gemini-1.5-pro) on their ability to generate accurate and meaningful feedback for programming assignments. Unlike prior studies that typically focus on a single LLM, Our comparative analysis highlights the different capabilities and limitations of multiple LLMs. To ensure reproducibility, we contribute a benchmark dataset based on real-world student submissions from introductory programming courses, making it publicly available to foster further research in this area.</p> </div> <div class="ltx_para" id="S1.p5"> <p class="ltx_p" id="S1.p5.1">In summary, the main contributions of this work are:</p> <ol class="ltx_enumerate" id="S1.I1"> <li class="ltx_item" id="S1.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">1.</span> <div class="ltx_para" id="S1.I1.i1.p1"> <p class="ltx_p" id="S1.I1.i1.p1.1">A evaluation of four LLMs in generating feedback for programming assignments, focusing on their ability to identify and explain student mistakes;</p> </div> </li> <li class="ltx_item" id="S1.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">2.</span> <div class="ltx_para" id="S1.I1.i2.p1"> <p class="ltx_p" id="S1.I1.i2.p1.1">A publicly available benchmark dataset of annotated student code submissions, facilitating future research on automated feedback systems.</p> </div> </li> </ol> </div> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">2 </span>Related Work</h2> <div class="ltx_para" id="S2.p1"> <p class="ltx_p" id="S2.p1.1"><span class="ltx_text ltx_font_bold" id="S2.p1.1.1">Evaluating Feedback Generated by LLMs in Programming Tasks</span>. <cite class="ltx_cite ltx_citemacro_citet">Pankiewicz and Baker [<a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#bib.bib7" title="">7</a>]</cite> used GPT-3.5 to provide hints to students upon request during programming tasks when errors such as compilation failures, runtime issues, or unit test failures were identified. Their approach used prompts that included the problem description, student code, and error details. Feedback usefulness was evaluated based on student opinions using a Likert scale. However, the study did not assess whether the feedback correctly aligned with the problem requirements or addressed issues such as hallucinations or logical inconsistencies.</p> </div> <div class="ltx_para" id="S2.p2"> <p class="ltx_p" id="S2.p2.1"><cite class="ltx_cite ltx_citemacro_citet">Azaiz et al. [<a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#bib.bib8" title="">8</a>]</cite> examined GPT-4 Turbo for feedback generation on 55 solutions to two programming assignments. They reported feedback accuracy rates of 75% to 95% in determining the correctness of solutions but noted that 22% of the feedback was inconsistent or incorrect. Similarly, <cite class="ltx_cite ltx_citemacro_citet">Roest et al. [<a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#bib.bib9" title="">9</a>]</cite> evaluated GPT-3.5 Turbo for real-time hint generation during students’ problem-solving processes. Experts reviewed 48 hints and found that 33% contained misleading information. <cite class="ltx_cite ltx_citemacro_citet">Jacobs and Jaschke [<a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#bib.bib10" title="">10</a>]</cite> employed GPT-4 to generate feedback using task specifications, student code, compiler output, and unit test results. Expert evaluation of 137 feedbacks revealed that 12% contained incorrect suggestions, and 6% exhibited hallucinations.</p> </div> <div class="ltx_para" id="S2.p3"> <p class="ltx_p" id="S2.p3.1">These studies underscore the challenges in ensuring the accuracy and reliability of LLM-generated feedback. Although the proportion of issues in feedback (ranging from 6% to 33%) may appear small, in the context of education, even a small rate of misleading feedback can negatively affect students’ learning, leading to misunderstandings, frustration, and loss of trust in the system. While these studies focus on evaluating individual models and their ability to generate feedback, our research goes further by conducting a systematic comparison of multiple LLMs.</p> </div> <div class="ltx_para" id="S2.p4"> <p class="ltx_p" id="S2.p4.1"><span class="ltx_text ltx_font_bold" id="S2.p4.1.1">Challenges in Evaluating Reasoning with LLMs</span>. Detecting reasoning mistakes in student code requires understanding the underlying logic and intent of the solution, going beyond surface-level syntax or test failures. Recent research has shown that LLMs generally perform poorly in identifying logical mistakes compared to their ability to correct them. <cite class="ltx_cite ltx_citemacro_citet">Tyen et al. [<a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#bib.bib5" title="">5</a>]</cite> demonstrated that several LLMs struggle with logical deduction, arithmetic expressions, and word-ordering tasks, often failing to pinpoint where reasoning mistakes occur. Similarly, <cite class="ltx_cite ltx_citemacro_citet">Xia et al. [<a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#bib.bib11" title="">11</a>]</cite> observed similar limitations in mathematical reasoning tasks, where LLMs frequently misinterpret problem-solving logic.</p> </div> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">3 </span>Benchmark</h2> <div class="ltx_para" id="S3.p1"> <p class="ltx_p" id="S3.p1.1">We created a benchmark dataset based on real-world student submissions to evaluate LLMs’ performance in providing feedback for programming tasks. This dataset comprises 45 Python solutions to 5 introductory programming assignments collected from an online system used in programming courses. These 45 solutions were submitted by 5 students, with some students providing multiple solutions for the same programming assignment. This reflects a range of solution attempts and iterative improvements, capturing diverse solution strategies and common errors. Each assignment was designed to test foundational programming concepts such as loops, conditionals, and functions.</p> </div> <div class="ltx_para" id="S3.p2"> <p class="ltx_p" id="S3.p2.1"><span class="ltx_text ltx_font_bold" id="S3.p2.1.1">Dataset Construction</span>. The dataset was generated from student submissions to an automated grading system, which determined the correctness of each solution based on problem requirements and test cases. The students’ solutions were written in Python 3.11. Table <a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#S3.T1" title="Table 1 ‣ 3 Benchmark ‣ Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving"><span class="ltx_text ltx_ref_tag">1</span></a> shows an overview of the dataset collected. This dataset was used to assess LLMs’ feedback and evaluate their ability to identify mistakes and suggest improvements and corrections.</p> </div> <figure class="ltx_table" id="S3.T1"> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table"><span class="ltx_text" id="S3.T1.2.1.1" style="font-size:90%;">Table 1</span>: </span><span class="ltx_text" id="S3.T1.3.2" style="font-size:90%;">Distribution of students’ solutions across assignments and correctness based on the submission system.</span></figcaption> <table class="ltx_tabular ltx_guessed_headers ltx_align_middle" id="S3.T1.4"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S3.T1.4.1.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_tt" id="S3.T1.4.1.1.1"><span class="ltx_text ltx_font_bold" id="S3.T1.4.1.1.1.1">Assignment</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S3.T1.4.1.1.2"><span class="ltx_text ltx_font_bold" id="S3.T1.4.1.1.2.1">Total Solutions</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S3.T1.4.1.1.3"><span class="ltx_text ltx_font_bold" id="S3.T1.4.1.1.3.1">Correct</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S3.T1.4.1.1.4"><span class="ltx_text ltx_font_bold" id="S3.T1.4.1.1.4.1">Incorrect</span></th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S3.T1.4.2.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t" id="S3.T1.4.2.1.1">Area of a Circle</th> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T1.4.2.1.2">14</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T1.4.2.1.3">05</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T1.4.2.1.4">09</td> </tr> <tr class="ltx_tr" id="S3.T1.4.3.2"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S3.T1.4.3.2.1">Simple Sum</th> <td class="ltx_td ltx_align_center" id="S3.T1.4.3.2.2">06</td> <td class="ltx_td ltx_align_center" id="S3.T1.4.3.2.3">04</td> <td class="ltx_td ltx_align_center" id="S3.T1.4.3.2.4">02</td> </tr> <tr class="ltx_tr" id="S3.T1.4.4.3"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S3.T1.4.4.3.1">T-Shirts</th> <td class="ltx_td ltx_align_center" id="S3.T1.4.4.3.2">10</td> <td class="ltx_td ltx_align_center" id="S3.T1.4.4.3.3">02</td> <td class="ltx_td ltx_align_center" id="S3.T1.4.4.3.4">08</td> </tr> <tr class="ltx_tr" id="S3.T1.4.5.4"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S3.T1.4.5.4.1">Huaauhahhuahau</th> <td class="ltx_td ltx_align_center" id="S3.T1.4.5.4.2">08</td> <td class="ltx_td ltx_align_center" id="S3.T1.4.5.4.3">04</td> <td class="ltx_td ltx_align_center" id="S3.T1.4.5.4.4">04</td> </tr> <tr class="ltx_tr" id="S3.T1.4.6.5"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S3.T1.4.6.5.1">Grandpa’s Balance</th> <td class="ltx_td ltx_align_center" id="S3.T1.4.6.5.2">07</td> <td class="ltx_td ltx_align_center" id="S3.T1.4.6.5.3">04</td> <td class="ltx_td ltx_align_center" id="S3.T1.4.6.5.4">03</td> </tr> <tr class="ltx_tr" id="S3.T1.4.7.6"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb ltx_border_t" id="S3.T1.4.7.6.1"><span class="ltx_text ltx_font_bold" id="S3.T1.4.7.6.1.1">Total</span></th> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T1.4.7.6.2">45</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T1.4.7.6.3">19</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T1.4.7.6.4">26</td> </tr> </tbody> </table> </figure> <div class="ltx_para" id="S3.p3"> <p class="ltx_p" id="S3.p3.1"><span class="ltx_text ltx_font_bold" id="S3.p3.1.1">Feedback Generation</span>. Feedback was generated using four LLMs: GPT-4o (version: 2024-08-06), GPT-4o-mini (version: 2024-07-18), GPT-4-Turbo (version: 2024-04-09), and Gemini-1.5-pro. Each model was prompted using the same template to ensure consistency in feedback generation (see the prompt template in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#S3.F1" title="Figure 1 ‣ 3 Benchmark ‣ Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving"><span class="ltx_text ltx_ref_tag">1</span></a>). The models were executed with a temperature setting of 0 (zero) to try to ensure consistent output results. The prompt explicitly instructed the models to act as programming teachers, assess the correctness of the code, and generate JSON-formatted feedback. Feedback included a binary indicator (<em class="ltx_emph ltx_font_italic" id="S3.p3.1.2">is_correct</em>) and a list of hints, with each hint specifying:</p> </div> <div class="ltx_para" id="S3.p4"> <ul class="ltx_itemize" id="S3.I1"> <li class="ltx_item" id="S3.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I1.i1.p1"> <p class="ltx_p" id="S3.I1.i1.p1.1">The line number of the issue,</p> </div> </li> <li class="ltx_item" id="S3.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I1.i2.p1"> <p class="ltx_p" id="S3.I1.i2.p1.1">The code line containing the issue,</p> </div> </li> <li class="ltx_item" id="S3.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I1.i3.p1"> <p class="ltx_p" id="S3.I1.i3.p1.1">A concise explanation of the problem.</p> </div> </li> </ul> </div> <figure class="ltx_figure" id="S3.F1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="652" id="S3.F1.g1" src="extracted/6291093/figs/prompt.png" width="598"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S3.F1.2.1.1" style="font-size:90%;">Figure 1</span>: </span><span class="ltx_text" id="S3.F1.3.2" style="font-size:90%;">Prompt used to ask for feedback.</span></figcaption> </figure> <div class="ltx_para" id="S3.p5"> <p class="ltx_p" id="S3.p5.1"><span class="ltx_text ltx_font_bold" id="S3.p5.1.1">Feedback Annotation</span>. Two teaching assistants annotated the feedback generated by the models. The feedback was assessed in two stages:</p> </div> <div class="ltx_para" id="S3.p6"> <ol class="ltx_enumerate" id="S3.I2"> <li class="ltx_item" id="S3.I2.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">1.</span> <div class="ltx_para" id="S3.I2.i1.p1"> <p class="ltx_p" id="S3.I2.i1.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I2.i1.p1.1.1">Automatic Annotation</span>: The correctness of the solutions was determined using the results from the automated submission system.</p> </div> </li> <li class="ltx_item" id="S3.I2.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">2.</span> <div class="ltx_para" id="S3.I2.i2.p1"> <p class="ltx_p" id="S3.I2.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I2.i2.p1.1.1">Human Annotation</span>: Teaching assistants evaluated the feedback and organized it into five categories. Disagreement cases were resolved through discussion and consensus.</p> </div> </li> </ol> </div> <div class="ltx_para" id="S3.p7"> <p class="ltx_p" id="S3.p7.1">Each hint was categorized individually, following an adaptation of the framework by <cite class="ltx_cite ltx_citemacro_citet">Hellas et al. [<a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#bib.bib12" title="">12</a>]</cite>. In our categorization, any error in the hint message makes it incorrect, even if other parts are accurate. The categories are defined as follows:</p> </div> <div class="ltx_para" id="S3.p8"> <ul class="ltx_itemize" id="S3.I3"> <li class="ltx_item" id="S3.I3.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I3.i1.p1"> <p class="ltx_p" id="S3.I3.i1.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I3.i1.p1.1.1">Accurate and Complete (AC)</span>: The feedback correctly identifies the appropriate line of code and explains the issue and/or how to resolve it. This is considered the ideal type of feedback, as it effectively supports students in identifying and understanding their mistakes.</p> </div> </li> <li class="ltx_item" id="S3.I3.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I3.i2.p1"> <p class="ltx_p" id="S3.I3.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I3.i2.p1.1.1">Accurate Line Only (ALO)</span>: The feedback correctly identifies the line containing the issue but provides an incorrect or misleading explanation of the problem. Unlike False Positive (FP), the line identified does indeed contain a mistake, but the model misunderstands the nature of the issue.</p> </div> </li> <li class="ltx_item" id="S3.I3.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I3.i3.p1"> <p class="ltx_p" id="S3.I3.i3.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I3.i3.p1.1.1">Accurate Issue Only (AIO)</span>: The feedback describes the right issue but does not indicate the correct line of code. This misalignment can confuse students, as the explanation may not appear to relate to the line they are working on.</p> </div> </li> <li class="ltx_item" id="S3.I3.i4" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I3.i4.p1"> <p class="ltx_p" id="S3.I3.i4.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I3.i4.p1.1.1">False Positive (FP)</span>: The feedback incorrectly identifies an issue where none exists. There is no actual error on the identified line, and the feedback message is entirely fake (hallucinated). This type of feedback can undermine student trust in the system.</p> </div> </li> <li class="ltx_item" id="S3.I3.i5" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I3.i5.p1"> <p class="ltx_p" id="S3.I3.i5.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I3.i5.p1.1.1">Misleading Suggestions</span> (MS): The feedback points out a real error in the code, but both the line and the suggested fix are incorrect. While related to AIO, MS adds an additional layer of confusion by including a flawed or misleading resolution.</p> </div> </li> </ul> </div> <div class="ltx_para" id="S3.p9"> <p class="ltx_p" id="S3.p9.1">Table <a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#S3.T2" title="Table 2 ‣ 3 Benchmark ‣ Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving"><span class="ltx_text ltx_ref_tag">2</span></a> displays the categorization of the hints produced by each model, indicating the frequency of each category. The analysis focuses only on true negatives, which means that the dataset includes only the feedback for solutions that the models accurately identified as incorrect.</p> </div> <figure class="ltx_table" id="S3.T2"> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table"><span class="ltx_text" id="S3.T2.2.1.1" style="font-size:90%;">Table 2</span>: </span><span class="ltx_text" id="S3.T2.3.2" style="font-size:90%;">Categorization of hints generated by each model.</span></figcaption> <table class="ltx_tabular ltx_guessed_headers ltx_align_middle" id="S3.T2.4"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S3.T2.4.1.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_tt" id="S3.T2.4.1.1.1" rowspan="2"><span class="ltx_text ltx_font_bold" id="S3.T2.4.1.1.1.1">Model</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" colspan="6" id="S3.T2.4.1.1.2"><span class="ltx_text ltx_font_bold" id="S3.T2.4.1.1.2.1">Categories</span></th> </tr> <tr class="ltx_tr" id="S3.T2.4.2.2"> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S3.T2.4.2.2.1"><span class="ltx_text ltx_font_bold" id="S3.T2.4.2.2.1.1">AC</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S3.T2.4.2.2.2"><span class="ltx_text ltx_font_bold" id="S3.T2.4.2.2.2.1">ALO</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S3.T2.4.2.2.3"><span class="ltx_text ltx_font_bold" id="S3.T2.4.2.2.3.1">AIO</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S3.T2.4.2.2.4"><span class="ltx_text ltx_font_bold" id="S3.T2.4.2.2.4.1">FP</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S3.T2.4.2.2.5"><span class="ltx_text ltx_font_bold" id="S3.T2.4.2.2.5.1">MS</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S3.T2.4.2.2.6"><span class="ltx_text ltx_font_bold" id="S3.T2.4.2.2.6.1">Total</span></th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S3.T2.4.3.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t" id="S3.T2.4.3.1.1">GPT-4o-mini</th> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.4.3.1.2">36</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.4.3.1.3">02</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.4.3.1.4">06</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.4.3.1.5">03</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.4.3.1.6">02</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.4.3.1.7">49</td> </tr> <tr class="ltx_tr" id="S3.T2.4.4.2"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S3.T2.4.4.2.1">GPT-4o</th> <td class="ltx_td ltx_align_center" id="S3.T2.4.4.2.2">28</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.4.2.3">10</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.4.2.4">03</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.4.2.5">08</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.4.2.6">00</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.4.2.7">49</td> </tr> <tr class="ltx_tr" id="S3.T2.4.5.3"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S3.T2.4.5.3.1">GPT-4-Turbo</th> <td class="ltx_td ltx_align_center" id="S3.T2.4.5.3.2">33</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.5.3.3">06</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.5.3.4">02</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.5.3.5">03</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.5.3.6">00</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.5.3.7">44</td> </tr> <tr class="ltx_tr" id="S3.T2.4.6.4"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S3.T2.4.6.4.1">Gemini-1.5-pro</th> <td class="ltx_td ltx_align_center" id="S3.T2.4.6.4.2">23</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.6.4.3">09</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.6.4.4">07</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.6.4.5">09</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.6.4.6">00</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.6.4.7">48</td> </tr> <tr class="ltx_tr" id="S3.T2.4.7.5"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb ltx_border_t" id="S3.T2.4.7.5.1"><span class="ltx_text ltx_font_bold" id="S3.T2.4.7.5.1.1">Total</span></th> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T2.4.7.5.2">120</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T2.4.7.5.3">27</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T2.4.7.5.4">18</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T2.4.7.5.5">23</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T2.4.7.5.6">02</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T2.4.7.5.7">190</td> </tr> </tbody> </table> </figure> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">4 </span>Evaluation of Model Performance</h2> <div class="ltx_para" id="S4.p1"> <p class="ltx_p" id="S4.p1.1">To assess the ability of LLMs to evaluate student code, we use our benchmark dataset to investigate two key research questions:</p> <ul class="ltx_itemize" id="S4.I1"> <li class="ltx_item" id="S4.I1.ix1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" id="S4.I1.ix1.1.1.1">RQ1.</span></span> <div class="ltx_para" id="S4.I1.ix1.p1"> <p class="ltx_p" id="S4.I1.ix1.p1.1">How accurately do LLMs determine whether a student’s solution fulfills the requirements of a programming task?</p> </div> </li> <li class="ltx_item" id="S4.I1.ix2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" id="S4.I1.ix2.1.1.1">RQ2.</span></span> <div class="ltx_para" id="S4.I1.ix2.p1"> <p class="ltx_p" id="S4.I1.ix2.p1.1">What are the common types of errors in student code that LLMs fail to identify?</p> </div> </li> </ul> </div> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.1 </span>Results for RQ1: Accuracy in Correctness Evaluation</h3> <div class="ltx_para" id="S4.SS1.p1"> <p class="ltx_p" id="S4.SS1.p1.1">The models performed similarly in evaluating the correctness of students’ code (Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#S4.F2" title="Figure 2 ‣ 4.1 Results for RQ1: Accuracy in Correctness Evaluation ‣ 4 Evaluation of Model Performance ‣ Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving"><span class="ltx_text ltx_ref_tag">2</span></a> shows a visualization of the confusion matrices for each model). The accuracy metrics for each model are as follows: GPT-4o-mini: 84.44%, Gemini-1.5-pro: 86.67%, GPT-4o: 88.89%, and GPT-4-Turbo: 88.89%.</p> </div> <div class="ltx_para" id="S4.SS1.p2"> <p class="ltx_p" id="S4.SS1.p2.1"><span class="ltx_text ltx_font_bold" id="S4.SS1.p2.1.1">GPT-4o-mini</span> misclassified seven correct solutions as incorrect. These errors were distributed across three tasks: four from the <span class="ltx_text ltx_font_italic" id="S4.SS1.p2.1.2">Simple Sum</span> assignment, two from the <span class="ltx_text ltx_font_italic" id="S4.SS1.p2.1.3">T-shirt</span> assignment, and one from <span class="ltx_text ltx_font_italic" id="S4.SS1.p2.1.4">Grandpa’s Balance</span>. In the <span class="ltx_text ltx_font_italic" id="S4.SS1.p2.1.5">Simple Sum</span> task, the model incorrectly claimed the solutions did not meet the required output format, despite their correctness. Similarly, in the <span class="ltx_text ltx_font_italic" id="S4.SS1.p2.1.6">T-shirt</span> and <span class="ltx_text ltx_font_italic" id="S4.SS1.p2.1.7">Grandpa’s Balance</span> assignments, the model misinterpreted the students’ logical reasoning, falsely identifying errors in the code.</p> </div> <div class="ltx_para" id="S4.SS1.p3"> <p class="ltx_p" id="S4.SS1.p3.1"><span class="ltx_text ltx_font_bold" id="S4.SS1.p3.1.1">GPT-4o</span> exhibited similar misclassification patterns, with five correct solutions identified as incorrect. These were a subset of the errors made by GPT-4o-mini, including three <span class="ltx_text ltx_font_italic" id="S4.SS1.p3.1.2">Simple Sum</span> solutions and two <span class="ltx_text ltx_font_italic" id="S4.SS1.p3.1.3">T-shirt</span> solutions. GPT-4o successfully identified that one solution from <span class="ltx_text ltx_font_italic" id="S4.SS1.p3.1.4">Simple Sum</span> met the output requirements, which GPT-4o-mini failed to recognize. However, the model also struggled to interpret logical reasoning in the <span class="ltx_text ltx_font_italic" id="S4.SS1.p3.1.5">T-shirt</span> task.</p> </div> <div class="ltx_para" id="S4.SS1.p4"> <p class="ltx_p" id="S4.SS1.p4.1"><span class="ltx_text ltx_font_bold" id="S4.SS1.p4.1.1">GPT-4-Turbo</span> demonstrated comparable accuracy to GPT-4o, misclassifying four correct solutions as incorrect. It improved over GPT-4o-mini in the <span class="ltx_text ltx_font_italic" id="S4.SS1.p4.1.2">Simple Sum</span> assignment, reducing errors to two cases. However, like the others, it struggled with the logical reasoning in two <span class="ltx_text ltx_font_italic" id="S4.SS1.p4.1.3">T-shirt</span> assignment solutions.</p> </div> <div class="ltx_para" id="S4.SS1.p5"> <p class="ltx_p" id="S4.SS1.p5.1"><span class="ltx_text ltx_font_bold" id="S4.SS1.p5.1.1">Gemini-1.5-pro</span> exhibited a similar pattern of errors as the GPT models. The misclassified solutions were a subset of those incorrectly labeled by GPT-4o and GPT-4o-mini. Gemini and GPT-4-Turbo were the only ones to classify an incorrect solution as correct incorrectly.</p> </div> <figure class="ltx_figure" id="S4.F2"> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S4.F2.2.1.1" style="font-size:90%;">Figure 2</span>: </span><span class="ltx_text" id="S4.F2.3.2" style="font-size:90%;">Confusion matrices for GPT-4o, GPT-4o-mini, and GPT-4-Turbo, showing the models’ performance in classifying correct and incorrect student solutions.</span></figcaption><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="232" id="S4.F2.g1" src="x1.png" width="830"/> </figure> </section> <section class="ltx_subsection" id="S4.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.2 </span>Results for RQ2: Error Detection and Feedback Quality</h3> <div class="ltx_para" id="S4.SS2.p1"> <p class="ltx_p" id="S4.SS2.p1.1">To evaluate the quality of feedback generated by the models, we categorized hints associated with correctly identified incorrect solutions into five categories: Accurate and Complete (AC), Accurate Line Only (ALO), Accurate Issue Only (AIO), False Positive (FP), and Misleading Suggestions (MS). Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.14630v1#S4.F3" title="Figure 3 ‣ 4.2 Results for RQ2: Error Detection and Feedback Quality ‣ 4 Evaluation of Model Performance ‣ Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving"><span class="ltx_text ltx_ref_tag">3</span></a> illustrates the distribution of feedback categories across models.</p> </div> <figure class="ltx_figure" id="S4.F3"> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S4.F3.2.1.1" style="font-size:90%;">Figure 3</span>: </span><span class="ltx_text" id="S4.F3.3.2" style="font-size:90%;">Comparison of the frequency of feedback categories generated by each model.</span></figcaption><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="247" id="S4.F3.g1" src="extracted/6291093/figs/barchart.png" width="419"/> </figure> <div class="ltx_para" id="S4.SS2.p2"> <p class="ltx_p" id="S4.SS2.p2.1">GPT-4o-mini generated the highest number of <span class="ltx_text ltx_font_italic" id="S4.SS2.p2.1.1">Accurate and Complete</span> hints, followed closely by GPT-4-Turbo. These results indicate the strength of these models in providing both the correct line and the proper explanation of the mistake. Furthermore, GPT-4o and Gemini-1.5-pro generated more <span class="ltx_text ltx_font_italic" id="S4.SS2.p2.1.2">Accurate Line Only</span> hints, suggesting a tendency to identify the line containing the issue but not adequately explaining the problem or assisting the student in fixing it. Across all models, the frequency of <span class="ltx_text ltx_font_italic" id="S4.SS2.p2.1.3">False Positives</span> was relatively low, occurring in approximately 12% of cases. These errors were more frequent in Gemini-1.5-pro and GPT-4o. Similarly, <span class="ltx_text ltx_font_italic" id="S4.SS2.p2.1.4">Misleading Suggestions</span> were rare, with only two instances reported for GPT-4o-mini.</p> </div> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">5 </span>Limitations and Future Directions</h2> <div class="ltx_para" id="S5.p1"> <p class="ltx_p" id="S5.p1.1">This study has potential limitations. First, the benchmark dataset comprises 45 solutions from 5 students across five programming assignments. Although it captures diverse solution strategies and common errors, the dataset’s small size and limited participant pool restrict the generalizability of our findings. Furthermore, imbalances in the number of submissions across assignments may introduce bias in the evaluation results. We plan to expand the dataset with more participants, diverse assignments, and augmented data to address these issues.</p> </div> <div class="ltx_para" id="S5.p2"> <p class="ltx_p" id="S5.p2.1">Second, the prompt design may have impacted the models’ performance. In future studies, we will explore alternative prompt designs, such as separating tasks into individual prompts or using dynamic, task-specific prompts. This could provide a more accurate evaluation of LLM capabilities.</p> </div> <div class="ltx_para" id="S5.p3"> <p class="ltx_p" id="S5.p3.1">Finally, our findings are based on the evaluation of four specific LLMs. While these models represent state-of-the-art capabilities, the results may not generalize to other models or future iterations. Additionally, while human annotations were validated to ensure consistency, they may still introduce an element of subjectivity.</p> </div> </section> <section class="ltx_section" id="S6"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">6 </span>Discussion and Conclusion</h2> <div class="ltx_para" id="S6.p1"> <p class="ltx_p" id="S6.p1.1">In this work, we evaluate the ability of large language models (LLMs) to provide feedback on programming tasks. We introduce a benchmark dataset of student submissions in introductory programming courses and use it to analyze and compare the performance of GPT-4o, GPT-4o-mini, GPT-4-Turbo, and Gemini-1.5-pro. Our study categorizes feedback generated by these models into five categories, highlighting their strengths and limitations in providing proper feedback. The results show that while these LLMs are effective in generating accurate feedback in many cases, they also exhibit significant challenges, including hallucinated errors, misleading suggestions, and struggles to interpret student logic.</p> </div> <div class="ltx_para" id="S6.p2"> <p class="ltx_p" id="S6.p2.1">Overall, 63% of the feedback hints generated by the models were totally correct, meaning they accurately identified the problematic line and provided an appropriate explanation. However, 37% of the feedback contained some issues, such as pointing to the wrong line, providing an incorrect hint message, or hallucinating non-existent errors.</p> </div> <div class="ltx_para" id="S6.p3"> <p class="ltx_p" id="S6.p3.1">Our findings reveal that although LLMs perform well in detecting syntactic and surface-level issues, they often fail to capture deeper reasoning mistakes in student code. Additionally, the frequency of hallucinations and false positives highlights the need for further refinement of these models to minimize their risks in educational settings. The benchmark, prompts, and supplementary materials can be found at: <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/priscylla/Assessing-LMMs-for-Feedback-Generation" title="">https://github.com/priscylla/Assessing-LMMs-for-Feedback-Generation</a>.</p> </div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.1.1"> <span class="ltx_bibblock"><span class="ltx_ERROR undefined" id="bib.1.1.1.1">\bibcommenthead</span> </span> </li> <li class="ltx_bibitem" id="bib.bib1"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Silva et al. [2019]</span> <span class="ltx_bibblock"> Silva, P., Costa, E., Araújo, J.R.: An adaptive approach to provide feedback for students in programming problem solving. In: Coy, A., Hayashi, Y., Chang, M. (eds.) Intelligent Tutoring Systems, pp. 14–23. Springer, Cham (2019) </span> </li> <li class="ltx_bibitem" id="bib.bib2"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Deeva et al. [2021]</span> <span class="ltx_bibblock"> Deeva, G., Bogdanova, D., Serral, E., Snoeck, M., De Weerdt, J.: A review of automated feedback systems for learners: Classification framework, challenges and opportunities. Computers &amp; Education <span class="ltx_text ltx_font_bold" id="bib.bib2.1.1">162</span>, 104094 (2021) <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1016/j.compedu.2020.104094" title="">https://doi.org/10.1016/j.compedu.2020.104094</a> </span> </li> <li class="ltx_bibitem" id="bib.bib3"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Maier and Klotz [2022]</span> <span class="ltx_bibblock"> Maier, U., Klotz, C.: Personalized feedback in digital learning environments: Classification framework and literature review. Computers and Education: Artificial Intelligence <span class="ltx_text ltx_font_bold" id="bib.bib3.1.1">3</span>, 100080 (2022) <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1016/j.caeai.2022.100080" title="">https://doi.org/10.1016/j.caeai.2022.100080</a> </span> </li> <li class="ltx_bibitem" id="bib.bib4"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kiesler et al. [2023]</span> <span class="ltx_bibblock"> Kiesler, N., Lohr, D., Keuning, H.: Exploring the potential of large language models to generate formative programming feedback. In: 2023 IEEE Frontiers in Education Conference (FIE), pp. 1–5 (2023). <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/FIE58773.2023.10343457" title="">https://doi.org/10.1109/FIE58773.2023.10343457</a> </span> </li> <li class="ltx_bibitem" id="bib.bib5"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Tyen et al. [2024]</span> <span class="ltx_bibblock"> Tyen, G., Mansoor, H., Carbune, V., Chen, P., Mak, T.: LLMs cannot find reasoning errors, but can correct them given the error location. In: Ku, L.-W., Martins, A., Srikumar, V. (eds.) Findings of the Association for Computational Linguistics: ACL 2024, pp. 13894–13908. Association for Computational Linguistics, Bangkok, Thailand (2024). <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.18653/v1/2024.findings-acl.826" title="">https://doi.org/10.18653/v1/2024.findings-acl.826</a> . <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://aclanthology.org/2024.findings-acl.826" title="">https://aclanthology.org/2024.findings-acl.826</a> </span> </li> <li class="ltx_bibitem" id="bib.bib6"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Jukiewicz [2024]</span> <span class="ltx_bibblock"> Jukiewicz, M.: The future of grading programming assignments in education: The role of chatgpt in automating the assessment and feedback process. Thinking Skills and Creativity <span class="ltx_text ltx_font_bold" id="bib.bib6.1.1">52</span>, 101522 (2024) <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1016/j.tsc.2024.101522" title="">https://doi.org/10.1016/j.tsc.2024.101522</a> </span> </li> <li class="ltx_bibitem" id="bib.bib7"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Pankiewicz and Baker [2023]</span> <span class="ltx_bibblock"> Pankiewicz, M., Baker, R.S.: Large language models (gpt) for automating feedback on programming assignments. In: Shin, J.-L., Kashihara, A., Chen, W., Ogata, H. (eds.) 31st International Conference on Computers in Education Conference Proceedings, Volume I, pp. 68–77. Asia-Pacific Society for Computers in Education (APSCE), ??? (2023) </span> </li> <li class="ltx_bibitem" id="bib.bib8"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Azaiz et al. [2024]</span> <span class="ltx_bibblock"> Azaiz, I., Kiesler, N., Strickroth, S.: Feedback-generation for programming exercises with gpt-4. In: Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1. ITiCSE 2024, pp. 31–37. Association for Computing Machinery, New York, NY, USA (2024). <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3649217.3653594" title="">https://doi.org/10.1145/3649217.3653594</a> . <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3649217.3653594" title="">https://doi.org/10.1145/3649217.3653594</a> </span> </li> <li class="ltx_bibitem" id="bib.bib9"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Roest et al. [2024]</span> <span class="ltx_bibblock"> Roest, L., Keuning, H., Jeuring, J.: Next-step hint generation for introductory programming using large language models. In: Proceedings of the 26th Australasian Computing Education Conference. ACE ’24, pp. 144–153. Association for Computing Machinery, New York, NY, USA (2024). <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3636243.3636259" title="">https://doi.org/10.1145/3636243.3636259</a> . <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3636243.3636259" title="">https://doi.org/10.1145/3636243.3636259</a> </span> </li> <li class="ltx_bibitem" id="bib.bib10"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Jacobs and Jaschke [2024]</span> <span class="ltx_bibblock"> Jacobs, S., Jaschke, S.: Evaluating the application of large language models to generate feedback in programming education. In: 2024 IEEE Global Engineering Education Conference (EDUCON), pp. 1–5 (2024). <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/EDUCON60312.2024.10578838" title="">https://doi.org/10.1109/EDUCON60312.2024.10578838</a> </span> </li> <li class="ltx_bibitem" id="bib.bib11"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Xia et al. [2024]</span> <span class="ltx_bibblock"> Xia, S., Li, X., Liu, Y., Wu, T., Liu, P.: Evaluating mathematical reasoning beyond accuracy. arXiv preprint arXiv:2404.05692 (2024) </span> </li> <li class="ltx_bibitem" id="bib.bib12"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Hellas et al. [2023]</span> <span class="ltx_bibblock"> Hellas, A., Leinonen, J., Sarsa, S., Koutcheme, C., Kujanpää, L., Sorva, J.: Exploring the responses of large language models to beginner programmers’ help requests. In: Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1. ICER ’23, pp. 93–105. Association for Computing Machinery, New York, NY, USA (2023). <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3568813.3600139" title="">https://doi.org/10.1145/3568813.3600139</a> . <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3568813.3600139" title="">https://doi.org/10.1145/3568813.3600139</a> </span> </li> </ul> </section> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Tue Mar 18 18:27:46 2025 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/"><span style="letter-spacing:-0.2em; margin-right:0.1em;">L<span class="ltx_font_smallcaps" style="position:relative; bottom:2.2pt;">a</span>T<span class="ltx_font_smallcaps" style="font-size:120%;position:relative; bottom:-0.2ex;">e</span></span><span style="font-size:90%; position:relative; bottom:-0.2ex;">XML</span><img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10