CINXE.COM

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>From Code to Courtroom: LLMs as the New Software Judges</title>  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <meta content="Large Language Models, Software Engineering, LLM-as-a-Judge, Research Roadmap" lang="en" name="keywords"/> <base href="/html/2503.02246v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S1" title="In From Code to Courtroom: LLMs as the New Software Judges">1 Introduction</a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S2" title="In From Code to Courtroom: LLMs as the New Software Judges">2 Definition of LLM-as-a-Judge</a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S3" title="In From Code to Courtroom: LLMs as the New Software Judges">3 Literature Review</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S3.SS1" title="In 3. Literature Review ‣ From Code to Courtroom: LLMs as the New Software Judges">3.1 Code Generation</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S3.SS2" title="In 3. Literature Review ‣ From Code to Courtroom: LLMs as the New Software Judges">3.2 Code Change</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S3.SS3" title="In 3. Literature Review ‣ From Code to Courtroom: LLMs as the New Software Judges">3.3 Software Documentation Summarization</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S3.SS4" title="In 3. Literature Review ‣ From Code to Courtroom: LLMs as the New Software Judges">3.4 Other SE Tasks</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S4" title="In From Code to Courtroom: LLMs as the New Software Judges">4 The Road Ahead</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S4.SS1" title="In 4. The Road Ahead ‣ From Code to Courtroom: LLMs as the New Software Judges">4.1 More Empirical Evaluation and Benchmarking</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S4.SS2" title="In 4. The Road Ahead ‣ From Code to Courtroom: LLMs as the New Software Judges">4.2 Better Judgment From Internal Factors</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S4.SS3" title="In 4. The Road Ahead ‣ From Code to Courtroom: LLMs as the New Software Judges">4.3 Better Judgment Through External Factors</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S4.SS4" title="In 4. The Road Ahead ‣ From Code to Courtroom: LLMs as the New Software Judges">4.4 Addressing The Security Issues of Using LLM-as-a-Judge</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S5" title="In From Code to Courtroom: LLMs as the New Software Judges">5 Conclusion</a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line ltx_leqno"> <h1 class="ltx_title ltx_title_document">From Code to Courtroom: LLMs as the New Software Judges</h1> <div class="ltx_authors"> Junda He <a href="mailto:jundahe@smu.edu.sg">jundahe@smu.edu.sg</a> , Jieke Shi <a href="mailto:jiekeshi@smu.edu.sg">jiekeshi@smu.edu.sg</a> Singapore Management UniversitySingapore , Terry Yue Zhuo <a href="mailto:terry.zhuo@monash.edu">terry.zhuo@monash.edu</a> Monash University & CSIRO’s Data61Australia , Christoph Treude <a href="mailto:ctreude@smu.edu.sg">ctreude@smu.edu.sg</a> Singapore Management UniversitySingapore , Jiamou Sun <a href="mailto:frank.sun@anu.edu.au">frank.sun@anu.edu.au</a> CSIRO’s Data61Australia , Zhenchang Xing <a href="mailto:zhenchang.xing@anu.edu.au">zhenchang.xing@anu.edu.au</a> CSIRO’s Data61 & Australian National UniversityAustralia , Xiaoning Du <a href="mailto:xiaoning.du@monash.edu">xiaoning.du@monash.edu</a> Monash UniversityAustralia and David Lo <a href="mailto:davidlo@smu.edu.sg">davidlo@smu.edu.sg</a> Singapore Management UniversitySingapore </div> <div class="ltx_dates">(20 February 2025)</div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract.</h6> Recently, Large Language Models (LLMs) have been increasingly used to automate SE tasks such as code generation and summarization. However, evaluating the quality of LLM-generated software artifacts remains challenging. Human evaluation, while effective, is very costly and time-consuming. Traditional automated metrics like BLEU rely on high-quality references and struggle to capture nuanced aspects of software quality, such as readability and usefulness. In response, the LLM-as-a-Judge paradigm, which employs LLMs for automated evaluation, has emerged. Given that LLMs possess strong coding abilities and reasoning skills, they hold promise as cost-effective and scalable surrogates for human evaluators. Nevertheless, LLM-as-a-Judge research in the SE community is still in its early stages, with many breakthroughs needed. This forward-looking SE 2030 paper aims to steer the research community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts, while also sharing potential research paths to achieve this goal. We provide a literature review of existing SE studies on LLM-as-a-Judge and envision these frameworks as reliable, robust, and scalable human surrogates capable of evaluating software artifacts with consistent, multi-faceted assessments by 2030 and beyond. To validate this vision, we analyze the limitations of current studies, identify key research gaps, and outline a detailed roadmap to guide future developments of LLM-as-a-Judge in software engineering. While not intended to be a definitive guide, our work aims to foster further research and adoption of LLM-as-a-Judge frameworks within the SE community, ultimately improving the effectiveness and scalability of software artifact evaluation methods. </div> <div class="ltx_keywords">Large Language Models, Software Engineering, LLM-as-a-Judge, Research Roadmap </div> ††copyright: acmlicensed††journal: TOSEM <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> 1. Introduction</h2> <div class="ltx_para" id="S1.p1"> With significant advances in Artificial Intelligence (AI)—particularly in Deep Learning (DL) <cite class="ltx_cite ltx_citemacro_citep">(LeCun et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib33" title="">2015</a>; Goodfellow et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib21" title="">2016</a>)</cite> and Large Language Models (LLMs) <cite class="ltx_cite ltx_citemacro_citep">(OpenAI, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib45" title="">2023b</a>; DeepSeek-AI et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib10" title="">[n. d.]</a>; Anthropic, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib4" title="">2023</a>)</cite>—various fields have increasingly leveraged LLMs to develop automated solutions, notably within Software Engineering (SE) <cite class="ltx_cite ltx_citemacro_citep">(Hou et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib25" title="">2024</a>; Gao et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib17" title="">2025</a>; Fan et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib14" title="">2023</a>; Wang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib65" title="">2024</a>; Shi et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib57" title="">2024a</a>)</cite>. Demonstrating exceptional performance across a spectrum of longstanding challenges, ranging from code generation <cite class="ltx_cite ltx_citemacro_citep">(Zhuo et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib90" title="">2024</a>; Liu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib40" title="">2024c</a>)</cite> and summarization <cite class="ltx_cite ltx_citemacro_citep">(Ahmed et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib3" title="">2024b</a>; Sun et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib61" title="">2024</a>)</cite> to program translation <cite class="ltx_cite ltx_citemacro_citep">(Yuan et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib81" title="">2024</a>; Pan et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib46" title="">2023</a>)</cite> and repair <cite class="ltx_cite ltx_citemacro_citep">(Jiang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib28" title="">2023</a>; Jin et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib30" title="">2023</a>)</cite>, LLMs have powered a series of new applications and tools that streamline the software development process, such as GitHub Copilot <cite class="ltx_cite ltx_citemacro_citep">(GitHub, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib19" title="">2025</a>)</cite>, Cursor Code Editor <cite class="ltx_cite ltx_citemacro_citep">(Anysphere, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib5" title="">2025</a>)</cite>, and more. </div> <div class="ltx_para" id="S1.p2"> However, these useful innovations have also brought new challenges in SE research, among which a key question is: </div> <div class="ltx_para ltx_noindent" id="S1.p3"> <svg class="ltx_picture" height="38.02" id="S1.p3.pic1" overflow="visible" version="1.1" width="600"><g fill="#000000" stroke="#000000" stroke-width="0.4pt" transform="translate(0,38.02) matrix(1 0 0 -1 0 0)"><g fill="#FFFFFF" fill-opacity="1.0"><path d="M 1.97 5.91 L 1.97 32.12 C 1.97 34.29 3.73 36.06 5.91 36.06 L 594.09 36.06 C 596.27 36.06 598.03 34.29 598.03 32.12 L 598.03 5.91 C 598.03 3.73 596.27 1.97 594.09 1.97 L 5.91 1.97 C 3.73 1.97 1.97 3.73 1.97 5.91 Z" style="stroke:none"></path></g><g fill-opacity="1.0" transform="matrix(1.0 0.0 0.0 1.0 53.15 5.91)"><foreignobject color="#000000" height="26.21" overflow="visible" transform="matrix(1 0 0 -1 0 16.6)" width="521.26"> How can we comprehensively and scalably evaluate the quality of LLM-generated software artifacts? </foreignobject></g></g></svg> </div> <div class="ltx_para" id="S1.p4"> Assessing these LLM-generated software artifacts through experienced developers, i.e., human evaluation, is an ideal approach, as software quality should be judged based on its effectiveness in real-world scenarios <cite class="ltx_cite ltx_citemacro_citep">(Tian, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib63" title="">2005</a>)</cite>. However, rigorous human evaluation poses significant challenges. A survey of SE researchers found that 84% agree that human evaluation is problematic, primarily due to time constraints, the need for specialized practical knowledge, and the high cost involved <cite class="ltx_cite ltx_citemacro_citep">(Buse et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib8" title="">2011</a>)</cite>. Additionally, human evaluators may experience fatigue and reduced focus, further compromising the quality of the evaluation and delaying the evaluation process <cite class="ltx_cite ltx_citemacro_citep">(Kumar et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib31" title="">2024</a>)</cite>. </div> <div class="ltx_para" id="S1.p5"> Therefore, researchers have increasingly adopted automated metrics for evaluating LLM-generated software artifacts <cite class="ltx_cite ltx_citemacro_citep">(Hu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib26" title="">2022</a>; Papineni et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib47" title="">2002</a>; Lin, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib38" title="">2004</a>)</cite>, given their relatively low cost and scalability. However, traditional automated metrics come with their own limitations. For example, Pass@<math alttext="k" class="ltx_Math" display="inline" id="S1.p5.1.m1.1"><semantics id="S1.p5.1.m1.1a"><mi id="S1.p5.1.m1.1.1" xref="S1.p5.1.m1.1.1.cmml">k</mi><annotation-xml encoding="MathML-Content" id="S1.p5.1.m1.1b"><ci id="S1.p5.1.m1.1.1.cmml" xref="S1.p5.1.m1.1.1">𝑘</ci></annotation-xml><annotation encoding="application/x-tex" id="S1.p5.1.m1.1c">k</annotation><annotation encoding="application/x-llamapun" id="S1.p5.1.m1.1d">italic_k</annotation></semantics></math>, a widely used metric for measuring code correctness, involves executing the first <math alttext="k" class="ltx_Math" display="inline" id="S1.p5.2.m2.1"><semantics id="S1.p5.2.m2.1a"><mi id="S1.p5.2.m2.1.1" xref="S1.p5.2.m2.1.1.cmml">k</mi><annotation-xml encoding="MathML-Content" id="S1.p5.2.m2.1b"><ci id="S1.p5.2.m2.1.1.cmml" xref="S1.p5.2.m2.1.1">𝑘</ci></annotation-xml><annotation encoding="application/x-tex" id="S1.p5.2.m2.1c">k</annotation><annotation encoding="application/x-llamapun" id="S1.p5.2.m2.1d">italic_k</annotation></semantics></math> generated code snippets against unit tests. Although useful, it still demands significant human effort to design comprehensive test suites and manually configure execution environments. Similarly, the effectiveness of text similarity-based metrics like BLEU <cite class="ltx_cite ltx_citemacro_citep">(Papineni et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib47" title="">2002</a>)</cite> and CodeBLEU <cite class="ltx_cite ltx_citemacro_citep">(Ren et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib52" title="">2020</a>)</cite>, along with embedding-based metrics like BERTScore <cite class="ltx_cite ltx_citemacro_citep">(Zhang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib83" title="">2019</a>)</cite> and CodeBERTScore <cite class="ltx_cite ltx_citemacro_citep">(Zhou et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib88" title="">2023</a>)</cite>, rely heavily on high-quality references that are typically annotated by human experts. Moreover, these methods not only fail to capture nuanced aspects of software artifacts—such as naturalness, usefulness, and adherence to best practices—as humans would, but they also often lead to misalignment with human judgment, as evidenced in several studies <cite class="ltx_cite ltx_citemacro_citep">(Hu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib26" title="">2022</a>; Kumar et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib31" title="">2024</a>; Roy et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib53" title="">2021</a>; Evtikhiev et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib13" title="">2023</a>)</cite>. Consequently, achieving comprehensive evaluation while ensuring scalable automation remains a persistent challenge in assessing the quality of LLM-generated software artifacts. </div> <div class="ltx_para" id="S1.p6"> Recently, the remarkable success of LLMs has inspired the emergence of the LLM-as-a-Judge (LLM-J) paradigm <cite class="ltx_cite ltx_citemacro_citep">(Zheng et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib86" title="">2024</a>)</cite>, which extends LLMs’ role from content generation to comprehensive content evaluation, positioning them as scalable and cost-effective surrogates for human evaluators. This paradigm is driven by several key attributes of LLMs. First, numerous studies have shown that LLMs exhibit both impressive coding abilities <cite class="ltx_cite ltx_citemacro_citep">(Liu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib39" title="">2024b</a>; Lozhkov et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib43" title="">2024</a>; Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib35" title="">2023</a>)</cite> and human-like reasoning skills <cite class="ltx_cite ltx_citemacro_citep">(Patil, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib49" title="">2025</a>; Ivanova, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib27" title="">2025</a>; Guo et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib23" title="">2025</a>)</cite>. Additionally, since LLMs are often trained through reinforcement learning from human feedback (RLHF) <cite class="ltx_cite ltx_citemacro_citep">(Bai et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib6" title="">2022</a>)</cite>, their outputs closely align with human expert judgments <cite class="ltx_cite ltx_citemacro_citep">(GitHub, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib18" title="">2025</a>; Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib9" title="">2021</a>)</cite>. Moreover, unlike human evaluators, LLMs do not experience fatigue or reduced efficiency from prolonged work, allowing them to maintain consistent performance over extended periods. These attributes collectively make LLMs highly suitable in content evaluation tasks. </div> <div class="ltx_para" id="S1.p7"> In this paper, we argue that LLM-as-a-Judge systems offer a promising solution to address the limitations of both costly human evaluation and traditional automated metrics in SE. Our research community is beginning to recognize the potential of LLM-as-a-Judge, as evidenced by a few empirical studies and emerging methodologies on this topic such as <cite class="ltx_cite ltx_citemacro_citep">(Wang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>; Wu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib69" title="">2024</a>; Ahmed et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>)</cite>. Nevertheless, these studies represent only initial progress, and the field remains in its early stages. Many breakthroughs are still needed, and the overall landscape is far from being fully mapped, especially as we look to 2030 and beyond. This necessitates this paper to redirect the research community’s focus by presenting a vision for the future of LLM-as-a-Judge and a roadmap that inspires further progress. </div> <div class="ltx_para" id="S1.p8"> In the following sections, this paper first provides a formal definition of the LLM-as-a-Judge concept (<a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.02246v1#S2" title="2. Definition of LLM-as-a-Judge ‣ From Code to Courtroom: LLMs as the New Software Judges">Section 2</a>). We then review key historical milestones and recent advances in LLM-as-a-Judge for software engineering (<a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.02246v1#S3" title="3. Literature Review ‣ From Code to Courtroom: LLMs as the New Software Judges">Section 3</a>). Looking beyond 2030, we envision LLM-as-a-Judge systems as reliable, robust, and scalable human surrogates capable of evaluating a wide range of software artifacts while providing consistent, multi-faceted assessments. To validate this vision, we identify key limitations and research gaps, outlining a roadmap with specific research directions and potential solutions for the SE research community to pursue (<a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.02246v1#S4" title="4. The Road Ahead ‣ From Code to Courtroom: LLMs as the New Software Judges">Section 4</a>). In summary, our study makes the following key contributions: <ul class="ltx_itemize" id="S1.I1"> <li class="ltx_item" id="S1.I1.i1" style="list-style-type:none;"> • <div class="ltx_para" id="S1.I1.i1.p1"> We conduct a comprehensive review of 16 primary studies on LLM-as-a-Judge in software engineering. </div> </li> <li class="ltx_item" id="S1.I1.i2" style="list-style-type:none;"> • <div class="ltx_para" id="S1.I1.i2.p1"> We analyze the limitations of existing research on LLM-as-a-Judge in SE, identifying major challenges and research gaps. </div> </li> <li class="ltx_item" id="S1.I1.i3" style="list-style-type:none;"> • <div class="ltx_para" id="S1.I1.i3.p1"> We present a forward-looking vision for LLM-as-a-Judge in SE by 2030 and beyond, proposing a detailed research roadmap with key opportunities for future exploration. </div> </li> </ul> </div> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> 2. Definition of LLM-as-a-Judge</h2> <div class="ltx_para" id="S2.p1"> As a recent emerging topic, various definitions of LLM-as-a-Judge are currently proposed and not yet unified. This paper was inspired by the definition proposed by Li et al. <cite class="ltx_cite ltx_citemacro_citep">(Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib34" title="">2024a</a>)</cite>. Informally speaking, LLM-as-a-Judge utilizes LLMs as evaluative tools to assess software artifact based on predefined evaluation criteria. Formally speaking, LLM-as-a-Judge is defined as follows: </div> <div class="ltx_para" id="S2.p2"> <table class="ltx_equation ltx_eqn_table" id="S2.E1"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_left" rowspan="1">(1)</td> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="E(\mathcal{T},\mathcal{C},\mathcal{X},\mathcal{R})\rightarrow(\mathcal{Y},% \mathcal{E},\mathcal{F})" class="ltx_Math" display="block" id="S2.E1.m1.7"><semantics id="S2.E1.m1.7a"><mrow id="S2.E1.m1.7.8" xref="S2.E1.m1.7.8.cmml"><mrow id="S2.E1.m1.7.8.2" xref="S2.E1.m1.7.8.2.cmml"><mi id="S2.E1.m1.7.8.2.2" xref="S2.E1.m1.7.8.2.2.cmml">E</mi><mo id="S2.E1.m1.7.8.2.1" xref="S2.E1.m1.7.8.2.1.cmml">⁢</mo><mrow id="S2.E1.m1.7.8.2.3.2" xref="S2.E1.m1.7.8.2.3.1.cmml"><mo id="S2.E1.m1.7.8.2.3.2.1" stretchy="false" xref="S2.E1.m1.7.8.2.3.1.cmml">(</mo><mi class="ltx_font_mathcaligraphic" id="S2.E1.m1.1.1" xref="S2.E1.m1.1.1.cmml">𝒯</mi><mo id="S2.E1.m1.7.8.2.3.2.2" xref="S2.E1.m1.7.8.2.3.1.cmml">,</mo><mi class="ltx_font_mathcaligraphic" id="S2.E1.m1.2.2" xref="S2.E1.m1.2.2.cmml">𝒞</mi><mo id="S2.E1.m1.7.8.2.3.2.3" xref="S2.E1.m1.7.8.2.3.1.cmml">,</mo><mi class="ltx_font_mathcaligraphic" id="S2.E1.m1.3.3" xref="S2.E1.m1.3.3.cmml">𝒳</mi><mo id="S2.E1.m1.7.8.2.3.2.4" xref="S2.E1.m1.7.8.2.3.1.cmml">,</mo><mi class="ltx_font_mathcaligraphic" id="S2.E1.m1.4.4" xref="S2.E1.m1.4.4.cmml">ℛ</mi><mo id="S2.E1.m1.7.8.2.3.2.5" stretchy="false" xref="S2.E1.m1.7.8.2.3.1.cmml">)</mo></mrow></mrow><mo id="S2.E1.m1.7.8.1" stretchy="false" xref="S2.E1.m1.7.8.1.cmml">→</mo><mrow id="S2.E1.m1.7.8.3.2" xref="S2.E1.m1.7.8.3.1.cmml"><mo id="S2.E1.m1.7.8.3.2.1" stretchy="false" xref="S2.E1.m1.7.8.3.1.cmml">(</mo><mi class="ltx_font_mathcaligraphic" id="S2.E1.m1.5.5" xref="S2.E1.m1.5.5.cmml">𝒴</mi><mo id="S2.E1.m1.7.8.3.2.2" xref="S2.E1.m1.7.8.3.1.cmml">,</mo><mi class="ltx_font_mathcaligraphic" id="S2.E1.m1.6.6" xref="S2.E1.m1.6.6.cmml">ℰ</mi><mo id="S2.E1.m1.7.8.3.2.3" xref="S2.E1.m1.7.8.3.1.cmml">,</mo><mi class="ltx_font_mathcaligraphic" id="S2.E1.m1.7.7" xref="S2.E1.m1.7.7.cmml">ℱ</mi><mo id="S2.E1.m1.7.8.3.2.4" stretchy="false" xref="S2.E1.m1.7.8.3.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E1.m1.7b"><apply id="S2.E1.m1.7.8.cmml" xref="S2.E1.m1.7.8"><ci id="S2.E1.m1.7.8.1.cmml" xref="S2.E1.m1.7.8.1">→</ci><apply id="S2.E1.m1.7.8.2.cmml" xref="S2.E1.m1.7.8.2"><times id="S2.E1.m1.7.8.2.1.cmml" xref="S2.E1.m1.7.8.2.1"></times><ci id="S2.E1.m1.7.8.2.2.cmml" xref="S2.E1.m1.7.8.2.2">𝐸</ci><vector id="S2.E1.m1.7.8.2.3.1.cmml" xref="S2.E1.m1.7.8.2.3.2"><ci id="S2.E1.m1.1.1.cmml" xref="S2.E1.m1.1.1">𝒯</ci><ci id="S2.E1.m1.2.2.cmml" xref="S2.E1.m1.2.2">𝒞</ci><ci id="S2.E1.m1.3.3.cmml" xref="S2.E1.m1.3.3">𝒳</ci><ci id="S2.E1.m1.4.4.cmml" xref="S2.E1.m1.4.4">ℛ</ci></vector></apply><vector id="S2.E1.m1.7.8.3.1.cmml" xref="S2.E1.m1.7.8.3.2"><ci id="S2.E1.m1.5.5.cmml" xref="S2.E1.m1.5.5">𝒴</ci><ci id="S2.E1.m1.6.6.cmml" xref="S2.E1.m1.6.6">ℰ</ci><ci id="S2.E1.m1.7.7.cmml" xref="S2.E1.m1.7.7">ℱ</ci></vector></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E1.m1.7c">E(\mathcal{T},\mathcal{C},\mathcal{X},\mathcal{R})\rightarrow(\mathcal{Y},% \mathcal{E},\mathcal{F})</annotation><annotation encoding="application/x-llamapun" id="S2.E1.m1.7d">italic_E ( caligraphic_T , caligraphic_C , caligraphic_X , caligraphic_R ) → ( caligraphic_Y , caligraphic_E , caligraphic_F )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> </table> </div> <div class="ltx_para" id="S2.p3"> The function <math alttext="E" class="ltx_Math" display="inline" id="S2.p3.1.m1.1"><semantics id="S2.p3.1.m1.1a"><mi id="S2.p3.1.m1.1.1" xref="S2.p3.1.m1.1.1.cmml">E</mi><annotation-xml encoding="MathML-Content" id="S2.p3.1.m1.1b"><ci id="S2.p3.1.m1.1.1.cmml" xref="S2.p3.1.m1.1.1">𝐸</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.1.m1.1c">E</annotation><annotation encoding="application/x-llamapun" id="S2.p3.1.m1.1d">italic_E</annotation></semantics></math> is the core evaluation mechanism powered by LLMs. It takes the following key inputs: </div> <div class="ltx_para" id="S2.p4"> <ul class="ltx_itemize" id="S2.I1"> <li class="ltx_item" id="S2.I1.i1" style="list-style-type:none;"> • <div class="ltx_para" id="S2.I1.i1.p1"> <math alttext="\mathcal{T}" class="ltx_Math" display="inline" id="S2.I1.i1.p1.1.m1.1"><semantics id="S2.I1.i1.p1.1.m1.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I1.i1.p1.1.m1.1.1" xref="S2.I1.i1.p1.1.m1.1.1.cmml">𝒯</mi><annotation-xml encoding="MathML-Content" id="S2.I1.i1.p1.1.m1.1b"><ci id="S2.I1.i1.p1.1.m1.1.1.cmml" xref="S2.I1.i1.p1.1.m1.1.1">𝒯</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i1.p1.1.m1.1c">\mathcal{T}</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i1.p1.1.m1.1d">caligraphic_T</annotation></semantics></math> – Evaluation Type. It mainly covers three types: point-wise, pairwise, and list-wise. <ul class="ltx_itemize" id="S2.I1.i1.I1"> <li class="ltx_item" id="S2.I1.i1.I1.i1" style="list-style-type:none;"> – <div class="ltx_para" id="S2.I1.i1.I1.i1.p1"> Point-wise Evaluation: Function <math alttext="E" class="ltx_Math" display="inline" id="S2.I1.i1.I1.i1.p1.1.m1.1"><semantics id="S2.I1.i1.I1.i1.p1.1.m1.1a"><mi id="S2.I1.i1.I1.i1.p1.1.m1.1.1" xref="S2.I1.i1.I1.i1.p1.1.m1.1.1.cmml">E</mi><annotation-xml encoding="MathML-Content" id="S2.I1.i1.I1.i1.p1.1.m1.1b"><ci id="S2.I1.i1.I1.i1.p1.1.m1.1.1.cmml" xref="S2.I1.i1.I1.i1.p1.1.m1.1.1">𝐸</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i1.I1.i1.p1.1.m1.1c">E</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i1.I1.i1.p1.1.m1.1d">italic_E</annotation></semantics></math> evaluates a single candidate content independently and often provides a score or categorical classification. </div> </li> <li class="ltx_item" id="S2.I1.i1.I1.i2" style="list-style-type:none;"> – <div class="ltx_para" id="S2.I1.i1.I1.i2.p1"> Pair-wise Evaluation: Function <math alttext="E" class="ltx_Math" display="inline" id="S2.I1.i1.I1.i2.p1.1.m1.1"><semantics id="S2.I1.i1.I1.i2.p1.1.m1.1a"><mi id="S2.I1.i1.I1.i2.p1.1.m1.1.1" xref="S2.I1.i1.I1.i2.p1.1.m1.1.1.cmml">E</mi><annotation-xml encoding="MathML-Content" id="S2.I1.i1.I1.i2.p1.1.m1.1b"><ci id="S2.I1.i1.I1.i2.p1.1.m1.1.1.cmml" xref="S2.I1.i1.I1.i2.p1.1.m1.1.1">𝐸</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i1.I1.i2.p1.1.m1.1c">E</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i1.I1.i2.p1.1.m1.1d">italic_E</annotation></semantics></math> compares two candidates and determines their relative quality. </div> </li> <li class="ltx_item" id="S2.I1.i1.I1.i3" style="list-style-type:none;"> – <div class="ltx_para" id="S2.I1.i1.I1.i3.p1"> List-wise Evaluation: Function <math alttext="E" class="ltx_Math" display="inline" id="S2.I1.i1.I1.i3.p1.1.m1.1"><semantics id="S2.I1.i1.I1.i3.p1.1.m1.1a"><mi id="S2.I1.i1.I1.i3.p1.1.m1.1.1" xref="S2.I1.i1.I1.i3.p1.1.m1.1.1.cmml">E</mi><annotation-xml encoding="MathML-Content" id="S2.I1.i1.I1.i3.p1.1.m1.1b"><ci id="S2.I1.i1.I1.i3.p1.1.m1.1.1.cmml" xref="S2.I1.i1.I1.i3.p1.1.m1.1.1">𝐸</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i1.I1.i3.p1.1.m1.1c">E</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i1.I1.i3.p1.1.m1.1d">italic_E</annotation></semantics></math> ranks multiple (more than 2) candidates. </div> </li> </ul> </div> </li> <li class="ltx_item" id="S2.I1.i2" style="list-style-type:none;"> • <div class="ltx_para" id="S2.I1.i2.p1"> <math alttext="\mathcal{C}" class="ltx_Math" display="inline" id="S2.I1.i2.p1.1.m1.1"><semantics id="S2.I1.i2.p1.1.m1.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I1.i2.p1.1.m1.1.1" xref="S2.I1.i2.p1.1.m1.1.1.cmml">𝒞</mi><annotation-xml encoding="MathML-Content" id="S2.I1.i2.p1.1.m1.1b"><ci id="S2.I1.i2.p1.1.m1.1.1.cmml" xref="S2.I1.i2.p1.1.m1.1.1">𝒞</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i2.p1.1.m1.1c">\mathcal{C}</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i2.p1.1.m1.1d">caligraphic_C</annotation></semantics></math> – Evaluation Criteria. It describes the assessment aspects, e.g., correctness, helpfulness, readability, etc. </div> </li> <li class="ltx_item" id="S2.I1.i3" style="list-style-type:none;"> • <div class="ltx_para" id="S2.I1.i3.p1"> <math alttext="\mathcal{X}" class="ltx_Math" display="inline" id="S2.I1.i3.p1.1.m1.1"><semantics id="S2.I1.i3.p1.1.m1.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I1.i3.p1.1.m1.1.1" xref="S2.I1.i3.p1.1.m1.1.1.cmml">𝒳</mi><annotation-xml encoding="MathML-Content" id="S2.I1.i3.p1.1.m1.1b"><ci id="S2.I1.i3.p1.1.m1.1.1.cmml" xref="S2.I1.i3.p1.1.m1.1.1">𝒳</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i3.p1.1.m1.1c">\mathcal{X}</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i3.p1.1.m1.1d">caligraphic_X</annotation></semantics></math> – Evaluation Item. It is the content to be judged. Note that <math alttext="\mathcal{X}" class="ltx_Math" display="inline" id="S2.I1.i3.p1.2.m2.1"><semantics id="S2.I1.i3.p1.2.m2.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I1.i3.p1.2.m2.1.1" xref="S2.I1.i3.p1.2.m2.1.1.cmml">𝒳</mi><annotation-xml encoding="MathML-Content" id="S2.I1.i3.p1.2.m2.1b"><ci id="S2.I1.i3.p1.2.m2.1.1.cmml" xref="S2.I1.i3.p1.2.m2.1.1">𝒳</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i3.p1.2.m2.1c">\mathcal{X}</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i3.p1.2.m2.1d">caligraphic_X</annotation></semantics></math> can be a single candidate or a set of candidates <math alttext="(x_{1},x_{2},...,x_{n})" class="ltx_Math" display="inline" id="S2.I1.i3.p1.3.m3.4"><semantics id="S2.I1.i3.p1.3.m3.4a"><mrow id="S2.I1.i3.p1.3.m3.4.4.3" xref="S2.I1.i3.p1.3.m3.4.4.4.cmml"><mo id="S2.I1.i3.p1.3.m3.4.4.3.4" stretchy="false" xref="S2.I1.i3.p1.3.m3.4.4.4.cmml">(</mo><msub id="S2.I1.i3.p1.3.m3.2.2.1.1" xref="S2.I1.i3.p1.3.m3.2.2.1.1.cmml"><mi id="S2.I1.i3.p1.3.m3.2.2.1.1.2" xref="S2.I1.i3.p1.3.m3.2.2.1.1.2.cmml">x</mi><mn id="S2.I1.i3.p1.3.m3.2.2.1.1.3" xref="S2.I1.i3.p1.3.m3.2.2.1.1.3.cmml">1</mn></msub><mo id="S2.I1.i3.p1.3.m3.4.4.3.5" xref="S2.I1.i3.p1.3.m3.4.4.4.cmml">,</mo><msub id="S2.I1.i3.p1.3.m3.3.3.2.2" xref="S2.I1.i3.p1.3.m3.3.3.2.2.cmml"><mi id="S2.I1.i3.p1.3.m3.3.3.2.2.2" xref="S2.I1.i3.p1.3.m3.3.3.2.2.2.cmml">x</mi><mn id="S2.I1.i3.p1.3.m3.3.3.2.2.3" xref="S2.I1.i3.p1.3.m3.3.3.2.2.3.cmml">2</mn></msub><mo id="S2.I1.i3.p1.3.m3.4.4.3.6" xref="S2.I1.i3.p1.3.m3.4.4.4.cmml">,</mo><mi id="S2.I1.i3.p1.3.m3.1.1" mathvariant="normal" xref="S2.I1.i3.p1.3.m3.1.1.cmml">…</mi><mo id="S2.I1.i3.p1.3.m3.4.4.3.7" xref="S2.I1.i3.p1.3.m3.4.4.4.cmml">,</mo><msub id="S2.I1.i3.p1.3.m3.4.4.3.3" xref="S2.I1.i3.p1.3.m3.4.4.3.3.cmml"><mi id="S2.I1.i3.p1.3.m3.4.4.3.3.2" xref="S2.I1.i3.p1.3.m3.4.4.3.3.2.cmml">x</mi><mi id="S2.I1.i3.p1.3.m3.4.4.3.3.3" xref="S2.I1.i3.p1.3.m3.4.4.3.3.3.cmml">n</mi></msub><mo id="S2.I1.i3.p1.3.m3.4.4.3.8" stretchy="false" xref="S2.I1.i3.p1.3.m3.4.4.4.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.I1.i3.p1.3.m3.4b"><vector id="S2.I1.i3.p1.3.m3.4.4.4.cmml" xref="S2.I1.i3.p1.3.m3.4.4.3"><apply id="S2.I1.i3.p1.3.m3.2.2.1.1.cmml" xref="S2.I1.i3.p1.3.m3.2.2.1.1"><csymbol cd="ambiguous" id="S2.I1.i3.p1.3.m3.2.2.1.1.1.cmml" xref="S2.I1.i3.p1.3.m3.2.2.1.1">subscript</csymbol><ci id="S2.I1.i3.p1.3.m3.2.2.1.1.2.cmml" xref="S2.I1.i3.p1.3.m3.2.2.1.1.2">𝑥</ci><cn id="S2.I1.i3.p1.3.m3.2.2.1.1.3.cmml" type="integer" xref="S2.I1.i3.p1.3.m3.2.2.1.1.3">1</cn></apply><apply id="S2.I1.i3.p1.3.m3.3.3.2.2.cmml" xref="S2.I1.i3.p1.3.m3.3.3.2.2"><csymbol cd="ambiguous" id="S2.I1.i3.p1.3.m3.3.3.2.2.1.cmml" xref="S2.I1.i3.p1.3.m3.3.3.2.2">subscript</csymbol><ci id="S2.I1.i3.p1.3.m3.3.3.2.2.2.cmml" xref="S2.I1.i3.p1.3.m3.3.3.2.2.2">𝑥</ci><cn id="S2.I1.i3.p1.3.m3.3.3.2.2.3.cmml" type="integer" xref="S2.I1.i3.p1.3.m3.3.3.2.2.3">2</cn></apply><ci id="S2.I1.i3.p1.3.m3.1.1.cmml" xref="S2.I1.i3.p1.3.m3.1.1">…</ci><apply id="S2.I1.i3.p1.3.m3.4.4.3.3.cmml" xref="S2.I1.i3.p1.3.m3.4.4.3.3"><csymbol cd="ambiguous" id="S2.I1.i3.p1.3.m3.4.4.3.3.1.cmml" xref="S2.I1.i3.p1.3.m3.4.4.3.3">subscript</csymbol><ci id="S2.I1.i3.p1.3.m3.4.4.3.3.2.cmml" xref="S2.I1.i3.p1.3.m3.4.4.3.3.2">𝑥</ci><ci id="S2.I1.i3.p1.3.m3.4.4.3.3.3.cmml" xref="S2.I1.i3.p1.3.m3.4.4.3.3.3">𝑛</ci></apply></vector></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i3.p1.3.m3.4c">(x_{1},x_{2},...,x_{n})</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i3.p1.3.m3.4d">( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )</annotation></semantics></math>, where <math alttext="x_{i}" class="ltx_Math" display="inline" id="S2.I1.i3.p1.4.m4.1"><semantics id="S2.I1.i3.p1.4.m4.1a"><msub id="S2.I1.i3.p1.4.m4.1.1" xref="S2.I1.i3.p1.4.m4.1.1.cmml"><mi id="S2.I1.i3.p1.4.m4.1.1.2" xref="S2.I1.i3.p1.4.m4.1.1.2.cmml">x</mi><mi id="S2.I1.i3.p1.4.m4.1.1.3" xref="S2.I1.i3.p1.4.m4.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S2.I1.i3.p1.4.m4.1b"><apply id="S2.I1.i3.p1.4.m4.1.1.cmml" xref="S2.I1.i3.p1.4.m4.1.1"><csymbol cd="ambiguous" id="S2.I1.i3.p1.4.m4.1.1.1.cmml" xref="S2.I1.i3.p1.4.m4.1.1">subscript</csymbol><ci id="S2.I1.i3.p1.4.m4.1.1.2.cmml" xref="S2.I1.i3.p1.4.m4.1.1.2">𝑥</ci><ci id="S2.I1.i3.p1.4.m4.1.1.3.cmml" xref="S2.I1.i3.p1.4.m4.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i3.p1.4.m4.1c">x_{i}</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i3.p1.4.m4.1d">italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math> is the <math alttext="i^{th}" class="ltx_Math" display="inline" id="S2.I1.i3.p1.5.m5.1"><semantics id="S2.I1.i3.p1.5.m5.1a"><msup id="S2.I1.i3.p1.5.m5.1.1" xref="S2.I1.i3.p1.5.m5.1.1.cmml"><mi id="S2.I1.i3.p1.5.m5.1.1.2" xref="S2.I1.i3.p1.5.m5.1.1.2.cmml">i</mi><mrow id="S2.I1.i3.p1.5.m5.1.1.3" xref="S2.I1.i3.p1.5.m5.1.1.3.cmml"><mi id="S2.I1.i3.p1.5.m5.1.1.3.2" xref="S2.I1.i3.p1.5.m5.1.1.3.2.cmml">t</mi><mo id="S2.I1.i3.p1.5.m5.1.1.3.1" xref="S2.I1.i3.p1.5.m5.1.1.3.1.cmml">⁢</mo><mi id="S2.I1.i3.p1.5.m5.1.1.3.3" xref="S2.I1.i3.p1.5.m5.1.1.3.3.cmml">h</mi></mrow></msup><annotation-xml encoding="MathML-Content" id="S2.I1.i3.p1.5.m5.1b"><apply id="S2.I1.i3.p1.5.m5.1.1.cmml" xref="S2.I1.i3.p1.5.m5.1.1"><csymbol cd="ambiguous" id="S2.I1.i3.p1.5.m5.1.1.1.cmml" xref="S2.I1.i3.p1.5.m5.1.1">superscript</csymbol><ci id="S2.I1.i3.p1.5.m5.1.1.2.cmml" xref="S2.I1.i3.p1.5.m5.1.1.2">𝑖</ci><apply id="S2.I1.i3.p1.5.m5.1.1.3.cmml" xref="S2.I1.i3.p1.5.m5.1.1.3"><times id="S2.I1.i3.p1.5.m5.1.1.3.1.cmml" xref="S2.I1.i3.p1.5.m5.1.1.3.1"></times><ci id="S2.I1.i3.p1.5.m5.1.1.3.2.cmml" xref="S2.I1.i3.p1.5.m5.1.1.3.2">𝑡</ci><ci id="S2.I1.i3.p1.5.m5.1.1.3.3.cmml" xref="S2.I1.i3.p1.5.m5.1.1.3.3">ℎ</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i3.p1.5.m5.1c">i^{th}</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i3.p1.5.m5.1d">italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT</annotation></semantics></math> candidate to be evaluated. </div> </li> <li class="ltx_item" id="S2.I1.i4" style="list-style-type:none;"> • <div class="ltx_para" id="S2.I1.i4.p1"> <math alttext="\mathcal{R}" class="ltx_Math" display="inline" id="S2.I1.i4.p1.1.m1.1"><semantics id="S2.I1.i4.p1.1.m1.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I1.i4.p1.1.m1.1.1" xref="S2.I1.i4.p1.1.m1.1.1.cmml">ℛ</mi><annotation-xml encoding="MathML-Content" id="S2.I1.i4.p1.1.m1.1b"><ci id="S2.I1.i4.p1.1.m1.1.1.cmml" xref="S2.I1.i4.p1.1.m1.1.1">ℛ</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i4.p1.1.m1.1c">\mathcal{R}</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i4.p1.1.m1.1d">caligraphic_R</annotation></semantics></math> – Optional reference information. For instance, a code implementation that successfully fulfills the requirements of a coding task. </div> </li> </ul> </div> <div class="ltx_para" id="S2.p5"> Given these inputs, <math alttext="E" class="ltx_Math" display="inline" id="S2.p5.1.m1.1"><semantics id="S2.p5.1.m1.1a"><mi id="S2.p5.1.m1.1.1" xref="S2.p5.1.m1.1.1.cmml">E</mi><annotation-xml encoding="MathML-Content" id="S2.p5.1.m1.1b"><ci id="S2.p5.1.m1.1.1.cmml" xref="S2.p5.1.m1.1.1">𝐸</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p5.1.m1.1c">E</annotation><annotation encoding="application/x-llamapun" id="S2.p5.1.m1.1d">italic_E</annotation></semantics></math> can produce three primary outputs: </div> <div class="ltx_para" id="S2.p6"> <ul class="ltx_itemize" id="S2.I2"> <li class="ltx_item" id="S2.I2.i1" style="list-style-type:none;"> • <div class="ltx_para" id="S2.I2.i1.p1"> <math alttext="\mathcal{Y}" class="ltx_Math" display="inline" id="S2.I2.i1.p1.1.m1.1"><semantics id="S2.I2.i1.p1.1.m1.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I2.i1.p1.1.m1.1.1" xref="S2.I2.i1.p1.1.m1.1.1.cmml">𝒴</mi><annotation-xml encoding="MathML-Content" id="S2.I2.i1.p1.1.m1.1b"><ci id="S2.I2.i1.p1.1.m1.1.1.cmml" xref="S2.I2.i1.p1.1.m1.1.1">𝒴</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i1.p1.1.m1.1c">\mathcal{Y}</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i1.p1.1.m1.1d">caligraphic_Y</annotation></semantics></math> – Evaluation Result. <math alttext="\mathcal{Y}" class="ltx_Math" display="inline" id="S2.I2.i1.p1.2.m2.1"><semantics id="S2.I2.i1.p1.2.m2.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I2.i1.p1.2.m2.1.1" xref="S2.I2.i1.p1.2.m2.1.1.cmml">𝒴</mi><annotation-xml encoding="MathML-Content" id="S2.I2.i1.p1.2.m2.1b"><ci id="S2.I2.i1.p1.2.m2.1.1.cmml" xref="S2.I2.i1.p1.2.m2.1.1">𝒴</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i1.p1.2.m2.1c">\mathcal{Y}</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i1.p1.2.m2.1d">caligraphic_Y</annotation></semantics></math> can take one of the following forms: <ul class="ltx_itemize" id="S2.I2.i1.I1"> <li class="ltx_item" id="S2.I2.i1.I1.i1" style="list-style-type:none;"> – <div class="ltx_para" id="S2.I2.i1.I1.i1.p1"> Graded Evaluation: Each candidate is assigned a score, either numerical (discrete or continuous) or categorical. </div> </li> <li class="ltx_item" id="S2.I2.i1.I1.i2" style="list-style-type:none;"> – <div class="ltx_para" id="S2.I2.i1.I1.i2.p1"> Rank Ordering: Candidates <math alttext="(x_{1},x_{2},...,x_{n})" class="ltx_Math" display="inline" id="S2.I2.i1.I1.i2.p1.1.m1.4"><semantics id="S2.I2.i1.I1.i2.p1.1.m1.4a"><mrow id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.4.cmml"><mo id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.4" stretchy="false" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.4.cmml">(</mo><msub id="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1" xref="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.cmml"><mi id="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.2" xref="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.2.cmml">x</mi><mn id="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.3" xref="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.3.cmml">1</mn></msub><mo id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.5" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.4.cmml">,</mo><msub id="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2" xref="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.cmml"><mi id="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.2" xref="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.2.cmml">x</mi><mn id="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.3" xref="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.3.cmml">2</mn></msub><mo id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.6" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.4.cmml">,</mo><mi id="S2.I2.i1.I1.i2.p1.1.m1.1.1" mathvariant="normal" xref="S2.I2.i1.I1.i2.p1.1.m1.1.1.cmml">…</mi><mo id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.7" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.4.cmml">,</mo><msub id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.cmml"><mi id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.2" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.2.cmml">x</mi><mi id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.3" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.3.cmml">n</mi></msub><mo id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.8" stretchy="false" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.4.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.I2.i1.I1.i2.p1.1.m1.4b"><vector id="S2.I2.i1.I1.i2.p1.1.m1.4.4.4.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.3"><apply id="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1"><csymbol cd="ambiguous" id="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.1.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1">subscript</csymbol><ci id="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.2.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.2">𝑥</ci><cn id="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.3.cmml" type="integer" xref="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.3">1</cn></apply><apply id="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2"><csymbol cd="ambiguous" id="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.1.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2">subscript</csymbol><ci id="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.2.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.2">𝑥</ci><cn id="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.3.cmml" type="integer" xref="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.3">2</cn></apply><ci id="S2.I2.i1.I1.i2.p1.1.m1.1.1.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.1.1">…</ci><apply id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3"><csymbol cd="ambiguous" id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.1.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3">subscript</csymbol><ci id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.2.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.2">𝑥</ci><ci id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.3.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.3">𝑛</ci></apply></vector></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i1.I1.i2.p1.1.m1.4c">(x_{1},x_{2},...,x_{n})</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i1.I1.i2.p1.1.m1.4d">( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )</annotation></semantics></math> are ranked based on their assessed quality. </div> </li> <li class="ltx_item" id="S2.I2.i1.I1.i3" style="list-style-type:none;"> – <div class="ltx_para" id="S2.I2.i1.I1.i3.p1"> Best-Choice Selection: Function <math alttext="E" class="ltx_Math" display="inline" id="S2.I2.i1.I1.i3.p1.1.m1.1"><semantics id="S2.I2.i1.I1.i3.p1.1.m1.1a"><mi id="S2.I2.i1.I1.i3.p1.1.m1.1.1" xref="S2.I2.i1.I1.i3.p1.1.m1.1.1.cmml">E</mi><annotation-xml encoding="MathML-Content" id="S2.I2.i1.I1.i3.p1.1.m1.1b"><ci id="S2.I2.i1.I1.i3.p1.1.m1.1.1.cmml" xref="S2.I2.i1.I1.i3.p1.1.m1.1.1">𝐸</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i1.I1.i3.p1.1.m1.1c">E</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i1.I1.i3.p1.1.m1.1d">italic_E</annotation></semantics></math> selects the most suitable candidate (<math alttext="x_{i}" class="ltx_Math" display="inline" id="S2.I2.i1.I1.i3.p1.2.m2.1"><semantics id="S2.I2.i1.I1.i3.p1.2.m2.1a"><msub id="S2.I2.i1.I1.i3.p1.2.m2.1.1" xref="S2.I2.i1.I1.i3.p1.2.m2.1.1.cmml"><mi id="S2.I2.i1.I1.i3.p1.2.m2.1.1.2" xref="S2.I2.i1.I1.i3.p1.2.m2.1.1.2.cmml">x</mi><mi id="S2.I2.i1.I1.i3.p1.2.m2.1.1.3" xref="S2.I2.i1.I1.i3.p1.2.m2.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S2.I2.i1.I1.i3.p1.2.m2.1b"><apply id="S2.I2.i1.I1.i3.p1.2.m2.1.1.cmml" xref="S2.I2.i1.I1.i3.p1.2.m2.1.1"><csymbol cd="ambiguous" id="S2.I2.i1.I1.i3.p1.2.m2.1.1.1.cmml" xref="S2.I2.i1.I1.i3.p1.2.m2.1.1">subscript</csymbol><ci id="S2.I2.i1.I1.i3.p1.2.m2.1.1.2.cmml" xref="S2.I2.i1.I1.i3.p1.2.m2.1.1.2">𝑥</ci><ci id="S2.I2.i1.I1.i3.p1.2.m2.1.1.3.cmml" xref="S2.I2.i1.I1.i3.p1.2.m2.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i1.I1.i3.p1.2.m2.1c">x_{i}</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i1.I1.i3.p1.2.m2.1d">italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math>) from the given set <math alttext="(x_{1},x_{2},...,x_{n})" class="ltx_Math" display="inline" id="S2.I2.i1.I1.i3.p1.3.m3.4"><semantics id="S2.I2.i1.I1.i3.p1.3.m3.4a"><mrow id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.4.cmml"><mo id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.4" stretchy="false" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.4.cmml">(</mo><msub id="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1" xref="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.cmml"><mi id="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.2" xref="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.2.cmml">x</mi><mn id="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.3" xref="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.3.cmml">1</mn></msub><mo id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.5" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.4.cmml">,</mo><msub id="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2" xref="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.cmml"><mi id="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.2" xref="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.2.cmml">x</mi><mn id="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.3" xref="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.3.cmml">2</mn></msub><mo id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.6" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.4.cmml">,</mo><mi id="S2.I2.i1.I1.i3.p1.3.m3.1.1" mathvariant="normal" xref="S2.I2.i1.I1.i3.p1.3.m3.1.1.cmml">…</mi><mo id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.7" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.4.cmml">,</mo><msub id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.cmml"><mi id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.2" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.2.cmml">x</mi><mi id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.3" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.3.cmml">n</mi></msub><mo id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.8" stretchy="false" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.4.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.I2.i1.I1.i3.p1.3.m3.4b"><vector id="S2.I2.i1.I1.i3.p1.3.m3.4.4.4.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.3"><apply id="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1"><csymbol cd="ambiguous" id="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.1.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1">subscript</csymbol><ci id="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.2.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.2">𝑥</ci><cn id="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.3.cmml" type="integer" xref="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.3">1</cn></apply><apply id="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2"><csymbol cd="ambiguous" id="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.1.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2">subscript</csymbol><ci id="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.2.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.2">𝑥</ci><cn id="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.3.cmml" type="integer" xref="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.3">2</cn></apply><ci id="S2.I2.i1.I1.i3.p1.3.m3.1.1.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.1.1">…</ci><apply id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3"><csymbol cd="ambiguous" id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.1.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3">subscript</csymbol><ci id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.2.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.2">𝑥</ci><ci id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.3.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.3">𝑛</ci></apply></vector></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i1.I1.i3.p1.3.m3.4c">(x_{1},x_{2},...,x_{n})</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i1.I1.i3.p1.3.m3.4d">( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )</annotation></semantics></math>. </div> </li> </ul> </div> </li> <li class="ltx_item" id="S2.I2.i2" style="list-style-type:none;"> • <div class="ltx_para" id="S2.I2.i2.p1"> <math alttext="\mathcal{E}" class="ltx_Math" display="inline" id="S2.I2.i2.p1.1.m1.1"><semantics id="S2.I2.i2.p1.1.m1.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I2.i2.p1.1.m1.1.1" xref="S2.I2.i2.p1.1.m1.1.1.cmml">ℰ</mi><annotation-xml encoding="MathML-Content" id="S2.I2.i2.p1.1.m1.1b"><ci id="S2.I2.i2.p1.1.m1.1.1.cmml" xref="S2.I2.i2.p1.1.m1.1.1">ℰ</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i2.p1.1.m1.1c">\mathcal{E}</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i2.p1.1.m1.1d">caligraphic_E</annotation></semantics></math> – Evaluation Explanation. Optionally provides reasoning and justification for the evaluation. </div> </li> <li class="ltx_item" id="S2.I2.i3" style="list-style-type:none;"> • <div class="ltx_para" id="S2.I2.i3.p1"> <math alttext="\mathcal{F}" class="ltx_Math" display="inline" id="S2.I2.i3.p1.1.m1.1"><semantics id="S2.I2.i3.p1.1.m1.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I2.i3.p1.1.m1.1.1" xref="S2.I2.i3.p1.1.m1.1.1.cmml">ℱ</mi><annotation-xml encoding="MathML-Content" id="S2.I2.i3.p1.1.m1.1b"><ci id="S2.I2.i3.p1.1.m1.1.1.cmml" xref="S2.I2.i3.p1.1.m1.1.1">ℱ</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i3.p1.1.m1.1c">\mathcal{F}</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i3.p1.1.m1.1d">caligraphic_F</annotation></semantics></math> – Feedback. Optionally offers constructive suggestions for improving the evaluated item <math alttext="\mathcal{X}" class="ltx_Math" display="inline" id="S2.I2.i3.p1.2.m2.1"><semantics id="S2.I2.i3.p1.2.m2.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I2.i3.p1.2.m2.1.1" xref="S2.I2.i3.p1.2.m2.1.1.cmml">𝒳</mi><annotation-xml encoding="MathML-Content" id="S2.I2.i3.p1.2.m2.1b"><ci id="S2.I2.i3.p1.2.m2.1.1.cmml" xref="S2.I2.i3.p1.2.m2.1.1">𝒳</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i3.p1.2.m2.1c">\mathcal{X}</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i3.p1.2.m2.1d">caligraphic_X</annotation></semantics></math>. </div> </li> </ul> </div> <div class="ltx_para" id="S2.p7"> Difference to previous work. Note that this definition of LLM-as-a-Judge is stricter than the definition proposed by Wang et al. <cite class="ltx_cite ltx_citemacro_citep">(Wang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>)</cite>. Wang et al. broadly define LLM-as-a-Judge as any method that leverages any features of LLM to evaluate content quality, which includes embedding-based approaches, such as BERTScore <cite class="ltx_cite ltx_citemacro_citep">(Zhang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib83" title="">2019</a>)</cite> and CodeBERTScore <cite class="ltx_cite ltx_citemacro_citep">(Zhou et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib88" title="">2023</a>)</cite>. Embedding-based methods evaluate content similarity by converting both the evaluated content and references into numerical representations (LLM embeddings). However, we argue that their definition is better characterized as LLM-based evaluation rather than LLM-as-a-Judge. The key distinction is that we envision LLM-as-a-Judge as a true surrogate for human evaluation. In our view, an LLM-as-a-Judge system should be capable of independently assessing content without requiring a strict reference. Moreover, it should be able to perform a wide spectrum of nuanced assessments (e.g., usefulness, adequacy, and conciseness) and can ideally provide justifications or constructive feedback. In contrast, embedding-based methods rely solely on the numerical representation of assessing content similarity and inherently lack these capabilities. Consequently, we do not include them in our definition and instead leave a stricter definition to previous work. </div> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> 3. Literature Review</h2> <div class="ltx_para" id="S3.p1"> This section, we provide an overview of studies that evaluate the effectiveness of LLM-as-a-Judge in SE tasks, with Table <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S3.T1" title="Table 1 ‣ 3. Literature Review ‣ From Code to Courtroom: LLMs as the New Software Judges">1</a> providing a structured summary that maps each study to its corresponding SE task. Following this, we present a detailed discussion of the evaluation objectives of these studies. </div> <figure class="ltx_table" id="S3.T1"> <figcaption class="ltx_caption ltx_centering">Table 1. Literature Review on the Applications of LLM-as-a-Judge in Software Engineering</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="S3.T1.1"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S3.T1.1.1.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_border_tt" id="S3.T1.1.1.1.1">Task</th> <th class="ltx_td ltx_nopad_r ltx_align_left ltx_th ltx_th_column ltx_border_tt" id="S3.T1.1.1.1.2">Reference</th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S3.T1.1.2.1"> <td class="ltx_td ltx_align_left ltx_border_t" id="S3.T1.1.2.1.1">Code Generation</td> <td class="ltx_td ltx_nopad_r ltx_align_left ltx_border_t" id="S3.T1.1.2.1.2"><cite class="ltx_cite ltx_citemacro_citep">(Wang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>; Tong and Zhang, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib64" title="">2024</a>; Ahmed et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>; Patel et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib48" title="">2024</a>; Tan et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib62" title="">2024</a>; Zhao et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib85" title="">2025</a>; Zhuo, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib89" title="">2023</a>; Sollenberger et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib59" title="">2024</a>; Weyssow et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib67" title="">2024</a>; Xu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib71" title="">2024</a>)</cite></td> </tr> <tr class="ltx_tr" id="S3.T1.1.3.2"> <td class="ltx_td ltx_align_left" id="S3.T1.1.3.2.1">Code Summarization</td> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.3.2.2"><cite class="ltx_cite ltx_citemacro_citep">(Wang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>; Ahmed et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>; Farchi et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib15" title="">2024</a>; Wu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib69" title="">2024</a>; Weyssow et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib67" title="">2024</a>)</cite></td> </tr> <tr class="ltx_tr" id="S3.T1.1.4.3"> <td class="ltx_td ltx_align_left" id="S3.T1.1.4.3.1">Bug Report Summarization</td> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.4.3.2"><cite class="ltx_cite ltx_citemacro_citep">(Kumar et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib31" title="">2024</a>)</cite></td> </tr> <tr class="ltx_tr" id="S3.T1.1.5.4"> <td class="ltx_td ltx_align_left" id="S3.T1.1.5.4.1">Code Translation</td> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.5.4.2"><cite class="ltx_cite ltx_citemacro_citep">(Wang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>)</cite></td> </tr> <tr class="ltx_tr" id="S3.T1.1.6.5"> <td class="ltx_td ltx_align_left" id="S3.T1.1.6.5.1">Question Answering</td> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.6.5.2"><cite class="ltx_cite ltx_citemacro_citep">(Zheng et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib86" title="">2024</a>)</cite></td> </tr> <tr class="ltx_tr" id="S3.T1.1.7.6"> <td class="ltx_td ltx_align_left" id="S3.T1.1.7.6.1">Requirements Causality Extraction</td> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.7.6.2"><cite class="ltx_cite ltx_citemacro_citep">(Ahmed et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>)</cite></td> </tr> <tr class="ltx_tr" id="S3.T1.1.8.7"> <td class="ltx_td ltx_align_left ltx_border_bb" id="S3.T1.1.8.7.1">Code Patches Generation</td> <td class="ltx_td ltx_nopad_r ltx_align_left ltx_border_bb" id="S3.T1.1.8.7.2"><cite class="ltx_cite ltx_citemacro_citep">(Yadavally et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib72" title="">2025</a>; Ahmed et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>; Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib36" title="">2024c</a>)</cite></td> </tr> </tbody> </table> </figure> <section class="ltx_subsection" id="S3.SS1"> <h3 class="ltx_title ltx_title_subsection"> 3.1. Code Generation</h3> <div class="ltx_para" id="S3.SS1.p1"> Effectively and automatically evaluating the quality of generated code is one of the most critical problems in software engineering. Before the introduction of LLM-as-a-Judge, automated code evaluation primarily relied on test-based method, i.e., executing the code based on pre-defined test cases <cite class="ltx_cite ltx_citemacro_citep">(Zhuo et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib90" title="">2024</a>; Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib9" title="">2021</a>; Hendrycks et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib24" title="">2021</a>)</cite>. However, passing all the test cases does not necessarily mean the quality of the code is good, as the test cases may not comprehensively cover all the edge cases <cite class="ltx_cite ltx_citemacro_citep">(Dou et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib12" title="">2024</a>)</cite>. Constructing test cases can also be particularly challenging for certain tasks <cite class="ltx_cite ltx_citemacro_citep">(Tong and Zhang, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib64" title="">2024</a>)</cite>, such as AI model training, web scraping, etc. When test cases are unavailable, many studies rely on reference-based metrics <cite class="ltx_cite ltx_citemacro_citep">(Dehaerne et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib11" title="">2022</a>; Zheng et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib87" title="">2023</a>)</cite>, such as CodeBLEU <cite class="ltx_cite ltx_citemacro_citep">(Ren et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib52" title="">2020</a>)</cite> and CodeBERTScore <cite class="ltx_cite ltx_citemacro_citep">(Zhou et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib88" title="">2023</a>)</cite>. However, the quality of the reference code restricts the effectiveness of reference-based metrics. Prior research <cite class="ltx_cite ltx_citemacro_citep">(Evtikhiev et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib13" title="">2023</a>)</cite> highlighted that reference-based metrics frequently misalign human judgments in code generation. </div> <div class="ltx_para" id="S3.SS1.p2"> Consequently, a substantial body of research has explored the effectiveness of LLMs in evaluating code generation <cite class="ltx_cite ltx_citemacro_citep">(Wang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>; Zheng et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib86" title="">2024</a>; Tong and Zhang, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib64" title="">2024</a>; Patel et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib48" title="">2024</a>; Tan et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib62" title="">2024</a>; Zhao et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib85" title="">2025</a>; Zhuo, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib89" title="">2023</a>)</cite>. We summarize three main characteristics of LLM-as-a-Judge that are different from the test-based and reference-based metrics: </div> <div class="ltx_para" id="S3.SS1.p3"> <ul class="ltx_itemize" id="S3.I1"> <li class="ltx_item" id="S3.I1.i1" style="list-style-type:none;"> • <div class="ltx_para" id="S3.I1.i1.p1"> Execution-free: It does not require composary execution of the code. </div> </li> <li class="ltx_item" id="S3.I1.i2" style="list-style-type:none;"> • <div class="ltx_para" id="S3.I1.i2.p1"> Reference-free: Unlike reference-based metrics, LLM-as-a-Judge does not require a reference code. Even though we can still optionally provide the reference code. However, as demonstrated by Zhuo et al. <cite class="ltx_cite ltx_citemacro_citep">(Zhuo, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib89" title="">2023</a>)</cite>, providing the reference does not generally improve the code evaluation performance. </div> </li> <li class="ltx_item" id="S3.I1.i3" style="list-style-type:none;"> • <div class="ltx_para" id="S3.I1.i3.p1"> Multi-Facet Evaluation: LLM-as-a-Judge assesses intrinsic code qualities that traditionally required human judgment, such as readability and usefulness. </div> </li> </ul> </div> <div class="ltx_para" id="S3.SS1.p4"> Among existing studies, Zhuo et al. <cite class="ltx_cite ltx_citemacro_citep">(Zhuo, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib89" title="">2023</a>)</cite> and Weyssow et al. <cite class="ltx_cite ltx_citemacro_citep">(Weyssow et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib67" title="">2024</a>)</cite> utilize GPT-3.5 <cite class="ltx_cite ltx_citemacro_citep">(OpenAI, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib44" title="">2023a</a>)</cite>, and Xu et al. <cite class="ltx_cite ltx_citemacro_citep">(Xu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib71" title="">2024</a>)</cite> use GPT-4 to annotate code snippets. CodeJudge <cite class="ltx_cite ltx_citemacro_citep">(Tong and Zhang, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib64" title="">2024</a>)</cite> first leverages a detailed taxonomy of common programming errors to guide the LLM in analyzing the generated code. Then, CodeJudge summarizes the analysis report to produce its final judgment. Wang et al. <cite class="ltx_cite ltx_citemacro_citep">(Wang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>)</cite> experiment with a series of general-purpose LLM-as-a-Judge methods from the natural language processing (NLP) domain in code generation evaluation, including BatchEval <cite class="ltx_cite ltx_citemacro_citep">(Yuan et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib80" title="">2023</a>)</cite>, GPTScore <cite class="ltx_cite ltx_citemacro_citep">(Fu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib16" title="">2023</a>)</cite>, and G-Eval <cite class="ltx_cite ltx_citemacro_citep">(Liu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib42" title="">2023</a>)</cite>. Ahmed et al. <cite class="ltx_cite ltx_citemacro_citep">(Ahmed et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>)</cite> investigate LLM’s ability to evaluate variable name-value inconsistency and function similarities. Gu et al. <cite class="ltx_cite ltx_citemacro_citep">(Gu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib22" title="">2024</a>)</cite> focus on analyzing LLM’s judgments on counterfeit code samples, i.e., incorrect programs generated by language models that pass weak but non-trivial correctness checks. Zhao et al. <cite class="ltx_cite ltx_citemacro_citep">(Zhao et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib85" title="">2025</a>)</cite> empirically evaluate the performance of 12 LLMs on code generation evaluation. Patel et al. <cite class="ltx_cite ltx_citemacro_citep">(Patel et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib48" title="">2024</a>)</cite> hypothesis that a combination of multiple evaluators can approximate the optimal evaluation. Thus, they propose the AIME framework, which employs multiple LLMs to evaluate different aspects of the code, including code correctness, readability, runtime performance, etc. </div> <div class="ltx_para" id="S3.SS1.p5"> Further, we summarize the common evaluation criteria used in the literature for code generation. We categorize them into two main criteria: Code Functionality and Code Quality. </div> <div class="ltx_para ltx_noindent" id="S3.SS1.p6"> Code Functionality: This criterion assesses the overall effectiveness, maintainability, and clarity of the code beyond its functional correctness. It evaluates intrinsic code attributes. Key aspects include: </div> <div class="ltx_para" id="S3.SS1.p7"> <ul class="ltx_itemize" id="S3.I2"> <li class="ltx_item" id="S3.I2.i1" style="list-style-type:none;"> - <div class="ltx_para" id="S3.I2.i1.p1"> Execution Stability: LLM-as-a-Judge’s evaluation process does not mandate code execution, meaning that running the code is not a required step. </div> </li> <li class="ltx_item" id="S3.I2.i2" style="list-style-type:none;"> - <div class="ltx_para" id="S3.I2.i2.p1"> Functional Correctness: Verifies whether the code produces the expected output according to the task description. </div> </li> <li class="ltx_item" id="S3.I2.i3" style="list-style-type:none;"> - <div class="ltx_para" id="S3.I2.i3.p1"> Fault Tolerance & Error Handling: Evaluates how the code manages edge cases, exceptions, invalid inputs, and unexpected failures. </div> </li> </ul> </div> <div class="ltx_para ltx_noindent" id="S3.SS1.p8"> Code Quality: These criteria evaluate intrinsic attributes such as readability and adherence to best practices. <ul class="ltx_itemize" id="S3.I3"> <li class="ltx_item" id="S3.I3.i1" style="list-style-type:none;"> - <div class="ltx_para" id="S3.I3.i1.p1"> Complexity & Efficiency: Analyzes computational complexity, execution time, and memory usage. </div> </li> <li class="ltx_item" id="S3.I3.i2" style="list-style-type:none;"> - <div class="ltx_para" id="S3.I3.i2.p1"> Helpfulness: Evaluates whether the code contributes meaningfully to solving the problem. This includes distinguishing partially correct solutions from completely unuseful and irrelevant code. </div> </li> <li class="ltx_item" id="S3.I3.i3" style="list-style-type:none;"> - <div class="ltx_para" id="S3.I3.i3.p1"> Readability: Measures how easily the code can be understood, including aspects like fluency, clarity, and conciseness. </div> </li> <li class="ltx_item" id="S3.I3.i4" style="list-style-type:none;"> - <div class="ltx_para" id="S3.I3.i4.p1"> Stylistic Consistency: Ensures adherence to established coding standards, formatting guidelines, and best practices. </div> </li> <li class="ltx_item" id="S3.I3.i5" style="list-style-type:none;"> - <div class="ltx_para" id="S3.I3.i5.p1"> Reference Similarity: Assesses how closely the generated code aligns with a given reference implementation. </div> </li> <li class="ltx_item" id="S3.I3.i6" style="list-style-type:none;"> - <div class="ltx_para" id="S3.I3.i6.p1"> Minor Anomalies & Warnings: Identifies non-critical issues such as redundant code or unused variables that do not affect execution. </div> </li> </ul> </div> </section> <section class="ltx_subsection" id="S3.SS2"> <h3 class="ltx_title ltx_title_subsection"> 3.2. Code Change</h3> <div class="ltx_para" id="S3.SS2.p1"> Beyond code generation, LLM-as-a-Judge has been explored for evaluating code modifications. Ahmed et al. <cite class="ltx_cite ltx_citemacro_citep">(Ahmed et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>)</cite> use LLM-as-a-Judge to determine whether a code patch effectively resolves a static analysis warning. Li et al. <cite class="ltx_cite ltx_citemacro_citep">(Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib36" title="">2024c</a>)</cite> apply LLMs to assess whether a patch successfully fixes a vulnerability, while Yadavally et al. <cite class="ltx_cite ltx_citemacro_citep">(Yadavally et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib72" title="">2025</a>)</cite> leverage LLMs to predict test outcomes of code patches. </div> </section> <section class="ltx_subsection" id="S3.SS3"> <h3 class="ltx_title ltx_title_subsection"> 3.3. Software Documentation Summarization</h3> <div class="ltx_para" id="S3.SS3.p1"> LLM-based judges have also been applied to assess natural language software documentation. Ahmed et al. <cite class="ltx_cite ltx_citemacro_citep">(Ahmed et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>)</cite> and Wu et al. <cite class="ltx_cite ltx_citemacro_citep">(Wu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib69" title="">2024</a>)</cite> introduce LLM judges to evaluate code summaries. Wu et al. <cite class="ltx_cite ltx_citemacro_citep">(Wu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib69" title="">2024</a>)</cite> introduce CODERPE, a framework that utilizes multiple LLMs and assigns LLMs with different roles to evaluate code summaries from various perspectives. These roles include a code reviewer, the original code author, a code editor, and a system analyst. In addition, Kumar et al. <cite class="ltx_cite ltx_citemacro_citep">(Kumar et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib31" title="">2024</a>)</cite> utilized LLMs to evaluate the quality of bug report titles and summaries. For code summarization, the common evaluation criteria include: <ul class="ltx_itemize" id="S3.I4"> <li class="ltx_item" id="S3.I4.i1" style="list-style-type:none;"> • <div class="ltx_para" id="S3.I4.i1.p1"> Language-related: Ensure clear, well-structured sentences with precise and appropriate wording. </div> </li> <li class="ltx_item" id="S3.I4.i2" style="list-style-type:none;"> • <div class="ltx_para" id="S3.I4.i2.p1"> Content-related: Summarize the core functionality and logic concisely, providing sufficient detail without including unnecessary information. </div> </li> <li class="ltx_item" id="S3.I4.i3" style="list-style-type:none;"> • <div class="ltx_para" id="S3.I4.i3.p1"> Effectiveness-related: Evaluate whether the summary is useful for developers and enhances their understanding of the code. </div> </li> </ul> </div> </section> <section class="ltx_subsection" id="S3.SS4"> <h3 class="ltx_title ltx_title_subsection"> 3.4. Other SE Tasks</h3> <div class="ltx_para" id="S3.SS4.p1"> LLM-as-a-Judge has also been explored in other software engineering tasks. Zheng et al. <cite class="ltx_cite ltx_citemacro_citep">(Zheng et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib86" title="">2024</a>)</cite> apply this approach to assess the quality of programming-related question answering. Wang et al. <cite class="ltx_cite ltx_citemacro_citep">(Wang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>)</cite> investigate its effectiveness in assessing code translation quality. In addition, Ahmed et al. <cite class="ltx_cite ltx_citemacro_citep">(Ahmed et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>)</cite> leverage LLMs to evaluate causal relationships extracted from natural language requirements. </div> </section> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> 4. The Road Ahead</h2> <div class="ltx_para" id="S4.p1"> This section outlines a research roadmap for achieving scalable, reliable, and effective LLM-as-a-Judge systems in software engineering. We begin by outlining the key limitations in current approaches and then propose concrete research opportunities and actionable steps that can significantly broaden the applicability of LLM-as-a-Judge across diverse SE contexts, steering us toward our envisioned future. </div> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_title_subsection"> 4.1. More Empirical Evaluation and Benchmarking</h3> <div class="ltx_para" id="S4.SS1.p1"> Limitation 1: Lack of High-Quality and Large-Scale Human-Annotated Benchmarks. In Section <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S3" title="3. Literature Review ‣ From Code to Courtroom: LLMs as the New Software Judges">3</a>, we reviewed numerous studies evaluating the effectiveness of LLM-as-a-Judge in software engineering. Benchmarks such as HumanEval <cite class="ltx_cite ltx_citemacro_citep">(Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib9" title="">2021</a>)</cite>, HumanEval-XL <cite class="ltx_cite ltx_citemacro_citep">(Peng et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib50" title="">2024</a>)</cite>, and CoNaLa <cite class="ltx_cite ltx_citemacro_citep">(Yin et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib79" title="">2018</a>)</cite> provide useful test cases to evaluate the ability of LLMs in assessing code correctness. However, these benchmarks fall short when assessing more nuanced aspects like helpfulness, readability, and alignment with human judgment. A critical limitation in existing LLM-as-a-Judge of SE is their reliance on small-scale datasets to measure human alignment <cite class="ltx_cite ltx_citemacro_citep">(Wang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>; Wu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib69" title="">2024</a>)</cite>, often comprising only a few hundred samples. For example, Wang et al. <cite class="ltx_cite ltx_citemacro_citep">(Wang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>)</cite> conducted experiments on three SE tasks, i.e., Code Translation, Code Generation, and Code Summarization, using a total of only 450 samples. Similarly, Ahmed et al. <cite class="ltx_cite ltx_citemacro_citep">(Ahmed et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>)</cite> evaluated code summarization using just 420 samples. While these sample sizes may be sufficient for demonstrating statistical significance, larger-scale benchmarks are also essential to ensure generalizability and mitigate threats to external validity. </div> <div class="ltx_para ltx_noindent" id="S4.SS1.p2"> Limitation 2: Inconsistent Empirical Findings. Related to these small-scale experiments, another major challenge is the inconsistency in empirical findings. A few works reached varying conclusions due to discrepancies in dataset selection, prompting strategies, and LLM configurations. For instance, Wang et al. <cite class="ltx_cite ltx_citemacro_citep">(Wang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>)</cite> found that traditional evaluation metrics (e.g., ROUGE <cite class="ltx_cite ltx_citemacro_citep">(Lin, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib38" title="">2004</a>)</cite>, METEOR <cite class="ltx_cite ltx_citemacro_citep">(Banerjee and Lavie, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib7" title="">2005</a>)</cite>, and ChrF++ <cite class="ltx_cite ltx_citemacro_citep">(Popović, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib51" title="">2017</a>)</cite>) significantly outperformed LLM-as-a-Judge methods when aligning with human judgment for code summarization. In contrast, Wu et al. <cite class="ltx_cite ltx_citemacro_citep">(Wu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib69" title="">2024</a>)</cite> reported that the LLM-as-a-Judge method surpassed conventional metrics in code summarization. These conflicting results highlight the need for a standardized, large-scale empirical study to enable fair and meaningful comparisons across studies. </div> <div class="ltx_para ltx_noindent" id="S4.SS1.p3"> Limitation 3: Lack of Empirical Findings in Bias of LLM-as-a-Judge systems in SE. Despite the growing adoption of LLM-as-a-Judge in SE, software artifacts and data may contain biases <cite class="ltx_cite ltx_citemacro_citep">(Widyasari et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib68" title="">2022</a>; Yang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib73" title="">2021</a>)</cite>. LLMs are already known to be susceptible to various biases and fairness issues. For example, Position bias <cite class="ltx_cite ltx_citemacro_citep">(Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib37" title="">2024b</a>)</cite>, where LLMs may favor responses based on their positioning within the text when making comparison; Verbosity bias <cite class="ltx_cite ltx_citemacro_citep">(Jiao et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib29" title="">2024</a>)</cite>, where LLMs show a tendency for longer, verbose responses, regardless of their qualities; Egocentric Bias <cite class="ltx_cite ltx_citemacro_citep">(Ye et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib77" title="">2024</a>)</cite>, where LLMs may overrate AI-generated solutions compared to human-written code. However, currently, there is a lack of thorough empirical investigation into the bias of LLM-as-a-Judge systems in SE. </div> <div class="ltx_para ltx_noindent" id="S4.SS1.p4"> Opportunity: Develop Comprehensive and Diverse Benchmarks. Future research should work on creating large-scale, multi-dimensional benchmarks that capture the complexity of real-world software engineering tasks. Key steps include: <ol class="ltx_enumerate" id="S4.I1"> <li class="ltx_item" id="S4.I1.i1" style="list-style-type:none;"> (1) <div class="ltx_para" id="S4.I1.i1.p1"> Expert Annotations and Quality Control: Engage a diverse pool of expert programmers to provide high-quality annotations. This annotation process should focus more on the nuanced criteria beyond code correctness, such as readability, helpfulness, and maintainability. In the beginning, define clear annotation guidelines that describe these nuanced dimensions to eliminate ambiguity. During the process, cross-validation and consensus mechanisms should be conducted to ensure the reliability of the annotation. </div> </li> <li class="ltx_item" id="S4.I1.i2" style="list-style-type:none;"> (2) <div class="ltx_para" id="S4.I1.i2.p1"> Dataset Expansion: Expand existing datasets to include a larger volume of high-quality expert-annotated samples. These benchmarks should encompass SE tasks at varying difficulty levels, various programming topics, and multiple programming languages. Additionally, new benchmarks should be extended to evaluate a broader spectrum of SE artifacts. For instance, expert evaluations could be gathered for software requirements documentation, system design, and API documentation. In the long term, we may also consider updating human preferences in the benchmarks to reflect evolving SE practices. </div> </li> </ol> </div> <div class="ltx_para ltx_noindent" id="S4.SS1.p5"> Opportunity: Comprehensive Empirical Evaluation in SE Tasks. Ultimately, these benchmarks will lay a solid foundation for understanding the strengths and weaknesses of LLM-as-a-Judge systems in SE. Large-scale experiments that systematically compare LLM-as-a-Judge methods with traditional metrics, while the impacts of design choices, such as prompting strategies and LLM configuration, will be investigated. Further, a thorough investigation into the bias and fairness of LLM-as-a-Judge systems in SE will be conducted. </div> </section> <section class="ltx_subsection" id="S4.SS2"> <h3 class="ltx_title ltx_title_subsection"> 4.2. Better Judgment From Internal Factors</h3> <div class="ltx_para ltx_noindent" id="S4.SS2.p1"> Limitation 4: Inadequate SE Domain-Specific Expertise in LLMs. While ideally, human evaluations should be conducted by experienced developers. Like human evaluators, LLM must exhibit a deep understanding of the task and the relevant domain knowledge to provide an accurate evaluation. Recent research highlights that LLMs have remarkable coding abilities, which builds confidence in their ability to assess tasks like code summarization and generation. However, research also indicates that some complex coding tasks are still challenging for LLMs <cite class="ltx_cite ltx_citemacro_citep">(Zhuo et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib90" title="">2024</a>; Zhao et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib84" title="">2024</a>)</cite>. For instance, Zhao et al. <cite class="ltx_cite ltx_citemacro_citep">(Zhao et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib84" title="">2024</a>)</cite> observed that current LLMs cannot generate a complete library from scratch, which also implies LLMs’ ability to evaluate library quality is still limited. Similar limitations extend to other critical areas in software engineering, such as software design, formal verification, distributed systems debugging, etc. Beyond gaps in SE-specific knowledge, LLMs may also struggle with discerning subtle differences among alternative solutions. As Zhao et al. <cite class="ltx_cite ltx_citemacro_citep">(Zhao et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib85" title="">2025</a>)</cite> highlighted, an LLM’s ability to generate correct code does not guarantee that it can effectively evaluate alternative implementations for the same task. </div> <div class="ltx_para ltx_noindent" id="S4.SS2.p2"> Opportunity: Enhancing SE Expertise in LLM. Future research can strengthen the SE domain-specific expertise of LLMs. A key approach is training LLMs on richer and more high-quality SE datasets <cite class="ltx_cite ltx_citemacro_citep">(Liu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib41" title="">2024a</a>)</cite>. Enhancing SE expertise in LLMs offers several benefits. For instance, LLMs with stronger formal verification capabilities can more accurately assess code correctness. Further, research can expand LLM-as-a-Judge’s applicability to a broader range of SE artifacts, such as Docker files, UML diagrams, and formal specifications. </div> <div class="ltx_para ltx_noindent" id="S4.SS2.p3"> Opportunity: Embedding Expert Tacit Knowledge. Another critical research direction for enhancing judgment involves capturing the tacit and procedural knowledge that human experts develop over years of experience, i.e., their intuition and unspoken decision-making processes. We can employ methods such as structured interviews <cite class="ltx_cite ltx_citemacro_citep">(Segal et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib55" title="">2006</a>)</cite>, think-aloud protocols <cite class="ltx_cite ltx_citemacro_citep">(Zhang and Zhang, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib82" title="">2019</a>)</cite>, and cognitive task analyses <cite class="ltx_cite ltx_citemacro_citep">(Schraagen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib54" title="">2000</a>)</cite>, where experts are asked to articulate their reasoning while conducting real-world evaluations. This approach enables the systematic extraction of procedural knowledge that is typically not documented in software artifacts. Once captured, this knowledge can be integrated into LLM training through techniques like reinforcement learning <cite class="ltx_cite ltx_citemacro_citep">(Bai et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib6" title="">2022</a>)</cite> or neuro-symbolic methods <cite class="ltx_cite ltx_citemacro_citep">(Kwon et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib32" title="">2023</a>)</cite>. </div> </section> <section class="ltx_subsection" id="S4.SS3"> <h3 class="ltx_title ltx_title_subsection"> 4.3. Better Judgment Through External Factors</h3> <div class="ltx_para ltx_noindent" id="S4.SS3.p1"> Limitation 5: Reliance on Internal Evaluation Mechanisms. Current works tend to depend solely on LLMs’ own abilities for evaluation. This shortfall underscores the need for additional mechanisms to assist LLM-as-a-Judge in decision-making. </div> <div class="ltx_para ltx_noindent" id="S4.SS3.p2"> Opportunity: Integrating SE tools with LLM-as-a-Judge. As humans rely on other tools to assist their judgment, LLM-as-a-Judge can incorporate outputs from various analysis tools. Integrating tools such as static analyzers, formal verification frameworks, and model checkers into the evaluation pipeline—or conversely, embedding LLM-as-a-Judge into other tools like Integrated Development Environments (IDEs) <cite class="ltx_cite ltx_citemacro_citep">(Shi et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib57" title="">2024a</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib56" title="">b</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib58" title="">2023</a>)</cite>—can add additional layers of validation. </div> <div class="ltx_para ltx_noindent" id="S4.SS3.p3"> Opportunity: Human-in-the-loop. For tasks that cannot be reliably automated by LLMs, we need to develop collaborative and interactive methods to include human oversight. Also, identifying the optimal ratio of LLM evaluators to human experts remains another challenge. Moreover, enhancing LLMs with the ability to assess their own confidence levels is crucial. When LLM generates low-confidence ratings, it should automatically flag these cases for human review. </div> </section> <section class="ltx_subsection" id="S4.SS4"> <h3 class="ltx_title ltx_title_subsection"> 4.4. Addressing The Security Issues of Using LLM-as-a-Judge</h3> <div class="ltx_para ltx_noindent" id="S4.SS4.p1"> Limitation 6: Insufficient Research on Adversarial Threats and Defensive Methods in SE. Adversarial attacks <cite class="ltx_cite ltx_citemacro_citep">(Yang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib75" title="">2022b</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib74" title="">a</a>; Yefet et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib78" title="">2020</a>)</cite> on LLM-as-a-Judge systems pose significant risks by subtly manipulating software artifacts to alter evaluation outcomes. For example, attackers may obfuscate code by injecting misleading comments or rearranging segments to mask critical vulnerabilities. Adversaries might also modify bug reports or commit messages with deceptive context to influence LLM’s code quality, maintainability, and security assessment. However, at the current stage, we notice that this threat for LLM-as-a-Judge is under-explored in the SE community. </div> <div class="ltx_para ltx_noindent" id="S4.SS4.p2"> Opportunity: Adversarial Testing and Robust Defensive Mechanisms. Future research should adopt approaches that develop both adversarial methods and defensive strategies. On the one hand, developing adversarial training techniques <cite class="ltx_cite ltx_citemacro_citep">(Gong et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib20" title="">2022</a>; Xhonneux et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib70" title="">2024</a>)</cite> is crucial to enhance the LLM’s robustness, where crafted adversarial examples are incorporated into the training process, On the other hand, additional robust defensive mechanisms, such as anomaly detection <cite class="ltx_cite ltx_citemacro_citep">(Song et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib60" title="">2025</a>; Yang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib76" title="">2024</a>)</cite>, should be integrated to detect and mitigate the threat of adversarial attacks proactively. </div> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> 5. Conclusion</h2> <div class="ltx_para" id="S5.p1"> This forward-looking SE 2030 paper first present a brief overview of the current landscape of LLM-as-a-Judge systems, identifying their limitations while highlighting the associated opportunities and presenting a future vision for their adoption in the SE community. To realize this vision by 2030 and beyond, we propose a research roadmap outlining the specific paths and potential solutions that the research community can pursue. Through these efforts, we aim to encourage broader participation in the LLM-as-a-Judge research journey. </div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> (1) </li> <li class="ltx_bibitem" id="bib.bib2"> Ahmed et al. (2024a) Toufique Ahmed, Premkumar Devanbu, Christoph Treude, and Michael Pradel. 2024a. Can LLMs replace manual annotation of software engineering artifacts? arXiv preprint arXiv:2408.05534 (2024). </li> <li class="ltx_bibitem" id="bib.bib3"> Ahmed et al. (2024b) Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, and Earl Barr. 2024b. Automatic semantic augmentation of language model prompts (for code summarization). In Proceedings of the IEEE/ACM 46th international conference on software engineering. 1–13. </li> <li class="ltx_bibitem" id="bib.bib4"> Anthropic (2023) Anthropic. 2023. Claude (Oct 8 version) [Large language model]. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.anthropic.com/" title="">https://www.anthropic.com/</a> </li> <li class="ltx_bibitem" id="bib.bib5"> Anysphere (2025) Anysphere. 2025. Cursor - The AI Code Editor — cursor.com. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.cursor.com" title="">https://www.cursor.com</a>. </li> <li class="ltx_bibitem" id="bib.bib6"> Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022). </li> <li class="ltx_bibitem" id="bib.bib7"> Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72. </li> <li class="ltx_bibitem" id="bib.bib8"> Buse et al. (2011) Raymond PL Buse, Caitlin Sadowski, and Westley Weimer. 2011. Benefits and barriers of user evaluation in software engineering research. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications. 643–656. </li> <li class="ltx_bibitem" id="bib.bib9"> Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021). </li> <li class="ltx_bibitem" id="bib.bib10"> DeepSeek-AI et al. ([n. d.]) Aixin Liu DeepSeek-AI, Bei Feng, Bin Wang, B Wang, B Liu, C Zhao, C Dengr, C Ruan, D Dai, D Guo, et al. [n. d.]. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024. URL https://arxiv. org/abs/2405.04434 ([n. d.]). </li> <li class="ltx_bibitem" id="bib.bib11"> Dehaerne et al. (2022) Enrique Dehaerne, Bappaditya Dey, Sandip Halder, Stefan De Gendt, and Wannes Meert. 2022. Code generation using machine learning: A systematic review. Ieee Access 10 (2022), 82434–82455. </li> <li class="ltx_bibitem" id="bib.bib12"> Dou et al. (2024) Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Weikang Zhou, Muling Wu, Mingxu Chai, Jessica Fan, Caishuang Huang, Yunbo Tao, et al. 2024. What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study. arXiv preprint arXiv:2407.06153 (2024). </li> <li class="ltx_bibitem" id="bib.bib13"> Evtikhiev et al. (2023) Mikhail Evtikhiev, Egor Bogomolov, Yaroslav Sokolov, and Timofey Bryksin. 2023. Out of the bleu: how should we assess quality of the code generation models? Journal of Systems and Software 203 (2023), 111741. </li> <li class="ltx_bibitem" id="bib.bib14"> Fan et al. (2023) Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2023. Large Language Models for Software Engineering: Survey and Open Problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). 31–53. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/ICSE-FoSE59343.2023.00008" title="">https://doi.org/10.1109/ICSE-FoSE59343.2023.00008</a> </li> <li class="ltx_bibitem" id="bib.bib15"> Farchi et al. (2024) Eitan Farchi, Shmulik Froimovich, Rami Katan, and Orna Raz. 2024. Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks. arXiv preprint arXiv:2410.21071 (2024). </li> <li class="ltx_bibitem" id="bib.bib16"> Fu et al. (2023) Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166 (2023). </li> <li class="ltx_bibitem" id="bib.bib17"> Gao et al. (2025) Cuiyun Gao, Xing Hu, Shan Gao, Xin Xia, and Zhi Jin. 2025. The Current Challenges of Software Engineering in the Era of Large Language Models. ACM Trans. Softw. Eng. Methodol. (Jan. 2025). <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3712005" title="">https://doi.org/10.1145/3712005</a> Just Accepted. </li> <li class="ltx_bibitem" id="bib.bib18"> GitHub (2025) GitHub. 2025. GitHub Copilot: Your AI Pair Programmer. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/features/copilot/" title="">https://github.com/features/copilot/</a>. Accessed on [insert access date]. </li> <li class="ltx_bibitem" id="bib.bib19"> GitHub (2025) GitHub. 2025. GitHub Copilot · Your AI pair programmer. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/features/copilot/" title="">https://github.com/features/copilot/</a>. </li> <li class="ltx_bibitem" id="bib.bib20"> Gong et al. (2022) Chen Gong, Zhou Yang, Yunpeng Bai, Jieke Shi, Arunesh Sinha, Bowen Xu, David Lo, Xinwen Hou, and Guoliang Fan. 2022. Curiosity-driven and victim-aware adversarial policies. In Proceedings of the 38th Annual Computer Security Applications Conference. 186–200. </li> <li class="ltx_bibitem" id="bib.bib21"> Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep learning. Vol. 1. MIT press Cambridge. </li> <li class="ltx_bibitem" id="bib.bib22"> Gu et al. (2024) Alex Gu, Wen-Ding Li, Naman Jain, Theo Olausson, Celine Lee, Koushik Sen, and Armando Solar-Lezama. 2024. The Counterfeit Conundrum: Can Code Language Models Grasp the Nuances of Their Incorrect Generations?. In Findings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 74–117. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.18653/v1/2024.findings-acl.7" title="">https://doi.org/10.18653/v1/2024.findings-acl.7</a> </li> <li class="ltx_bibitem" id="bib.bib23"> Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025). </li> <li class="ltx_bibitem" id="bib.bib24"> Hendrycks et al. (2021) Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Measuring coding challenge competence with apps (2021). arXiv preprint arXiv:2105.09938 (2021). </li> <li class="ltx_bibitem" id="bib.bib25"> Hou et al. (2024) Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Trans. Softw. Eng. Methodol. 33, 8, Article 220 (Dec. 2024), 79 pages. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3695988" title="">https://doi.org/10.1145/3695988</a> </li> <li class="ltx_bibitem" id="bib.bib26"> Hu et al. (2022) Xing Hu, Qiuyuan Chen, Haoye Wang, Xin Xia, David Lo, and Thomas Zimmermann. 2022. Correlating automated and human evaluation of code documentation generation quality. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 4 (2022), 1–28. </li> <li class="ltx_bibitem" id="bib.bib27"> Ivanova (2025) Anna A Ivanova. 2025. How to evaluate the cognitive abilities of LLMs. Nature Human Behaviour (2025), 1–4. </li> <li class="ltx_bibitem" id="bib.bib28"> Jiang et al. (2023) Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. 2023. Impact of Code Language Models on Automated Program Repair. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 1430–1442. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/ICSE48619.2023.00125" title="">https://doi.org/10.1109/ICSE48619.2023.00125</a> </li> <li class="ltx_bibitem" id="bib.bib29"> Jiao et al. (2024) Tong Jiao, Jian Zhang, Kui Xu, Rui Li, Xi Du, Shangqi Wang, and Zhenbo Song. 2024. Enhancing Fairness in LLM Evaluations: Unveiling and Mitigating Biases in Standard-Answer-Based Evaluations. In Proceedings of the AAAI Symposium Series, Vol. 4. 56–59. </li> <li class="ltx_bibitem" id="bib.bib30"> Jin et al. (2023) Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. 2023. InferFix: End-to-End Program Repair with LLMs. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (San Francisco, CA, USA) (ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 1646–1656. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3611643.3613892" title="">https://doi.org/10.1145/3611643.3613892</a> </li> <li class="ltx_bibitem" id="bib.bib31"> Kumar et al. (2024) Abhishek Kumar, Sonia Haiduc, Partha Pratim Das, and Partha Pratim Chakrabarti. 2024. LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization. arXiv preprint arXiv:2409.00630 (2024). </li> <li class="ltx_bibitem" id="bib.bib32"> Kwon et al. (2023) Joe Kwon, Sydney Levine, and Joshua B Tenenbaum. 2023. Neuro-symbolic models of human moral judgment: LLMs as automatic feature extractors. (2023). </li> <li class="ltx_bibitem" id="bib.bib33"> LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436–444. </li> <li class="ltx_bibitem" id="bib.bib34"> Li et al. (2024a) Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024a. Llms-as-judges: a comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579 (2024). </li> <li class="ltx_bibitem" id="bib.bib35"> Li et al. (2023) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023). </li> <li class="ltx_bibitem" id="bib.bib36"> Li et al. (2024c) Yikun Li, Ting Zhang, Ratnadira Widyasari, Yan Naing Tun, Huu Hung Nguyen, Tan Bui, Ivana Clairine Irsan, Yiran Cheng, Xiang Lan, Han Wei Ang, et al. 2024c. CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics. arXiv preprint arXiv:2411.17274 (2024). </li> <li class="ltx_bibitem" id="bib.bib37"> Li et al. (2024b) Zongjie Li, Chaozheng Wang, Pingchuan Ma, Daoyuan Wu, Shuai Wang, Cuiyun Gao, and Yang Liu. 2024b. Split and Merge: Aligning Position Biases in LLM-based Evaluators. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 11084–11108. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.18653/v1/2024.emnlp-main.621" title="">https://doi.org/10.18653/v1/2024.emnlp-main.621</a> </li> <li class="ltx_bibitem" id="bib.bib38"> Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81. </li> <li class="ltx_bibitem" id="bib.bib39"> Liu et al. (2024b) Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024b. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024). </li> <li class="ltx_bibitem" id="bib.bib40"> Liu et al. (2024c) Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma. 2024c. Exploring and evaluating hallucinations in llm-powered code generation. arXiv preprint arXiv:2404.00971 (2024). </li> <li class="ltx_bibitem" id="bib.bib41"> Liu et al. (2024a) Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin. 2024a. Datasets for large language models: A comprehensive survey. arXiv preprint arXiv:2402.18041 (2024). </li> <li class="ltx_bibitem" id="bib.bib42"> Liu et al. (2023) Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634 (2023). </li> <li class="ltx_bibitem" id="bib.bib43"> Lozhkov et al. (2024) Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173 (2024). </li> <li class="ltx_bibitem" id="bib.bib44"> OpenAI (2023a) OpenAI. 2023a. GPT-3.5: Advancements in Large Language Models. OpenAI Technical Report. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://openai.com/research/gpt-3-5" title="">https://openai.com/research/gpt-3-5</a> </li> <li class="ltx_bibitem" id="bib.bib45"> OpenAI (2023b) OpenAI. 2023b. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023). <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2303.08774" title="">https://arxiv.org/abs/2303.08774</a> </li> <li class="ltx_bibitem" id="bib.bib46"> Pan et al. (2023) Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lambert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2023. Understanding the effectiveness of large language models in code translation. CoRR (2023). </li> <li class="ltx_bibitem" id="bib.bib47"> Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318. </li> <li class="ltx_bibitem" id="bib.bib48"> Patel et al. (2024) Bhrij Patel, Souradip Chakraborty, Wesley A Suttle, Mengdi Wang, Amrit Singh Bedi, and Dinesh Manocha. 2024. AIME: AI System Optimization via Multiple LLM Evaluators. arXiv preprint arXiv:2410.03131 (2024). </li> <li class="ltx_bibitem" id="bib.bib49"> Patil (2025) Avinash Patil. 2025. Advancing Reasoning in Large Language Models: Promising Methods and Approaches. arXiv preprint arXiv:2502.03671 (2025). </li> <li class="ltx_bibitem" id="bib.bib50"> Peng et al. (2024) Qiwei Peng, Yekun Chai, and Xuhong Li. 2024. Humaneval-xl: A multilingual code generation benchmark for cross-lingual natural language generalization. arXiv preprint arXiv:2402.16694 (2024). </li> <li class="ltx_bibitem" id="bib.bib51"> Popović (2017) Maja Popović. 2017. chrF++: words helping character n-grams. In Proceedings of the second conference on machine translation. 612–618. </li> <li class="ltx_bibitem" id="bib.bib52"> Ren et al. (2020) Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297 (2020). </li> <li class="ltx_bibitem" id="bib.bib53"> Roy et al. (2021) Devjeet Roy, Sarah Fakhoury, and Venera Arnaoudova. 2021. Reassessing automatic evaluation metrics for code summarization tasks. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1105–1116. </li> <li class="ltx_bibitem" id="bib.bib54"> Schraagen et al. (2000) Jan Maarten Schraagen, Susan F Chipman, and Valerie L Shalin. 2000. Cognitive task analysis. Psychology Press. </li> <li class="ltx_bibitem" id="bib.bib55"> Segal et al. (2006) Daniel L Segal, Frederick L Coolidge, Alisa O’Riley, and Benjamin A Heinz. 2006. Structured and semistructured interviews. In Clinician’s handbook of adult behavioral assessment. Elsevier, 121–144. </li> <li class="ltx_bibitem" id="bib.bib56"> Shi et al. (2024b) Jieke Shi, Zhou Yang, Hong Jin Kang, Bowen Xu, Junda He, and David Lo. 2024b. Greening Large Language Models of Code. In Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Society (Lisbon, Portugal) (ICSE-SEIS’24). Association for Computing Machinery, New York, NY, USA, 142–153. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3639475.3640097" title="">https://doi.org/10.1145/3639475.3640097</a> </li> <li class="ltx_bibitem" id="bib.bib57"> Shi et al. (2024a) Jieke Shi, Zhou Yang, and David Lo. 2024a. Efficient and Green Large Language Models for Software Engineering: Vision and the Road Ahead. ACM Trans. Softw. Eng. Methodol. (Dec. 2024). <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3708525" title="">https://doi.org/10.1145/3708525</a> Just Accepted. </li> <li class="ltx_bibitem" id="bib.bib58"> Shi et al. (2023) Jieke Shi, Zhou Yang, Bowen Xu, Hong Jin Kang, and David Lo. 2023. Compressing Pre-trained Models of Code into 3 MB. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (Rochester, MI, USA) (ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 24, 12 pages. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3551349.3556964" title="">https://doi.org/10.1145/3551349.3556964</a> </li> <li class="ltx_bibitem" id="bib.bib59"> Sollenberger et al. (2024) Zachariah Sollenberger, Jay Patel, Christian Munley, Aaron Jarmusch, and Sunita Chandrasekaran. 2024. LLM4VV: Exploring LLM-as-a-Judge for Validation and Verification Testsuites. In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1885–1893. </li> <li class="ltx_bibitem" id="bib.bib60"> Song et al. (2025) Shuang Song, Yifei Zhang, and Neng Gao. 2025. Confront Insider Threat: Precise Anomaly Detection in Behavior Logs Based on LLM Fine-Tuning. In Proceedings of the 31st International Conference on Computational Linguistics, Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert (Eds.). Association for Computational Linguistics, Abu Dhabi, UAE, 8589–8601. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://aclanthology.org/2025.coling-main.574/" title="">https://aclanthology.org/2025.coling-main.574/</a> </li> <li class="ltx_bibitem" id="bib.bib61"> Sun et al. (2024) Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, and Zhenyu Chen. 2024. Source code summarization in the era of large language models. arXiv preprint arXiv:2407.07959 (2024). </li> <li class="ltx_bibitem" id="bib.bib62"> Tan et al. (2024) Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y Tang, Alejandro Cuadron, Chenguang Wang, Raluca Ada Popa, and Ion Stoica. 2024. Judgebench: A benchmark for evaluating llm-based judges. arXiv preprint arXiv:2410.12784 (2024). </li> <li class="ltx_bibitem" id="bib.bib63"> Tian (2005) Jeff Tian. 2005. Software quality engineering: testing, quality assurance, and quantifiable improvement. John Wiley & Sons. </li> <li class="ltx_bibitem" id="bib.bib64"> Tong and Zhang (2024) Weixi Tong and Tianyi Zhang. 2024. CodeJudge: Evaluating Code Generation with Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 20032–20051. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.18653/v1/2024.emnlp-main.1118" title="">https://doi.org/10.18653/v1/2024.emnlp-main.1118</a> </li> <li class="ltx_bibitem" id="bib.bib65"> Wang et al. (2024) Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software Testing With Large Language Models: Survey, Landscape, and Vision. IEEE Transactions on Software Engineering 50, 4 (2024), 911–936. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/TSE.2024.3368208" title="">https://doi.org/10.1109/TSE.2024.3368208</a> </li> <li class="ltx_bibitem" id="bib.bib66"> Wang et al. (2025) Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. 2025. Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering. arXiv preprint arXiv:2502.06193 (2025). </li> <li class="ltx_bibitem" id="bib.bib67"> Weyssow et al. (2024) Martin Weyssow, Aton Kamanda, Xin Zhou, and Houari Sahraoui. 2024. Codeultrafeedback: An llm-as-a-judge dataset for aligning large language models to coding preferences. arXiv preprint arXiv:2403.09032 (2024). </li> <li class="ltx_bibitem" id="bib.bib68"> Widyasari et al. (2022) Ratnadira Widyasari, Stefanus Agus Haryono, Ferdian Thung, Jieke Shi, Constance Tan, Fiona Wee, Jack Phan, and David Lo. 2022. On the Influence of Biases in Bug Localization: Evaluation and Benchmark. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 128–139. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/SANER53432.2022.00027" title="">https://doi.org/10.1109/SANER53432.2022.00027</a> </li> <li class="ltx_bibitem" id="bib.bib69"> Wu et al. (2024) Yang Wu, Yao Wan, Zhaoyang Chu, Wenting Zhao, Ye Liu, Hongyu Zhang, Xuanhua Shi, and Philip S Yu. 2024. Can Large Language Models Serve as Evaluators for Code Summarization? arXiv preprint arXiv:2412.01333 (2024). </li> <li class="ltx_bibitem" id="bib.bib70"> Xhonneux et al. (2024) Sophie Xhonneux, Alessandro Sordoni, Stephan Günnemann, Gauthier Gidel, and Leo Schwinn. 2024. Efficient Adversarial Training in LLMs with Continuous Attacks. arXiv:2405.15589 [cs.LG] <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2405.15589" title="">https://arxiv.org/abs/2405.15589</a> </li> <li class="ltx_bibitem" id="bib.bib71"> Xu et al. (2024) Fangzhou Xu, Sai Zhang, Zhenchang Xing, Xiaowang Zhang, Yahong Han, and Zhiyong Feng. 2024. Human-Like Code Quality Evaluation through LLM-based Recursive Semantic Comprehension. arXiv preprint arXiv:2412.00314 (2024). </li> <li class="ltx_bibitem" id="bib.bib72"> Yadavally et al. (2025) Aashish Yadavally, Hoan Nguyen, Laurent Callot, and Gauthier Guinet. 2025. Large Language Model Critics for Execution-Free Evaluation of Code Changes. arXiv preprint arXiv:2501.16655 (2025). </li> <li class="ltx_bibitem" id="bib.bib73"> Yang et al. (2021) Zhou Yang, Harshit Jain, Jieke Shi, Muhammad Hilmi Asyrofi, and David Lo. 2021. BiasHeal: On-the-Fly Black-Box Healing of Bias in Sentiment Analysis Systems. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). 644–648. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/ICSME52107.2021.00073" title="">https://doi.org/10.1109/ICSME52107.2021.00073</a> </li> <li class="ltx_bibitem" id="bib.bib74"> Yang et al. (2022a) Zhou Yang, Jieke Shi, Muhammad Hilmi Asyrofi, and David Lo. 2022a. Revisiting Neuron Coverage Metrics and Quality of Deep Neural Networks. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 408–419. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/SANER53432.2022.00056" title="">https://doi.org/10.1109/SANER53432.2022.00056</a> </li> <li class="ltx_bibitem" id="bib.bib75"> Yang et al. (2022b) Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022b. Natural attack for pre-trained models of code. In Proceedings of the 44th International Conference on Software Engineering. 1482–1493. </li> <li class="ltx_bibitem" id="bib.bib76"> Yang et al. (2024) Zhou Yang, Bowen Xu, Jie M. Zhang, Hong Jin Kang, Jieke Shi, Junda He, and David Lo. 2024. Stealthy Backdoor Attack for Code Models. IEEE Transactions on Software Engineering 50, 4 (2024), 721–741. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/TSE.2024.3361661" title="">https://doi.org/10.1109/TSE.2024.3361661</a> </li> <li class="ltx_bibitem" id="bib.bib77"> Ye et al. (2024) Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al. 2024. Justice or prejudice? quantifying biases in llm-as-a-judge. arXiv preprint arXiv:2410.02736 (2024). </li> <li class="ltx_bibitem" id="bib.bib78"> Yefet et al. (2020) Noam Yefet, Uri Alon, and Eran Yahav. 2020. Adversarial examples for models of code. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1–30. </li> <li class="ltx_bibitem" id="bib.bib79"> Yin et al. (2018) Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15th international conference on mining software repositories. 476–486. </li> <li class="ltx_bibitem" id="bib.bib80"> Yuan et al. (2023) Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Boyuan Pan, Heda Wang, and Kan Li. 2023. BatchEval: Towards Human-like Text Evaluation. arXiv preprint arXiv:2401.00437 (2023). </li> <li class="ltx_bibitem" id="bib.bib81"> Yuan et al. (2024) Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Kai Yu, Xin Peng, and Yiling Lou. 2024. Transagent: An llm-based multi-agent system for code translation. arXiv preprint arXiv:2409.19894 (2024). </li> <li class="ltx_bibitem" id="bib.bib82"> Zhang and Zhang (2019) Lawrence Jun Zhang and Donglan Zhang. 2019. Think-aloud protocols. In The Routledge handbook of research methods in applied linguistics. Routledge, 302–311. </li> <li class="ltx_bibitem" id="bib.bib83"> Zhang et al. (2019) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019). </li> <li class="ltx_bibitem" id="bib.bib84"> Zhao et al. (2024) Wenting Zhao, Nan Jiang, Celine Lee, Justin T Chiu, Claire Cardie, Matthias Gallé, and Alexander M Rush. 2024. Commit0: Library Generation from Scratch. arXiv preprint arXiv:2412.01769 (2024). </li> <li class="ltx_bibitem" id="bib.bib85"> Zhao et al. (2025) Yuwei Zhao, Ziyang Luo, Yuchen Tian, Hongzhan Lin, Weixiang Yan, Annan Li, and Jing Ma. 2025. CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?. In Proceedings of the 31st International Conference on Computational Linguistics, Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert (Eds.). Association for Computational Linguistics, Abu Dhabi, UAE, 73–95. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://aclanthology.org/2025.coling-main.7/" title="">https://aclanthology.org/2025.coling-main.7/</a> </li> <li class="ltx_bibitem" id="bib.bib86"> Zheng et al. (2024) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems 36 (2024). </li> <li class="ltx_bibitem" id="bib.bib87"> Zheng et al. (2023) Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al. 2023. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5673–5684. </li> <li class="ltx_bibitem" id="bib.bib88"> Zhou et al. (2023) Shuyan Zhou, Uri Alon, Sumit Agarwal, and Graham Neubig. 2023. Codebertscore: Evaluating code generation with pretrained models of code. arXiv preprint arXiv:2302.05527 (2023). </li> <li class="ltx_bibitem" id="bib.bib89"> Zhuo (2023) Terry Yue Zhuo. 2023. ICE-Score: Instructing Large Language Models to Evaluate Code. arXiv preprint arXiv:2304.14317 (2023). </li> <li class="ltx_bibitem" id="bib.bib90"> Zhuo et al. (2024) Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. 2024. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877 (2024). </li> </ul> </section> <div class="ltx_pagination ltx_role_newpage"></div> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Tue Mar 4 03:47:27 2025 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/">LaTeXML<img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>