CINXE.COM

From Code to Courtroom: LLMs as the New Software Judges

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>From Code to Courtroom: LLMs as the New Software Judges</title> <!--Generated on Tue Mar 4 03:47:27 2025 by LaTeXML (version 0.8.8) http://dlmf.nist.gov/LaTeXML/.--> <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <meta content="Large Language Models, Software Engineering, LLM-as-a-Judge, Research Roadmap" lang="en" name="keywords"/> <base href="/html/2503.02246v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S1" title="In From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">1 </span>Introduction</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S2" title="In From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2 </span>Definition of LLM-as-a-Judge</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S3" title="In From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3 </span>Literature Review</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S3.SS1" title="In 3. Literature Review ‣ From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.1 </span>Code Generation</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S3.SS2" title="In 3. Literature Review ‣ From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.2 </span>Code Change</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S3.SS3" title="In 3. Literature Review ‣ From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.3 </span>Software Documentation Summarization</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S3.SS4" title="In 3. Literature Review ‣ From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.4 </span>Other SE Tasks</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S4" title="In From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4 </span>The Road Ahead</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S4.SS1" title="In 4. The Road Ahead ‣ From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.1 </span>More Empirical Evaluation and Benchmarking</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S4.SS2" title="In 4. The Road Ahead ‣ From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.2 </span>Better Judgment From Internal Factors</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S4.SS3" title="In 4. The Road Ahead ‣ From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.3 </span>Better Judgment Through External Factors</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S4.SS4" title="In 4. The Road Ahead ‣ From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.4 </span>Addressing The Security Issues of Using LLM-as-a-Judge</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S5" title="In From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5 </span>Conclusion</span></a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line ltx_leqno"> <h1 class="ltx_title ltx_title_document">From Code to Courtroom: LLMs as the New Software Judges</h1> <div class="ltx_authors"> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Junda He </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_email"><a href="mailto:jundahe@smu.edu.sg">jundahe@smu.edu.sg</a> </span></span></span> <span class="ltx_author_before">, </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Jieke Shi </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_email"><a href="mailto:jiekeshi@smu.edu.sg">jiekeshi@smu.edu.sg</a> </span> <span class="ltx_contact ltx_role_affiliation"><span class="ltx_text ltx_affiliation_institution" id="id1.1.id1">Singapore Management University</span><span class="ltx_text ltx_affiliation_country" id="id2.2.id2">Singapore</span> </span></span></span> <span class="ltx_author_before">, </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Terry Yue Zhuo </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_email"><a href="mailto:terry.zhuo@monash.edu">terry.zhuo@monash.edu</a> </span> <span class="ltx_contact ltx_role_affiliation"><span class="ltx_text ltx_affiliation_institution" id="id3.1.id1">Monash University &amp; CSIRO’s Data61</span><span class="ltx_text ltx_affiliation_country" id="id4.2.id2">Australia</span> </span></span></span> <span class="ltx_author_before">, </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Christoph Treude </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_email"><a href="mailto:ctreude@smu.edu.sg">ctreude@smu.edu.sg</a> </span> <span class="ltx_contact ltx_role_affiliation"><span class="ltx_text ltx_affiliation_institution" id="id5.1.id1">Singapore Management University</span><span class="ltx_text ltx_affiliation_country" id="id6.2.id2">Singapore</span> </span></span></span> <span class="ltx_author_before">, </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Jiamou Sun </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_email"><a href="mailto:frank.sun@anu.edu.au">frank.sun@anu.edu.au</a> </span> <span class="ltx_contact ltx_role_affiliation"><span class="ltx_text ltx_affiliation_institution" id="id7.1.id1">CSIRO’s Data61</span><span class="ltx_text ltx_affiliation_country" id="id8.2.id2">Australia</span> </span></span></span> <span class="ltx_author_before">, </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Zhenchang Xing </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_email"><a href="mailto:zhenchang.xing@anu.edu.au">zhenchang.xing@anu.edu.au</a> </span> <span class="ltx_contact ltx_role_affiliation"><span class="ltx_text ltx_affiliation_institution" id="id9.1.id1">CSIRO’s Data61 &amp; Australian National University</span><span class="ltx_text ltx_affiliation_country" id="id10.2.id2">Australia</span> </span></span></span> <span class="ltx_author_before">, </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Xiaoning Du </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_email"><a href="mailto:xiaoning.du@monash.edu">xiaoning.du@monash.edu</a> </span> <span class="ltx_contact ltx_role_affiliation"><span class="ltx_text ltx_affiliation_institution" id="id11.1.id1">Monash University</span><span class="ltx_text ltx_affiliation_country" id="id12.2.id2">Australia</span> </span></span></span> <span class="ltx_author_before"> and </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">David Lo </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_email"><a href="mailto:davidlo@smu.edu.sg">davidlo@smu.edu.sg</a> </span> <span class="ltx_contact ltx_role_affiliation"><span class="ltx_text ltx_affiliation_institution" id="id13.1.id1">Singapore Management University</span><span class="ltx_text ltx_affiliation_country" id="id14.2.id2">Singapore</span> </span></span></span> </div> <div class="ltx_dates">(20 February 2025)</div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract.</h6> <p class="ltx_p" id="id15.id1">Recently, Large Language Models (LLMs) have been increasingly used to automate SE tasks such as code generation and summarization. However, evaluating the quality of LLM-generated software artifacts remains challenging. Human evaluation, while effective, is very costly and time-consuming. Traditional automated metrics like BLEU rely on high-quality references and struggle to capture nuanced aspects of software quality, such as readability and usefulness. In response, the LLM-as-a-Judge paradigm, which employs LLMs for automated evaluation, has emerged. Given that LLMs possess strong coding abilities and reasoning skills, they hold promise as cost-effective and scalable surrogates for human evaluators. Nevertheless, LLM-as-a-Judge research in the SE community is still in its early stages, with many breakthroughs needed.</p> <p class="ltx_p" id="id16.id2">This forward-looking SE 2030 paper aims to steer the research community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts, while also sharing potential research paths to achieve this goal. We provide a literature review of existing SE studies on LLM-as-a-Judge and envision these frameworks as reliable, robust, and scalable human surrogates capable of evaluating software artifacts with consistent, multi-faceted assessments by 2030 and beyond. To validate this vision, we analyze the limitations of current studies, identify key research gaps, and outline a detailed roadmap to guide future developments of LLM-as-a-Judge in software engineering. While not intended to be a definitive guide, our work aims to foster further research and adoption of LLM-as-a-Judge frameworks within the SE community, ultimately improving the effectiveness and scalability of software artifact evaluation methods.</p> </div> <div class="ltx_keywords">Large Language Models, Software Engineering, LLM-as-a-Judge, Research Roadmap </div> <span class="ltx_note ltx_note_frontmatter ltx_role_copyright" id="id1"><sup class="ltx_note_mark">†</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">†</sup><span class="ltx_note_type">copyright: </span>acmlicensed</span></span></span><span class="ltx_note ltx_note_frontmatter ltx_role_journal" id="id2"><sup class="ltx_note_mark">†</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">†</sup><span class="ltx_note_type">journal: </span>TOSEM</span></span></span> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">1. </span>Introduction</h2> <div class="ltx_para" id="S1.p1"> <p class="ltx_p" id="S1.p1.1">With significant advances in Artificial Intelligence (AI)—particularly in Deep Learning (DL) <cite class="ltx_cite ltx_citemacro_citep">(LeCun et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib33" title="">2015</a>; Goodfellow et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib21" title="">2016</a>)</cite> and Large Language Models (LLMs) <cite class="ltx_cite ltx_citemacro_citep">(OpenAI, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib45" title="">2023b</a>; DeepSeek-AI et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib10" title="">[n. d.]</a>; Anthropic, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib4" title="">2023</a>)</cite>—various fields have increasingly leveraged LLMs to develop automated solutions, notably within Software Engineering (SE) <cite class="ltx_cite ltx_citemacro_citep">(Hou et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib25" title="">2024</a>; Gao et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib17" title="">2025</a>; Fan et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib14" title="">2023</a>; Wang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib65" title="">2024</a>; Shi et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib57" title="">2024a</a>)</cite>. Demonstrating exceptional performance across a spectrum of longstanding challenges, ranging from code generation <cite class="ltx_cite ltx_citemacro_citep">(Zhuo et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib90" title="">2024</a>; Liu et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib40" title="">2024c</a>)</cite> and summarization <cite class="ltx_cite ltx_citemacro_citep">(Ahmed et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib3" title="">2024b</a>; Sun et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib61" title="">2024</a>)</cite> to program translation <cite class="ltx_cite ltx_citemacro_citep">(Yuan et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib81" title="">2024</a>; Pan et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib46" title="">2023</a>)</cite> and repair <cite class="ltx_cite ltx_citemacro_citep">(Jiang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib28" title="">2023</a>; Jin et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib30" title="">2023</a>)</cite>, LLMs have powered a series of new applications and tools that streamline the software development process, such as GitHub Copilot <cite class="ltx_cite ltx_citemacro_citep">(GitHub, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib19" title="">2025</a>)</cite>, Cursor Code Editor <cite class="ltx_cite ltx_citemacro_citep">(Anysphere, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib5" title="">2025</a>)</cite>, and more.</p> </div> <div class="ltx_para" id="S1.p2"> <p class="ltx_p" id="S1.p2.1">However, these useful innovations have also brought new challenges in SE research, among which a key question is:</p> </div> <div class="ltx_para ltx_noindent" id="S1.p3"> <svg class="ltx_picture" height="38.02" id="S1.p3.pic1" overflow="visible" version="1.1" width="600"><g fill="#000000" stroke="#000000" stroke-width="0.4pt" transform="translate(0,38.02) matrix(1 0 0 -1 0 0)"><g fill="#FFFFFF" fill-opacity="1.0"><path d="M 1.97 5.91 L 1.97 32.12 C 1.97 34.29 3.73 36.06 5.91 36.06 L 594.09 36.06 C 596.27 36.06 598.03 34.29 598.03 32.12 L 598.03 5.91 C 598.03 3.73 596.27 1.97 594.09 1.97 L 5.91 1.97 C 3.73 1.97 1.97 3.73 1.97 5.91 Z" style="stroke:none"></path></g><g fill-opacity="1.0" transform="matrix(1.0 0.0 0.0 1.0 53.15 5.91)"><foreignobject color="#000000" height="26.21" overflow="visible" transform="matrix(1 0 0 -1 0 16.6)" width="521.26"> <span class="ltx_inline-block ltx_minipage ltx_align_bottom" id="S1.p3.pic1.1.1.1.1.1" style="width:376.7pt;"> <span class="ltx_p" id="S1.p3.pic1.1.1.1.1.1.1"><span class="ltx_text ltx_font_italic" id="S1.p3.pic1.1.1.1.1.1.1.1">How can we comprehensively and scalably evaluate the quality of LLM-generated software artifacts?</span></span> </span></foreignobject></g></g></svg> </div> <div class="ltx_para" id="S1.p4"> <p class="ltx_p" id="S1.p4.1">Assessing these LLM-generated software artifacts through experienced developers, i.e., <span class="ltx_text ltx_font_italic" id="S1.p4.1.1">human evaluation</span>, is an ideal approach, as software quality should be judged based on its effectiveness in real-world scenarios <cite class="ltx_cite ltx_citemacro_citep">(Tian, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib63" title="">2005</a>)</cite>. However, rigorous human evaluation poses significant challenges. A survey of SE researchers found that 84% agree that human evaluation is problematic, primarily due to time constraints, the need for specialized practical knowledge, and the high cost involved <cite class="ltx_cite ltx_citemacro_citep">(Buse et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib8" title="">2011</a>)</cite>. Additionally, human evaluators may experience fatigue and reduced focus, further compromising the quality of the evaluation and delaying the evaluation process <cite class="ltx_cite ltx_citemacro_citep">(Kumar et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib31" title="">2024</a>)</cite>.</p> </div> <div class="ltx_para" id="S1.p5"> <p class="ltx_p" id="S1.p5.2">Therefore, researchers have increasingly adopted automated metrics for evaluating LLM-generated software artifacts <cite class="ltx_cite ltx_citemacro_citep">(Hu et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib26" title="">2022</a>; Papineni et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib47" title="">2002</a>; Lin, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib38" title="">2004</a>)</cite>, given their relatively low cost and scalability. However, traditional automated metrics come with their own limitations. For example, Pass@<math alttext="k" class="ltx_Math" display="inline" id="S1.p5.1.m1.1"><semantics id="S1.p5.1.m1.1a"><mi id="S1.p5.1.m1.1.1" xref="S1.p5.1.m1.1.1.cmml">k</mi><annotation-xml encoding="MathML-Content" id="S1.p5.1.m1.1b"><ci id="S1.p5.1.m1.1.1.cmml" xref="S1.p5.1.m1.1.1">𝑘</ci></annotation-xml><annotation encoding="application/x-tex" id="S1.p5.1.m1.1c">k</annotation><annotation encoding="application/x-llamapun" id="S1.p5.1.m1.1d">italic_k</annotation></semantics></math>, a widely used metric for measuring code correctness, involves executing the first <math alttext="k" class="ltx_Math" display="inline" id="S1.p5.2.m2.1"><semantics id="S1.p5.2.m2.1a"><mi id="S1.p5.2.m2.1.1" xref="S1.p5.2.m2.1.1.cmml">k</mi><annotation-xml encoding="MathML-Content" id="S1.p5.2.m2.1b"><ci id="S1.p5.2.m2.1.1.cmml" xref="S1.p5.2.m2.1.1">𝑘</ci></annotation-xml><annotation encoding="application/x-tex" id="S1.p5.2.m2.1c">k</annotation><annotation encoding="application/x-llamapun" id="S1.p5.2.m2.1d">italic_k</annotation></semantics></math> generated code snippets against unit tests. Although useful, it still demands significant human effort to design comprehensive test suites and manually configure execution environments. Similarly, the effectiveness of text similarity-based metrics like BLEU <cite class="ltx_cite ltx_citemacro_citep">(Papineni et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib47" title="">2002</a>)</cite> and CodeBLEU <cite class="ltx_cite ltx_citemacro_citep">(Ren et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib52" title="">2020</a>)</cite>, along with embedding-based metrics like BERTScore <cite class="ltx_cite ltx_citemacro_citep">(Zhang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib83" title="">2019</a>)</cite> and CodeBERTScore <cite class="ltx_cite ltx_citemacro_citep">(Zhou et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib88" title="">2023</a>)</cite>, rely heavily on high-quality references that are typically annotated by human experts. Moreover, these methods not only fail to capture nuanced aspects of software artifacts—such as naturalness, usefulness, and adherence to best practices—as humans would, but they also often lead to misalignment with human judgment, as evidenced in several studies <cite class="ltx_cite ltx_citemacro_citep">(Hu et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib26" title="">2022</a>; Kumar et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib31" title="">2024</a>; Roy et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib53" title="">2021</a>; Evtikhiev et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib13" title="">2023</a>)</cite>. Consequently, achieving comprehensive evaluation while ensuring scalable automation remains a persistent challenge in assessing the quality of LLM-generated software artifacts.</p> </div> <div class="ltx_para" id="S1.p6"> <p class="ltx_p" id="S1.p6.1">Recently, the remarkable success of LLMs has inspired the emergence of the LLM-as-a-Judge (LLM-J) paradigm <cite class="ltx_cite ltx_citemacro_citep">(Zheng et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib86" title="">2024</a>)</cite>, which extends LLMs’ role from content generation to comprehensive content evaluation, positioning them as scalable and cost-effective surrogates for human evaluators. This paradigm is driven by several key attributes of LLMs. First, numerous studies have shown that LLMs exhibit both impressive coding abilities <cite class="ltx_cite ltx_citemacro_citep">(Liu et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib39" title="">2024b</a>; Lozhkov et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib43" title="">2024</a>; Li et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib35" title="">2023</a>)</cite> and human-like reasoning skills <cite class="ltx_cite ltx_citemacro_citep">(Patil, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib49" title="">2025</a>; Ivanova, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib27" title="">2025</a>; Guo et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib23" title="">2025</a>)</cite>. Additionally, since LLMs are often trained through reinforcement learning from human feedback (RLHF) <cite class="ltx_cite ltx_citemacro_citep">(Bai et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib6" title="">2022</a>)</cite>, their outputs closely align with human expert judgments <cite class="ltx_cite ltx_citemacro_citep">(GitHub, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib18" title="">2025</a>; Chen et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib9" title="">2021</a>)</cite>. Moreover, unlike human evaluators, LLMs do not experience fatigue or reduced efficiency from prolonged work, allowing them to maintain consistent performance over extended periods. These attributes collectively make LLMs highly suitable in content evaluation tasks.</p> </div> <div class="ltx_para" id="S1.p7"> <p class="ltx_p" id="S1.p7.1">In this paper, we argue that LLM-as-a-Judge systems offer a promising solution to address the limitations of both costly human evaluation and traditional automated metrics in SE. Our research community is beginning to recognize the potential of LLM-as-a-Judge, as evidenced by a few empirical studies and emerging methodologies on this topic such as <cite class="ltx_cite ltx_citemacro_citep">(Wang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>; Wu et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib69" title="">2024</a>; Ahmed et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>)</cite>. Nevertheless, these studies represent only initial progress, and the field remains in its early stages. Many breakthroughs are still needed, and the overall landscape is far from being fully mapped, especially as we look to 2030 and beyond. This necessitates this paper to redirect the research community’s focus by presenting a vision for the future of LLM-as-a-Judge and a roadmap that inspires further progress.</p> </div> <div class="ltx_para" id="S1.p8"> <p class="ltx_p" id="S1.p8.1">In the following sections, this paper first provides a formal definition of the LLM-as-a-Judge concept (<a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.02246v1#S2" title="2. Definition of LLM-as-a-Judge ‣ From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_tag">Section 2</span></a>). We then review key historical milestones and recent advances in LLM-as-a-Judge for software engineering (<a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.02246v1#S3" title="3. Literature Review ‣ From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_tag">Section 3</span></a>). Looking beyond 2030, we envision LLM-as-a-Judge systems as reliable, robust, and scalable human surrogates capable of evaluating a wide range of software artifacts while providing consistent, multi-faceted assessments. To validate this vision, we identify key limitations and research gaps, outlining a roadmap with specific research directions and potential solutions for the SE research community to pursue (<a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.02246v1#S4" title="4. The Road Ahead ‣ From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_tag">Section 4</span></a>). In summary, our study makes the following key contributions:</p> <ul class="ltx_itemize" id="S1.I1"> <li class="ltx_item" id="S1.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i1.p1"> <p class="ltx_p" id="S1.I1.i1.p1.1">We conduct a comprehensive review of 16 primary studies on LLM-as-a-Judge in software engineering.</p> </div> </li> <li class="ltx_item" id="S1.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i2.p1"> <p class="ltx_p" id="S1.I1.i2.p1.1">We analyze the limitations of existing research on LLM-as-a-Judge in SE, identifying major challenges and research gaps.</p> </div> </li> <li class="ltx_item" id="S1.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i3.p1"> <p class="ltx_p" id="S1.I1.i3.p1.1">We present a forward-looking vision for LLM-as-a-Judge in SE by 2030 and beyond, proposing a detailed research roadmap with key opportunities for future exploration.</p> </div> </li> </ul> </div> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">2. </span>Definition of LLM-as-a-Judge</h2> <div class="ltx_para" id="S2.p1"> <p class="ltx_p" id="S2.p1.1">As a recent emerging topic, various definitions of LLM-as-a-Judge are currently proposed and not yet unified. This paper was inspired by the definition proposed by Li et al. <cite class="ltx_cite ltx_citemacro_citep">(Li et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib34" title="">2024a</a>)</cite>. Informally speaking, LLM-as-a-Judge utilizes LLMs as evaluative tools to assess software artifact based on predefined evaluation criteria. Formally speaking, LLM-as-a-Judge is defined as follows:</p> </div> <div class="ltx_para" id="S2.p2"> <table class="ltx_equation ltx_eqn_table" id="S2.E1"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_left" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_left">(1)</span></td> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="E(\mathcal{T},\mathcal{C},\mathcal{X},\mathcal{R})\rightarrow(\mathcal{Y},% \mathcal{E},\mathcal{F})" class="ltx_Math" display="block" id="S2.E1.m1.7"><semantics id="S2.E1.m1.7a"><mrow id="S2.E1.m1.7.8" xref="S2.E1.m1.7.8.cmml"><mrow id="S2.E1.m1.7.8.2" xref="S2.E1.m1.7.8.2.cmml"><mi id="S2.E1.m1.7.8.2.2" xref="S2.E1.m1.7.8.2.2.cmml">E</mi><mo id="S2.E1.m1.7.8.2.1" xref="S2.E1.m1.7.8.2.1.cmml">⁢</mo><mrow id="S2.E1.m1.7.8.2.3.2" xref="S2.E1.m1.7.8.2.3.1.cmml"><mo id="S2.E1.m1.7.8.2.3.2.1" stretchy="false" xref="S2.E1.m1.7.8.2.3.1.cmml">(</mo><mi class="ltx_font_mathcaligraphic" id="S2.E1.m1.1.1" xref="S2.E1.m1.1.1.cmml">𝒯</mi><mo id="S2.E1.m1.7.8.2.3.2.2" xref="S2.E1.m1.7.8.2.3.1.cmml">,</mo><mi class="ltx_font_mathcaligraphic" id="S2.E1.m1.2.2" xref="S2.E1.m1.2.2.cmml">𝒞</mi><mo id="S2.E1.m1.7.8.2.3.2.3" xref="S2.E1.m1.7.8.2.3.1.cmml">,</mo><mi class="ltx_font_mathcaligraphic" id="S2.E1.m1.3.3" xref="S2.E1.m1.3.3.cmml">𝒳</mi><mo id="S2.E1.m1.7.8.2.3.2.4" xref="S2.E1.m1.7.8.2.3.1.cmml">,</mo><mi class="ltx_font_mathcaligraphic" id="S2.E1.m1.4.4" xref="S2.E1.m1.4.4.cmml">ℛ</mi><mo id="S2.E1.m1.7.8.2.3.2.5" stretchy="false" xref="S2.E1.m1.7.8.2.3.1.cmml">)</mo></mrow></mrow><mo id="S2.E1.m1.7.8.1" stretchy="false" xref="S2.E1.m1.7.8.1.cmml">→</mo><mrow id="S2.E1.m1.7.8.3.2" xref="S2.E1.m1.7.8.3.1.cmml"><mo id="S2.E1.m1.7.8.3.2.1" stretchy="false" xref="S2.E1.m1.7.8.3.1.cmml">(</mo><mi class="ltx_font_mathcaligraphic" id="S2.E1.m1.5.5" xref="S2.E1.m1.5.5.cmml">𝒴</mi><mo id="S2.E1.m1.7.8.3.2.2" xref="S2.E1.m1.7.8.3.1.cmml">,</mo><mi class="ltx_font_mathcaligraphic" id="S2.E1.m1.6.6" xref="S2.E1.m1.6.6.cmml">ℰ</mi><mo id="S2.E1.m1.7.8.3.2.3" xref="S2.E1.m1.7.8.3.1.cmml">,</mo><mi class="ltx_font_mathcaligraphic" id="S2.E1.m1.7.7" xref="S2.E1.m1.7.7.cmml">ℱ</mi><mo id="S2.E1.m1.7.8.3.2.4" stretchy="false" xref="S2.E1.m1.7.8.3.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E1.m1.7b"><apply id="S2.E1.m1.7.8.cmml" xref="S2.E1.m1.7.8"><ci id="S2.E1.m1.7.8.1.cmml" xref="S2.E1.m1.7.8.1">→</ci><apply id="S2.E1.m1.7.8.2.cmml" xref="S2.E1.m1.7.8.2"><times id="S2.E1.m1.7.8.2.1.cmml" xref="S2.E1.m1.7.8.2.1"></times><ci id="S2.E1.m1.7.8.2.2.cmml" xref="S2.E1.m1.7.8.2.2">𝐸</ci><vector id="S2.E1.m1.7.8.2.3.1.cmml" xref="S2.E1.m1.7.8.2.3.2"><ci id="S2.E1.m1.1.1.cmml" xref="S2.E1.m1.1.1">𝒯</ci><ci id="S2.E1.m1.2.2.cmml" xref="S2.E1.m1.2.2">𝒞</ci><ci id="S2.E1.m1.3.3.cmml" xref="S2.E1.m1.3.3">𝒳</ci><ci id="S2.E1.m1.4.4.cmml" xref="S2.E1.m1.4.4">ℛ</ci></vector></apply><vector id="S2.E1.m1.7.8.3.1.cmml" xref="S2.E1.m1.7.8.3.2"><ci id="S2.E1.m1.5.5.cmml" xref="S2.E1.m1.5.5">𝒴</ci><ci id="S2.E1.m1.6.6.cmml" xref="S2.E1.m1.6.6">ℰ</ci><ci id="S2.E1.m1.7.7.cmml" xref="S2.E1.m1.7.7">ℱ</ci></vector></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E1.m1.7c">E(\mathcal{T},\mathcal{C},\mathcal{X},\mathcal{R})\rightarrow(\mathcal{Y},% \mathcal{E},\mathcal{F})</annotation><annotation encoding="application/x-llamapun" id="S2.E1.m1.7d">italic_E ( caligraphic_T , caligraphic_C , caligraphic_X , caligraphic_R ) → ( caligraphic_Y , caligraphic_E , caligraphic_F )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> </table> </div> <div class="ltx_para" id="S2.p3"> <p class="ltx_p" id="S2.p3.1">The function <math alttext="E" class="ltx_Math" display="inline" id="S2.p3.1.m1.1"><semantics id="S2.p3.1.m1.1a"><mi id="S2.p3.1.m1.1.1" xref="S2.p3.1.m1.1.1.cmml">E</mi><annotation-xml encoding="MathML-Content" id="S2.p3.1.m1.1b"><ci id="S2.p3.1.m1.1.1.cmml" xref="S2.p3.1.m1.1.1">𝐸</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.1.m1.1c">E</annotation><annotation encoding="application/x-llamapun" id="S2.p3.1.m1.1d">italic_E</annotation></semantics></math> is the core evaluation mechanism powered by LLMs. It takes the following key <span class="ltx_text ltx_font_bold" id="S2.p3.1.1">inputs</span>:</p> </div> <div class="ltx_para" id="S2.p4"> <ul class="ltx_itemize" id="S2.I1"> <li class="ltx_item" id="S2.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S2.I1.i1.p1"> <p class="ltx_p" id="S2.I1.i1.p1.1"><math alttext="\mathcal{T}" class="ltx_Math" display="inline" id="S2.I1.i1.p1.1.m1.1"><semantics id="S2.I1.i1.p1.1.m1.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I1.i1.p1.1.m1.1.1" xref="S2.I1.i1.p1.1.m1.1.1.cmml">𝒯</mi><annotation-xml encoding="MathML-Content" id="S2.I1.i1.p1.1.m1.1b"><ci id="S2.I1.i1.p1.1.m1.1.1.cmml" xref="S2.I1.i1.p1.1.m1.1.1">𝒯</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i1.p1.1.m1.1c">\mathcal{T}</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i1.p1.1.m1.1d">caligraphic_T</annotation></semantics></math> – Evaluation Type. It mainly covers three types: <span class="ltx_text ltx_font_italic" id="S2.I1.i1.p1.1.1">point-wise</span>, <span class="ltx_text ltx_font_italic" id="S2.I1.i1.p1.1.2">pairwise</span>, and <span class="ltx_text ltx_font_italic" id="S2.I1.i1.p1.1.3">list-wise</span>.</p> <ul class="ltx_itemize" id="S2.I1.i1.I1"> <li class="ltx_item" id="S2.I1.i1.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" id="S2.I1.i1.I1.i1.1.1.1">–</span></span> <div class="ltx_para" id="S2.I1.i1.I1.i1.p1"> <p class="ltx_p" id="S2.I1.i1.I1.i1.p1.1"><span class="ltx_text ltx_font_bold" id="S2.I1.i1.I1.i1.p1.1.1">Point-wise Evaluation:</span> Function <math alttext="E" class="ltx_Math" display="inline" id="S2.I1.i1.I1.i1.p1.1.m1.1"><semantics id="S2.I1.i1.I1.i1.p1.1.m1.1a"><mi id="S2.I1.i1.I1.i1.p1.1.m1.1.1" xref="S2.I1.i1.I1.i1.p1.1.m1.1.1.cmml">E</mi><annotation-xml encoding="MathML-Content" id="S2.I1.i1.I1.i1.p1.1.m1.1b"><ci id="S2.I1.i1.I1.i1.p1.1.m1.1.1.cmml" xref="S2.I1.i1.I1.i1.p1.1.m1.1.1">𝐸</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i1.I1.i1.p1.1.m1.1c">E</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i1.I1.i1.p1.1.m1.1d">italic_E</annotation></semantics></math> evaluates a single candidate content independently and often provides a score or categorical classification.</p> </div> </li> <li class="ltx_item" id="S2.I1.i1.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" id="S2.I1.i1.I1.i2.1.1.1">–</span></span> <div class="ltx_para" id="S2.I1.i1.I1.i2.p1"> <p class="ltx_p" id="S2.I1.i1.I1.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S2.I1.i1.I1.i2.p1.1.1">Pair-wise Evaluation:</span> Function <math alttext="E" class="ltx_Math" display="inline" id="S2.I1.i1.I1.i2.p1.1.m1.1"><semantics id="S2.I1.i1.I1.i2.p1.1.m1.1a"><mi id="S2.I1.i1.I1.i2.p1.1.m1.1.1" xref="S2.I1.i1.I1.i2.p1.1.m1.1.1.cmml">E</mi><annotation-xml encoding="MathML-Content" id="S2.I1.i1.I1.i2.p1.1.m1.1b"><ci id="S2.I1.i1.I1.i2.p1.1.m1.1.1.cmml" xref="S2.I1.i1.I1.i2.p1.1.m1.1.1">𝐸</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i1.I1.i2.p1.1.m1.1c">E</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i1.I1.i2.p1.1.m1.1d">italic_E</annotation></semantics></math> compares two candidates and determines their relative quality.</p> </div> </li> <li class="ltx_item" id="S2.I1.i1.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" id="S2.I1.i1.I1.i3.1.1.1">–</span></span> <div class="ltx_para" id="S2.I1.i1.I1.i3.p1"> <p class="ltx_p" id="S2.I1.i1.I1.i3.p1.1"><span class="ltx_text ltx_font_bold" id="S2.I1.i1.I1.i3.p1.1.1">List-wise Evaluation:</span> Function <math alttext="E" class="ltx_Math" display="inline" id="S2.I1.i1.I1.i3.p1.1.m1.1"><semantics id="S2.I1.i1.I1.i3.p1.1.m1.1a"><mi id="S2.I1.i1.I1.i3.p1.1.m1.1.1" xref="S2.I1.i1.I1.i3.p1.1.m1.1.1.cmml">E</mi><annotation-xml encoding="MathML-Content" id="S2.I1.i1.I1.i3.p1.1.m1.1b"><ci id="S2.I1.i1.I1.i3.p1.1.m1.1.1.cmml" xref="S2.I1.i1.I1.i3.p1.1.m1.1.1">𝐸</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i1.I1.i3.p1.1.m1.1c">E</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i1.I1.i3.p1.1.m1.1d">italic_E</annotation></semantics></math> ranks multiple (more than 2) candidates.</p> </div> </li> </ul> </div> </li> <li class="ltx_item" id="S2.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S2.I1.i2.p1"> <p class="ltx_p" id="S2.I1.i2.p1.1"><math alttext="\mathcal{C}" class="ltx_Math" display="inline" id="S2.I1.i2.p1.1.m1.1"><semantics id="S2.I1.i2.p1.1.m1.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I1.i2.p1.1.m1.1.1" xref="S2.I1.i2.p1.1.m1.1.1.cmml">𝒞</mi><annotation-xml encoding="MathML-Content" id="S2.I1.i2.p1.1.m1.1b"><ci id="S2.I1.i2.p1.1.m1.1.1.cmml" xref="S2.I1.i2.p1.1.m1.1.1">𝒞</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i2.p1.1.m1.1c">\mathcal{C}</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i2.p1.1.m1.1d">caligraphic_C</annotation></semantics></math> – Evaluation Criteria. It describes the assessment aspects, e.g., correctness, helpfulness, readability, etc.</p> </div> </li> <li class="ltx_item" id="S2.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S2.I1.i3.p1"> <p class="ltx_p" id="S2.I1.i3.p1.5"><math alttext="\mathcal{X}" class="ltx_Math" display="inline" id="S2.I1.i3.p1.1.m1.1"><semantics id="S2.I1.i3.p1.1.m1.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I1.i3.p1.1.m1.1.1" xref="S2.I1.i3.p1.1.m1.1.1.cmml">𝒳</mi><annotation-xml encoding="MathML-Content" id="S2.I1.i3.p1.1.m1.1b"><ci id="S2.I1.i3.p1.1.m1.1.1.cmml" xref="S2.I1.i3.p1.1.m1.1.1">𝒳</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i3.p1.1.m1.1c">\mathcal{X}</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i3.p1.1.m1.1d">caligraphic_X</annotation></semantics></math> – Evaluation Item. It is the content to be judged. Note that <math alttext="\mathcal{X}" class="ltx_Math" display="inline" id="S2.I1.i3.p1.2.m2.1"><semantics id="S2.I1.i3.p1.2.m2.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I1.i3.p1.2.m2.1.1" xref="S2.I1.i3.p1.2.m2.1.1.cmml">𝒳</mi><annotation-xml encoding="MathML-Content" id="S2.I1.i3.p1.2.m2.1b"><ci id="S2.I1.i3.p1.2.m2.1.1.cmml" xref="S2.I1.i3.p1.2.m2.1.1">𝒳</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i3.p1.2.m2.1c">\mathcal{X}</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i3.p1.2.m2.1d">caligraphic_X</annotation></semantics></math> can be a single candidate or a set of candidates <math alttext="(x_{1},x_{2},...,x_{n})" class="ltx_Math" display="inline" id="S2.I1.i3.p1.3.m3.4"><semantics id="S2.I1.i3.p1.3.m3.4a"><mrow id="S2.I1.i3.p1.3.m3.4.4.3" xref="S2.I1.i3.p1.3.m3.4.4.4.cmml"><mo id="S2.I1.i3.p1.3.m3.4.4.3.4" stretchy="false" xref="S2.I1.i3.p1.3.m3.4.4.4.cmml">(</mo><msub id="S2.I1.i3.p1.3.m3.2.2.1.1" xref="S2.I1.i3.p1.3.m3.2.2.1.1.cmml"><mi id="S2.I1.i3.p1.3.m3.2.2.1.1.2" xref="S2.I1.i3.p1.3.m3.2.2.1.1.2.cmml">x</mi><mn id="S2.I1.i3.p1.3.m3.2.2.1.1.3" xref="S2.I1.i3.p1.3.m3.2.2.1.1.3.cmml">1</mn></msub><mo id="S2.I1.i3.p1.3.m3.4.4.3.5" xref="S2.I1.i3.p1.3.m3.4.4.4.cmml">,</mo><msub id="S2.I1.i3.p1.3.m3.3.3.2.2" xref="S2.I1.i3.p1.3.m3.3.3.2.2.cmml"><mi id="S2.I1.i3.p1.3.m3.3.3.2.2.2" xref="S2.I1.i3.p1.3.m3.3.3.2.2.2.cmml">x</mi><mn id="S2.I1.i3.p1.3.m3.3.3.2.2.3" xref="S2.I1.i3.p1.3.m3.3.3.2.2.3.cmml">2</mn></msub><mo id="S2.I1.i3.p1.3.m3.4.4.3.6" xref="S2.I1.i3.p1.3.m3.4.4.4.cmml">,</mo><mi id="S2.I1.i3.p1.3.m3.1.1" mathvariant="normal" xref="S2.I1.i3.p1.3.m3.1.1.cmml">…</mi><mo id="S2.I1.i3.p1.3.m3.4.4.3.7" xref="S2.I1.i3.p1.3.m3.4.4.4.cmml">,</mo><msub id="S2.I1.i3.p1.3.m3.4.4.3.3" xref="S2.I1.i3.p1.3.m3.4.4.3.3.cmml"><mi id="S2.I1.i3.p1.3.m3.4.4.3.3.2" xref="S2.I1.i3.p1.3.m3.4.4.3.3.2.cmml">x</mi><mi id="S2.I1.i3.p1.3.m3.4.4.3.3.3" xref="S2.I1.i3.p1.3.m3.4.4.3.3.3.cmml">n</mi></msub><mo id="S2.I1.i3.p1.3.m3.4.4.3.8" stretchy="false" xref="S2.I1.i3.p1.3.m3.4.4.4.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.I1.i3.p1.3.m3.4b"><vector id="S2.I1.i3.p1.3.m3.4.4.4.cmml" xref="S2.I1.i3.p1.3.m3.4.4.3"><apply id="S2.I1.i3.p1.3.m3.2.2.1.1.cmml" xref="S2.I1.i3.p1.3.m3.2.2.1.1"><csymbol cd="ambiguous" id="S2.I1.i3.p1.3.m3.2.2.1.1.1.cmml" xref="S2.I1.i3.p1.3.m3.2.2.1.1">subscript</csymbol><ci id="S2.I1.i3.p1.3.m3.2.2.1.1.2.cmml" xref="S2.I1.i3.p1.3.m3.2.2.1.1.2">𝑥</ci><cn id="S2.I1.i3.p1.3.m3.2.2.1.1.3.cmml" type="integer" xref="S2.I1.i3.p1.3.m3.2.2.1.1.3">1</cn></apply><apply id="S2.I1.i3.p1.3.m3.3.3.2.2.cmml" xref="S2.I1.i3.p1.3.m3.3.3.2.2"><csymbol cd="ambiguous" id="S2.I1.i3.p1.3.m3.3.3.2.2.1.cmml" xref="S2.I1.i3.p1.3.m3.3.3.2.2">subscript</csymbol><ci id="S2.I1.i3.p1.3.m3.3.3.2.2.2.cmml" xref="S2.I1.i3.p1.3.m3.3.3.2.2.2">𝑥</ci><cn id="S2.I1.i3.p1.3.m3.3.3.2.2.3.cmml" type="integer" xref="S2.I1.i3.p1.3.m3.3.3.2.2.3">2</cn></apply><ci id="S2.I1.i3.p1.3.m3.1.1.cmml" xref="S2.I1.i3.p1.3.m3.1.1">…</ci><apply id="S2.I1.i3.p1.3.m3.4.4.3.3.cmml" xref="S2.I1.i3.p1.3.m3.4.4.3.3"><csymbol cd="ambiguous" id="S2.I1.i3.p1.3.m3.4.4.3.3.1.cmml" xref="S2.I1.i3.p1.3.m3.4.4.3.3">subscript</csymbol><ci id="S2.I1.i3.p1.3.m3.4.4.3.3.2.cmml" xref="S2.I1.i3.p1.3.m3.4.4.3.3.2">𝑥</ci><ci id="S2.I1.i3.p1.3.m3.4.4.3.3.3.cmml" xref="S2.I1.i3.p1.3.m3.4.4.3.3.3">𝑛</ci></apply></vector></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i3.p1.3.m3.4c">(x_{1},x_{2},...,x_{n})</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i3.p1.3.m3.4d">( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )</annotation></semantics></math>, where <math alttext="x_{i}" class="ltx_Math" display="inline" id="S2.I1.i3.p1.4.m4.1"><semantics id="S2.I1.i3.p1.4.m4.1a"><msub id="S2.I1.i3.p1.4.m4.1.1" xref="S2.I1.i3.p1.4.m4.1.1.cmml"><mi id="S2.I1.i3.p1.4.m4.1.1.2" xref="S2.I1.i3.p1.4.m4.1.1.2.cmml">x</mi><mi id="S2.I1.i3.p1.4.m4.1.1.3" xref="S2.I1.i3.p1.4.m4.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S2.I1.i3.p1.4.m4.1b"><apply id="S2.I1.i3.p1.4.m4.1.1.cmml" xref="S2.I1.i3.p1.4.m4.1.1"><csymbol cd="ambiguous" id="S2.I1.i3.p1.4.m4.1.1.1.cmml" xref="S2.I1.i3.p1.4.m4.1.1">subscript</csymbol><ci id="S2.I1.i3.p1.4.m4.1.1.2.cmml" xref="S2.I1.i3.p1.4.m4.1.1.2">𝑥</ci><ci id="S2.I1.i3.p1.4.m4.1.1.3.cmml" xref="S2.I1.i3.p1.4.m4.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i3.p1.4.m4.1c">x_{i}</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i3.p1.4.m4.1d">italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math> is the <math alttext="i^{th}" class="ltx_Math" display="inline" id="S2.I1.i3.p1.5.m5.1"><semantics id="S2.I1.i3.p1.5.m5.1a"><msup id="S2.I1.i3.p1.5.m5.1.1" xref="S2.I1.i3.p1.5.m5.1.1.cmml"><mi id="S2.I1.i3.p1.5.m5.1.1.2" xref="S2.I1.i3.p1.5.m5.1.1.2.cmml">i</mi><mrow id="S2.I1.i3.p1.5.m5.1.1.3" xref="S2.I1.i3.p1.5.m5.1.1.3.cmml"><mi id="S2.I1.i3.p1.5.m5.1.1.3.2" xref="S2.I1.i3.p1.5.m5.1.1.3.2.cmml">t</mi><mo id="S2.I1.i3.p1.5.m5.1.1.3.1" xref="S2.I1.i3.p1.5.m5.1.1.3.1.cmml">⁢</mo><mi id="S2.I1.i3.p1.5.m5.1.1.3.3" xref="S2.I1.i3.p1.5.m5.1.1.3.3.cmml">h</mi></mrow></msup><annotation-xml encoding="MathML-Content" id="S2.I1.i3.p1.5.m5.1b"><apply id="S2.I1.i3.p1.5.m5.1.1.cmml" xref="S2.I1.i3.p1.5.m5.1.1"><csymbol cd="ambiguous" id="S2.I1.i3.p1.5.m5.1.1.1.cmml" xref="S2.I1.i3.p1.5.m5.1.1">superscript</csymbol><ci id="S2.I1.i3.p1.5.m5.1.1.2.cmml" xref="S2.I1.i3.p1.5.m5.1.1.2">𝑖</ci><apply id="S2.I1.i3.p1.5.m5.1.1.3.cmml" xref="S2.I1.i3.p1.5.m5.1.1.3"><times id="S2.I1.i3.p1.5.m5.1.1.3.1.cmml" xref="S2.I1.i3.p1.5.m5.1.1.3.1"></times><ci id="S2.I1.i3.p1.5.m5.1.1.3.2.cmml" xref="S2.I1.i3.p1.5.m5.1.1.3.2">𝑡</ci><ci id="S2.I1.i3.p1.5.m5.1.1.3.3.cmml" xref="S2.I1.i3.p1.5.m5.1.1.3.3">ℎ</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i3.p1.5.m5.1c">i^{th}</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i3.p1.5.m5.1d">italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT</annotation></semantics></math> candidate to be evaluated.</p> </div> </li> <li class="ltx_item" id="S2.I1.i4" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S2.I1.i4.p1"> <p class="ltx_p" id="S2.I1.i4.p1.1"><math alttext="\mathcal{R}" class="ltx_Math" display="inline" id="S2.I1.i4.p1.1.m1.1"><semantics id="S2.I1.i4.p1.1.m1.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I1.i4.p1.1.m1.1.1" xref="S2.I1.i4.p1.1.m1.1.1.cmml">ℛ</mi><annotation-xml encoding="MathML-Content" id="S2.I1.i4.p1.1.m1.1b"><ci id="S2.I1.i4.p1.1.m1.1.1.cmml" xref="S2.I1.i4.p1.1.m1.1.1">ℛ</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I1.i4.p1.1.m1.1c">\mathcal{R}</annotation><annotation encoding="application/x-llamapun" id="S2.I1.i4.p1.1.m1.1d">caligraphic_R</annotation></semantics></math> – Optional reference information. For instance, a code implementation that successfully fulfills the requirements of a coding task.</p> </div> </li> </ul> </div> <div class="ltx_para" id="S2.p5"> <p class="ltx_p" id="S2.p5.1">Given these inputs, <math alttext="E" class="ltx_Math" display="inline" id="S2.p5.1.m1.1"><semantics id="S2.p5.1.m1.1a"><mi id="S2.p5.1.m1.1.1" xref="S2.p5.1.m1.1.1.cmml">E</mi><annotation-xml encoding="MathML-Content" id="S2.p5.1.m1.1b"><ci id="S2.p5.1.m1.1.1.cmml" xref="S2.p5.1.m1.1.1">𝐸</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p5.1.m1.1c">E</annotation><annotation encoding="application/x-llamapun" id="S2.p5.1.m1.1d">italic_E</annotation></semantics></math> can produce three primary <span class="ltx_text ltx_font_bold" id="S2.p5.1.1">outputs</span>:</p> </div> <div class="ltx_para" id="S2.p6"> <ul class="ltx_itemize" id="S2.I2"> <li class="ltx_item" id="S2.I2.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S2.I2.i1.p1"> <p class="ltx_p" id="S2.I2.i1.p1.2"><math alttext="\mathcal{Y}" class="ltx_Math" display="inline" id="S2.I2.i1.p1.1.m1.1"><semantics id="S2.I2.i1.p1.1.m1.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I2.i1.p1.1.m1.1.1" xref="S2.I2.i1.p1.1.m1.1.1.cmml">𝒴</mi><annotation-xml encoding="MathML-Content" id="S2.I2.i1.p1.1.m1.1b"><ci id="S2.I2.i1.p1.1.m1.1.1.cmml" xref="S2.I2.i1.p1.1.m1.1.1">𝒴</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i1.p1.1.m1.1c">\mathcal{Y}</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i1.p1.1.m1.1d">caligraphic_Y</annotation></semantics></math> – Evaluation Result. <math alttext="\mathcal{Y}" class="ltx_Math" display="inline" id="S2.I2.i1.p1.2.m2.1"><semantics id="S2.I2.i1.p1.2.m2.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I2.i1.p1.2.m2.1.1" xref="S2.I2.i1.p1.2.m2.1.1.cmml">𝒴</mi><annotation-xml encoding="MathML-Content" id="S2.I2.i1.p1.2.m2.1b"><ci id="S2.I2.i1.p1.2.m2.1.1.cmml" xref="S2.I2.i1.p1.2.m2.1.1">𝒴</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i1.p1.2.m2.1c">\mathcal{Y}</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i1.p1.2.m2.1d">caligraphic_Y</annotation></semantics></math> can take one of the following forms:</p> <ul class="ltx_itemize" id="S2.I2.i1.I1"> <li class="ltx_item" id="S2.I2.i1.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" id="S2.I2.i1.I1.i1.1.1.1">–</span></span> <div class="ltx_para" id="S2.I2.i1.I1.i1.p1"> <p class="ltx_p" id="S2.I2.i1.I1.i1.p1.1"><span class="ltx_text ltx_font_bold" id="S2.I2.i1.I1.i1.p1.1.1">Graded Evaluation:</span> Each candidate is assigned a score, either numerical (discrete or continuous) or categorical.</p> </div> </li> <li class="ltx_item" id="S2.I2.i1.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" id="S2.I2.i1.I1.i2.1.1.1">–</span></span> <div class="ltx_para" id="S2.I2.i1.I1.i2.p1"> <p class="ltx_p" id="S2.I2.i1.I1.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S2.I2.i1.I1.i2.p1.1.1">Rank Ordering:</span> Candidates <math alttext="(x_{1},x_{2},...,x_{n})" class="ltx_Math" display="inline" id="S2.I2.i1.I1.i2.p1.1.m1.4"><semantics id="S2.I2.i1.I1.i2.p1.1.m1.4a"><mrow id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.4.cmml"><mo id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.4" stretchy="false" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.4.cmml">(</mo><msub id="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1" xref="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.cmml"><mi id="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.2" xref="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.2.cmml">x</mi><mn id="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.3" xref="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.3.cmml">1</mn></msub><mo id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.5" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.4.cmml">,</mo><msub id="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2" xref="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.cmml"><mi id="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.2" xref="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.2.cmml">x</mi><mn id="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.3" xref="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.3.cmml">2</mn></msub><mo id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.6" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.4.cmml">,</mo><mi id="S2.I2.i1.I1.i2.p1.1.m1.1.1" mathvariant="normal" xref="S2.I2.i1.I1.i2.p1.1.m1.1.1.cmml">…</mi><mo id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.7" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.4.cmml">,</mo><msub id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.cmml"><mi id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.2" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.2.cmml">x</mi><mi id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.3" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.3.cmml">n</mi></msub><mo id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.8" stretchy="false" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.4.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.I2.i1.I1.i2.p1.1.m1.4b"><vector id="S2.I2.i1.I1.i2.p1.1.m1.4.4.4.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.3"><apply id="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1"><csymbol cd="ambiguous" id="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.1.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1">subscript</csymbol><ci id="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.2.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.2">𝑥</ci><cn id="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.3.cmml" type="integer" xref="S2.I2.i1.I1.i2.p1.1.m1.2.2.1.1.3">1</cn></apply><apply id="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2"><csymbol cd="ambiguous" id="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.1.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2">subscript</csymbol><ci id="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.2.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.2">𝑥</ci><cn id="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.3.cmml" type="integer" xref="S2.I2.i1.I1.i2.p1.1.m1.3.3.2.2.3">2</cn></apply><ci id="S2.I2.i1.I1.i2.p1.1.m1.1.1.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.1.1">…</ci><apply id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3"><csymbol cd="ambiguous" id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.1.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3">subscript</csymbol><ci id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.2.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.2">𝑥</ci><ci id="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.3.cmml" xref="S2.I2.i1.I1.i2.p1.1.m1.4.4.3.3.3">𝑛</ci></apply></vector></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i1.I1.i2.p1.1.m1.4c">(x_{1},x_{2},...,x_{n})</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i1.I1.i2.p1.1.m1.4d">( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )</annotation></semantics></math> are ranked based on their assessed quality.</p> </div> </li> <li class="ltx_item" id="S2.I2.i1.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" id="S2.I2.i1.I1.i3.1.1.1">–</span></span> <div class="ltx_para" id="S2.I2.i1.I1.i3.p1"> <p class="ltx_p" id="S2.I2.i1.I1.i3.p1.3"><span class="ltx_text ltx_font_bold" id="S2.I2.i1.I1.i3.p1.3.1">Best-Choice Selection:</span> Function <math alttext="E" class="ltx_Math" display="inline" id="S2.I2.i1.I1.i3.p1.1.m1.1"><semantics id="S2.I2.i1.I1.i3.p1.1.m1.1a"><mi id="S2.I2.i1.I1.i3.p1.1.m1.1.1" xref="S2.I2.i1.I1.i3.p1.1.m1.1.1.cmml">E</mi><annotation-xml encoding="MathML-Content" id="S2.I2.i1.I1.i3.p1.1.m1.1b"><ci id="S2.I2.i1.I1.i3.p1.1.m1.1.1.cmml" xref="S2.I2.i1.I1.i3.p1.1.m1.1.1">𝐸</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i1.I1.i3.p1.1.m1.1c">E</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i1.I1.i3.p1.1.m1.1d">italic_E</annotation></semantics></math> selects the most suitable candidate (<math alttext="x_{i}" class="ltx_Math" display="inline" id="S2.I2.i1.I1.i3.p1.2.m2.1"><semantics id="S2.I2.i1.I1.i3.p1.2.m2.1a"><msub id="S2.I2.i1.I1.i3.p1.2.m2.1.1" xref="S2.I2.i1.I1.i3.p1.2.m2.1.1.cmml"><mi id="S2.I2.i1.I1.i3.p1.2.m2.1.1.2" xref="S2.I2.i1.I1.i3.p1.2.m2.1.1.2.cmml">x</mi><mi id="S2.I2.i1.I1.i3.p1.2.m2.1.1.3" xref="S2.I2.i1.I1.i3.p1.2.m2.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S2.I2.i1.I1.i3.p1.2.m2.1b"><apply id="S2.I2.i1.I1.i3.p1.2.m2.1.1.cmml" xref="S2.I2.i1.I1.i3.p1.2.m2.1.1"><csymbol cd="ambiguous" id="S2.I2.i1.I1.i3.p1.2.m2.1.1.1.cmml" xref="S2.I2.i1.I1.i3.p1.2.m2.1.1">subscript</csymbol><ci id="S2.I2.i1.I1.i3.p1.2.m2.1.1.2.cmml" xref="S2.I2.i1.I1.i3.p1.2.m2.1.1.2">𝑥</ci><ci id="S2.I2.i1.I1.i3.p1.2.m2.1.1.3.cmml" xref="S2.I2.i1.I1.i3.p1.2.m2.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i1.I1.i3.p1.2.m2.1c">x_{i}</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i1.I1.i3.p1.2.m2.1d">italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math>) from the given set <math alttext="(x_{1},x_{2},...,x_{n})" class="ltx_Math" display="inline" id="S2.I2.i1.I1.i3.p1.3.m3.4"><semantics id="S2.I2.i1.I1.i3.p1.3.m3.4a"><mrow id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.4.cmml"><mo id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.4" stretchy="false" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.4.cmml">(</mo><msub id="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1" xref="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.cmml"><mi id="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.2" xref="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.2.cmml">x</mi><mn id="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.3" xref="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.3.cmml">1</mn></msub><mo id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.5" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.4.cmml">,</mo><msub id="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2" xref="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.cmml"><mi id="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.2" xref="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.2.cmml">x</mi><mn id="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.3" xref="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.3.cmml">2</mn></msub><mo id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.6" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.4.cmml">,</mo><mi id="S2.I2.i1.I1.i3.p1.3.m3.1.1" mathvariant="normal" xref="S2.I2.i1.I1.i3.p1.3.m3.1.1.cmml">…</mi><mo id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.7" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.4.cmml">,</mo><msub id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.cmml"><mi id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.2" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.2.cmml">x</mi><mi id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.3" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.3.cmml">n</mi></msub><mo id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.8" stretchy="false" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.4.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.I2.i1.I1.i3.p1.3.m3.4b"><vector id="S2.I2.i1.I1.i3.p1.3.m3.4.4.4.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.3"><apply id="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1"><csymbol cd="ambiguous" id="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.1.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1">subscript</csymbol><ci id="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.2.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.2">𝑥</ci><cn id="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.3.cmml" type="integer" xref="S2.I2.i1.I1.i3.p1.3.m3.2.2.1.1.3">1</cn></apply><apply id="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2"><csymbol cd="ambiguous" id="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.1.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2">subscript</csymbol><ci id="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.2.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.2">𝑥</ci><cn id="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.3.cmml" type="integer" xref="S2.I2.i1.I1.i3.p1.3.m3.3.3.2.2.3">2</cn></apply><ci id="S2.I2.i1.I1.i3.p1.3.m3.1.1.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.1.1">…</ci><apply id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3"><csymbol cd="ambiguous" id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.1.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3">subscript</csymbol><ci id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.2.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.2">𝑥</ci><ci id="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.3.cmml" xref="S2.I2.i1.I1.i3.p1.3.m3.4.4.3.3.3">𝑛</ci></apply></vector></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i1.I1.i3.p1.3.m3.4c">(x_{1},x_{2},...,x_{n})</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i1.I1.i3.p1.3.m3.4d">( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )</annotation></semantics></math>.</p> </div> </li> </ul> </div> </li> <li class="ltx_item" id="S2.I2.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S2.I2.i2.p1"> <p class="ltx_p" id="S2.I2.i2.p1.1"><math alttext="\mathcal{E}" class="ltx_Math" display="inline" id="S2.I2.i2.p1.1.m1.1"><semantics id="S2.I2.i2.p1.1.m1.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I2.i2.p1.1.m1.1.1" xref="S2.I2.i2.p1.1.m1.1.1.cmml">ℰ</mi><annotation-xml encoding="MathML-Content" id="S2.I2.i2.p1.1.m1.1b"><ci id="S2.I2.i2.p1.1.m1.1.1.cmml" xref="S2.I2.i2.p1.1.m1.1.1">ℰ</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i2.p1.1.m1.1c">\mathcal{E}</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i2.p1.1.m1.1d">caligraphic_E</annotation></semantics></math> – Evaluation Explanation. Optionally provides reasoning and justification for the evaluation.</p> </div> </li> <li class="ltx_item" id="S2.I2.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S2.I2.i3.p1"> <p class="ltx_p" id="S2.I2.i3.p1.2"><math alttext="\mathcal{F}" class="ltx_Math" display="inline" id="S2.I2.i3.p1.1.m1.1"><semantics id="S2.I2.i3.p1.1.m1.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I2.i3.p1.1.m1.1.1" xref="S2.I2.i3.p1.1.m1.1.1.cmml">ℱ</mi><annotation-xml encoding="MathML-Content" id="S2.I2.i3.p1.1.m1.1b"><ci id="S2.I2.i3.p1.1.m1.1.1.cmml" xref="S2.I2.i3.p1.1.m1.1.1">ℱ</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i3.p1.1.m1.1c">\mathcal{F}</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i3.p1.1.m1.1d">caligraphic_F</annotation></semantics></math> – Feedback. Optionally offers constructive suggestions for improving the evaluated item <math alttext="\mathcal{X}" class="ltx_Math" display="inline" id="S2.I2.i3.p1.2.m2.1"><semantics id="S2.I2.i3.p1.2.m2.1a"><mi class="ltx_font_mathcaligraphic" id="S2.I2.i3.p1.2.m2.1.1" xref="S2.I2.i3.p1.2.m2.1.1.cmml">𝒳</mi><annotation-xml encoding="MathML-Content" id="S2.I2.i3.p1.2.m2.1b"><ci id="S2.I2.i3.p1.2.m2.1.1.cmml" xref="S2.I2.i3.p1.2.m2.1.1">𝒳</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.I2.i3.p1.2.m2.1c">\mathcal{X}</annotation><annotation encoding="application/x-llamapun" id="S2.I2.i3.p1.2.m2.1d">caligraphic_X</annotation></semantics></math>.</p> </div> </li> </ul> </div> <div class="ltx_para" id="S2.p7"> <p class="ltx_p" id="S2.p7.1"><span class="ltx_text ltx_font_bold" id="S2.p7.1.1">Difference to previous work. </span> Note that this definition of LLM-as-a-Judge is stricter than the definition proposed by Wang et al. <cite class="ltx_cite ltx_citemacro_citep">(Wang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>)</cite>. Wang et al. broadly define LLM-as-a-Judge as any method that leverages any features of LLM to evaluate content quality, which includes embedding-based approaches, such as BERTScore <cite class="ltx_cite ltx_citemacro_citep">(Zhang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib83" title="">2019</a>)</cite> and CodeBERTScore <cite class="ltx_cite ltx_citemacro_citep">(Zhou et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib88" title="">2023</a>)</cite>. Embedding-based methods evaluate content similarity by converting both the evaluated content and references into numerical representations (LLM embeddings). However, we argue that their definition is better characterized as <span class="ltx_text ltx_font_italic" id="S2.p7.1.2">LLM-based evaluation</span> rather than <span class="ltx_text ltx_font_italic" id="S2.p7.1.3">LLM-as-a-Judge</span>. The key distinction is that we envision LLM-as-a-Judge as a true surrogate for human evaluation. In our view, an LLM-as-a-Judge system should be capable of independently assessing content without requiring a strict reference. Moreover, it should be able to perform a wide spectrum of nuanced assessments (e.g., usefulness, adequacy, and conciseness) and can ideally provide justifications or constructive feedback. In contrast, embedding-based methods rely solely on the numerical representation of assessing content similarity and inherently lack these capabilities. Consequently, we do not include them in our definition and instead leave a stricter definition to previous work.</p> </div> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">3. </span>Literature Review</h2> <div class="ltx_para" id="S3.p1"> <p class="ltx_p" id="S3.p1.1">This section, we provide an overview of studies that evaluate the effectiveness of LLM-as-a-Judge in SE tasks, with Table <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S3.T1" title="Table 1 ‣ 3. Literature Review ‣ From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_tag">1</span></a> providing a structured summary that maps each study to its corresponding SE task. Following this, we present a detailed discussion of the evaluation objectives of these studies.</p> </div> <figure class="ltx_table" id="S3.T1"> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 1. </span>Literature Review on the Applications of LLM-as-a-Judge in Software Engineering</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="S3.T1.1"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S3.T1.1.1.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_border_tt" id="S3.T1.1.1.1.1">Task</th> <th class="ltx_td ltx_nopad_r ltx_align_left ltx_th ltx_th_column ltx_border_tt" id="S3.T1.1.1.1.2">Reference</th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S3.T1.1.2.1"> <td class="ltx_td ltx_align_left ltx_border_t" id="S3.T1.1.2.1.1">Code Generation</td> <td class="ltx_td ltx_nopad_r ltx_align_left ltx_border_t" id="S3.T1.1.2.1.2"><cite class="ltx_cite ltx_citemacro_citep">(Wang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>; Tong and Zhang, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib64" title="">2024</a>; Ahmed et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>; Patel et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib48" title="">2024</a>; Tan et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib62" title="">2024</a>; Zhao et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib85" title="">2025</a>; Zhuo, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib89" title="">2023</a>; Sollenberger et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib59" title="">2024</a>; Weyssow et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib67" title="">2024</a>; Xu et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib71" title="">2024</a>)</cite></td> </tr> <tr class="ltx_tr" id="S3.T1.1.3.2"> <td class="ltx_td ltx_align_left" id="S3.T1.1.3.2.1">Code Summarization</td> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.3.2.2"><cite class="ltx_cite ltx_citemacro_citep">(Wang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>; Ahmed et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>; Farchi et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib15" title="">2024</a>; Wu et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib69" title="">2024</a>; Weyssow et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib67" title="">2024</a>)</cite></td> </tr> <tr class="ltx_tr" id="S3.T1.1.4.3"> <td class="ltx_td ltx_align_left" id="S3.T1.1.4.3.1">Bug Report Summarization</td> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.4.3.2"><cite class="ltx_cite ltx_citemacro_citep">(Kumar et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib31" title="">2024</a>)</cite></td> </tr> <tr class="ltx_tr" id="S3.T1.1.5.4"> <td class="ltx_td ltx_align_left" id="S3.T1.1.5.4.1">Code Translation</td> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.5.4.2"><cite class="ltx_cite ltx_citemacro_citep">(Wang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>)</cite></td> </tr> <tr class="ltx_tr" id="S3.T1.1.6.5"> <td class="ltx_td ltx_align_left" id="S3.T1.1.6.5.1">Question Answering</td> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.6.5.2"><cite class="ltx_cite ltx_citemacro_citep">(Zheng et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib86" title="">2024</a>)</cite></td> </tr> <tr class="ltx_tr" id="S3.T1.1.7.6"> <td class="ltx_td ltx_align_left" id="S3.T1.1.7.6.1">Requirements Causality Extraction</td> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.7.6.2"><cite class="ltx_cite ltx_citemacro_citep">(Ahmed et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>)</cite></td> </tr> <tr class="ltx_tr" id="S3.T1.1.8.7"> <td class="ltx_td ltx_align_left ltx_border_bb" id="S3.T1.1.8.7.1">Code Patches Generation</td> <td class="ltx_td ltx_nopad_r ltx_align_left ltx_border_bb" id="S3.T1.1.8.7.2"><cite class="ltx_cite ltx_citemacro_citep">(Yadavally et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib72" title="">2025</a>; Ahmed et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>; Li et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib36" title="">2024c</a>)</cite></td> </tr> </tbody> </table> </figure> <section class="ltx_subsection" id="S3.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.1. </span>Code Generation</h3> <div class="ltx_para" id="S3.SS1.p1"> <p class="ltx_p" id="S3.SS1.p1.1">Effectively and automatically evaluating the quality of generated code is one of the most critical problems in software engineering. Before the introduction of LLM-as-a-Judge, automated code evaluation primarily relied on test-based method, i.e., executing the code based on pre-defined test cases <cite class="ltx_cite ltx_citemacro_citep">(Zhuo et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib90" title="">2024</a>; Chen et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib9" title="">2021</a>; Hendrycks et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib24" title="">2021</a>)</cite>. However, passing all the test cases does not necessarily mean the quality of the code is good, as the test cases may not comprehensively cover all the edge cases <cite class="ltx_cite ltx_citemacro_citep">(Dou et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib12" title="">2024</a>)</cite>. Constructing test cases can also be particularly challenging for certain tasks <cite class="ltx_cite ltx_citemacro_citep">(Tong and Zhang, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib64" title="">2024</a>)</cite>, such as AI model training, web scraping, etc. When test cases are unavailable, many studies rely on reference-based metrics <cite class="ltx_cite ltx_citemacro_citep">(Dehaerne et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib11" title="">2022</a>; Zheng et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib87" title="">2023</a>)</cite>, such as CodeBLEU <cite class="ltx_cite ltx_citemacro_citep">(Ren et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib52" title="">2020</a>)</cite> and CodeBERTScore <cite class="ltx_cite ltx_citemacro_citep">(Zhou et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib88" title="">2023</a>)</cite>. However, the quality of the reference code restricts the effectiveness of reference-based metrics. Prior research <cite class="ltx_cite ltx_citemacro_citep">(Evtikhiev et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib13" title="">2023</a>)</cite> highlighted that reference-based metrics frequently misalign human judgments in code generation.</p> </div> <div class="ltx_para" id="S3.SS1.p2"> <p class="ltx_p" id="S3.SS1.p2.1">Consequently, a substantial body of research has explored the effectiveness of LLMs in evaluating code generation <cite class="ltx_cite ltx_citemacro_citep">(Wang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>; Zheng et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib86" title="">2024</a>; Tong and Zhang, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib64" title="">2024</a>; Patel et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib48" title="">2024</a>; Tan et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib62" title="">2024</a>; Zhao et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib85" title="">2025</a>; Zhuo, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib89" title="">2023</a>)</cite>. We summarize three main characteristics of LLM-as-a-Judge that are different from the test-based and reference-based metrics:</p> </div> <div class="ltx_para" id="S3.SS1.p3"> <ul class="ltx_itemize" id="S3.I1"> <li class="ltx_item" id="S3.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I1.i1.p1"> <p class="ltx_p" id="S3.I1.i1.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I1.i1.p1.1.1">Execution-free:</span> It does not require composary execution of the code.</p> </div> </li> <li class="ltx_item" id="S3.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I1.i2.p1"> <p class="ltx_p" id="S3.I1.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I1.i2.p1.1.1">Reference-free:</span> Unlike reference-based metrics, LLM-as-a-Judge does not require a reference code. Even though we can still optionally provide the reference code. However, as demonstrated by Zhuo et al. <cite class="ltx_cite ltx_citemacro_citep">(Zhuo, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib89" title="">2023</a>)</cite>, providing the reference does not generally improve the code evaluation performance.</p> </div> </li> <li class="ltx_item" id="S3.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I1.i3.p1"> <p class="ltx_p" id="S3.I1.i3.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I1.i3.p1.1.1">Multi-Facet Evaluation:</span> LLM-as-a-Judge assesses intrinsic code qualities that traditionally required human judgment, such as readability and usefulness.</p> </div> </li> </ul> </div> <div class="ltx_para" id="S3.SS1.p4"> <p class="ltx_p" id="S3.SS1.p4.1">Among existing studies, Zhuo et al. <cite class="ltx_cite ltx_citemacro_citep">(Zhuo, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib89" title="">2023</a>)</cite> and Weyssow et al. <cite class="ltx_cite ltx_citemacro_citep">(Weyssow et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib67" title="">2024</a>)</cite> utilize GPT-3.5 <cite class="ltx_cite ltx_citemacro_citep">(OpenAI, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib44" title="">2023a</a>)</cite>, and Xu et al. <cite class="ltx_cite ltx_citemacro_citep">(Xu et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib71" title="">2024</a>)</cite> use GPT-4 to annotate code snippets. CodeJudge <cite class="ltx_cite ltx_citemacro_citep">(Tong and Zhang, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib64" title="">2024</a>)</cite> first leverages a detailed taxonomy of common programming errors to guide the LLM in analyzing the generated code. Then, CodeJudge summarizes the analysis report to produce its final judgment. Wang et al. <cite class="ltx_cite ltx_citemacro_citep">(Wang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>)</cite> experiment with a series of general-purpose LLM-as-a-Judge methods from the natural language processing (NLP) domain in code generation evaluation, including BatchEval <cite class="ltx_cite ltx_citemacro_citep">(Yuan et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib80" title="">2023</a>)</cite>, GPTScore <cite class="ltx_cite ltx_citemacro_citep">(Fu et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib16" title="">2023</a>)</cite>, and G-Eval <cite class="ltx_cite ltx_citemacro_citep">(Liu et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib42" title="">2023</a>)</cite>. Ahmed et al. <cite class="ltx_cite ltx_citemacro_citep">(Ahmed et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>)</cite> investigate LLM’s ability to evaluate variable name-value inconsistency and function similarities. Gu et al. <cite class="ltx_cite ltx_citemacro_citep">(Gu et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib22" title="">2024</a>)</cite> focus on analyzing LLM’s judgments on counterfeit code samples, i.e., incorrect programs generated by language models that pass weak but non-trivial correctness checks. Zhao et al. <cite class="ltx_cite ltx_citemacro_citep">(Zhao et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib85" title="">2025</a>)</cite> empirically evaluate the performance of 12 LLMs on code generation evaluation. Patel et al. <cite class="ltx_cite ltx_citemacro_citep">(Patel et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib48" title="">2024</a>)</cite> hypothesis that a combination of multiple evaluators can approximate the optimal evaluation. Thus, they propose the AIME framework, which employs multiple LLMs to evaluate different aspects of the code, including code correctness, readability, runtime performance, etc.</p> </div> <div class="ltx_para" id="S3.SS1.p5"> <p class="ltx_p" id="S3.SS1.p5.1">Further, we summarize the common evaluation criteria used in the literature for code generation. We categorize them into two main criteria: <span class="ltx_text ltx_font_italic" id="S3.SS1.p5.1.1">Code Functionality</span> and <span class="ltx_text ltx_font_italic" id="S3.SS1.p5.1.2">Code Quality</span>.</p> </div> <div class="ltx_para ltx_noindent" id="S3.SS1.p6"> <p class="ltx_p" id="S3.SS1.p6.1"><span class="ltx_text ltx_font_bold" id="S3.SS1.p6.1.1">Code Functionality:</span> This criterion assesses the overall effectiveness, maintainability, and clarity of the code beyond its functional correctness. It evaluates intrinsic code attributes. Key aspects include:</p> </div> <div class="ltx_para" id="S3.SS1.p7"> <ul class="ltx_itemize" id="S3.I2"> <li class="ltx_item" id="S3.I2.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">-</span> <div class="ltx_para" id="S3.I2.i1.p1"> <p class="ltx_p" id="S3.I2.i1.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I2.i1.p1.1.1">Execution Stability:</span> LLM-as-a-Judge’s evaluation process does not mandate code execution, meaning that running the code is not a required step.</p> </div> </li> <li class="ltx_item" id="S3.I2.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">-</span> <div class="ltx_para" id="S3.I2.i2.p1"> <p class="ltx_p" id="S3.I2.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I2.i2.p1.1.1">Functional Correctness:</span> Verifies whether the code produces the expected output according to the task description.</p> </div> </li> <li class="ltx_item" id="S3.I2.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">-</span> <div class="ltx_para" id="S3.I2.i3.p1"> <p class="ltx_p" id="S3.I2.i3.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I2.i3.p1.1.1">Fault Tolerance &amp; Error Handling:</span> Evaluates how the code manages edge cases, exceptions, invalid inputs, and unexpected failures.</p> </div> </li> </ul> </div> <div class="ltx_para ltx_noindent" id="S3.SS1.p8"> <p class="ltx_p" id="S3.SS1.p8.1"><span class="ltx_text ltx_font_bold" id="S3.SS1.p8.1.1">Code Quality:</span> These criteria evaluate intrinsic attributes such as readability and adherence to best practices.</p> <ul class="ltx_itemize" id="S3.I3"> <li class="ltx_item" id="S3.I3.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">-</span> <div class="ltx_para" id="S3.I3.i1.p1"> <p class="ltx_p" id="S3.I3.i1.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I3.i1.p1.1.1">Complexity &amp; Efficiency:</span> Analyzes computational complexity, execution time, and memory usage.</p> </div> </li> <li class="ltx_item" id="S3.I3.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">-</span> <div class="ltx_para" id="S3.I3.i2.p1"> <p class="ltx_p" id="S3.I3.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I3.i2.p1.1.1">Helpfulness:</span> Evaluates whether the code contributes meaningfully to solving the problem. This includes distinguishing partially correct solutions from completely unuseful and irrelevant code.</p> </div> </li> <li class="ltx_item" id="S3.I3.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">-</span> <div class="ltx_para" id="S3.I3.i3.p1"> <p class="ltx_p" id="S3.I3.i3.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I3.i3.p1.1.1">Readability:</span> Measures how easily the code can be understood, including aspects like fluency, clarity, and conciseness.</p> </div> </li> <li class="ltx_item" id="S3.I3.i4" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">-</span> <div class="ltx_para" id="S3.I3.i4.p1"> <p class="ltx_p" id="S3.I3.i4.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I3.i4.p1.1.1">Stylistic Consistency:</span> Ensures adherence to established coding standards, formatting guidelines, and best practices.</p> </div> </li> <li class="ltx_item" id="S3.I3.i5" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">-</span> <div class="ltx_para" id="S3.I3.i5.p1"> <p class="ltx_p" id="S3.I3.i5.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I3.i5.p1.1.1">Reference Similarity:</span> Assesses how closely the generated code aligns with a given reference implementation.</p> </div> </li> <li class="ltx_item" id="S3.I3.i6" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">-</span> <div class="ltx_para" id="S3.I3.i6.p1"> <p class="ltx_p" id="S3.I3.i6.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I3.i6.p1.1.1">Minor Anomalies &amp; Warnings:</span> Identifies non-critical issues such as redundant code or unused variables that do not affect execution.</p> </div> </li> </ul> </div> </section> <section class="ltx_subsection" id="S3.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.2. </span>Code Change</h3> <div class="ltx_para" id="S3.SS2.p1"> <p class="ltx_p" id="S3.SS2.p1.1">Beyond code generation, LLM-as-a-Judge has been explored for evaluating code modifications. Ahmed et al. <cite class="ltx_cite ltx_citemacro_citep">(Ahmed et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>)</cite> use LLM-as-a-Judge to determine whether a code patch effectively resolves a static analysis warning. Li et al. <cite class="ltx_cite ltx_citemacro_citep">(Li et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib36" title="">2024c</a>)</cite> apply LLMs to assess whether a patch successfully fixes a vulnerability, while Yadavally et al. <cite class="ltx_cite ltx_citemacro_citep">(Yadavally et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib72" title="">2025</a>)</cite> leverage LLMs to predict test outcomes of code patches.</p> </div> </section> <section class="ltx_subsection" id="S3.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.3. </span>Software Documentation Summarization</h3> <div class="ltx_para" id="S3.SS3.p1"> <p class="ltx_p" id="S3.SS3.p1.1">LLM-based judges have also been applied to assess natural language software documentation. Ahmed et al. <cite class="ltx_cite ltx_citemacro_citep">(Ahmed et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>)</cite> and Wu et al. <cite class="ltx_cite ltx_citemacro_citep">(Wu et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib69" title="">2024</a>)</cite> introduce LLM judges to evaluate code summaries. Wu et al. <cite class="ltx_cite ltx_citemacro_citep">(Wu et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib69" title="">2024</a>)</cite> introduce CODERPE, a framework that utilizes multiple LLMs and assigns LLMs with different roles to evaluate code summaries from various perspectives. These roles include a code reviewer, the original code author, a code editor, and a system analyst. In addition, Kumar et al. <cite class="ltx_cite ltx_citemacro_citep">(Kumar et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib31" title="">2024</a>)</cite> utilized LLMs to evaluate the quality of bug report titles and summaries. For code summarization, the common evaluation criteria include:</p> <ul class="ltx_itemize" id="S3.I4"> <li class="ltx_item" id="S3.I4.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I4.i1.p1"> <p class="ltx_p" id="S3.I4.i1.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I4.i1.p1.1.1">Language-related:</span> Ensure clear, well-structured sentences with precise and appropriate wording.</p> </div> </li> <li class="ltx_item" id="S3.I4.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I4.i2.p1"> <p class="ltx_p" id="S3.I4.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I4.i2.p1.1.1">Content-related:</span> Summarize the core functionality and logic concisely, providing sufficient detail without including unnecessary information.</p> </div> </li> <li class="ltx_item" id="S3.I4.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I4.i3.p1"> <p class="ltx_p" id="S3.I4.i3.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I4.i3.p1.1.1">Effectiveness-related:</span> Evaluate whether the summary is useful for developers and enhances their understanding of the code.</p> </div> </li> </ul> </div> </section> <section class="ltx_subsection" id="S3.SS4"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.4. </span>Other SE Tasks</h3> <div class="ltx_para" id="S3.SS4.p1"> <p class="ltx_p" id="S3.SS4.p1.1">LLM-as-a-Judge has also been explored in other software engineering tasks. Zheng et al. <cite class="ltx_cite ltx_citemacro_citep">(Zheng et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib86" title="">2024</a>)</cite> apply this approach to assess the quality of programming-related question answering. Wang et al. <cite class="ltx_cite ltx_citemacro_citep">(Wang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>)</cite> investigate its effectiveness in assessing code translation quality. In addition, Ahmed et al. <cite class="ltx_cite ltx_citemacro_citep">(Ahmed et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>)</cite> leverage LLMs to evaluate causal relationships extracted from natural language requirements.</p> </div> </section> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">4. </span>The Road Ahead</h2> <div class="ltx_para" id="S4.p1"> <p class="ltx_p" id="S4.p1.1">This section outlines a research roadmap for achieving scalable, reliable, and effective LLM-as-a-Judge systems in software engineering. We begin by outlining the key limitations in current approaches and then propose concrete research opportunities and actionable steps that can significantly broaden the applicability of LLM-as-a-Judge across diverse SE contexts, steering us toward our envisioned future.</p> </div> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.1. </span>More Empirical Evaluation and Benchmarking</h3> <div class="ltx_para" id="S4.SS1.p1"> <p class="ltx_p" id="S4.SS1.p1.1"><span class="ltx_text ltx_font_bold" id="S4.SS1.p1.1.1">Limitation 1: Lack of High-Quality and Large-Scale Human-Annotated Benchmarks. </span> In Section <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#S3" title="3. Literature Review ‣ From Code to Courtroom: LLMs as the New Software Judges"><span class="ltx_text ltx_ref_tag">3</span></a>, we reviewed numerous studies evaluating the effectiveness of LLM-as-a-Judge in software engineering. Benchmarks such as HumanEval <cite class="ltx_cite ltx_citemacro_citep">(Chen et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib9" title="">2021</a>)</cite>, HumanEval-XL <cite class="ltx_cite ltx_citemacro_citep">(Peng et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib50" title="">2024</a>)</cite>, and CoNaLa <cite class="ltx_cite ltx_citemacro_citep">(Yin et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib79" title="">2018</a>)</cite> provide useful test cases to evaluate the ability of LLMs in assessing code correctness. However, these benchmarks fall short when assessing more nuanced aspects like helpfulness, readability, and alignment with human judgment. A critical limitation in existing LLM-as-a-Judge of SE is their reliance on small-scale datasets to measure human alignment <cite class="ltx_cite ltx_citemacro_citep">(Wang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>; Wu et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib69" title="">2024</a>)</cite>, often comprising only a few hundred samples. For example, Wang et al. <cite class="ltx_cite ltx_citemacro_citep">(Wang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>)</cite> conducted experiments on three SE tasks, i.e., Code Translation, Code Generation, and Code Summarization, using a total of only 450 samples. Similarly, Ahmed et al. <cite class="ltx_cite ltx_citemacro_citep">(Ahmed et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib2" title="">2024a</a>)</cite> evaluated code summarization using just 420 samples. While these sample sizes may be sufficient for demonstrating statistical significance, larger-scale benchmarks are also essential to ensure generalizability and mitigate threats to external validity.</p> </div> <div class="ltx_para ltx_noindent" id="S4.SS1.p2"> <p class="ltx_p" id="S4.SS1.p2.1"><span class="ltx_text ltx_font_bold" id="S4.SS1.p2.1.1">Limitation 2: Inconsistent Empirical Findings. </span>Related to these small-scale experiments, another major challenge is the inconsistency in empirical findings. A few works reached varying conclusions due to discrepancies in dataset selection, prompting strategies, and LLM configurations. For instance, Wang et al. <cite class="ltx_cite ltx_citemacro_citep">(Wang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib66" title="">2025</a>)</cite> found that traditional evaluation metrics (e.g., ROUGE <cite class="ltx_cite ltx_citemacro_citep">(Lin, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib38" title="">2004</a>)</cite>, METEOR <cite class="ltx_cite ltx_citemacro_citep">(Banerjee and Lavie, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib7" title="">2005</a>)</cite>, and ChrF++ <cite class="ltx_cite ltx_citemacro_citep">(Popović, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib51" title="">2017</a>)</cite>) significantly outperformed LLM-as-a-Judge methods when aligning with human judgment for code summarization. In contrast, Wu et al. <cite class="ltx_cite ltx_citemacro_citep">(Wu et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib69" title="">2024</a>)</cite> reported that the LLM-as-a-Judge method surpassed conventional metrics in code summarization. These conflicting results highlight the need for a standardized, large-scale empirical study to enable fair and meaningful comparisons across studies.</p> </div> <div class="ltx_para ltx_noindent" id="S4.SS1.p3"> <p class="ltx_p" id="S4.SS1.p3.1"><span class="ltx_text ltx_font_bold" id="S4.SS1.p3.1.1">Limitation 3: Lack of Empirical Findings in Bias of LLM-as-a-Judge systems in SE.</span> Despite the growing adoption of LLM-as-a-Judge in SE, software artifacts and data may contain biases <cite class="ltx_cite ltx_citemacro_citep">(Widyasari et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib68" title="">2022</a>; Yang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib73" title="">2021</a>)</cite>. LLMs are already known to be susceptible to various biases and fairness issues. For example, Position bias <cite class="ltx_cite ltx_citemacro_citep">(Li et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib37" title="">2024b</a>)</cite>, where LLMs may favor responses based on their positioning within the text when making comparison; Verbosity bias <cite class="ltx_cite ltx_citemacro_citep">(Jiao et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib29" title="">2024</a>)</cite>, where LLMs show a tendency for longer, verbose responses, regardless of their qualities; Egocentric Bias <cite class="ltx_cite ltx_citemacro_citep">(Ye et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib77" title="">2024</a>)</cite>, where LLMs may overrate AI-generated solutions compared to human-written code. However, currently, there is a lack of thorough empirical investigation into the bias of LLM-as-a-Judge systems in SE.</p> </div> <div class="ltx_para ltx_noindent" id="S4.SS1.p4"> <p class="ltx_p" id="S4.SS1.p4.1"><span class="ltx_text ltx_font_bold ltx_font_italic" id="S4.SS1.p4.1.1">Opportunity: Develop Comprehensive and Diverse Benchmarks.</span> Future research should work on creating large-scale, multi-dimensional benchmarks that capture the complexity of real-world software engineering tasks. Key steps include:</p> <ol class="ltx_enumerate" id="S4.I1"> <li class="ltx_item" id="S4.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">(1)</span> <div class="ltx_para" id="S4.I1.i1.p1"> <p class="ltx_p" id="S4.I1.i1.p1.1"><span class="ltx_text ltx_font_italic" id="S4.I1.i1.p1.1.1">Expert Annotations and Quality Control:</span> Engage a diverse pool of expert programmers to provide high-quality annotations. This annotation process should focus more on the nuanced criteria beyond code correctness, such as readability, helpfulness, and maintainability. In the beginning, define clear annotation guidelines that describe these nuanced dimensions to eliminate ambiguity. During the process, cross-validation and consensus mechanisms should be conducted to ensure the reliability of the annotation.</p> </div> </li> <li class="ltx_item" id="S4.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">(2)</span> <div class="ltx_para" id="S4.I1.i2.p1"> <p class="ltx_p" id="S4.I1.i2.p1.1"><span class="ltx_text ltx_font_italic" id="S4.I1.i2.p1.1.1">Dataset Expansion:</span> Expand existing datasets to include a larger volume of high-quality expert-annotated samples. These benchmarks should encompass SE tasks at varying difficulty levels, various programming topics, and multiple programming languages. Additionally, new benchmarks should be extended to evaluate a broader spectrum of SE artifacts. For instance, expert evaluations could be gathered for software requirements documentation, system design, and API documentation. In the long term, we may also consider updating human preferences in the benchmarks to reflect evolving SE practices.</p> </div> </li> </ol> </div> <div class="ltx_para ltx_noindent" id="S4.SS1.p5"> <p class="ltx_p" id="S4.SS1.p5.1"><span class="ltx_text ltx_font_bold ltx_font_italic" id="S4.SS1.p5.1.1">Opportunity: Comprehensive Empirical Evaluation in SE Tasks.</span> Ultimately, these benchmarks will lay a solid foundation for understanding the strengths and weaknesses of LLM-as-a-Judge systems in SE. Large-scale experiments that systematically compare LLM-as-a-Judge methods with traditional metrics, while the impacts of design choices, such as prompting strategies and LLM configuration, will be investigated. Further, a thorough investigation into the bias and fairness of LLM-as-a-Judge systems in SE will be conducted.</p> </div> </section> <section class="ltx_subsection" id="S4.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.2. </span>Better Judgment From Internal Factors</h3> <div class="ltx_para ltx_noindent" id="S4.SS2.p1"> <p class="ltx_p" id="S4.SS2.p1.1"><span class="ltx_text ltx_font_bold" id="S4.SS2.p1.1.1">Limitation 4: Inadequate SE Domain-Specific Expertise in LLMs.</span> While ideally, human evaluations should be conducted by experienced developers. Like human evaluators, LLM must exhibit a deep understanding of the task and the relevant domain knowledge to provide an accurate evaluation. Recent research highlights that LLMs have remarkable coding abilities, which builds confidence in their ability to assess tasks like code summarization and generation. However, research also indicates that some complex coding tasks are still challenging for LLMs <cite class="ltx_cite ltx_citemacro_citep">(Zhuo et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib90" title="">2024</a>; Zhao et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib84" title="">2024</a>)</cite>. For instance, Zhao et al. <cite class="ltx_cite ltx_citemacro_citep">(Zhao et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib84" title="">2024</a>)</cite> observed that current LLMs cannot generate a complete library from scratch, which also implies LLMs’ ability to evaluate library quality is still limited. Similar limitations extend to other critical areas in software engineering, such as software design, formal verification, distributed systems debugging, etc. Beyond gaps in SE-specific knowledge, LLMs may also struggle with discerning subtle differences among alternative solutions. As Zhao et al. <cite class="ltx_cite ltx_citemacro_citep">(Zhao et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib85" title="">2025</a>)</cite> highlighted, an LLM’s ability to generate correct code does not guarantee that it can effectively evaluate alternative implementations for the same task.</p> </div> <div class="ltx_para ltx_noindent" id="S4.SS2.p2"> <p class="ltx_p" id="S4.SS2.p2.1"><span class="ltx_text ltx_font_bold ltx_font_italic" id="S4.SS2.p2.1.1">Opportunity: Enhancing SE Expertise in LLM.</span> Future research can strengthen the SE domain-specific expertise of LLMs. A key approach is training LLMs on richer and more high-quality SE datasets <cite class="ltx_cite ltx_citemacro_citep">(Liu et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib41" title="">2024a</a>)</cite>. Enhancing SE expertise in LLMs offers several benefits. For instance, LLMs with stronger formal verification capabilities can more accurately assess code correctness. Further, research can expand LLM-as-a-Judge’s applicability to a broader range of SE artifacts, such as Docker files, UML diagrams, and formal specifications.</p> </div> <div class="ltx_para ltx_noindent" id="S4.SS2.p3"> <p class="ltx_p" id="S4.SS2.p3.1"><span class="ltx_text ltx_font_bold ltx_font_italic" id="S4.SS2.p3.1.1">Opportunity: Embedding Expert Tacit Knowledge.</span> Another critical research direction for enhancing judgment involves capturing the tacit and procedural knowledge that human experts develop over years of experience, i.e., their intuition and unspoken decision-making processes. We can employ methods such as structured interviews <cite class="ltx_cite ltx_citemacro_citep">(Segal et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib55" title="">2006</a>)</cite>, think-aloud protocols <cite class="ltx_cite ltx_citemacro_citep">(Zhang and Zhang, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib82" title="">2019</a>)</cite>, and cognitive task analyses <cite class="ltx_cite ltx_citemacro_citep">(Schraagen et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib54" title="">2000</a>)</cite>, where experts are asked to articulate their reasoning while conducting real-world evaluations. This approach enables the systematic extraction of procedural knowledge that is typically not documented in software artifacts. Once captured, this knowledge can be integrated into LLM training through techniques like reinforcement learning <cite class="ltx_cite ltx_citemacro_citep">(Bai et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib6" title="">2022</a>)</cite> or neuro-symbolic methods <cite class="ltx_cite ltx_citemacro_citep">(Kwon et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib32" title="">2023</a>)</cite>.</p> </div> </section> <section class="ltx_subsection" id="S4.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.3. </span>Better Judgment Through External Factors</h3> <div class="ltx_para ltx_noindent" id="S4.SS3.p1"> <p class="ltx_p" id="S4.SS3.p1.1"><span class="ltx_text ltx_font_bold" id="S4.SS3.p1.1.1">Limitation 5: Reliance on Internal Evaluation Mechanisms.</span> Current works tend to depend solely on LLMs’ own abilities for evaluation. This shortfall underscores the need for additional mechanisms to assist LLM-as-a-Judge in decision-making.</p> </div> <div class="ltx_para ltx_noindent" id="S4.SS3.p2"> <p class="ltx_p" id="S4.SS3.p2.1"><span class="ltx_text ltx_font_bold ltx_font_italic" id="S4.SS3.p2.1.1">Opportunity: Integrating SE tools with LLM-as-a-Judge.</span> As humans rely on other tools to assist their judgment, LLM-as-a-Judge can incorporate outputs from various analysis tools. Integrating tools such as static analyzers, formal verification frameworks, and model checkers into the evaluation pipeline—or conversely, embedding LLM-as-a-Judge into other tools like Integrated Development Environments (IDEs) <cite class="ltx_cite ltx_citemacro_citep">(Shi et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib57" title="">2024a</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib56" title="">b</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib58" title="">2023</a>)</cite>—can add additional layers of validation.</p> </div> <div class="ltx_para ltx_noindent" id="S4.SS3.p3"> <p class="ltx_p" id="S4.SS3.p3.1"><span class="ltx_text ltx_font_bold ltx_font_italic" id="S4.SS3.p3.1.1">Opportunity: Human-in-the-loop.</span> For tasks that cannot be reliably automated by LLMs, we need to develop collaborative and interactive methods to include human oversight. Also, identifying the optimal ratio of LLM evaluators to human experts remains another challenge. Moreover, enhancing LLMs with the ability to assess their own confidence levels is crucial. When LLM generates low-confidence ratings, it should automatically flag these cases for human review.</p> </div> </section> <section class="ltx_subsection" id="S4.SS4"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.4. </span>Addressing The Security Issues of Using LLM-as-a-Judge</h3> <div class="ltx_para ltx_noindent" id="S4.SS4.p1"> <p class="ltx_p" id="S4.SS4.p1.1"><span class="ltx_text ltx_font_bold" id="S4.SS4.p1.1.1">Limitation 6: Insufficient Research on Adversarial Threats and Defensive Methods in SE.</span> Adversarial attacks <cite class="ltx_cite ltx_citemacro_citep">(Yang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib75" title="">2022b</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib74" title="">a</a>; Yefet et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib78" title="">2020</a>)</cite> on LLM-as-a-Judge systems pose significant risks by subtly manipulating software artifacts to alter evaluation outcomes. For example, attackers may obfuscate code by injecting misleading comments or rearranging segments to mask critical vulnerabilities. Adversaries might also modify bug reports or commit messages with deceptive context to influence LLM’s code quality, maintainability, and security assessment. However, at the current stage, we notice that this threat for LLM-as-a-Judge is under-explored in the SE community.</p> </div> <div class="ltx_para ltx_noindent" id="S4.SS4.p2"> <p class="ltx_p" id="S4.SS4.p2.1"><span class="ltx_text ltx_font_bold ltx_font_italic" id="S4.SS4.p2.1.1">Opportunity: Adversarial Testing and Robust Defensive Mechanisms.</span> Future research should adopt approaches that develop both adversarial methods and defensive strategies. On the one hand, developing adversarial training techniques <cite class="ltx_cite ltx_citemacro_citep">(Gong et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib20" title="">2022</a>; Xhonneux et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib70" title="">2024</a>)</cite> is crucial to enhance the LLM’s robustness, where crafted adversarial examples are incorporated into the training process, On the other hand, additional robust defensive mechanisms, such as anomaly detection <cite class="ltx_cite ltx_citemacro_citep">(Song et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib60" title="">2025</a>; Yang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2503.02246v1#bib.bib76" title="">2024</a>)</cite>, should be integrated to detect and mitigate the threat of adversarial attacks proactively.</p> </div> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">5. </span>Conclusion</h2> <div class="ltx_para" id="S5.p1"> <p class="ltx_p" id="S5.p1.1">This forward-looking SE 2030 paper first present a brief overview of the current landscape of LLM-as-a-Judge systems, identifying their limitations while highlighting the associated opportunities and presenting a future vision for their adoption in the SE community. To realize this vision by 2030 and beyond, we propose a research roadmap outlining the specific paths and potential solutions that the research community can pursue. Through these efforts, we aim to encourage broader participation in the LLM-as-a-Judge research journey.</p> </div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">(1)</span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib2"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ahmed et al<span class="ltx_text" id="bib.bib2.2.2.1">.</span> (2024a)</span> <span class="ltx_bibblock"> Toufique Ahmed, Premkumar Devanbu, Christoph Treude, and Michael Pradel. 2024a. </span> <span class="ltx_bibblock">Can LLMs replace manual annotation of software engineering artifacts? </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib2.3.1">arXiv preprint arXiv:2408.05534</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib3"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ahmed et al<span class="ltx_text" id="bib.bib3.2.2.1">.</span> (2024b)</span> <span class="ltx_bibblock"> Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, and Earl Barr. 2024b. </span> <span class="ltx_bibblock">Automatic semantic augmentation of language model prompts (for code summarization). In <em class="ltx_emph ltx_font_italic" id="bib.bib3.3.1">Proceedings of the IEEE/ACM 46th international conference on software engineering</em>. 1–13. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib4"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Anthropic (2023)</span> <span class="ltx_bibblock"> Anthropic. 2023. </span> <span class="ltx_bibblock">Claude (Oct 8 version) [Large language model]. </span> <span class="ltx_bibblock"> </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.anthropic.com/" title="">https://www.anthropic.com/</a> </span> </li> <li class="ltx_bibitem" id="bib.bib5"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Anysphere (2025)</span> <span class="ltx_bibblock"> Anysphere. 2025. </span> <span class="ltx_bibblock">Cursor - The AI Code Editor — cursor.com. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.cursor.com" title="">https://www.cursor.com</a>. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib6"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Bai et al<span class="ltx_text" id="bib.bib6.2.2.1">.</span> (2022)</span> <span class="ltx_bibblock"> Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al<span class="ltx_text" id="bib.bib6.3.1">.</span> 2022. </span> <span class="ltx_bibblock">Training a helpful and harmless assistant with reinforcement learning from human feedback. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib6.4.1">arXiv preprint arXiv:2204.05862</em> (2022). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib7"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Banerjee and Lavie (2005)</span> <span class="ltx_bibblock"> Satanjeev Banerjee and Alon Lavie. 2005. </span> <span class="ltx_bibblock">METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In <em class="ltx_emph ltx_font_italic" id="bib.bib7.1.1">Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization</em>. 65–72. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib8"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Buse et al<span class="ltx_text" id="bib.bib8.2.2.1">.</span> (2011)</span> <span class="ltx_bibblock"> Raymond PL Buse, Caitlin Sadowski, and Westley Weimer. 2011. </span> <span class="ltx_bibblock">Benefits and barriers of user evaluation in software engineering research. In <em class="ltx_emph ltx_font_italic" id="bib.bib8.3.1">Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications</em>. 643–656. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib9"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Chen et al<span class="ltx_text" id="bib.bib9.2.2.1">.</span> (2021)</span> <span class="ltx_bibblock"> Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al<span class="ltx_text" id="bib.bib9.3.1">.</span> 2021. </span> <span class="ltx_bibblock">Evaluating large language models trained on code. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib9.4.1">arXiv preprint arXiv:2107.03374</em> (2021). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib10"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">DeepSeek-AI et al<span class="ltx_text" id="bib.bib10.2.2.1">.</span> ([n. d.])</span> <span class="ltx_bibblock"> Aixin Liu DeepSeek-AI, Bei Feng, Bin Wang, B Wang, B Liu, C Zhao, C Dengr, C Ruan, D Dai, D Guo, et al<span class="ltx_text" id="bib.bib10.3.1">.</span> [n. d.]. </span> <span class="ltx_bibblock">Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib10.4.1">URL https://arxiv. org/abs/2405.04434</em> ([n. d.]). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib11"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Dehaerne et al<span class="ltx_text" id="bib.bib11.2.2.1">.</span> (2022)</span> <span class="ltx_bibblock"> Enrique Dehaerne, Bappaditya Dey, Sandip Halder, Stefan De Gendt, and Wannes Meert. 2022. </span> <span class="ltx_bibblock">Code generation using machine learning: A systematic review. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib11.3.1">Ieee Access</em> 10 (2022), 82434–82455. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib12"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Dou et al<span class="ltx_text" id="bib.bib12.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Weikang Zhou, Muling Wu, Mingxu Chai, Jessica Fan, Caishuang Huang, Yunbo Tao, et al<span class="ltx_text" id="bib.bib12.3.1">.</span> 2024. </span> <span class="ltx_bibblock">What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib12.4.1">arXiv preprint arXiv:2407.06153</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib13"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Evtikhiev et al<span class="ltx_text" id="bib.bib13.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Mikhail Evtikhiev, Egor Bogomolov, Yaroslav Sokolov, and Timofey Bryksin. 2023. </span> <span class="ltx_bibblock">Out of the bleu: how should we assess quality of the code generation models? </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib13.3.1">Journal of Systems and Software</em> 203 (2023), 111741. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib14"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Fan et al<span class="ltx_text" id="bib.bib14.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2023. </span> <span class="ltx_bibblock">Large Language Models for Software Engineering: Survey and Open Problems. In <em class="ltx_emph ltx_font_italic" id="bib.bib14.3.1">2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE)</em>. 31–53. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/ICSE-FoSE59343.2023.00008" title="">https://doi.org/10.1109/ICSE-FoSE59343.2023.00008</a> </span> </li> <li class="ltx_bibitem" id="bib.bib15"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Farchi et al<span class="ltx_text" id="bib.bib15.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Eitan Farchi, Shmulik Froimovich, Rami Katan, and Orna Raz. 2024. </span> <span class="ltx_bibblock">Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib15.3.1">arXiv preprint arXiv:2410.21071</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib16"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Fu et al<span class="ltx_text" id="bib.bib16.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. </span> <span class="ltx_bibblock">Gptscore: Evaluate as you desire. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib16.3.1">arXiv preprint arXiv:2302.04166</em> (2023). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib17"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Gao et al<span class="ltx_text" id="bib.bib17.2.2.1">.</span> (2025)</span> <span class="ltx_bibblock"> Cuiyun Gao, Xing Hu, Shan Gao, Xin Xia, and Zhi Jin. 2025. </span> <span class="ltx_bibblock">The Current Challenges of Software Engineering in the Era of Large Language Models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib17.3.1">ACM Trans. Softw. Eng. Methodol.</em> (Jan. 2025). </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3712005" title="">https://doi.org/10.1145/3712005</a> </span> <span class="ltx_bibblock">Just Accepted. </span> </li> <li class="ltx_bibitem" id="bib.bib18"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">GitHub (2025)</span> <span class="ltx_bibblock"> GitHub. 2025. </span> <span class="ltx_bibblock">GitHub Copilot: Your AI Pair Programmer. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/features/copilot/" title="">https://github.com/features/copilot/</a>. </span> <span class="ltx_bibblock"> </span> <span class="ltx_bibblock">Accessed on [insert access date]. </span> </li> <li class="ltx_bibitem" id="bib.bib19"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">GitHub (2025)</span> <span class="ltx_bibblock"> GitHub. 2025. </span> <span class="ltx_bibblock">GitHub Copilot · Your AI pair programmer. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/features/copilot/" title="">https://github.com/features/copilot/</a>. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib20"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Gong et al<span class="ltx_text" id="bib.bib20.2.2.1">.</span> (2022)</span> <span class="ltx_bibblock"> Chen Gong, Zhou Yang, Yunpeng Bai, Jieke Shi, Arunesh Sinha, Bowen Xu, David Lo, Xinwen Hou, and Guoliang Fan. 2022. </span> <span class="ltx_bibblock">Curiosity-driven and victim-aware adversarial policies. In <em class="ltx_emph ltx_font_italic" id="bib.bib20.3.1">Proceedings of the 38th Annual Computer Security Applications Conference</em>. 186–200. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib21"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Goodfellow et al<span class="ltx_text" id="bib.bib21.2.2.1">.</span> (2016)</span> <span class="ltx_bibblock"> Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib21.3.1">Deep learning</em>. Vol. 1. </span> <span class="ltx_bibblock">MIT press Cambridge. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib22"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Gu et al<span class="ltx_text" id="bib.bib22.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Alex Gu, Wen-Ding Li, Naman Jain, Theo Olausson, Celine Lee, Koushik Sen, and Armando Solar-Lezama. 2024. </span> <span class="ltx_bibblock">The Counterfeit Conundrum: Can Code Language Models Grasp the Nuances of Their Incorrect Generations?. In <em class="ltx_emph ltx_font_italic" id="bib.bib22.3.1">Findings of the Association for Computational Linguistics: ACL 2024</em>, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 74–117. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.18653/v1/2024.findings-acl.7" title="">https://doi.org/10.18653/v1/2024.findings-acl.7</a> </span> </li> <li class="ltx_bibitem" id="bib.bib23"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Guo et al<span class="ltx_text" id="bib.bib23.2.2.1">.</span> (2025)</span> <span class="ltx_bibblock"> Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al<span class="ltx_text" id="bib.bib23.3.1">.</span> 2025. </span> <span class="ltx_bibblock">Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib23.4.1">arXiv preprint arXiv:2501.12948</em> (2025). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib24"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Hendrycks et al<span class="ltx_text" id="bib.bib24.2.2.1">.</span> (2021)</span> <span class="ltx_bibblock"> Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al<span class="ltx_text" id="bib.bib24.3.1">.</span> 2021. </span> <span class="ltx_bibblock">Measuring coding challenge competence with apps (2021). </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib24.4.1">arXiv preprint arXiv:2105.09938</em> (2021). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib25"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Hou et al<span class="ltx_text" id="bib.bib25.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. </span> <span class="ltx_bibblock">Large Language Models for Software Engineering: A Systematic Literature Review. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib25.3.1">ACM Trans. Softw. Eng. Methodol.</em> 33, 8, Article 220 (Dec. 2024), 79 pages. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3695988" title="">https://doi.org/10.1145/3695988</a> </span> </li> <li class="ltx_bibitem" id="bib.bib26"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Hu et al<span class="ltx_text" id="bib.bib26.2.2.1">.</span> (2022)</span> <span class="ltx_bibblock"> Xing Hu, Qiuyuan Chen, Haoye Wang, Xin Xia, David Lo, and Thomas Zimmermann. 2022. </span> <span class="ltx_bibblock">Correlating automated and human evaluation of code documentation generation quality. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib26.3.1">ACM Transactions on Software Engineering and Methodology (TOSEM)</em> 31, 4 (2022), 1–28. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib27"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ivanova (2025)</span> <span class="ltx_bibblock"> Anna A Ivanova. 2025. </span> <span class="ltx_bibblock">How to evaluate the cognitive abilities of LLMs. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib27.1.1">Nature Human Behaviour</em> (2025), 1–4. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib28"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Jiang et al<span class="ltx_text" id="bib.bib28.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. 2023. </span> <span class="ltx_bibblock">Impact of Code Language Models on Automated Program Repair. In <em class="ltx_emph ltx_font_italic" id="bib.bib28.3.1">Proceedings of the 45th International Conference on Software Engineering</em> (Melbourne, Victoria, Australia) <em class="ltx_emph ltx_font_italic" id="bib.bib28.4.2">(ICSE ’23)</em>. IEEE Press, 1430–1442. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/ICSE48619.2023.00125" title="">https://doi.org/10.1109/ICSE48619.2023.00125</a> </span> </li> <li class="ltx_bibitem" id="bib.bib29"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Jiao et al<span class="ltx_text" id="bib.bib29.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Tong Jiao, Jian Zhang, Kui Xu, Rui Li, Xi Du, Shangqi Wang, and Zhenbo Song. 2024. </span> <span class="ltx_bibblock">Enhancing Fairness in LLM Evaluations: Unveiling and Mitigating Biases in Standard-Answer-Based Evaluations. In <em class="ltx_emph ltx_font_italic" id="bib.bib29.3.1">Proceedings of the AAAI Symposium Series</em>, Vol. 4. 56–59. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib30"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Jin et al<span class="ltx_text" id="bib.bib30.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. 2023. </span> <span class="ltx_bibblock">InferFix: End-to-End Program Repair with LLMs. In <em class="ltx_emph ltx_font_italic" id="bib.bib30.3.1">Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering</em> (San Francisco, CA, USA) <em class="ltx_emph ltx_font_italic" id="bib.bib30.4.2">(ESEC/FSE 2023)</em>. Association for Computing Machinery, New York, NY, USA, 1646–1656. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3611643.3613892" title="">https://doi.org/10.1145/3611643.3613892</a> </span> </li> <li class="ltx_bibitem" id="bib.bib31"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kumar et al<span class="ltx_text" id="bib.bib31.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Abhishek Kumar, Sonia Haiduc, Partha Pratim Das, and Partha Pratim Chakrabarti. 2024. </span> <span class="ltx_bibblock">LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib31.3.1">arXiv preprint arXiv:2409.00630</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib32"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kwon et al<span class="ltx_text" id="bib.bib32.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Joe Kwon, Sydney Levine, and Joshua B Tenenbaum. 2023. </span> <span class="ltx_bibblock">Neuro-symbolic models of human moral judgment: LLMs as automatic feature extractors. </span> <span class="ltx_bibblock">(2023). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib33"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">LeCun et al<span class="ltx_text" id="bib.bib33.2.2.1">.</span> (2015)</span> <span class="ltx_bibblock"> Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. </span> <span class="ltx_bibblock">Deep learning. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib33.3.1">nature</em> 521, 7553 (2015), 436–444. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib34"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li et al<span class="ltx_text" id="bib.bib34.2.2.1">.</span> (2024a)</span> <span class="ltx_bibblock"> Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024a. </span> <span class="ltx_bibblock">Llms-as-judges: a comprehensive survey on llm-based evaluation methods. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib34.3.1">arXiv preprint arXiv:2412.05579</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib35"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li et al<span class="ltx_text" id="bib.bib35.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al<span class="ltx_text" id="bib.bib35.3.1">.</span> 2023. </span> <span class="ltx_bibblock">Starcoder: may the source be with you! </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib35.4.1">arXiv preprint arXiv:2305.06161</em> (2023). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib36"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li et al<span class="ltx_text" id="bib.bib36.2.2.1">.</span> (2024c)</span> <span class="ltx_bibblock"> Yikun Li, Ting Zhang, Ratnadira Widyasari, Yan Naing Tun, Huu Hung Nguyen, Tan Bui, Ivana Clairine Irsan, Yiran Cheng, Xiang Lan, Han Wei Ang, et al<span class="ltx_text" id="bib.bib36.3.1">.</span> 2024c. </span> <span class="ltx_bibblock">CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib36.4.1">arXiv preprint arXiv:2411.17274</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib37"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li et al<span class="ltx_text" id="bib.bib37.2.2.1">.</span> (2024b)</span> <span class="ltx_bibblock"> Zongjie Li, Chaozheng Wang, Pingchuan Ma, Daoyuan Wu, Shuai Wang, Cuiyun Gao, and Yang Liu. 2024b. </span> <span class="ltx_bibblock">Split and Merge: Aligning Position Biases in LLM-based Evaluators. In <em class="ltx_emph ltx_font_italic" id="bib.bib37.3.1">Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing</em>, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 11084–11108. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.18653/v1/2024.emnlp-main.621" title="">https://doi.org/10.18653/v1/2024.emnlp-main.621</a> </span> </li> <li class="ltx_bibitem" id="bib.bib38"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Lin (2004)</span> <span class="ltx_bibblock"> Chin-Yew Lin. 2004. </span> <span class="ltx_bibblock">Rouge: A package for automatic evaluation of summaries. In <em class="ltx_emph ltx_font_italic" id="bib.bib38.1.1">Text summarization branches out</em>. 74–81. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib39"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Liu et al<span class="ltx_text" id="bib.bib39.2.2.1">.</span> (2024b)</span> <span class="ltx_bibblock"> Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al<span class="ltx_text" id="bib.bib39.3.1">.</span> 2024b. </span> <span class="ltx_bibblock">Deepseek-v3 technical report. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib39.4.1">arXiv preprint arXiv:2412.19437</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib40"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Liu et al<span class="ltx_text" id="bib.bib40.2.2.1">.</span> (2024c)</span> <span class="ltx_bibblock"> Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma. 2024c. </span> <span class="ltx_bibblock">Exploring and evaluating hallucinations in llm-powered code generation. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib40.3.1">arXiv preprint arXiv:2404.00971</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib41"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Liu et al<span class="ltx_text" id="bib.bib41.2.2.1">.</span> (2024a)</span> <span class="ltx_bibblock"> Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin. 2024a. </span> <span class="ltx_bibblock">Datasets for large language models: A comprehensive survey. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib41.3.1">arXiv preprint arXiv:2402.18041</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib42"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Liu et al<span class="ltx_text" id="bib.bib42.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. </span> <span class="ltx_bibblock">G-eval: Nlg evaluation using gpt-4 with better human alignment. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib42.3.1">arXiv preprint arXiv:2303.16634</em> (2023). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib43"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Lozhkov et al<span class="ltx_text" id="bib.bib43.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al<span class="ltx_text" id="bib.bib43.3.1">.</span> 2024. </span> <span class="ltx_bibblock">Starcoder 2 and the stack v2: The next generation. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib43.4.1">arXiv preprint arXiv:2402.19173</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib44"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">OpenAI (2023a)</span> <span class="ltx_bibblock"> OpenAI. 2023a. </span> <span class="ltx_bibblock">GPT-3.5: Advancements in Large Language Models. </span> <span class="ltx_bibblock">OpenAI Technical Report. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://openai.com/research/gpt-3-5" title="">https://openai.com/research/gpt-3-5</a> </span> </li> <li class="ltx_bibitem" id="bib.bib45"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">OpenAI (2023b)</span> <span class="ltx_bibblock"> OpenAI. 2023b. </span> <span class="ltx_bibblock">GPT-4 Technical Report. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib45.1.1">arXiv preprint arXiv:2303.08774</em> (2023). </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2303.08774" title="">https://arxiv.org/abs/2303.08774</a> </span> </li> <li class="ltx_bibitem" id="bib.bib46"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Pan et al<span class="ltx_text" id="bib.bib46.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lambert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2023. </span> <span class="ltx_bibblock">Understanding the effectiveness of large language models in code translation. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib46.3.1">CoRR</em> (2023). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib47"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Papineni et al<span class="ltx_text" id="bib.bib47.2.2.1">.</span> (2002)</span> <span class="ltx_bibblock"> Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. </span> <span class="ltx_bibblock">Bleu: a method for automatic evaluation of machine translation. In <em class="ltx_emph ltx_font_italic" id="bib.bib47.3.1">Proceedings of the 40th annual meeting of the Association for Computational Linguistics</em>. 311–318. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib48"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Patel et al<span class="ltx_text" id="bib.bib48.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Bhrij Patel, Souradip Chakraborty, Wesley A Suttle, Mengdi Wang, Amrit Singh Bedi, and Dinesh Manocha. 2024. </span> <span class="ltx_bibblock">AIME: AI System Optimization via Multiple LLM Evaluators. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib48.3.1">arXiv preprint arXiv:2410.03131</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib49"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Patil (2025)</span> <span class="ltx_bibblock"> Avinash Patil. 2025. </span> <span class="ltx_bibblock">Advancing Reasoning in Large Language Models: Promising Methods and Approaches. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib49.1.1">arXiv preprint arXiv:2502.03671</em> (2025). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib50"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Peng et al<span class="ltx_text" id="bib.bib50.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Qiwei Peng, Yekun Chai, and Xuhong Li. 2024. </span> <span class="ltx_bibblock">Humaneval-xl: A multilingual code generation benchmark for cross-lingual natural language generalization. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib50.3.1">arXiv preprint arXiv:2402.16694</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib51"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Popović (2017)</span> <span class="ltx_bibblock"> Maja Popović. 2017. </span> <span class="ltx_bibblock">chrF++: words helping character n-grams. In <em class="ltx_emph ltx_font_italic" id="bib.bib51.1.1">Proceedings of the second conference on machine translation</em>. 612–618. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib52"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ren et al<span class="ltx_text" id="bib.bib52.2.2.1">.</span> (2020)</span> <span class="ltx_bibblock"> Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. </span> <span class="ltx_bibblock">Codebleu: a method for automatic evaluation of code synthesis. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib52.3.1">arXiv preprint arXiv:2009.10297</em> (2020). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib53"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Roy et al<span class="ltx_text" id="bib.bib53.2.2.1">.</span> (2021)</span> <span class="ltx_bibblock"> Devjeet Roy, Sarah Fakhoury, and Venera Arnaoudova. 2021. </span> <span class="ltx_bibblock">Reassessing automatic evaluation metrics for code summarization tasks. In <em class="ltx_emph ltx_font_italic" id="bib.bib53.3.1">Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering</em>. 1105–1116. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib54"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Schraagen et al<span class="ltx_text" id="bib.bib54.2.2.1">.</span> (2000)</span> <span class="ltx_bibblock"> Jan Maarten Schraagen, Susan F Chipman, and Valerie L Shalin. 2000. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib54.3.1">Cognitive task analysis</em>. </span> <span class="ltx_bibblock">Psychology Press. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib55"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Segal et al<span class="ltx_text" id="bib.bib55.2.2.1">.</span> (2006)</span> <span class="ltx_bibblock"> Daniel L Segal, Frederick L Coolidge, Alisa O’Riley, and Benjamin A Heinz. 2006. </span> <span class="ltx_bibblock">Structured and semistructured interviews. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib55.3.1">Clinician’s handbook of adult behavioral assessment</em>. Elsevier, 121–144. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib56"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Shi et al<span class="ltx_text" id="bib.bib56.2.2.1">.</span> (2024b)</span> <span class="ltx_bibblock"> Jieke Shi, Zhou Yang, Hong Jin Kang, Bowen Xu, Junda He, and David Lo. 2024b. </span> <span class="ltx_bibblock">Greening Large Language Models of Code. In <em class="ltx_emph ltx_font_italic" id="bib.bib56.3.1">Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Society</em> (Lisbon, Portugal) <em class="ltx_emph ltx_font_italic" id="bib.bib56.4.2">(ICSE-SEIS’24)</em>. Association for Computing Machinery, New York, NY, USA, 142–153. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3639475.3640097" title="">https://doi.org/10.1145/3639475.3640097</a> </span> </li> <li class="ltx_bibitem" id="bib.bib57"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Shi et al<span class="ltx_text" id="bib.bib57.2.2.1">.</span> (2024a)</span> <span class="ltx_bibblock"> Jieke Shi, Zhou Yang, and David Lo. 2024a. </span> <span class="ltx_bibblock">Efficient and Green Large Language Models for Software Engineering: Vision and the Road Ahead. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib57.3.1">ACM Trans. Softw. Eng. Methodol.</em> (Dec. 2024). </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3708525" title="">https://doi.org/10.1145/3708525</a> </span> <span class="ltx_bibblock">Just Accepted. </span> </li> <li class="ltx_bibitem" id="bib.bib58"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Shi et al<span class="ltx_text" id="bib.bib58.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Jieke Shi, Zhou Yang, Bowen Xu, Hong Jin Kang, and David Lo. 2023. </span> <span class="ltx_bibblock">Compressing Pre-trained Models of Code into 3 MB. In <em class="ltx_emph ltx_font_italic" id="bib.bib58.3.1">Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering</em> (Rochester, MI, USA) <em class="ltx_emph ltx_font_italic" id="bib.bib58.4.2">(ASE ’22)</em>. Association for Computing Machinery, New York, NY, USA, Article 24, 12 pages. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3551349.3556964" title="">https://doi.org/10.1145/3551349.3556964</a> </span> </li> <li class="ltx_bibitem" id="bib.bib59"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Sollenberger et al<span class="ltx_text" id="bib.bib59.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Zachariah Sollenberger, Jay Patel, Christian Munley, Aaron Jarmusch, and Sunita Chandrasekaran. 2024. </span> <span class="ltx_bibblock">LLM4VV: Exploring LLM-as-a-Judge for Validation and Verification Testsuites. In <em class="ltx_emph ltx_font_italic" id="bib.bib59.3.1">SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis</em>. IEEE, 1885–1893. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib60"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Song et al<span class="ltx_text" id="bib.bib60.2.2.1">.</span> (2025)</span> <span class="ltx_bibblock"> Shuang Song, Yifei Zhang, and Neng Gao. 2025. </span> <span class="ltx_bibblock">Confront Insider Threat: Precise Anomaly Detection in Behavior Logs Based on LLM Fine-Tuning. In <em class="ltx_emph ltx_font_italic" id="bib.bib60.3.1">Proceedings of the 31st International Conference on Computational Linguistics</em>, Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert (Eds.). Association for Computational Linguistics, Abu Dhabi, UAE, 8589–8601. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://aclanthology.org/2025.coling-main.574/" title="">https://aclanthology.org/2025.coling-main.574/</a> </span> </li> <li class="ltx_bibitem" id="bib.bib61"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Sun et al<span class="ltx_text" id="bib.bib61.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, and Zhenyu Chen. 2024. </span> <span class="ltx_bibblock">Source code summarization in the era of large language models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib61.3.1">arXiv preprint arXiv:2407.07959</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib62"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Tan et al<span class="ltx_text" id="bib.bib62.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y Tang, Alejandro Cuadron, Chenguang Wang, Raluca Ada Popa, and Ion Stoica. 2024. </span> <span class="ltx_bibblock">Judgebench: A benchmark for evaluating llm-based judges. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib62.3.1">arXiv preprint arXiv:2410.12784</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib63"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Tian (2005)</span> <span class="ltx_bibblock"> Jeff Tian. 2005. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib63.1.1">Software quality engineering: testing, quality assurance, and quantifiable improvement</em>. </span> <span class="ltx_bibblock">John Wiley &amp; Sons. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib64"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Tong and Zhang (2024)</span> <span class="ltx_bibblock"> Weixi Tong and Tianyi Zhang. 2024. </span> <span class="ltx_bibblock">CodeJudge: Evaluating Code Generation with Large Language Models. In <em class="ltx_emph ltx_font_italic" id="bib.bib64.1.1">Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing</em>, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 20032–20051. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.18653/v1/2024.emnlp-main.1118" title="">https://doi.org/10.18653/v1/2024.emnlp-main.1118</a> </span> </li> <li class="ltx_bibitem" id="bib.bib65"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wang et al<span class="ltx_text" id="bib.bib65.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. </span> <span class="ltx_bibblock">Software Testing With Large Language Models: Survey, Landscape, and Vision. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib65.3.1">IEEE Transactions on Software Engineering</em> 50, 4 (2024), 911–936. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/TSE.2024.3368208" title="">https://doi.org/10.1109/TSE.2024.3368208</a> </span> </li> <li class="ltx_bibitem" id="bib.bib66"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wang et al<span class="ltx_text" id="bib.bib66.2.2.1">.</span> (2025)</span> <span class="ltx_bibblock"> Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. 2025. </span> <span class="ltx_bibblock">Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib66.3.1">arXiv preprint arXiv:2502.06193</em> (2025). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib67"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Weyssow et al<span class="ltx_text" id="bib.bib67.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Martin Weyssow, Aton Kamanda, Xin Zhou, and Houari Sahraoui. 2024. </span> <span class="ltx_bibblock">Codeultrafeedback: An llm-as-a-judge dataset for aligning large language models to coding preferences. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib67.3.1">arXiv preprint arXiv:2403.09032</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib68"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Widyasari et al<span class="ltx_text" id="bib.bib68.2.2.1">.</span> (2022)</span> <span class="ltx_bibblock"> Ratnadira Widyasari, Stefanus Agus Haryono, Ferdian Thung, Jieke Shi, Constance Tan, Fiona Wee, Jack Phan, and David Lo. 2022. </span> <span class="ltx_bibblock">On the Influence of Biases in Bug Localization: Evaluation and Benchmark. In <em class="ltx_emph ltx_font_italic" id="bib.bib68.3.1">2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)</em>. 128–139. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/SANER53432.2022.00027" title="">https://doi.org/10.1109/SANER53432.2022.00027</a> </span> </li> <li class="ltx_bibitem" id="bib.bib69"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wu et al<span class="ltx_text" id="bib.bib69.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Yang Wu, Yao Wan, Zhaoyang Chu, Wenting Zhao, Ye Liu, Hongyu Zhang, Xuanhua Shi, and Philip S Yu. 2024. </span> <span class="ltx_bibblock">Can Large Language Models Serve as Evaluators for Code Summarization? </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib69.3.1">arXiv preprint arXiv:2412.01333</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib70"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Xhonneux et al<span class="ltx_text" id="bib.bib70.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Sophie Xhonneux, Alessandro Sordoni, Stephan Günnemann, Gauthier Gidel, and Leo Schwinn. 2024. </span> <span class="ltx_bibblock">Efficient Adversarial Training in LLMs with Continuous Attacks. </span> <span class="ltx_bibblock"> </span> <span class="ltx_bibblock">arXiv:2405.15589 [cs.LG] <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2405.15589" title="">https://arxiv.org/abs/2405.15589</a> </span> </li> <li class="ltx_bibitem" id="bib.bib71"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Xu et al<span class="ltx_text" id="bib.bib71.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Fangzhou Xu, Sai Zhang, Zhenchang Xing, Xiaowang Zhang, Yahong Han, and Zhiyong Feng. 2024. </span> <span class="ltx_bibblock">Human-Like Code Quality Evaluation through LLM-based Recursive Semantic Comprehension. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib71.3.1">arXiv preprint arXiv:2412.00314</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib72"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yadavally et al<span class="ltx_text" id="bib.bib72.2.2.1">.</span> (2025)</span> <span class="ltx_bibblock"> Aashish Yadavally, Hoan Nguyen, Laurent Callot, and Gauthier Guinet. 2025. </span> <span class="ltx_bibblock">Large Language Model Critics for Execution-Free Evaluation of Code Changes. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib72.3.1">arXiv preprint arXiv:2501.16655</em> (2025). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib73"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yang et al<span class="ltx_text" id="bib.bib73.2.2.1">.</span> (2021)</span> <span class="ltx_bibblock"> Zhou Yang, Harshit Jain, Jieke Shi, Muhammad Hilmi Asyrofi, and David Lo. 2021. </span> <span class="ltx_bibblock">BiasHeal: On-the-Fly Black-Box Healing of Bias in Sentiment Analysis Systems. In <em class="ltx_emph ltx_font_italic" id="bib.bib73.3.1">2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)</em>. 644–648. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/ICSME52107.2021.00073" title="">https://doi.org/10.1109/ICSME52107.2021.00073</a> </span> </li> <li class="ltx_bibitem" id="bib.bib74"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yang et al<span class="ltx_text" id="bib.bib74.2.2.1">.</span> (2022a)</span> <span class="ltx_bibblock"> Zhou Yang, Jieke Shi, Muhammad Hilmi Asyrofi, and David Lo. 2022a. </span> <span class="ltx_bibblock">Revisiting Neuron Coverage Metrics and Quality of Deep Neural Networks. In <em class="ltx_emph ltx_font_italic" id="bib.bib74.3.1">2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)</em>. 408–419. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/SANER53432.2022.00056" title="">https://doi.org/10.1109/SANER53432.2022.00056</a> </span> </li> <li class="ltx_bibitem" id="bib.bib75"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yang et al<span class="ltx_text" id="bib.bib75.2.2.1">.</span> (2022b)</span> <span class="ltx_bibblock"> Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022b. </span> <span class="ltx_bibblock">Natural attack for pre-trained models of code. In <em class="ltx_emph ltx_font_italic" id="bib.bib75.3.1">Proceedings of the 44th International Conference on Software Engineering</em>. 1482–1493. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib76"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yang et al<span class="ltx_text" id="bib.bib76.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Zhou Yang, Bowen Xu, Jie M. Zhang, Hong Jin Kang, Jieke Shi, Junda He, and David Lo. 2024. </span> <span class="ltx_bibblock">Stealthy Backdoor Attack for Code Models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib76.3.1">IEEE Transactions on Software Engineering</em> 50, 4 (2024), 721–741. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/TSE.2024.3361661" title="">https://doi.org/10.1109/TSE.2024.3361661</a> </span> </li> <li class="ltx_bibitem" id="bib.bib77"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ye et al<span class="ltx_text" id="bib.bib77.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al<span class="ltx_text" id="bib.bib77.3.1">.</span> 2024. </span> <span class="ltx_bibblock">Justice or prejudice? quantifying biases in llm-as-a-judge. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib77.4.1">arXiv preprint arXiv:2410.02736</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib78"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yefet et al<span class="ltx_text" id="bib.bib78.2.2.1">.</span> (2020)</span> <span class="ltx_bibblock"> Noam Yefet, Uri Alon, and Eran Yahav. 2020. </span> <span class="ltx_bibblock">Adversarial examples for models of code. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib78.3.1">Proceedings of the ACM on Programming Languages</em> 4, OOPSLA (2020), 1–30. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib79"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yin et al<span class="ltx_text" id="bib.bib79.2.2.1">.</span> (2018)</span> <span class="ltx_bibblock"> Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. </span> <span class="ltx_bibblock">Learning to mine aligned code and natural language pairs from stack overflow. In <em class="ltx_emph ltx_font_italic" id="bib.bib79.3.1">Proceedings of the 15th international conference on mining software repositories</em>. 476–486. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib80"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yuan et al<span class="ltx_text" id="bib.bib80.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Boyuan Pan, Heda Wang, and Kan Li. 2023. </span> <span class="ltx_bibblock">BatchEval: Towards Human-like Text Evaluation. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib80.3.1">arXiv preprint arXiv:2401.00437</em> (2023). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib81"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yuan et al<span class="ltx_text" id="bib.bib81.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Kai Yu, Xin Peng, and Yiling Lou. 2024. </span> <span class="ltx_bibblock">Transagent: An llm-based multi-agent system for code translation. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib81.3.1">arXiv preprint arXiv:2409.19894</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib82"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhang and Zhang (2019)</span> <span class="ltx_bibblock"> Lawrence Jun Zhang and Donglan Zhang. 2019. </span> <span class="ltx_bibblock">Think-aloud protocols. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib82.1.1">The Routledge handbook of research methods in applied linguistics</em>. Routledge, 302–311. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib83"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhang et al<span class="ltx_text" id="bib.bib83.2.2.1">.</span> (2019)</span> <span class="ltx_bibblock"> Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. </span> <span class="ltx_bibblock">Bertscore: Evaluating text generation with bert. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib83.3.1">arXiv preprint arXiv:1904.09675</em> (2019). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib84"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhao et al<span class="ltx_text" id="bib.bib84.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Wenting Zhao, Nan Jiang, Celine Lee, Justin T Chiu, Claire Cardie, Matthias Gallé, and Alexander M Rush. 2024. </span> <span class="ltx_bibblock">Commit0: Library Generation from Scratch. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib84.3.1">arXiv preprint arXiv:2412.01769</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib85"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhao et al<span class="ltx_text" id="bib.bib85.2.2.1">.</span> (2025)</span> <span class="ltx_bibblock"> Yuwei Zhao, Ziyang Luo, Yuchen Tian, Hongzhan Lin, Weixiang Yan, Annan Li, and Jing Ma. 2025. </span> <span class="ltx_bibblock">CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?. In <em class="ltx_emph ltx_font_italic" id="bib.bib85.3.1">Proceedings of the 31st International Conference on Computational Linguistics</em>, Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert (Eds.). Association for Computational Linguistics, Abu Dhabi, UAE, 73–95. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://aclanthology.org/2025.coling-main.7/" title="">https://aclanthology.org/2025.coling-main.7/</a> </span> </li> <li class="ltx_bibitem" id="bib.bib86"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zheng et al<span class="ltx_text" id="bib.bib86.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al<span class="ltx_text" id="bib.bib86.3.1">.</span> 2024. </span> <span class="ltx_bibblock">Judging llm-as-a-judge with mt-bench and chatbot arena. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib86.4.1">Advances in Neural Information Processing Systems</em> 36 (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib87"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zheng et al<span class="ltx_text" id="bib.bib87.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al<span class="ltx_text" id="bib.bib87.3.1">.</span> 2023. </span> <span class="ltx_bibblock">Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In <em class="ltx_emph ltx_font_italic" id="bib.bib87.4.1">Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</em>. 5673–5684. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib88"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhou et al<span class="ltx_text" id="bib.bib88.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Shuyan Zhou, Uri Alon, Sumit Agarwal, and Graham Neubig. 2023. </span> <span class="ltx_bibblock">Codebertscore: Evaluating code generation with pretrained models of code. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib88.3.1">arXiv preprint arXiv:2302.05527</em> (2023). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib89"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhuo (2023)</span> <span class="ltx_bibblock"> Terry Yue Zhuo. 2023. </span> <span class="ltx_bibblock">ICE-Score: Instructing Large Language Models to Evaluate Code. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib89.1.1">arXiv preprint arXiv:2304.14317</em> (2023). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib90"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhuo et al<span class="ltx_text" id="bib.bib90.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al<span class="ltx_text" id="bib.bib90.3.1">.</span> 2024. </span> <span class="ltx_bibblock">Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib90.4.1">arXiv preprint arXiv:2406.15877</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> </ul> </section> <div class="ltx_pagination ltx_role_newpage"></div> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Tue Mar 4 03:47:27 2025 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/"><span style="letter-spacing:-0.2em; margin-right:0.1em;">L<span class="ltx_font_smallcaps" style="position:relative; bottom:2.2pt;">a</span>T<span class="ltx_font_smallcaps" style="font-size:120%;position:relative; bottom:-0.2ex;">e</span></span><span style="font-size:90%; position:relative; bottom:-0.2ex;">XML</span><img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10