CINXE.COM
LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios
<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios</title> <!--Generated on Mon Nov 11 14:40:48 2024 by LaTeXML (version 0.8.8) http://dlmf.nist.gov/LaTeXML/.--> <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <base href="/html/2411.07037v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S1" title="In LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">1 </span>Introduction</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S2" title="In LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2 </span>Related Work</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S3" title="In LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3 </span>LIFBench</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S3.SS1" title="In 3 LIFBench ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.1 </span>Problem Definition</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S3.SS2" title="In 3 LIFBench ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.2 </span>Data Collection</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S3.SS3" title="In 3 LIFBench ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.3 </span>Data Extension</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S4" title="In LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4 </span>LIFEval</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S4.SS1" title="In 4 LIFEval ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.1 </span>Automated Rubric-based Scoring</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S4.SS2" title="In 4 LIFEval ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.2 </span>Score-Capability Mapping</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S4.SS3" title="In 4 LIFEval ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.3 </span>Instruction Following Stability</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S5" title="In LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5 </span>Experiments</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S5.SS1" title="In 5 Experiments ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.1 </span>Experiment Setup</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S5.SS2" title="In 5 Experiments ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.2 </span>Results on LIFBench</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S6" title="In LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">6 </span>Conclusion</span></a></li> <li class="ltx_tocentry ltx_tocentry_appendix"> <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#A1" title="In LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A </span>Details in data collection</span></a> <ol class="ltx_toclist ltx_toclist_appendix"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#A1.SS1" title="In Appendix A Details in data collection ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A.1 </span>List</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#A1.SS2" title="In Appendix A Details in data collection ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A.2 </span>MultiDoc</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#A2" title="In LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">B </span>Effectiveness of Expression Extension</span></a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#A3" title="In LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">C </span>Examples in LIFBench</span></a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document"> <h1 class="ltx_title ltx_title_document"> LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios </h1> <div class="ltx_authors"> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Xiaodong Wu<sup class="ltx_sup" id="id13.13.id1"><span class="ltx_text ltx_font_italic" id="id13.13.id1.1">†</span></sup> Minhao Wang<sup class="ltx_sup" id="id14.14.id2"><span class="ltx_text ltx_font_italic" id="id14.14.id2.1">†</span></sup> Yichen Liu <sup class="ltx_sup" id="id15.15.id3"><span class="ltx_text ltx_font_italic" id="id15.15.id3.1">†</span></sup> <span class="ltx_text ltx_font_bold" id="id4.4.1"> Xiaoming Shi <sup class="ltx_sup" id="id4.4.1.1"><span class="ltx_text ltx_font_medium ltx_font_italic" id="id4.4.1.1.1">†</span></sup> </span> <br class="ltx_break"/><span class="ltx_text ltx_font_bold" id="id5.5.2"> He Yan <sup class="ltx_sup" id="id5.5.2.1"><span class="ltx_text ltx_font_medium ltx_font_italic" id="id5.5.2.1.1">§</span></sup> </span> <span class="ltx_text ltx_font_bold" id="id6.6.3"> Xiangju Lu <sup class="ltx_sup" id="id6.6.3.1"><span class="ltx_text ltx_font_medium ltx_font_italic" id="id6.6.3.1.1">§</span></sup></span> <span class="ltx_text ltx_font_bold" id="id7.7.4"> Junmin Zhu <sup class="ltx_sup" id="id7.7.4.1"><span class="ltx_text ltx_font_medium ltx_font_italic" id="id7.7.4.1.1">§</span></sup></span> <span class="ltx_text ltx_font_bold" id="id16.16.id4"> Wei Zhang</span> <sup class="ltx_sup" id="id17.17.id5"><span class="ltx_text ltx_font_italic" id="id17.17.id5.1">†</span></sup> <br class="ltx_break"/><sup class="ltx_sup" id="id18.18.id6"><span class="ltx_text ltx_font_italic" id="id18.18.id6.1">†</span></sup>East China Normal University <sup class="ltx_sup" id="id19.19.id7"><span class="ltx_text ltx_font_italic" id="id19.19.id7.1">§</span></sup>iQIYI Inc <br class="ltx_break"/><sup class="ltx_sup" id="id20.20.id8"><span class="ltx_text ltx_font_italic" id="id20.20.id8.1">†</span></sup><span class="ltx_text ltx_font_typewriter" id="id12.12.5">{51255901079,51275901104,51275901148}@stu.ecnu.edu.cn xmshi@ir.hit.edu.cn <br class="ltx_break"/> <sup class="ltx_sup" id="id12.12.5.1"><span class="ltx_text ltx_font_serif ltx_font_italic" id="id12.12.5.1.1">§</span></sup>{yanhe, luxiangju, zhujunmin}@qiyi.com <sup class="ltx_sup" id="id12.12.5.2">*</sup>zhangwei.thu2011@gmail.com </span> </span><span class="ltx_author_notes">Corresponding author.</span></span> </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract</h6> <p class="ltx_p" id="id21.id1">As Large Language Models (LLMs) continue to advance in natural language processing (NLP), their ability to stably follow instructions in long-context inputs has become crucial for real-world applications. While existing benchmarks assess various LLM capabilities, they rarely focus on instruction-following in long-context scenarios or stability on different inputs. In response, we introduce the Long-context Instruction-Following Benchmark (LIFBench), a scalable dataset designed to evaluate LLMs’ instruction-following capabilities and stability across long contexts. LIFBench comprises three long-context scenarios and eleven diverse tasks, supported by 2,766 instructions generated through an automated expansion method across three dimensions: length, expression, and variables. For evaluation, we propose LIFEval, a rubric-based assessment framework that provides precise, automated scoring of complex LLM responses without relying on LLM-assisted evaluations or human judgments. This approach facilitates a comprehensive analysis of model performance and stability across various perspectives. We conduct extensive experiments on 20 notable LLMs across six length intervals, analyzing their instruction-following capabilities and stability. Our work contributes LIFBench and LIFEval as robust tools for assessing LLM performance in complex, long-context settings, providing insights that can inform future LLM development.</p> </div> <div class="ltx_para ltx_noindent" id="p1"> <div class="ltx_block ltx_align_bottom" id="p1.12"> <p class="ltx_p" id="p1.12.13"><span class="ltx_text ltx_font_bold" id="p1.12.13.1">LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios</span></p> <br class="ltx_break ltx_centering"/> <p class="ltx_p ltx_align_center" id="p1.12.12" style="width:433.6pt;"><span class="ltx_text ltx_inline-block" id="p1.12.12.12" style="width:0.0pt;"> <span class="ltx_tabular ltx_align_top" id="p1.12.12.12.12"> <span class="ltx_tr" id="p1.4.4.4.4.4"> <span class="ltx_td ltx_align_center" id="p1.4.4.4.4.4.4"><span class="ltx_text ltx_font_bold" id="p1.4.4.4.4.4.4.4">Xiaodong Wu<sup class="ltx_sup" id="p1.4.4.4.4.4.4.4.1"><span class="ltx_text ltx_font_medium ltx_font_italic" id="p1.4.4.4.4.4.4.4.1.1">†</span></sup> Minhao Wang<sup class="ltx_sup" id="p1.4.4.4.4.4.4.4.2"><span class="ltx_text ltx_font_medium ltx_font_italic" id="p1.4.4.4.4.4.4.4.2.1">†</span></sup> Yichen Liu <sup class="ltx_sup" id="p1.4.4.4.4.4.4.4.3"><span class="ltx_text ltx_font_medium ltx_font_italic" id="p1.4.4.4.4.4.4.4.3.1">†</span></sup> Xiaoming Shi <sup class="ltx_sup" id="p1.4.4.4.4.4.4.4.4"><span class="ltx_text ltx_font_medium ltx_font_italic" id="p1.4.4.4.4.4.4.4.4.1">†</span></sup></span></span></span> <span class="ltx_tr" id="p1.8.8.8.8.8"> <span class="ltx_td ltx_align_center" id="p1.8.8.8.8.8.4"><span class="ltx_text ltx_font_bold" id="p1.5.5.5.5.5.1.1"> He Yan <sup class="ltx_sup" id="p1.5.5.5.5.5.1.1.1"><span class="ltx_text ltx_font_medium ltx_font_italic" id="p1.5.5.5.5.5.1.1.1.1">§</span></sup> </span> <span class="ltx_text ltx_font_bold" id="p1.6.6.6.6.6.2.2"> Xiangju Lu <sup class="ltx_sup" id="p1.6.6.6.6.6.2.2.1"><span class="ltx_text ltx_font_medium ltx_font_italic" id="p1.6.6.6.6.6.2.2.1.1">§</span></sup></span> <span class="ltx_text ltx_font_bold" id="p1.7.7.7.7.7.3.3"> Junmin Zhu <sup class="ltx_sup" id="p1.7.7.7.7.7.3.3.1"><span class="ltx_text ltx_font_medium ltx_font_italic" id="p1.7.7.7.7.7.3.3.1.1">§</span></sup></span> <span class="ltx_text ltx_font_bold" id="p1.8.8.8.8.8.4.4"> Wei Zhang</span> <sup class="ltx_sup" id="p1.8.8.8.8.8.4.5"><span class="ltx_text ltx_font_italic" id="p1.8.8.8.8.8.4.5.1">†</span></sup><span class="ltx_note ltx_role_thanks" id="p1.8.8.8.8.8.4.6"><sup class="ltx_note_mark">†</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">†</sup><span class="ltx_note_type">thanks: </span>Corresponding author.</span></span></span></span></span> <span class="ltx_tr" id="p1.10.10.10.10.10"> <span class="ltx_td ltx_align_center" id="p1.10.10.10.10.10.2"><sup class="ltx_sup" id="p1.10.10.10.10.10.2.1"><span class="ltx_text ltx_font_italic" id="p1.10.10.10.10.10.2.1.1">†</span></sup>East China Normal University <sup class="ltx_sup" id="p1.10.10.10.10.10.2.2"><span class="ltx_text ltx_font_italic" id="p1.10.10.10.10.10.2.2.1">§</span></sup>iQIYI Inc</span></span> <span class="ltx_tr" id="p1.11.11.11.11.11"> <span class="ltx_td ltx_align_center" id="p1.11.11.11.11.11.1"><sup class="ltx_sup" id="p1.11.11.11.11.11.1.1"><span class="ltx_text ltx_font_italic" id="p1.11.11.11.11.11.1.1.1">†</span></sup><span class="ltx_text ltx_font_typewriter" id="p1.11.11.11.11.11.1.2">{51255901079,51275901104,51275901148}@stu.ecnu.edu.cn xmshi@ir.hit.edu.cn</span></span></span> <span class="ltx_tr" id="p1.12.12.12.12.12"> <span class="ltx_td ltx_align_center" id="p1.12.12.12.12.12.1"><span class="ltx_text ltx_font_typewriter" id="p1.12.12.12.12.12.1.1"> <sup class="ltx_sup" id="p1.12.12.12.12.12.1.1.1"><span class="ltx_text ltx_font_serif ltx_font_italic" id="p1.12.12.12.12.12.1.1.1.1">§</span></sup>{yanhe, luxiangju, zhujunmin}@qiyi.com <sup class="ltx_sup" id="p1.12.12.12.12.12.1.1.2">*</sup>zhangwei.thu2011@gmail.com</span></span></span> </span></span></p> <br class="ltx_break ltx_centering"/> </div> </div> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2> <figure class="ltx_table" id="S1.T1"> <div class="ltx_flex_figure ltx_flex_table"> <div class="ltx_flex_cell ltx_flex_size_1"> <div class="ltx_inline-block ltx_figure_panel ltx_align_center ltx_transformed_outer" id="S1.T1.1" style="width:433.6pt;height:241.1pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(22.6pt,-12.6pt) scale(1.11637654654324,1.11637654654324) ;"> <table class="ltx_tabular ltx_align_middle" id="S1.T1.1.1"> <tr class="ltx_tr" id="S1.T1.1.1.2" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left" id="S1.T1.1.1.2.1"> <span class="ltx_rule" style="width:100%;height:1.0pt;background:black;display:inline-block;"> </span> <span class="ltx_text" id="S1.T1.1.1.2.1.1" style="background-color:#E6E6E6;"> <span class="ltx_text ltx_font_bold" id="S1.T1.1.1.2.1.1.1">Benchmark</span></span> </td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.2.2"><span class="ltx_text ltx_font_bold" id="S1.T1.1.1.2.2.1" style="background-color:#E6E6E6;">Long.</span></td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.2.3"><span class="ltx_text ltx_font_bold" id="S1.T1.1.1.2.3.1" style="background-color:#E6E6E6;">Inst.</span></td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.2.4"><span class="ltx_text ltx_font_bold" id="S1.T1.1.1.2.4.1" style="background-color:#E6E6E6;">Stab.</span></td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.2.5"><span class="ltx_text ltx_font_bold" id="S1.T1.1.1.2.5.1" style="background-color:#E6E6E6;">Unlim.</span></td> </tr> <tr class="ltx_tr" id="S1.T1.1.1.3"> <td class="ltx_td ltx_align_left ltx_border_t" id="S1.T1.1.1.3.1">ZeroSCROLLS (<cite class="ltx_cite ltx_citemacro_citeyear"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib42" title="">2023</a></cite>)</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S1.T1.1.1.3.2">✓</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S1.T1.1.1.3.3">✗</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S1.T1.1.1.3.4">✗</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S1.T1.1.1.3.5">✗</td> </tr> <tr class="ltx_tr" id="S1.T1.1.1.4"> <td class="ltx_td ltx_align_left" id="S1.T1.1.1.4.1">BAMBOO (<cite class="ltx_cite ltx_citemacro_citeyear"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib12" title="">2024</a></cite>)</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.4.2">✓</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.4.3">✗</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.4.4">✗</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.4.5">✗</td> </tr> <tr class="ltx_tr" id="S1.T1.1.1.5"> <td class="ltx_td ltx_align_left" id="S1.T1.1.1.5.1">Longbench (<cite class="ltx_cite ltx_citemacro_citeyear"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib4" title="">2023</a></cite>)</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.5.2">✓</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.5.3">✗</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.5.4">✗</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.5.5">✗</td> </tr> <tr class="ltx_tr" id="S1.T1.1.1.1"> <td class="ltx_td ltx_align_left" id="S1.T1.1.1.1.1"> <math alttext="\infty" class="ltx_Math" display="inline" id="S1.T1.1.1.1.1.m1.1"><semantics id="S1.T1.1.1.1.1.m1.1a"><mi id="S1.T1.1.1.1.1.m1.1.1" mathvariant="normal" xref="S1.T1.1.1.1.1.m1.1.1.cmml">∞</mi><annotation-xml encoding="MathML-Content" id="S1.T1.1.1.1.1.m1.1b"><infinity id="S1.T1.1.1.1.1.m1.1.1.cmml" xref="S1.T1.1.1.1.1.m1.1.1"></infinity></annotation-xml><annotation encoding="application/x-tex" id="S1.T1.1.1.1.1.m1.1c">\infty</annotation><annotation encoding="application/x-llamapun" id="S1.T1.1.1.1.1.m1.1d">∞</annotation></semantics></math> Bench (<cite class="ltx_cite ltx_citemacro_citeyear"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib52" title="">2024c</a></cite>)</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.1.2">✓</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.1.3">✗</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.1.4">✗</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.1.5">✗</td> </tr> <tr class="ltx_tr" id="S1.T1.1.1.6"> <td class="ltx_td ltx_align_left" id="S1.T1.1.1.6.1">RULER (<cite class="ltx_cite ltx_citemacro_citeyear"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib18" title="">2024</a></cite>)</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.6.2">✓</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.6.3">✗</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.6.4">✗</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.6.5">✓</td> </tr> <tr class="ltx_tr" id="S1.T1.1.1.7"> <td class="ltx_td ltx_align_left ltx_border_t" id="S1.T1.1.1.7.1">IFEval (<cite class="ltx_cite ltx_citemacro_citeyear"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib54" title="">2023a</a></cite>)</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S1.T1.1.1.7.2">✗</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S1.T1.1.1.7.3">✓</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S1.T1.1.1.7.4">✗</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S1.T1.1.1.7.5">✗</td> </tr> <tr class="ltx_tr" id="S1.T1.1.1.8"> <td class="ltx_td ltx_align_left" id="S1.T1.1.1.8.1">FollowBench (<cite class="ltx_cite ltx_citemacro_citeyear"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib20" title="">2023</a></cite>)</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.8.2">✗</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.8.3">✓</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.8.4">✗</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.8.5">✗</td> </tr> <tr class="ltx_tr" id="S1.T1.1.1.9"> <td class="ltx_td ltx_align_left" id="S1.T1.1.1.9.1">InfoBench (<cite class="ltx_cite ltx_citemacro_citeyear"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib40" title="">2024b</a></cite>)</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.9.2">✗</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.9.3">✓</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.9.4">✗</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.9.5">✗</td> </tr> <tr class="ltx_tr" id="S1.T1.1.1.10"> <td class="ltx_td ltx_align_left" id="S1.T1.1.1.10.1">CELLO (<cite class="ltx_cite ltx_citemacro_citeyear"><a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib16" title="">2024</a></cite>)</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.10.2">✗</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.10.3">✓</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.10.4">✗</td> <td class="ltx_td ltx_align_center" id="S1.T1.1.1.10.5">✗</td> </tr> <tr class="ltx_tr" id="S1.T1.1.1.11"> <td class="ltx_td ltx_align_left ltx_border_t" id="S1.T1.1.1.11.1">LIFBench (Ours)</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S1.T1.1.1.11.2">✓</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S1.T1.1.1.11.3">✓</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S1.T1.1.1.11.4">✓</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S1.T1.1.1.11.5">✓</td> </tr> <tr class="ltx_tr" id="S1.T1.1.1.12"> <td class="ltx_td ltx_align_left" id="S1.T1.1.1.12.1"><span class="ltx_rule" style="width:100%;height:1.0pt;background:black;display:inline-block;"> </span></td> <td class="ltx_td" id="S1.T1.1.1.12.2"></td> <td class="ltx_td" id="S1.T1.1.1.12.3"></td> <td class="ltx_td" id="S1.T1.1.1.12.4"></td> <td class="ltx_td" id="S1.T1.1.1.12.5"></td> </tr> </table> </span></div> </div> </div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 1: </span>A comparison of our LIFBench with some relevant datasets. We summarize their focus, including <span class="ltx_text ltx_framed ltx_framed_underline" id="S1.T1.6.1">Long</span>-context scenarios, <span class="ltx_text ltx_framed ltx_framed_underline" id="S1.T1.7.2">Inst</span>ruction-Following, and model <span class="ltx_text ltx_framed ltx_framed_underline" id="S1.T1.8.3">stab</span>ility. ‘Unlim.’ denotes whether the data length can be <span class="ltx_text ltx_framed ltx_framed_underline" id="S1.T1.9.4">unlim</span>ited. </figcaption><div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_1"> <p class="ltx_p ltx_figure_panel ltx_align_center" id="S1.T1.10">.</p> </div> </div> </figure> <figure class="ltx_figure" id="S1.F1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="417" id="S1.F1.g1" src="x1.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 1: </span>The framework of LIFBench. <span class="ltx_text ltx_font_bold" id="S1.F1.14.1">Bold</span> denotes the scenario description <math alttext="D" class="ltx_Math" display="inline" id="S1.F1.6.m1.1"><semantics id="S1.F1.6.m1.1b"><mi id="S1.F1.6.m1.1.1" xref="S1.F1.6.m1.1.1.cmml">D</mi><annotation-xml encoding="MathML-Content" id="S1.F1.6.m1.1c"><ci id="S1.F1.6.m1.1.1.cmml" xref="S1.F1.6.m1.1.1">𝐷</ci></annotation-xml><annotation encoding="application/x-tex" id="S1.F1.6.m1.1d">D</annotation><annotation encoding="application/x-llamapun" id="S1.F1.6.m1.1e">italic_D</annotation></semantics></math>; normal denotes the context <math alttext="X" class="ltx_Math" display="inline" id="S1.F1.7.m2.1"><semantics id="S1.F1.7.m2.1b"><mi id="S1.F1.7.m2.1.1" xref="S1.F1.7.m2.1.1.cmml">X</mi><annotation-xml encoding="MathML-Content" id="S1.F1.7.m2.1c"><ci id="S1.F1.7.m2.1.1.cmml" xref="S1.F1.7.m2.1.1">𝑋</ci></annotation-xml><annotation encoding="application/x-tex" id="S1.F1.7.m2.1d">X</annotation><annotation encoding="application/x-llamapun" id="S1.F1.7.m2.1e">italic_X</annotation></semantics></math>; <span class="ltx_text ltx_font_italic" id="S1.F1.15.2">italics</span> denotes instruction <math alttext="I" class="ltx_Math" display="inline" id="S1.F1.8.m3.1"><semantics id="S1.F1.8.m3.1b"><mi id="S1.F1.8.m3.1.1" xref="S1.F1.8.m3.1.1.cmml">I</mi><annotation-xml encoding="MathML-Content" id="S1.F1.8.m3.1c"><ci id="S1.F1.8.m3.1.1.cmml" xref="S1.F1.8.m3.1.1">𝐼</ci></annotation-xml><annotation encoding="application/x-tex" id="S1.F1.8.m3.1d">I</annotation><annotation encoding="application/x-llamapun" id="S1.F1.8.m3.1e">italic_I</annotation></semantics></math>, where the <span class="ltx_text" id="S1.F1.16.3" style="color:#FF0000;">red</span> indicate the instruction variables <math alttext="var" class="ltx_Math" display="inline" id="S1.F1.9.m4.1"><semantics id="S1.F1.9.m4.1b"><mrow id="S1.F1.9.m4.1.1" xref="S1.F1.9.m4.1.1.cmml"><mi id="S1.F1.9.m4.1.1.2" xref="S1.F1.9.m4.1.1.2.cmml">v</mi><mo id="S1.F1.9.m4.1.1.1" xref="S1.F1.9.m4.1.1.1.cmml"></mo><mi id="S1.F1.9.m4.1.1.3" xref="S1.F1.9.m4.1.1.3.cmml">a</mi><mo id="S1.F1.9.m4.1.1.1b" xref="S1.F1.9.m4.1.1.1.cmml"></mo><mi id="S1.F1.9.m4.1.1.4" xref="S1.F1.9.m4.1.1.4.cmml">r</mi></mrow><annotation-xml encoding="MathML-Content" id="S1.F1.9.m4.1c"><apply id="S1.F1.9.m4.1.1.cmml" xref="S1.F1.9.m4.1.1"><times id="S1.F1.9.m4.1.1.1.cmml" xref="S1.F1.9.m4.1.1.1"></times><ci id="S1.F1.9.m4.1.1.2.cmml" xref="S1.F1.9.m4.1.1.2">𝑣</ci><ci id="S1.F1.9.m4.1.1.3.cmml" xref="S1.F1.9.m4.1.1.3">𝑎</ci><ci id="S1.F1.9.m4.1.1.4.cmml" xref="S1.F1.9.m4.1.1.4">𝑟</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S1.F1.9.m4.1d">var</annotation><annotation encoding="application/x-llamapun" id="S1.F1.9.m4.1e">italic_v italic_a italic_r</annotation></semantics></math>, and the remaining black parts correspond to the instruction template <math alttext="tpl" class="ltx_Math" display="inline" id="S1.F1.10.m5.1"><semantics id="S1.F1.10.m5.1b"><mrow id="S1.F1.10.m5.1.1" xref="S1.F1.10.m5.1.1.cmml"><mi id="S1.F1.10.m5.1.1.2" xref="S1.F1.10.m5.1.1.2.cmml">t</mi><mo id="S1.F1.10.m5.1.1.1" xref="S1.F1.10.m5.1.1.1.cmml"></mo><mi id="S1.F1.10.m5.1.1.3" xref="S1.F1.10.m5.1.1.3.cmml">p</mi><mo id="S1.F1.10.m5.1.1.1b" xref="S1.F1.10.m5.1.1.1.cmml"></mo><mi id="S1.F1.10.m5.1.1.4" xref="S1.F1.10.m5.1.1.4.cmml">l</mi></mrow><annotation-xml encoding="MathML-Content" id="S1.F1.10.m5.1c"><apply id="S1.F1.10.m5.1.1.cmml" xref="S1.F1.10.m5.1.1"><times id="S1.F1.10.m5.1.1.1.cmml" xref="S1.F1.10.m5.1.1.1"></times><ci id="S1.F1.10.m5.1.1.2.cmml" xref="S1.F1.10.m5.1.1.2">𝑡</ci><ci id="S1.F1.10.m5.1.1.3.cmml" xref="S1.F1.10.m5.1.1.3">𝑝</ci><ci id="S1.F1.10.m5.1.1.4.cmml" xref="S1.F1.10.m5.1.1.4">𝑙</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S1.F1.10.m5.1d">tpl</annotation><annotation encoding="application/x-llamapun" id="S1.F1.10.m5.1e">italic_t italic_p italic_l</annotation></semantics></math>.</figcaption> </figure> <div class="ltx_para" id="S1.p1"> <p class="ltx_p" id="S1.p1.1">As Large Language Models (LLMs) continue to expand their impact across practical applications <cite class="ltx_cite ltx_citemacro_cite">Achiam et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib1" title="">2023</a>); Chowdhery et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib9" title="">2023</a>); Brown (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib5" title="">2020</a>)</cite>, their performance in natural language processing (NLP) tasks has reached new heights, spanning text generation <cite class="ltx_cite ltx_citemacro_cite">Que et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib41" title="">2024</a>); Tan et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib44" title="">2024</a>); Zhang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib51" title="">2024b</a>)</cite>, complex reasoning <cite class="ltx_cite ltx_citemacro_cite">Parmar et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib37" title="">2024</a>)</cite>, and problem-solving <cite class="ltx_cite ltx_citemacro_cite">Lu et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib31" title="">2023</a>); Li et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib26" title="">2024a</a>)</cite>. In real-world production scenarios, stable instruction-following ability in long-context inputs are crucial, as these capabilities enable LLMs to better meet user needs and accurately complete more complex and diverse tasks.</p> </div> <div class="ltx_para" id="S1.p2"> <p class="ltx_p" id="S1.p2.1">These needs have driven the creation of a number of benchmarks to accurately assess the LLMs’ capabilities. On the one hand, <cite class="ltx_cite ltx_citemacro_cite">Shaham et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib42" title="">2023</a>); An et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib2" title="">2023</a>); Dong et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib12" title="">2024</a>)</cite> collated traditional NLP datasets, forming comprehensive benchmarks containing long data. Subsequently, considering the power of LLMs on open-ended tasks, <cite class="ltx_cite ltx_citemacro_cite">Bai et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib4" title="">2023</a>); Zhang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib52" title="">2024c</a>); Hsieh et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib18" title="">2024</a>)</cite> has further introduced some synthetic tasks to better assess the understanding and problem-solving capabilities in long contexts. On the other hand, some studies have gone beyond the limitations of specific downstream tasks, crafted more elaborate and systematic instruction datasets, which aim to assess a model’s ability to follow complex instructions <cite class="ltx_cite ltx_citemacro_cite">Zhou et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib54" title="">2023a</a>); He et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib16" title="">2024</a>); Qin et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib40" title="">2024b</a>)</cite> or satisfying diversiform constraints <cite class="ltx_cite ltx_citemacro_cite">Jiang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib20" title="">2023</a>); Wen et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib48" title="">2024</a>)</cite>. However, two issues remain unresolved: (1) What is the instruction-following capability of LLMs in long-context scenarios? (2) How to evaluate the impact of some input factors (e.g., the expression of instruction, the variable in the input) on LLMs’ stability?</p> </div> <div class="ltx_para" id="S1.p3"> <p class="ltx_p" id="S1.p3.1">Based on these issues, we introduce the <span class="ltx_text ltx_font_bold" id="S1.p3.1.1">L</span>ong-context <span class="ltx_text ltx_font_bold" id="S1.p3.1.2">I</span>nstruction <span class="ltx_text ltx_font_bold" id="S1.p3.1.3">F</span>ollowing <span class="ltx_text ltx_font_bold" id="S1.p3.1.4">Bench</span>mark (LIFBench), a scalable benchmark for evaluating LLM instruction-following capability and stability in long-context scenarios. The framework of our benchmark is shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S1.F1" title="Figure 1 ‣ 1 Introduction ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_tag">1</span></a>. Considering realistic applications of LLM, we construct three long-context scenarios based on the granularity of information to be processed. On this basis, eleven delicate tasks are designed, which can reflect various dimensions of instruction-following capabilities. We manually crafted templates for all tasks, and introduced an automated instruction expansion method from three dimensions (length, expression, and variables), resulting in a total of 2,766 instructions. The dataset’s scalability allows our benchmark to grow extensively and support testing with arbitrarily long context lengths.</p> </div> <div class="ltx_para" id="S1.p4"> <p class="ltx_p" id="S1.p4.1">For evaluation, traditional metrics for downstream tasks are not applicable to complex instruction-following scenarios <cite class="ltx_cite ltx_citemacro_cite">Honovich et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib17" title="">2023</a>)</cite>. Moreover, many studies rely on GPT-4 for automated, open-ended assessment, but this approach encounters limitations due to notable gaps between GPT-4 and human experts <cite class="ltx_cite ltx_citemacro_cite">Qin et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib40" title="">2024b</a>); Jiang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib20" title="">2023</a>)</cite>, as well as potential bias problems <cite class="ltx_cite ltx_citemacro_cite">Wang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib47" title="">2023</a>)</cite>. To address these challenges, we propose LIFEval, a framework for evaluating complex LLM responses. Through rubric-based scoring and an automated assessment process, LIFEval enables precise, detailed distinctions in response quality without relying on LLMs or human evaluators, providing insights into models’ fundamental capabilities and stability from various perspectives.</p> </div> <div class="ltx_para" id="S1.p5"> <p class="ltx_p" id="S1.p5.1">Overall, our contributions are as follows:</p> <ul class="ltx_itemize" id="S1.I1"> <li class="ltx_item" id="S1.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i1.p1"> <p class="ltx_p" id="S1.I1.i1.p1.1">We introduce LIFBench, a benchmark designed to evaluate instruction-following capabilities in long-context scenarios, containing 11 carefully designed tasks across three scenarios.</p> </div> </li> <li class="ltx_item" id="S1.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i2.p1"> <p class="ltx_p" id="S1.I1.i2.p1.1">We develop methods for dataset expansion across three perspectives, enabling scalable data generation with up to 2,766 instructions, adaptable to arbitrary context lengths.</p> </div> </li> <li class="ltx_item" id="S1.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i3.p1"> <p class="ltx_p" id="S1.I1.i3.p1.1">We propose LIFEval, an evaluation framework for accurately, rapidly, and comprehensively assessing the quality and stability of LLMs’ complex responses.</p> </div> </li> <li class="ltx_item" id="S1.I1.i4" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i4.p1"> <p class="ltx_p" id="S1.I1.i4.p1.1">We conduct extensive experiments across six length intervals, which evaluates and analyzes the instruction-following capabilities and stability of 20 famous LLMs, including both open-source and closed-source models.</p> </div> </li> </ul> </div> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">2 </span>Related Work</h2> <div class="ltx_para" id="S2.p1"> <p class="ltx_p" id="S2.p1.1">Several studies have focused on the performance of LLMs for long contexts. These benchmarks collect data for traditional NLP testing tasks (e.g. summarization, QA, writing) to form comprehensive datasets containing long data <cite class="ltx_cite ltx_citemacro_cite">Shaham et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib43" title="">2022</a>, <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib42" title="">2023</a>); Dong et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib12" title="">2024</a>); Li et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib27" title="">2024b</a>)</cite>. Additionally, due to the excellent performance of LLMs on open-ended tasks, some benchmarks have also designed synthetic tasks to better probe the model’s ability in math, reasoning and logic <cite class="ltx_cite ltx_citemacro_cite">Bai et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib4" title="">2023</a>); Zhang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib52" title="">2024c</a>); Kwan et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib22" title="">2023</a>); Wang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib46" title="">2024</a>); Hsieh et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib18" title="">2024</a>)</cite>. For evaluation, most of works follow the regular metrics (e.g. Acc, ROUGE <cite class="ltx_cite ltx_citemacro_cite">Lin (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib28" title="">2004</a>)</cite>, BLEU <cite class="ltx_cite ltx_citemacro_cite">Papineni et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib36" title="">2002</a>)</cite>), where the performance can be obtained through automated calculations. However, some open-ended tasks are difficult to utilize above metrics directly, hence powerful LLMs, such as GPT-4, are commonly used as an evaluator. For example, <cite class="ltx_cite ltx_citemacro_cite">An et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib2" title="">2023</a>); Li et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib25" title="">2023</a>)</cite> presents the model predictions and answers provided by the ground-truth to the GPT-4 evaluator, which is tasked with conducting a direct scoring or comparison to evaluate the baseline’s performance on partial tasks. In contrast to LIFBench, these benchmarks only assess problem solving capabilities in long-context scenarios, overlooking the challenges of instruction following that arise in real-world applications.</p> </div> <div class="ltx_para" id="S2.p2"> <p class="ltx_p" id="S2.p2.1">Besides, numerous efforts have been made to enhance the evaluation and benchmarking of LLMs’ instruction-following ability. <cite class="ltx_cite ltx_citemacro_cite">Jiang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib20" title="">2023</a>); Jing et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib21" title="">2023</a>); Zhou et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib54" title="">2023a</a>); Qin et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib39" title="">2024a</a>)</cite> focus on capturing real-world constraints to construct comprehensive evaluation frameworks for LLMs, aiming to assess their understanding and execution of complex instructions. <cite class="ltx_cite ltx_citemacro_cite">Chen et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib8" title="">2024b</a>)</cite> explores the robustness of LLMs in following multiple instructions. Given the complexities of evaluating instruction-following abilities, several studies have introduced meaningful innovations in their evaluation methodologies to better capture this multifaceted challenge. <cite class="ltx_cite ltx_citemacro_cite">Cook et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib11" title="">2024</a>); Wen et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib48" title="">2024</a>); Qin et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib40" title="">2024b</a>); Zhang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib50" title="">2024a</a>)</cite> decomposes instructions into checklists composed of YES/NO questions or PASS/FAIL criteria, which answers should meet. CELLO <cite class="ltx_cite ltx_citemacro_cite">He et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib16" title="">2024</a>)</cite> defined four criteria that should be assessed as they can encompass common errors made by LLMs, and developed corresponding automated evaluation metrics to reflect ability of models in complex instruction-following scenarios. However, the aforementioned studies do not address the performance of models in following instructions within long-context scenarios, nor do they consider the stability of inputs during the instruction-following process.</p> </div> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">3 </span>LIFBench</h2> <section class="ltx_subsection" id="S3.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.1 </span>Problem Definition</h3> <div class="ltx_para" id="S3.SS1.p1"> <p class="ltx_p" id="S3.SS1.p1.4">Based on the examples in Table <span class="ltx_ref ltx_missing_label ltx_ref_self">LABEL:tab:allprompts</span>, we model the instruction-following task in long-context scenarios as follows: Given a prompt consisting of a scenario description (<math alttext="D" class="ltx_Math" display="inline" id="S3.SS1.p1.1.m1.1"><semantics id="S3.SS1.p1.1.m1.1a"><mi id="S3.SS1.p1.1.m1.1.1" xref="S3.SS1.p1.1.m1.1.1.cmml">D</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.1.m1.1b"><ci id="S3.SS1.p1.1.m1.1.1.cmml" xref="S3.SS1.p1.1.m1.1.1">𝐷</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.1.m1.1c">D</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.1.m1.1d">italic_D</annotation></semantics></math>), input (<math alttext="X" class="ltx_Math" display="inline" id="S3.SS1.p1.2.m2.1"><semantics id="S3.SS1.p1.2.m2.1a"><mi id="S3.SS1.p1.2.m2.1.1" xref="S3.SS1.p1.2.m2.1.1.cmml">X</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.2.m2.1b"><ci id="S3.SS1.p1.2.m2.1.1.cmml" xref="S3.SS1.p1.2.m2.1.1">𝑋</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.2.m2.1c">X</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.2.m2.1d">italic_X</annotation></semantics></math>), and instruction (<math alttext="I" class="ltx_Math" display="inline" id="S3.SS1.p1.3.m3.1"><semantics id="S3.SS1.p1.3.m3.1a"><mi id="S3.SS1.p1.3.m3.1.1" xref="S3.SS1.p1.3.m3.1.1.cmml">I</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.3.m3.1b"><ci id="S3.SS1.p1.3.m3.1.1.cmml" xref="S3.SS1.p1.3.m3.1.1">𝐼</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.3.m3.1c">I</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.3.m3.1d">italic_I</annotation></semantics></math>), the model is expected to output an answer (<math alttext="A" class="ltx_Math" display="inline" id="S3.SS1.p1.4.m4.1"><semantics id="S3.SS1.p1.4.m4.1a"><mi id="S3.SS1.p1.4.m4.1.1" xref="S3.SS1.p1.4.m4.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.4.m4.1b"><ci id="S3.SS1.p1.4.m4.1.1.cmml" xref="S3.SS1.p1.4.m4.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.4.m4.1c">A</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.4.m4.1d">italic_A</annotation></semantics></math>). This process can be represented as:</p> <table class="ltx_equation ltx_eqn_table" id="S3.E1"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="(D,X_{len},I_{tpl,var})\to A" class="ltx_Math" display="block" id="S3.E1.m1.5"><semantics id="S3.E1.m1.5a"><mrow id="S3.E1.m1.5.5" xref="S3.E1.m1.5.5.cmml"><mrow id="S3.E1.m1.5.5.2.2" xref="S3.E1.m1.5.5.2.3.cmml"><mo id="S3.E1.m1.5.5.2.2.3" stretchy="false" xref="S3.E1.m1.5.5.2.3.cmml">(</mo><mi id="S3.E1.m1.3.3" xref="S3.E1.m1.3.3.cmml">D</mi><mo id="S3.E1.m1.5.5.2.2.4" xref="S3.E1.m1.5.5.2.3.cmml">,</mo><msub id="S3.E1.m1.4.4.1.1.1" xref="S3.E1.m1.4.4.1.1.1.cmml"><mi id="S3.E1.m1.4.4.1.1.1.2" xref="S3.E1.m1.4.4.1.1.1.2.cmml">X</mi><mrow id="S3.E1.m1.4.4.1.1.1.3" xref="S3.E1.m1.4.4.1.1.1.3.cmml"><mi id="S3.E1.m1.4.4.1.1.1.3.2" xref="S3.E1.m1.4.4.1.1.1.3.2.cmml">l</mi><mo id="S3.E1.m1.4.4.1.1.1.3.1" xref="S3.E1.m1.4.4.1.1.1.3.1.cmml"></mo><mi id="S3.E1.m1.4.4.1.1.1.3.3" xref="S3.E1.m1.4.4.1.1.1.3.3.cmml">e</mi><mo id="S3.E1.m1.4.4.1.1.1.3.1a" xref="S3.E1.m1.4.4.1.1.1.3.1.cmml"></mo><mi id="S3.E1.m1.4.4.1.1.1.3.4" xref="S3.E1.m1.4.4.1.1.1.3.4.cmml">n</mi></mrow></msub><mo id="S3.E1.m1.5.5.2.2.5" xref="S3.E1.m1.5.5.2.3.cmml">,</mo><msub id="S3.E1.m1.5.5.2.2.2" xref="S3.E1.m1.5.5.2.2.2.cmml"><mi id="S3.E1.m1.5.5.2.2.2.2" xref="S3.E1.m1.5.5.2.2.2.2.cmml">I</mi><mrow id="S3.E1.m1.2.2.2.2" xref="S3.E1.m1.2.2.2.3.cmml"><mrow id="S3.E1.m1.1.1.1.1.1" xref="S3.E1.m1.1.1.1.1.1.cmml"><mi id="S3.E1.m1.1.1.1.1.1.2" xref="S3.E1.m1.1.1.1.1.1.2.cmml">t</mi><mo id="S3.E1.m1.1.1.1.1.1.1" xref="S3.E1.m1.1.1.1.1.1.1.cmml"></mo><mi id="S3.E1.m1.1.1.1.1.1.3" xref="S3.E1.m1.1.1.1.1.1.3.cmml">p</mi><mo id="S3.E1.m1.1.1.1.1.1.1a" xref="S3.E1.m1.1.1.1.1.1.1.cmml"></mo><mi id="S3.E1.m1.1.1.1.1.1.4" xref="S3.E1.m1.1.1.1.1.1.4.cmml">l</mi></mrow><mo id="S3.E1.m1.2.2.2.2.3" xref="S3.E1.m1.2.2.2.3.cmml">,</mo><mrow id="S3.E1.m1.2.2.2.2.2" xref="S3.E1.m1.2.2.2.2.2.cmml"><mi id="S3.E1.m1.2.2.2.2.2.2" xref="S3.E1.m1.2.2.2.2.2.2.cmml">v</mi><mo id="S3.E1.m1.2.2.2.2.2.1" xref="S3.E1.m1.2.2.2.2.2.1.cmml"></mo><mi id="S3.E1.m1.2.2.2.2.2.3" xref="S3.E1.m1.2.2.2.2.2.3.cmml">a</mi><mo id="S3.E1.m1.2.2.2.2.2.1a" xref="S3.E1.m1.2.2.2.2.2.1.cmml"></mo><mi id="S3.E1.m1.2.2.2.2.2.4" xref="S3.E1.m1.2.2.2.2.2.4.cmml">r</mi></mrow></mrow></msub><mo id="S3.E1.m1.5.5.2.2.6" stretchy="false" xref="S3.E1.m1.5.5.2.3.cmml">)</mo></mrow><mo id="S3.E1.m1.5.5.3" stretchy="false" xref="S3.E1.m1.5.5.3.cmml">→</mo><mi id="S3.E1.m1.5.5.4" xref="S3.E1.m1.5.5.4.cmml">A</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.E1.m1.5b"><apply id="S3.E1.m1.5.5.cmml" xref="S3.E1.m1.5.5"><ci id="S3.E1.m1.5.5.3.cmml" xref="S3.E1.m1.5.5.3">→</ci><vector id="S3.E1.m1.5.5.2.3.cmml" xref="S3.E1.m1.5.5.2.2"><ci id="S3.E1.m1.3.3.cmml" xref="S3.E1.m1.3.3">𝐷</ci><apply id="S3.E1.m1.4.4.1.1.1.cmml" xref="S3.E1.m1.4.4.1.1.1"><csymbol cd="ambiguous" id="S3.E1.m1.4.4.1.1.1.1.cmml" xref="S3.E1.m1.4.4.1.1.1">subscript</csymbol><ci id="S3.E1.m1.4.4.1.1.1.2.cmml" xref="S3.E1.m1.4.4.1.1.1.2">𝑋</ci><apply id="S3.E1.m1.4.4.1.1.1.3.cmml" xref="S3.E1.m1.4.4.1.1.1.3"><times id="S3.E1.m1.4.4.1.1.1.3.1.cmml" xref="S3.E1.m1.4.4.1.1.1.3.1"></times><ci id="S3.E1.m1.4.4.1.1.1.3.2.cmml" xref="S3.E1.m1.4.4.1.1.1.3.2">𝑙</ci><ci id="S3.E1.m1.4.4.1.1.1.3.3.cmml" xref="S3.E1.m1.4.4.1.1.1.3.3">𝑒</ci><ci id="S3.E1.m1.4.4.1.1.1.3.4.cmml" xref="S3.E1.m1.4.4.1.1.1.3.4">𝑛</ci></apply></apply><apply id="S3.E1.m1.5.5.2.2.2.cmml" xref="S3.E1.m1.5.5.2.2.2"><csymbol cd="ambiguous" id="S3.E1.m1.5.5.2.2.2.1.cmml" xref="S3.E1.m1.5.5.2.2.2">subscript</csymbol><ci id="S3.E1.m1.5.5.2.2.2.2.cmml" xref="S3.E1.m1.5.5.2.2.2.2">𝐼</ci><list id="S3.E1.m1.2.2.2.3.cmml" xref="S3.E1.m1.2.2.2.2"><apply id="S3.E1.m1.1.1.1.1.1.cmml" xref="S3.E1.m1.1.1.1.1.1"><times id="S3.E1.m1.1.1.1.1.1.1.cmml" xref="S3.E1.m1.1.1.1.1.1.1"></times><ci id="S3.E1.m1.1.1.1.1.1.2.cmml" xref="S3.E1.m1.1.1.1.1.1.2">𝑡</ci><ci id="S3.E1.m1.1.1.1.1.1.3.cmml" xref="S3.E1.m1.1.1.1.1.1.3">𝑝</ci><ci id="S3.E1.m1.1.1.1.1.1.4.cmml" xref="S3.E1.m1.1.1.1.1.1.4">𝑙</ci></apply><apply id="S3.E1.m1.2.2.2.2.2.cmml" xref="S3.E1.m1.2.2.2.2.2"><times id="S3.E1.m1.2.2.2.2.2.1.cmml" xref="S3.E1.m1.2.2.2.2.2.1"></times><ci id="S3.E1.m1.2.2.2.2.2.2.cmml" xref="S3.E1.m1.2.2.2.2.2.2">𝑣</ci><ci id="S3.E1.m1.2.2.2.2.2.3.cmml" xref="S3.E1.m1.2.2.2.2.2.3">𝑎</ci><ci id="S3.E1.m1.2.2.2.2.2.4.cmml" xref="S3.E1.m1.2.2.2.2.2.4">𝑟</ci></apply></list></apply></vector><ci id="S3.E1.m1.5.5.4.cmml" xref="S3.E1.m1.5.5.4">𝐴</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E1.m1.5c">(D,X_{len},I_{tpl,var})\to A</annotation><annotation encoding="application/x-llamapun" id="S3.E1.m1.5d">( italic_D , italic_X start_POSTSUBSCRIPT italic_l italic_e italic_n end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t italic_p italic_l , italic_v italic_a italic_r end_POSTSUBSCRIPT ) → italic_A</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(1)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S3.SS1.p1.17">In this setup, the scenario description <math alttext="D" class="ltx_Math" display="inline" id="S3.SS1.p1.5.m1.1"><semantics id="S3.SS1.p1.5.m1.1a"><mi id="S3.SS1.p1.5.m1.1.1" xref="S3.SS1.p1.5.m1.1.1.cmml">D</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.5.m1.1b"><ci id="S3.SS1.p1.5.m1.1.1.cmml" xref="S3.SS1.p1.5.m1.1.1">𝐷</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.5.m1.1c">D</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.5.m1.1d">italic_D</annotation></semantics></math> provides context at the beginning of the prompt, outlining the task background. All tasks within a scenario share the same <math alttext="D" class="ltx_Math" display="inline" id="S3.SS1.p1.6.m2.1"><semantics id="S3.SS1.p1.6.m2.1a"><mi id="S3.SS1.p1.6.m2.1.1" xref="S3.SS1.p1.6.m2.1.1.cmml">D</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.6.m2.1b"><ci id="S3.SS1.p1.6.m2.1.1.cmml" xref="S3.SS1.p1.6.m2.1.1">𝐷</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.6.m2.1c">D</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.6.m2.1d">italic_D</annotation></semantics></math>. The input <math alttext="X" class="ltx_Math" display="inline" id="S3.SS1.p1.7.m3.1"><semantics id="S3.SS1.p1.7.m3.1a"><mi id="S3.SS1.p1.7.m3.1.1" xref="S3.SS1.p1.7.m3.1.1.cmml">X</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.7.m3.1b"><ci id="S3.SS1.p1.7.m3.1.1.cmml" xref="S3.SS1.p1.7.m3.1.1">𝑋</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.7.m3.1c">X</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.7.m3.1d">italic_X</annotation></semantics></math>, as the main body of the prompt, provides essential information and varies by scenario. For example, in the List scenario, <math alttext="X" class="ltx_Math" display="inline" id="S3.SS1.p1.8.m4.1"><semantics id="S3.SS1.p1.8.m4.1a"><mi id="S3.SS1.p1.8.m4.1.1" xref="S3.SS1.p1.8.m4.1.1.cmml">X</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.8.m4.1b"><ci id="S3.SS1.p1.8.m4.1.1.cmml" xref="S3.SS1.p1.8.m4.1.1">𝑋</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.8.m4.1c">X</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.8.m4.1d">italic_X</annotation></semantics></math> is a long list, while in the OneDoc scenario, <math alttext="X" class="ltx_Math" display="inline" id="S3.SS1.p1.9.m5.1"><semantics id="S3.SS1.p1.9.m5.1a"><mi id="S3.SS1.p1.9.m5.1.1" xref="S3.SS1.p1.9.m5.1.1.cmml">X</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.9.m5.1b"><ci id="S3.SS1.p1.9.m5.1.1.cmml" xref="S3.SS1.p1.9.m5.1.1">𝑋</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.9.m5.1c">X</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.9.m5.1d">italic_X</annotation></semantics></math> represents a lengthy processed document. The parameter <math alttext="len" class="ltx_Math" display="inline" id="S3.SS1.p1.10.m6.1"><semantics id="S3.SS1.p1.10.m6.1a"><mrow id="S3.SS1.p1.10.m6.1.1" xref="S3.SS1.p1.10.m6.1.1.cmml"><mi id="S3.SS1.p1.10.m6.1.1.2" xref="S3.SS1.p1.10.m6.1.1.2.cmml">l</mi><mo id="S3.SS1.p1.10.m6.1.1.1" xref="S3.SS1.p1.10.m6.1.1.1.cmml"></mo><mi id="S3.SS1.p1.10.m6.1.1.3" xref="S3.SS1.p1.10.m6.1.1.3.cmml">e</mi><mo id="S3.SS1.p1.10.m6.1.1.1a" xref="S3.SS1.p1.10.m6.1.1.1.cmml"></mo><mi id="S3.SS1.p1.10.m6.1.1.4" xref="S3.SS1.p1.10.m6.1.1.4.cmml">n</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.10.m6.1b"><apply id="S3.SS1.p1.10.m6.1.1.cmml" xref="S3.SS1.p1.10.m6.1.1"><times id="S3.SS1.p1.10.m6.1.1.1.cmml" xref="S3.SS1.p1.10.m6.1.1.1"></times><ci id="S3.SS1.p1.10.m6.1.1.2.cmml" xref="S3.SS1.p1.10.m6.1.1.2">𝑙</ci><ci id="S3.SS1.p1.10.m6.1.1.3.cmml" xref="S3.SS1.p1.10.m6.1.1.3">𝑒</ci><ci id="S3.SS1.p1.10.m6.1.1.4.cmml" xref="S3.SS1.p1.10.m6.1.1.4">𝑛</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.10.m6.1c">len</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.10.m6.1d">italic_l italic_e italic_n</annotation></semantics></math> represents the number of tokens in <math alttext="X" class="ltx_Math" display="inline" id="S3.SS1.p1.11.m7.1"><semantics id="S3.SS1.p1.11.m7.1a"><mi id="S3.SS1.p1.11.m7.1.1" xref="S3.SS1.p1.11.m7.1.1.cmml">X</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.11.m7.1b"><ci id="S3.SS1.p1.11.m7.1.1.cmml" xref="S3.SS1.p1.11.m7.1.1">𝑋</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.11.m7.1c">X</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.11.m7.1d">italic_X</annotation></semantics></math>. The instruction <math alttext="I" class="ltx_Math" display="inline" id="S3.SS1.p1.12.m8.1"><semantics id="S3.SS1.p1.12.m8.1a"><mi id="S3.SS1.p1.12.m8.1.1" xref="S3.SS1.p1.12.m8.1.1.cmml">I</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.12.m8.1b"><ci id="S3.SS1.p1.12.m8.1.1.cmml" xref="S3.SS1.p1.12.m8.1.1">𝐼</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.12.m8.1c">I</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.12.m8.1d">italic_I</annotation></semantics></math>, typically placed at the end of the prompt, specifies the requirements for model response. It consists of two components: (1) the instruction template (<math alttext="tpl" class="ltx_Math" display="inline" id="S3.SS1.p1.13.m9.1"><semantics id="S3.SS1.p1.13.m9.1a"><mrow id="S3.SS1.p1.13.m9.1.1" xref="S3.SS1.p1.13.m9.1.1.cmml"><mi id="S3.SS1.p1.13.m9.1.1.2" xref="S3.SS1.p1.13.m9.1.1.2.cmml">t</mi><mo id="S3.SS1.p1.13.m9.1.1.1" xref="S3.SS1.p1.13.m9.1.1.1.cmml"></mo><mi id="S3.SS1.p1.13.m9.1.1.3" xref="S3.SS1.p1.13.m9.1.1.3.cmml">p</mi><mo id="S3.SS1.p1.13.m9.1.1.1a" xref="S3.SS1.p1.13.m9.1.1.1.cmml"></mo><mi id="S3.SS1.p1.13.m9.1.1.4" xref="S3.SS1.p1.13.m9.1.1.4.cmml">l</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.13.m9.1b"><apply id="S3.SS1.p1.13.m9.1.1.cmml" xref="S3.SS1.p1.13.m9.1.1"><times id="S3.SS1.p1.13.m9.1.1.1.cmml" xref="S3.SS1.p1.13.m9.1.1.1"></times><ci id="S3.SS1.p1.13.m9.1.1.2.cmml" xref="S3.SS1.p1.13.m9.1.1.2">𝑡</ci><ci id="S3.SS1.p1.13.m9.1.1.3.cmml" xref="S3.SS1.p1.13.m9.1.1.3">𝑝</ci><ci id="S3.SS1.p1.13.m9.1.1.4.cmml" xref="S3.SS1.p1.13.m9.1.1.4">𝑙</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.13.m9.1c">tpl</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.13.m9.1d">italic_t italic_p italic_l</annotation></semantics></math>), which outlines the task requirements, and (2) the instruction variable (<math alttext="var" class="ltx_Math" display="inline" id="S3.SS1.p1.14.m10.1"><semantics id="S3.SS1.p1.14.m10.1a"><mrow id="S3.SS1.p1.14.m10.1.1" xref="S3.SS1.p1.14.m10.1.1.cmml"><mi id="S3.SS1.p1.14.m10.1.1.2" xref="S3.SS1.p1.14.m10.1.1.2.cmml">v</mi><mo id="S3.SS1.p1.14.m10.1.1.1" xref="S3.SS1.p1.14.m10.1.1.1.cmml"></mo><mi id="S3.SS1.p1.14.m10.1.1.3" xref="S3.SS1.p1.14.m10.1.1.3.cmml">a</mi><mo id="S3.SS1.p1.14.m10.1.1.1a" xref="S3.SS1.p1.14.m10.1.1.1.cmml"></mo><mi id="S3.SS1.p1.14.m10.1.1.4" xref="S3.SS1.p1.14.m10.1.1.4.cmml">r</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.14.m10.1b"><apply id="S3.SS1.p1.14.m10.1.1.cmml" xref="S3.SS1.p1.14.m10.1.1"><times id="S3.SS1.p1.14.m10.1.1.1.cmml" xref="S3.SS1.p1.14.m10.1.1.1"></times><ci id="S3.SS1.p1.14.m10.1.1.2.cmml" xref="S3.SS1.p1.14.m10.1.1.2">𝑣</ci><ci id="S3.SS1.p1.14.m10.1.1.3.cmml" xref="S3.SS1.p1.14.m10.1.1.3">𝑎</ci><ci id="S3.SS1.p1.14.m10.1.1.4.cmml" xref="S3.SS1.p1.14.m10.1.1.4">𝑟</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.14.m10.1c">var</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.14.m10.1d">italic_v italic_a italic_r</annotation></semantics></math>), representing variable part within the template. Generally, in LIFBench, <math alttext="D" class="ltx_Math" display="inline" id="S3.SS1.p1.15.m11.1"><semantics id="S3.SS1.p1.15.m11.1a"><mi id="S3.SS1.p1.15.m11.1.1" xref="S3.SS1.p1.15.m11.1.1.cmml">D</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.15.m11.1b"><ci id="S3.SS1.p1.15.m11.1.1.cmml" xref="S3.SS1.p1.15.m11.1.1">𝐷</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.15.m11.1c">D</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.15.m11.1d">italic_D</annotation></semantics></math> and <math alttext="I" class="ltx_Math" display="inline" id="S3.SS1.p1.16.m12.1"><semantics id="S3.SS1.p1.16.m12.1a"><mi id="S3.SS1.p1.16.m12.1.1" xref="S3.SS1.p1.16.m12.1.1.cmml">I</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.16.m12.1b"><ci id="S3.SS1.p1.16.m12.1.1.cmml" xref="S3.SS1.p1.16.m12.1.1">𝐼</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.16.m12.1c">I</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.16.m12.1d">italic_I</annotation></semantics></math> tend to be short, while <math alttext="X" class="ltx_Math" display="inline" id="S3.SS1.p1.17.m13.1"><semantics id="S3.SS1.p1.17.m13.1a"><mi id="S3.SS1.p1.17.m13.1.1" xref="S3.SS1.p1.17.m13.1.1.cmml">X</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.17.m13.1b"><ci id="S3.SS1.p1.17.m13.1.1.cmml" xref="S3.SS1.p1.17.m13.1.1">𝑋</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.17.m13.1c">X</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.17.m13.1d">italic_X</annotation></semantics></math> is a long text of thousands of tokens.</p> </div> </section> <section class="ltx_subsection" id="S3.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.2 </span>Data Collection</h3> <figure class="ltx_table" id="S3.T2"> <div class="ltx_inline-block ltx_align_center ltx_transformed_outer" id="S3.T2.1" style="width:433.6pt;height:454.8pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(105.3pt,-110.4pt) scale(1.94351256186382,1.94351256186382) ;"> <table class="ltx_tabular ltx_align_middle" id="S3.T2.1.1"> <tr class="ltx_tr" id="S3.T2.1.1.1"> <td class="ltx_td ltx_align_left" id="S3.T2.1.1.1.1"> <span class="ltx_rule" style="width:100%;height:1.0pt;background:black;display:inline-block;"> </span><span class="ltx_text" id="S3.T2.1.1.1.1.1" style="background-color:#E6E6E6;"> <span class="ltx_text ltx_font_bold" id="S3.T2.1.1.1.1.1.1">Scenario</span></span> </td> <td class="ltx_td ltx_align_left" id="S3.T2.1.1.1.2"><span class="ltx_text ltx_font_bold" id="S3.T2.1.1.1.2.1" style="background-color:#E6E6E6;">Task</span></td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.1.3"><span class="ltx_text ltx_font_bold" id="S3.T2.1.1.1.3.1" style="background-color:#E6E6E6;">ID</span></td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.1.4"><span class="ltx_text ltx_font_bold" id="S3.T2.1.1.1.4.1" style="background-color:#E6E6E6;">#Samples</span></td> </tr> <tr class="ltx_tr" id="S3.T2.1.1.2"> <td class="ltx_td ltx_align_left ltx_border_t" id="S3.T2.1.1.2.1" rowspan="6"><span class="ltx_text" id="S3.T2.1.1.2.1.1">List</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="S3.T2.1.1.2.2">Single-ID</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.1.1.2.3">LSI</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.1.1.2.4">180</td> </tr> <tr class="ltx_tr" id="S3.T2.1.1.3"> <td class="ltx_td ltx_align_left" id="S3.T2.1.1.3.1">Multi-ID</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.3.2">LMI</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.3.3">150</td> </tr> <tr class="ltx_tr" id="S3.T2.1.1.4"> <td class="ltx_td ltx_align_left" id="S3.T2.1.1.4.1">Offset-ID</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.4.2">LOI</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.4.3">396</td> </tr> <tr class="ltx_tr" id="S3.T2.1.1.5"> <td class="ltx_td ltx_align_left" id="S3.T2.1.1.5.1">Offset-Element</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.5.2">LOE</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.5.3">432</td> </tr> <tr class="ltx_tr" id="S3.T2.1.1.6"> <td class="ltx_td ltx_align_left" id="S3.T2.1.1.6.1">Blur-ID</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.6.2">LBI</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.6.3">396</td> </tr> <tr class="ltx_tr" id="S3.T2.1.1.7"> <td class="ltx_td ltx_align_left" id="S3.T2.1.1.7.1">Blur-Element</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.7.2">LBE</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.7.3">432</td> </tr> <tr class="ltx_tr" id="S3.T2.1.1.8"> <td class="ltx_td ltx_align_left ltx_border_t" id="S3.T2.1.1.8.1" rowspan="2"><span class="ltx_text" id="S3.T2.1.1.8.1.1">MultiDoc</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="S3.T2.1.1.8.2">Batch-label</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.1.1.8.3">MB</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.1.1.8.4">150</td> </tr> <tr class="ltx_tr" id="S3.T2.1.1.9"> <td class="ltx_td ltx_align_left" id="S3.T2.1.1.9.1">Find-dup-doc</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.9.2">MF</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.9.3">150</td> </tr> <tr class="ltx_tr" id="S3.T2.1.1.10"> <td class="ltx_td ltx_align_left ltx_border_t" id="S3.T2.1.1.10.1" rowspan="3"><span class="ltx_text" id="S3.T2.1.1.10.1.1">OneDoc</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="S3.T2.1.1.10.2">Repeat</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.1.1.10.3">OR</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.1.1.10.4">150</td> </tr> <tr class="ltx_tr" id="S3.T2.1.1.11"> <td class="ltx_td ltx_align_left" id="S3.T2.1.1.11.1">QA</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.11.2">OQ</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.11.3">180</td> </tr> <tr class="ltx_tr" id="S3.T2.1.1.12"> <td class="ltx_td ltx_align_left" id="S3.T2.1.1.12.1">Extract</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.12.2">OE</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.1.12.3">150</td> </tr> <tr class="ltx_tr" id="S3.T2.1.1.13"> <td class="ltx_td ltx_align_left" id="S3.T2.1.1.13.1"><span class="ltx_rule" style="width:100%;height:1.0pt;background:black;display:inline-block;"> </span></td> <td class="ltx_td" id="S3.T2.1.1.13.2"></td> <td class="ltx_td" id="S3.T2.1.1.13.3"></td> <td class="ltx_td" id="S3.T2.1.1.13.4"></td> </tr> </table> </span></div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 2: </span>An overview of the dataset statistics in LIFBench. </figcaption> </figure> <div class="ltx_para" id="S3.SS2.p1"> <p class="ltx_p" id="S3.SS2.p1.1">To accurately simulate real-world application scenarios for LLMs in long-text processing, this study constructs 3 scenarios and 11 corresponding tasks. These tasks replicate the instruction-following challenges that LLMs may encounter in long-text scenarios, enabling an evaluation of model performance in handling complex instructions.</p> </div> <div class="ltx_para ltx_noindent" id="S3.SS2.p2"> <p class="ltx_p" id="S3.SS2.p2.1"><span class="ltx_text ltx_font_bold" id="S3.SS2.p2.1.1">List</span> The List scenario is designed to assess the ability of LLMs to handle structured batches of short-item information, as seen in applications like retrieval, recommendation, and structured data processing. In this scenario, the input is a lengthy ordered list, and the LLM must identify elements that meet the specified criteria provided in the instructions.</p> </div> <div class="ltx_para" id="S3.SS2.p3"> <p class="ltx_p" id="S3.SS2.p3.1">The scenario consists of six tasks. The <span class="ltx_text ltx_font_italic" id="S3.SS2.p3.1.1">Single-ID</span> and <span class="ltx_text ltx_font_italic" id="S3.SS2.p3.1.2">Multi-ID</span> tasks are foundational retrieval tasks that prompt the model to locate specific elements within an ordered list based on provided IDs. Building on these basic tasks, the <span class="ltx_text ltx_font_italic" id="S3.SS2.p3.1.3">Offset-ID</span>, <span class="ltx_text ltx_font_italic" id="S3.SS2.p3.1.4">Offset-element</span>, <span class="ltx_text ltx_font_italic" id="S3.SS2.p3.1.5">Blur-ID</span>, and <span class="ltx_text ltx_font_italic" id="S3.SS2.p3.1.6">Blur-element</span> tasks introduce spatial constraints, adding complexity to the retrieval process. Here, "Offset" and "Blur" represent different types of spatial limitations: "Offset" specifies precise indices, evaluating the model’s fine-grained spatial awareness, while "Blur" provides a broader spatial range, allowing the LLM to exercise some autonomy. The terms "ID" and "Element" refer to the type of reference in the retrieval process, representing either the position number in the ordered list or the list element itself.</p> </div> <div class="ltx_para" id="S3.SS2.p4"> <p class="ltx_p" id="S3.SS2.p4.1">To enhance the realism of the List scenario, we constructed an ordered list combining randomly generated UUIDs with natural language instruction texts from the Alpaca-52k <cite class="ltx_cite ltx_citemacro_cite">Taori et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib45" title="">2023</a>)</cite> dataset. This design introduces semantic noise, challenging the model’s ability to focus on the target instructions. For additional details, please refer to Appendix <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#A1.SS1" title="A.1 List ‣ Appendix A Details in data collection ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_tag">A.1</span></a>.</p> </div> <div class="ltx_para ltx_noindent" id="S3.SS2.p5"> <p class="ltx_p" id="S3.SS2.p5.1"><span class="ltx_text ltx_font_bold" id="S3.SS2.p5.1.1">MultiDoc</span> The MultiDoc scenario focuses on tasks where LLMs must process and understand coarser-grained information, such as multi-document summarization, document clustering, and retrieval-augmented generation (RAG). Unlike the List scenario, MultiDoc tasks require models to compare multiple documents, identify differences, and execute batch operations, placing high demands on the models’ ability to follow instructions accurately.</p> </div> <div class="ltx_para" id="S3.SS2.p6"> <p class="ltx_p" id="S3.SS2.p6.1">To construct inputs <math alttext="X" class="ltx_Math" display="inline" id="S3.SS2.p6.1.m1.1"><semantics id="S3.SS2.p6.1.m1.1a"><mi id="S3.SS2.p6.1.m1.1.1" xref="S3.SS2.p6.1.m1.1.1.cmml">X</mi><annotation-xml encoding="MathML-Content" id="S3.SS2.p6.1.m1.1b"><ci id="S3.SS2.p6.1.m1.1.1.cmml" xref="S3.SS2.p6.1.m1.1.1">𝑋</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p6.1.m1.1c">X</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p6.1.m1.1d">italic_X</annotation></semantics></math> in this scenario, we selected documents from four diverse datasets: GovReport <cite class="ltx_cite ltx_citemacro_cite">Huang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib19" title="">2021</a>)</cite>, XSum <cite class="ltx_cite ltx_citemacro_cite">Narayan et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib33" title="">2018</a>)</cite>, QMSum <cite class="ltx_cite ltx_citemacro_cite">Zhong et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib53" title="">2021</a>)</cite>, and the Paul Graham Essays dataset<span class="ltx_note ltx_role_footnote" id="footnote1"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup><span class="ltx_tag ltx_tag_note">1</span><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://huggingface.co/datasets/sgoel9/paul_graham_essays" title="">https://huggingface.co/datasets/sgoel9/paul_graham_essays</a></span></span></span>. These selective datasets span multiple domains and various forms of text, offering a high degree of diversity. For each dataset, we sampled and truncated the main text to ensure lengths between 300 and 500 tokens, resulting in a multi-document collection for task inputs. Each document contains up to six fields (e.g., "text", "id", "iD2", "title", "date", "source"), some of which were manually annotated or modified to add challenge and ensure task relevance. Approximately 25% of the documents are duplicates, introducing further complexity and space for task design. Further details can be found in Appendix <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#A1.SS2" title="A.2 MultiDoc ‣ Appendix A Details in data collection ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_tag">A.2</span></a>.</p> </div> <div class="ltx_para" id="S3.SS2.p7"> <p class="ltx_p" id="S3.SS2.p7.1">Two elaborate tasks were designed for this scenario. In <span class="ltx_text ltx_font_italic" id="S3.SS2.p7.1.1">Find-dup-doc</span>, LLMs are required to identify and output all duplicate documents in a specified format. In <span class="ltx_text ltx_font_italic" id="S3.SS2.p7.1.2">batch-label</span>, models must read through the entire input and label each document based on specified attributes, which can demonstrate accurate and comprehensive instruction-following capabilities.</p> </div> <div class="ltx_para ltx_noindent" id="S3.SS2.p8"> <p class="ltx_p" id="S3.SS2.p8.1"><span class="ltx_text ltx_font_bold" id="S3.SS2.p8.1.1">OneDoc</span> The OneDoc scenario simulates tasks involving processing of a single, very long document, typical in use cases like information extraction, question answering, and other conventional NLP tasks requiring analysis of extensive text. LLMs in this scenario are generally tasked with identifying key information within the document and processing it as specified in the instructions.</p> </div> <div class="ltx_para" id="S3.SS2.p9"> <p class="ltx_p" id="S3.SS2.p9.1">To create the input for OneDoc, we synthesized an extra-long document by concatenating entries from the Paul Graham Essays dataset, following the approach of <cite class="ltx_cite ltx_citemacro_cite">Greg Kamradt (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib15" title="">2023</a>)</cite>. In this document, some sentences are randomly tagged as "key information." Each tag indicates the type (e.g., Topic, Evidence, Concession) and a unique identifier for each piece of key information, to help LLMs recognize and distinguish these elements.</p> </div> <div class="ltx_para" id="S3.SS2.p10"> <p class="ltx_p" id="S3.SS2.p10.1">Within this framework, we designed three distinct tasks based on the tagged information in the synthetic document: The <span class="ltx_text ltx_font_italic" id="S3.SS2.p10.1.1">Repeat</span> task requires the model to repeat a specified amount of key information. The <span class="ltx_text ltx_font_italic" id="S3.SS2.p10.1.2">Extract</span> task instructs the model to extract specific types of key information. The <span class="ltx_text ltx_font_italic" id="S3.SS2.p10.1.3">QA</span> task presents the model with a sentence and asks it to identify whether the sentence is labeled as key information.</p> </div> </section> <section class="ltx_subsection" id="S3.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.3 </span>Data Extension</h3> <div class="ltx_para" id="S3.SS3.p1"> <p class="ltx_p" id="S3.SS3.p1.1">We manually write an initial instruction template for each task in Section <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S3.SS2" title="3.2 Data Collection ‣ 3 LIFBench ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_tag">3.2</span></a>. In this section, we will build on these templates and expand them in three dimensions (length, expression, and variable) to form a sizeable test dataset.</p> </div> <div class="ltx_para ltx_noindent" id="S3.SS3.p2"> <p class="ltx_p" id="S3.SS3.p2.1"><span class="ltx_text ltx_font_bold" id="S3.SS3.p2.1.1">Length</span> A number of works <cite class="ltx_cite ltx_citemacro_cite">Bai et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib4" title="">2023</a>); Ni et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib34" title="">2024</a>); Li et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib27" title="">2024b</a>); Levy et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib24" title="">2024</a>)</cite> have found that length of input text has an impact on the ability of the LLMs. In LIFBench, different lengths of prompts allow us to explore the impact of context length on the instruction-following capabilities of LLMs, making it essential to introduce variations in prompt length.</p> </div> <div class="ltx_para" id="S3.SS3.p3"> <p class="ltx_p" id="S3.SS3.p3.1">To control prompt length, we adjust the number of tokens in the input, which forms the main body of the prompt. Across the three scenarios—List, MultiDoc, and OneDoc—the approach is similar with minor adjustments. In the List scenario, we select from a pre-constructed list of 140K+ elements (totaling over 2M tokens) to reach the desired length. In the MultiDoc scenario, we use a collection of documents with 400K+ tokens, adjusting length by the number of documents included. For OneDoc, Paul’s articles are concatenated sequentially until the target length is reached, adding task-specific tags where needed.</p> </div> <div class="ltx_para ltx_noindent" id="S3.SS3.p4"> <p class="ltx_p" id="S3.SS3.p4.1"><span class="ltx_text ltx_font_bold" id="S3.SS3.p4.1.1">Expression</span> In real-world contexts, due to cultural differences, regional practices, and discrepancies between spoken and written expressions, individuals often provide significantly varied descriptions of the same subject, which undoubtedly introduces challenges to the stability of large models in following instructions. To assess LLM robustness in this regard, we diversify instruction templates to create multiple expressions for each task with differing wording and syntax.</p> </div> <div class="ltx_para" id="S3.SS3.p5"> <p class="ltx_p" id="S3.SS3.p5.1">Our approach follows a four-step process: "Rewriting-Encoding-Clustering-Sampling." First, we use GPT-4 <cite class="ltx_cite ltx_citemacro_cite">Achiam et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib1" title="">2023</a>)</cite> and Claude <cite class="ltx_cite ltx_citemacro_cite">Anthropic (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib3" title="">2024</a>)</cite> to generate 40+ rewrites of each original instruction template. Next, the rewritten templates are encoded into vector representations for clustering, with the number of clusters set to the target number of rewritten instructions for each task. For instance, if a task requires five rewritten instructions, we create five clusters. Finally, we sample from each cluster, selecting the usable template closest to the cluster center as the final diversified expression.</p> </div> <div class="ltx_para" id="S3.SS3.p6"> <p class="ltx_p" id="S3.SS3.p6.1">To implement this, we chose the BGE-M3 text embedding model <cite class="ltx_cite ltx_citemacro_cite">Chen et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib7" title="">2024a</a>)</cite> for encoding and applied K-means clustering <cite class="ltx_cite ltx_citemacro_cite">Macqueen (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib32" title="">1967</a>)</cite>. Additionally, since LLMs can introduce inaccuracies in rewrites due to misinterpretations, we manually filtered out unsuitable rewrites during the sampling phase. To validate the effectiveness of this method, we designed experiments, the details of which can be found in Appendix <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#A2" title="Appendix B Effectiveness of Expression Extension ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_tag">B</span></a>.</p> </div> <div class="ltx_para ltx_noindent" id="S3.SS3.p7"> <p class="ltx_p" id="S3.SS3.p7.1"><span class="ltx_text ltx_font_bold" id="S3.SS3.p7.1.1">Variable</span> Some placeholders are preset in the instruction templates to indicate variable parts of instructions, i.e. instruction variables. These variables encompass elements such as query keys (for retrieval tasks), categorization criteria (for classification tasks), and format requirements. By analyzing LLM performance across varying instruction variables, we can evaluate the model’s understanding and consistency in executing instructions. Ideally, models with strong instruction-following abilities execute instructions stably across variable conditions, while weaker models may exhibit inconsistency. Each instruction variable has been manually assigned specific sampling spaces and rules, enabling the generation of a diverse and comprehensive set of test instructions. Each instruction variable is assigned specific sampling spaces and rules, which are finally populated into designated placeholders within the templates, generating a diverse and comprehensive set of test instructions.</p> </div> </section> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">4 </span>LIFEval</h2> <div class="ltx_para" id="S4.p1"> <p class="ltx_p" id="S4.p1.1">Due to the high cost and inefficiency of manual evaluation, current benchmarks often use traditional metrics (e.g., ACC, EM) or advanced LLMs like GPT-4 for automated scoring. However, these metrics do not fully capture LLMs’ instruction-following abilities, and even the most advanced LLMs struggle with precision in evaluation. To address these gaps, we propose LIFEval, an efficient framework that provides accurate assessments and detailed insights into LLMs’ instruction-following performance and stability.</p> </div> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.1 </span>Automated Rubric-based Scoring</h3> <div class="ltx_para" id="S4.SS1.p1"> <p class="ltx_p" id="S4.SS1.p1.4">To evaluate the output quality of LLMs on LIFBench tasks, we introduce Automated Rubric-based Scoring (ARS), an accurate and efficient programmatic evaluation method.As shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S1.F1" title="Figure 1 ‣ 1 Introduction ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_tag">1</span></a>, we manually crafted scoring rubrics <math alttext="R" class="ltx_Math" display="inline" id="S4.SS1.p1.1.m1.1"><semantics id="S4.SS1.p1.1.m1.1a"><mi id="S4.SS1.p1.1.m1.1.1" xref="S4.SS1.p1.1.m1.1.1.cmml">R</mi><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.1.m1.1b"><ci id="S4.SS1.p1.1.m1.1.1.cmml" xref="S4.SS1.p1.1.m1.1.1">𝑅</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.1.m1.1c">R</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.1.m1.1d">italic_R</annotation></semantics></math> for each task, each comprising several scoring points <math alttext="s" class="ltx_Math" display="inline" id="S4.SS1.p1.2.m2.1"><semantics id="S4.SS1.p1.2.m2.1a"><mi id="S4.SS1.p1.2.m2.1.1" xref="S4.SS1.p1.2.m2.1.1.cmml">s</mi><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.2.m2.1b"><ci id="S4.SS1.p1.2.m2.1.1.cmml" xref="S4.SS1.p1.2.m2.1.1">𝑠</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.2.m2.1c">s</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.2.m2.1d">italic_s</annotation></semantics></math> with varying weights <math alttext="\tilde{s}" class="ltx_Math" display="inline" id="S4.SS1.p1.3.m3.1"><semantics id="S4.SS1.p1.3.m3.1a"><mover accent="true" id="S4.SS1.p1.3.m3.1.1" xref="S4.SS1.p1.3.m3.1.1.cmml"><mi id="S4.SS1.p1.3.m3.1.1.2" xref="S4.SS1.p1.3.m3.1.1.2.cmml">s</mi><mo id="S4.SS1.p1.3.m3.1.1.1" xref="S4.SS1.p1.3.m3.1.1.1.cmml">~</mo></mover><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.3.m3.1b"><apply id="S4.SS1.p1.3.m3.1.1.cmml" xref="S4.SS1.p1.3.m3.1.1"><ci id="S4.SS1.p1.3.m3.1.1.1.cmml" xref="S4.SS1.p1.3.m3.1.1.1">~</ci><ci id="S4.SS1.p1.3.m3.1.1.2.cmml" xref="S4.SS1.p1.3.m3.1.1.2">𝑠</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.3.m3.1c">\tilde{s}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.3.m3.1d">over~ start_ARG italic_s end_ARG</annotation></semantics></math>, reflecting their importance and complexity. All of these points can be assessed automatically through a program. The scoring process on task <math alttext="t" class="ltx_Math" display="inline" id="S4.SS1.p1.4.m4.1"><semantics id="S4.SS1.p1.4.m4.1a"><mi id="S4.SS1.p1.4.m4.1.1" xref="S4.SS1.p1.4.m4.1.1.cmml">t</mi><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.4.m4.1b"><ci id="S4.SS1.p1.4.m4.1.1.cmml" xref="S4.SS1.p1.4.m4.1.1">𝑡</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.4.m4.1c">t</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.4.m4.1d">italic_t</annotation></semantics></math> can be shown as follows:</p> <table class="ltx_equation ltx_eqn_table" id="S4.E2"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\begin{split}ARS_{t}(A_{t})=\frac{1}{\tilde{R}_{t}}\sum_{s\in R_{t}}f_{s}(A_{t% }),\tilde{R}_{t}=\sum_{s\in R_{t}}\tilde{s}\end{split}" class="ltx_Math" display="block" id="S4.E2.m1.29"><semantics id="S4.E2.m1.29a"><mtable displaystyle="true" id="S4.E2.m1.29.29.4"><mtr id="S4.E2.m1.29.29.4a"><mtd class="ltx_align_right" columnalign="right" id="S4.E2.m1.29.29.4b"><mrow id="S4.E2.m1.29.29.4.27.27.27.27"><mrow id="S4.E2.m1.28.28.3.26.26.26.26.1"><mrow id="S4.E2.m1.28.28.3.26.26.26.26.1.1"><mi id="S4.E2.m1.1.1.1.1.1.1" xref="S4.E2.m1.1.1.1.1.1.1.cmml">A</mi><mo id="S4.E2.m1.28.28.3.26.26.26.26.1.1.2" xref="S4.E2.m1.27.27.2.3.cmml"></mo><mi id="S4.E2.m1.2.2.2.2.2.2" xref="S4.E2.m1.2.2.2.2.2.2.cmml">R</mi><mo id="S4.E2.m1.28.28.3.26.26.26.26.1.1.2a" xref="S4.E2.m1.27.27.2.3.cmml"></mo><msub id="S4.E2.m1.28.28.3.26.26.26.26.1.1.3"><mi id="S4.E2.m1.3.3.3.3.3.3" xref="S4.E2.m1.3.3.3.3.3.3.cmml">S</mi><mi id="S4.E2.m1.4.4.4.4.4.4.1" xref="S4.E2.m1.4.4.4.4.4.4.1.cmml">t</mi></msub><mo id="S4.E2.m1.28.28.3.26.26.26.26.1.1.2b" xref="S4.E2.m1.27.27.2.3.cmml"></mo><mrow id="S4.E2.m1.28.28.3.26.26.26.26.1.1.1.1"><mo id="S4.E2.m1.5.5.5.5.5.5" stretchy="false" xref="S4.E2.m1.27.27.2.3.cmml">(</mo><msub id="S4.E2.m1.28.28.3.26.26.26.26.1.1.1.1.1"><mi id="S4.E2.m1.6.6.6.6.6.6" xref="S4.E2.m1.6.6.6.6.6.6.cmml">A</mi><mi id="S4.E2.m1.7.7.7.7.7.7.1" xref="S4.E2.m1.7.7.7.7.7.7.1.cmml">t</mi></msub><mo id="S4.E2.m1.8.8.8.8.8.8" stretchy="false" xref="S4.E2.m1.27.27.2.3.cmml">)</mo></mrow></mrow><mo id="S4.E2.m1.9.9.9.9.9.9" xref="S4.E2.m1.9.9.9.9.9.9.cmml">=</mo><mrow id="S4.E2.m1.28.28.3.26.26.26.26.1.2"><mfrac id="S4.E2.m1.10.10.10.10.10.10" xref="S4.E2.m1.10.10.10.10.10.10.cmml"><mn id="S4.E2.m1.10.10.10.10.10.10.2" xref="S4.E2.m1.10.10.10.10.10.10.2.cmml">1</mn><msub id="S4.E2.m1.10.10.10.10.10.10.3" xref="S4.E2.m1.10.10.10.10.10.10.3.cmml"><mover accent="true" id="S4.E2.m1.10.10.10.10.10.10.3.2" xref="S4.E2.m1.10.10.10.10.10.10.3.2.cmml"><mi id="S4.E2.m1.10.10.10.10.10.10.3.2.2" xref="S4.E2.m1.10.10.10.10.10.10.3.2.2.cmml">R</mi><mo id="S4.E2.m1.10.10.10.10.10.10.3.2.1" xref="S4.E2.m1.10.10.10.10.10.10.3.2.1.cmml">~</mo></mover><mi id="S4.E2.m1.10.10.10.10.10.10.3.3" xref="S4.E2.m1.10.10.10.10.10.10.3.3.cmml">t</mi></msub></mfrac><mo id="S4.E2.m1.28.28.3.26.26.26.26.1.2.2" xref="S4.E2.m1.27.27.2.3.cmml"></mo><mrow id="S4.E2.m1.28.28.3.26.26.26.26.1.2.1"><munder id="S4.E2.m1.28.28.3.26.26.26.26.1.2.1.2"><mo id="S4.E2.m1.11.11.11.11.11.11" movablelimits="false" xref="S4.E2.m1.11.11.11.11.11.11.cmml">∑</mo><mrow id="S4.E2.m1.12.12.12.12.12.12.1" xref="S4.E2.m1.12.12.12.12.12.12.1.cmml"><mi id="S4.E2.m1.12.12.12.12.12.12.1.2" xref="S4.E2.m1.12.12.12.12.12.12.1.2.cmml">s</mi><mo id="S4.E2.m1.12.12.12.12.12.12.1.1" xref="S4.E2.m1.12.12.12.12.12.12.1.1.cmml">∈</mo><msub id="S4.E2.m1.12.12.12.12.12.12.1.3" xref="S4.E2.m1.12.12.12.12.12.12.1.3.cmml"><mi id="S4.E2.m1.12.12.12.12.12.12.1.3.2" xref="S4.E2.m1.12.12.12.12.12.12.1.3.2.cmml">R</mi><mi id="S4.E2.m1.12.12.12.12.12.12.1.3.3" xref="S4.E2.m1.12.12.12.12.12.12.1.3.3.cmml">t</mi></msub></mrow></munder><mrow id="S4.E2.m1.28.28.3.26.26.26.26.1.2.1.1"><msub id="S4.E2.m1.28.28.3.26.26.26.26.1.2.1.1.3"><mi id="S4.E2.m1.13.13.13.13.13.13" xref="S4.E2.m1.13.13.13.13.13.13.cmml">f</mi><mi id="S4.E2.m1.14.14.14.14.14.14.1" xref="S4.E2.m1.14.14.14.14.14.14.1.cmml">s</mi></msub><mo id="S4.E2.m1.28.28.3.26.26.26.26.1.2.1.1.2" xref="S4.E2.m1.27.27.2.3.cmml"></mo><mrow id="S4.E2.m1.28.28.3.26.26.26.26.1.2.1.1.1.1"><mo id="S4.E2.m1.15.15.15.15.15.15" stretchy="false" xref="S4.E2.m1.27.27.2.3.cmml">(</mo><msub id="S4.E2.m1.28.28.3.26.26.26.26.1.2.1.1.1.1.1"><mi id="S4.E2.m1.16.16.16.16.16.16" xref="S4.E2.m1.16.16.16.16.16.16.cmml">A</mi><mi id="S4.E2.m1.17.17.17.17.17.17.1" xref="S4.E2.m1.17.17.17.17.17.17.1.cmml">t</mi></msub><mo id="S4.E2.m1.18.18.18.18.18.18" stretchy="false" xref="S4.E2.m1.27.27.2.3.cmml">)</mo></mrow></mrow></mrow></mrow></mrow><mo id="S4.E2.m1.19.19.19.19.19.19" xref="S4.E2.m1.27.27.2.3.cmml">,</mo><mrow id="S4.E2.m1.29.29.4.27.27.27.27.2"><msub id="S4.E2.m1.29.29.4.27.27.27.27.2.1"><mover accent="true" id="S4.E2.m1.20.20.20.20.20.20" xref="S4.E2.m1.20.20.20.20.20.20.cmml"><mi id="S4.E2.m1.20.20.20.20.20.20.2" xref="S4.E2.m1.20.20.20.20.20.20.2.cmml">R</mi><mo id="S4.E2.m1.20.20.20.20.20.20.1" xref="S4.E2.m1.20.20.20.20.20.20.1.cmml">~</mo></mover><mi id="S4.E2.m1.21.21.21.21.21.21.1" xref="S4.E2.m1.21.21.21.21.21.21.1.cmml">t</mi></msub><mo id="S4.E2.m1.22.22.22.22.22.22" rspace="0.111em" xref="S4.E2.m1.22.22.22.22.22.22.cmml">=</mo><mrow id="S4.E2.m1.29.29.4.27.27.27.27.2.2"><munder id="S4.E2.m1.29.29.4.27.27.27.27.2.2.1"><mo id="S4.E2.m1.23.23.23.23.23.23" movablelimits="false" xref="S4.E2.m1.23.23.23.23.23.23.cmml">∑</mo><mrow id="S4.E2.m1.24.24.24.24.24.24.1" xref="S4.E2.m1.24.24.24.24.24.24.1.cmml"><mi id="S4.E2.m1.24.24.24.24.24.24.1.2" xref="S4.E2.m1.24.24.24.24.24.24.1.2.cmml">s</mi><mo id="S4.E2.m1.24.24.24.24.24.24.1.1" xref="S4.E2.m1.24.24.24.24.24.24.1.1.cmml">∈</mo><msub id="S4.E2.m1.24.24.24.24.24.24.1.3" xref="S4.E2.m1.24.24.24.24.24.24.1.3.cmml"><mi id="S4.E2.m1.24.24.24.24.24.24.1.3.2" xref="S4.E2.m1.24.24.24.24.24.24.1.3.2.cmml">R</mi><mi id="S4.E2.m1.24.24.24.24.24.24.1.3.3" xref="S4.E2.m1.24.24.24.24.24.24.1.3.3.cmml">t</mi></msub></mrow></munder><mover accent="true" id="S4.E2.m1.25.25.25.25.25.25" xref="S4.E2.m1.25.25.25.25.25.25.cmml"><mi id="S4.E2.m1.25.25.25.25.25.25.2" xref="S4.E2.m1.25.25.25.25.25.25.2.cmml">s</mi><mo id="S4.E2.m1.25.25.25.25.25.25.1" xref="S4.E2.m1.25.25.25.25.25.25.1.cmml">~</mo></mover></mrow></mrow></mrow></mtd></mtr></mtable><annotation-xml encoding="MathML-Content" id="S4.E2.m1.29b"><apply id="S4.E2.m1.27.27.2.3.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"><csymbol cd="ambiguous" id="S4.E2.m1.27.27.2.3a.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2">formulae-sequence</csymbol><apply id="S4.E2.m1.26.26.1.1.1.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"><eq id="S4.E2.m1.9.9.9.9.9.9.cmml" xref="S4.E2.m1.9.9.9.9.9.9"></eq><apply id="S4.E2.m1.26.26.1.1.1.1.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"><times id="S4.E2.m1.26.26.1.1.1.1.2.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"></times><ci id="S4.E2.m1.1.1.1.1.1.1.cmml" xref="S4.E2.m1.1.1.1.1.1.1">𝐴</ci><ci id="S4.E2.m1.2.2.2.2.2.2.cmml" xref="S4.E2.m1.2.2.2.2.2.2">𝑅</ci><apply id="S4.E2.m1.26.26.1.1.1.1.5.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"><csymbol cd="ambiguous" id="S4.E2.m1.26.26.1.1.1.1.5.1.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2">subscript</csymbol><ci id="S4.E2.m1.3.3.3.3.3.3.cmml" xref="S4.E2.m1.3.3.3.3.3.3">𝑆</ci><ci id="S4.E2.m1.4.4.4.4.4.4.1.cmml" xref="S4.E2.m1.4.4.4.4.4.4.1">𝑡</ci></apply><apply id="S4.E2.m1.26.26.1.1.1.1.1.1.1.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"><csymbol cd="ambiguous" id="S4.E2.m1.26.26.1.1.1.1.1.1.1.1.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2">subscript</csymbol><ci id="S4.E2.m1.6.6.6.6.6.6.cmml" xref="S4.E2.m1.6.6.6.6.6.6">𝐴</ci><ci id="S4.E2.m1.7.7.7.7.7.7.1.cmml" xref="S4.E2.m1.7.7.7.7.7.7.1">𝑡</ci></apply></apply><apply id="S4.E2.m1.26.26.1.1.1.2.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"><times id="S4.E2.m1.26.26.1.1.1.2.2.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"></times><apply id="S4.E2.m1.10.10.10.10.10.10.cmml" xref="S4.E2.m1.10.10.10.10.10.10"><divide id="S4.E2.m1.10.10.10.10.10.10.1.cmml" xref="S4.E2.m1.10.10.10.10.10.10"></divide><cn id="S4.E2.m1.10.10.10.10.10.10.2.cmml" type="integer" xref="S4.E2.m1.10.10.10.10.10.10.2">1</cn><apply id="S4.E2.m1.10.10.10.10.10.10.3.cmml" xref="S4.E2.m1.10.10.10.10.10.10.3"><csymbol cd="ambiguous" id="S4.E2.m1.10.10.10.10.10.10.3.1.cmml" xref="S4.E2.m1.10.10.10.10.10.10.3">subscript</csymbol><apply id="S4.E2.m1.10.10.10.10.10.10.3.2.cmml" xref="S4.E2.m1.10.10.10.10.10.10.3.2"><ci id="S4.E2.m1.10.10.10.10.10.10.3.2.1.cmml" xref="S4.E2.m1.10.10.10.10.10.10.3.2.1">~</ci><ci id="S4.E2.m1.10.10.10.10.10.10.3.2.2.cmml" xref="S4.E2.m1.10.10.10.10.10.10.3.2.2">𝑅</ci></apply><ci id="S4.E2.m1.10.10.10.10.10.10.3.3.cmml" xref="S4.E2.m1.10.10.10.10.10.10.3.3">𝑡</ci></apply></apply><apply id="S4.E2.m1.26.26.1.1.1.2.1.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"><apply id="S4.E2.m1.26.26.1.1.1.2.1.2.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"><csymbol cd="ambiguous" id="S4.E2.m1.26.26.1.1.1.2.1.2.1.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2">subscript</csymbol><sum id="S4.E2.m1.11.11.11.11.11.11.cmml" xref="S4.E2.m1.11.11.11.11.11.11"></sum><apply id="S4.E2.m1.12.12.12.12.12.12.1.cmml" xref="S4.E2.m1.12.12.12.12.12.12.1"><in id="S4.E2.m1.12.12.12.12.12.12.1.1.cmml" xref="S4.E2.m1.12.12.12.12.12.12.1.1"></in><ci id="S4.E2.m1.12.12.12.12.12.12.1.2.cmml" xref="S4.E2.m1.12.12.12.12.12.12.1.2">𝑠</ci><apply id="S4.E2.m1.12.12.12.12.12.12.1.3.cmml" xref="S4.E2.m1.12.12.12.12.12.12.1.3"><csymbol cd="ambiguous" id="S4.E2.m1.12.12.12.12.12.12.1.3.1.cmml" xref="S4.E2.m1.12.12.12.12.12.12.1.3">subscript</csymbol><ci id="S4.E2.m1.12.12.12.12.12.12.1.3.2.cmml" xref="S4.E2.m1.12.12.12.12.12.12.1.3.2">𝑅</ci><ci id="S4.E2.m1.12.12.12.12.12.12.1.3.3.cmml" xref="S4.E2.m1.12.12.12.12.12.12.1.3.3">𝑡</ci></apply></apply></apply><apply id="S4.E2.m1.26.26.1.1.1.2.1.1.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"><times id="S4.E2.m1.26.26.1.1.1.2.1.1.2.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"></times><apply id="S4.E2.m1.26.26.1.1.1.2.1.1.3.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"><csymbol cd="ambiguous" id="S4.E2.m1.26.26.1.1.1.2.1.1.3.1.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2">subscript</csymbol><ci id="S4.E2.m1.13.13.13.13.13.13.cmml" xref="S4.E2.m1.13.13.13.13.13.13">𝑓</ci><ci id="S4.E2.m1.14.14.14.14.14.14.1.cmml" xref="S4.E2.m1.14.14.14.14.14.14.1">𝑠</ci></apply><apply id="S4.E2.m1.26.26.1.1.1.2.1.1.1.1.1.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"><csymbol cd="ambiguous" id="S4.E2.m1.26.26.1.1.1.2.1.1.1.1.1.1.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2">subscript</csymbol><ci id="S4.E2.m1.16.16.16.16.16.16.cmml" xref="S4.E2.m1.16.16.16.16.16.16">𝐴</ci><ci id="S4.E2.m1.17.17.17.17.17.17.1.cmml" xref="S4.E2.m1.17.17.17.17.17.17.1">𝑡</ci></apply></apply></apply></apply></apply><apply id="S4.E2.m1.27.27.2.2.2.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"><eq id="S4.E2.m1.22.22.22.22.22.22.cmml" xref="S4.E2.m1.22.22.22.22.22.22"></eq><apply id="S4.E2.m1.27.27.2.2.2.2.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"><csymbol cd="ambiguous" id="S4.E2.m1.27.27.2.2.2.2.1.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2">subscript</csymbol><apply id="S4.E2.m1.20.20.20.20.20.20.cmml" xref="S4.E2.m1.20.20.20.20.20.20"><ci id="S4.E2.m1.20.20.20.20.20.20.1.cmml" xref="S4.E2.m1.20.20.20.20.20.20.1">~</ci><ci id="S4.E2.m1.20.20.20.20.20.20.2.cmml" xref="S4.E2.m1.20.20.20.20.20.20.2">𝑅</ci></apply><ci id="S4.E2.m1.21.21.21.21.21.21.1.cmml" xref="S4.E2.m1.21.21.21.21.21.21.1">𝑡</ci></apply><apply id="S4.E2.m1.27.27.2.2.2.3.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"><apply id="S4.E2.m1.27.27.2.2.2.3.1.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2"><csymbol cd="ambiguous" id="S4.E2.m1.27.27.2.2.2.3.1.1.cmml" xref="S4.E2.m1.28.28.3.26.26.26.26.1.1.2">subscript</csymbol><sum id="S4.E2.m1.23.23.23.23.23.23.cmml" xref="S4.E2.m1.23.23.23.23.23.23"></sum><apply id="S4.E2.m1.24.24.24.24.24.24.1.cmml" xref="S4.E2.m1.24.24.24.24.24.24.1"><in id="S4.E2.m1.24.24.24.24.24.24.1.1.cmml" xref="S4.E2.m1.24.24.24.24.24.24.1.1"></in><ci id="S4.E2.m1.24.24.24.24.24.24.1.2.cmml" xref="S4.E2.m1.24.24.24.24.24.24.1.2">𝑠</ci><apply id="S4.E2.m1.24.24.24.24.24.24.1.3.cmml" xref="S4.E2.m1.24.24.24.24.24.24.1.3"><csymbol cd="ambiguous" id="S4.E2.m1.24.24.24.24.24.24.1.3.1.cmml" xref="S4.E2.m1.24.24.24.24.24.24.1.3">subscript</csymbol><ci id="S4.E2.m1.24.24.24.24.24.24.1.3.2.cmml" xref="S4.E2.m1.24.24.24.24.24.24.1.3.2">𝑅</ci><ci id="S4.E2.m1.24.24.24.24.24.24.1.3.3.cmml" xref="S4.E2.m1.24.24.24.24.24.24.1.3.3">𝑡</ci></apply></apply></apply><apply id="S4.E2.m1.25.25.25.25.25.25.cmml" xref="S4.E2.m1.25.25.25.25.25.25"><ci id="S4.E2.m1.25.25.25.25.25.25.1.cmml" xref="S4.E2.m1.25.25.25.25.25.25.1">~</ci><ci id="S4.E2.m1.25.25.25.25.25.25.2.cmml" xref="S4.E2.m1.25.25.25.25.25.25.2">𝑠</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E2.m1.29c">\begin{split}ARS_{t}(A_{t})=\frac{1}{\tilde{R}_{t}}\sum_{s\in R_{t}}f_{s}(A_{t% }),\tilde{R}_{t}=\sum_{s\in R_{t}}\tilde{s}\end{split}</annotation><annotation encoding="application/x-llamapun" id="S4.E2.m1.29d">start_ROW start_CELL italic_A italic_R italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_s ∈ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG end_CELL end_ROW</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(2)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S4.SS1.p1.10">In this equation, <math alttext="A_{t}" class="ltx_Math" display="inline" id="S4.SS1.p1.5.m1.1"><semantics id="S4.SS1.p1.5.m1.1a"><msub id="S4.SS1.p1.5.m1.1.1" xref="S4.SS1.p1.5.m1.1.1.cmml"><mi id="S4.SS1.p1.5.m1.1.1.2" xref="S4.SS1.p1.5.m1.1.1.2.cmml">A</mi><mi id="S4.SS1.p1.5.m1.1.1.3" xref="S4.SS1.p1.5.m1.1.1.3.cmml">t</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.5.m1.1b"><apply id="S4.SS1.p1.5.m1.1.1.cmml" xref="S4.SS1.p1.5.m1.1.1"><csymbol cd="ambiguous" id="S4.SS1.p1.5.m1.1.1.1.cmml" xref="S4.SS1.p1.5.m1.1.1">subscript</csymbol><ci id="S4.SS1.p1.5.m1.1.1.2.cmml" xref="S4.SS1.p1.5.m1.1.1.2">𝐴</ci><ci id="S4.SS1.p1.5.m1.1.1.3.cmml" xref="S4.SS1.p1.5.m1.1.1.3">𝑡</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.5.m1.1c">A_{t}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.5.m1.1d">italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT</annotation></semantics></math> represents a model’s outputs on task <math alttext="t" class="ltx_Math" display="inline" id="S4.SS1.p1.6.m2.1"><semantics id="S4.SS1.p1.6.m2.1a"><mi id="S4.SS1.p1.6.m2.1.1" xref="S4.SS1.p1.6.m2.1.1.cmml">t</mi><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.6.m2.1b"><ci id="S4.SS1.p1.6.m2.1.1.cmml" xref="S4.SS1.p1.6.m2.1.1">𝑡</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.6.m2.1c">t</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.6.m2.1d">italic_t</annotation></semantics></math>, and <math alttext="\tilde{R}_{t}" class="ltx_Math" display="inline" id="S4.SS1.p1.7.m3.1"><semantics id="S4.SS1.p1.7.m3.1a"><msub id="S4.SS1.p1.7.m3.1.1" xref="S4.SS1.p1.7.m3.1.1.cmml"><mover accent="true" id="S4.SS1.p1.7.m3.1.1.2" xref="S4.SS1.p1.7.m3.1.1.2.cmml"><mi id="S4.SS1.p1.7.m3.1.1.2.2" xref="S4.SS1.p1.7.m3.1.1.2.2.cmml">R</mi><mo id="S4.SS1.p1.7.m3.1.1.2.1" xref="S4.SS1.p1.7.m3.1.1.2.1.cmml">~</mo></mover><mi id="S4.SS1.p1.7.m3.1.1.3" xref="S4.SS1.p1.7.m3.1.1.3.cmml">t</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.7.m3.1b"><apply id="S4.SS1.p1.7.m3.1.1.cmml" xref="S4.SS1.p1.7.m3.1.1"><csymbol cd="ambiguous" id="S4.SS1.p1.7.m3.1.1.1.cmml" xref="S4.SS1.p1.7.m3.1.1">subscript</csymbol><apply id="S4.SS1.p1.7.m3.1.1.2.cmml" xref="S4.SS1.p1.7.m3.1.1.2"><ci id="S4.SS1.p1.7.m3.1.1.2.1.cmml" xref="S4.SS1.p1.7.m3.1.1.2.1">~</ci><ci id="S4.SS1.p1.7.m3.1.1.2.2.cmml" xref="S4.SS1.p1.7.m3.1.1.2.2">𝑅</ci></apply><ci id="S4.SS1.p1.7.m3.1.1.3.cmml" xref="S4.SS1.p1.7.m3.1.1.3">𝑡</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.7.m3.1c">\tilde{R}_{t}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.7.m3.1d">over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT</annotation></semantics></math> is the sum of weights across all scoring points, standardizing scores across tasks. The function <math alttext="f_{s}(\cdot)\in[0,\tilde{s}]" class="ltx_Math" display="inline" id="S4.SS1.p1.8.m4.3"><semantics id="S4.SS1.p1.8.m4.3a"><mrow id="S4.SS1.p1.8.m4.3.4" xref="S4.SS1.p1.8.m4.3.4.cmml"><mrow id="S4.SS1.p1.8.m4.3.4.2" xref="S4.SS1.p1.8.m4.3.4.2.cmml"><msub id="S4.SS1.p1.8.m4.3.4.2.2" xref="S4.SS1.p1.8.m4.3.4.2.2.cmml"><mi id="S4.SS1.p1.8.m4.3.4.2.2.2" xref="S4.SS1.p1.8.m4.3.4.2.2.2.cmml">f</mi><mi id="S4.SS1.p1.8.m4.3.4.2.2.3" xref="S4.SS1.p1.8.m4.3.4.2.2.3.cmml">s</mi></msub><mo id="S4.SS1.p1.8.m4.3.4.2.1" xref="S4.SS1.p1.8.m4.3.4.2.1.cmml"></mo><mrow id="S4.SS1.p1.8.m4.3.4.2.3.2" xref="S4.SS1.p1.8.m4.3.4.2.cmml"><mo id="S4.SS1.p1.8.m4.3.4.2.3.2.1" stretchy="false" xref="S4.SS1.p1.8.m4.3.4.2.cmml">(</mo><mo id="S4.SS1.p1.8.m4.1.1" lspace="0em" rspace="0em" xref="S4.SS1.p1.8.m4.1.1.cmml">⋅</mo><mo id="S4.SS1.p1.8.m4.3.4.2.3.2.2" stretchy="false" xref="S4.SS1.p1.8.m4.3.4.2.cmml">)</mo></mrow></mrow><mo id="S4.SS1.p1.8.m4.3.4.1" xref="S4.SS1.p1.8.m4.3.4.1.cmml">∈</mo><mrow id="S4.SS1.p1.8.m4.3.4.3.2" xref="S4.SS1.p1.8.m4.3.4.3.1.cmml"><mo id="S4.SS1.p1.8.m4.3.4.3.2.1" stretchy="false" xref="S4.SS1.p1.8.m4.3.4.3.1.cmml">[</mo><mn id="S4.SS1.p1.8.m4.2.2" xref="S4.SS1.p1.8.m4.2.2.cmml">0</mn><mo id="S4.SS1.p1.8.m4.3.4.3.2.2" xref="S4.SS1.p1.8.m4.3.4.3.1.cmml">,</mo><mover accent="true" id="S4.SS1.p1.8.m4.3.3" xref="S4.SS1.p1.8.m4.3.3.cmml"><mi id="S4.SS1.p1.8.m4.3.3.2" xref="S4.SS1.p1.8.m4.3.3.2.cmml">s</mi><mo id="S4.SS1.p1.8.m4.3.3.1" xref="S4.SS1.p1.8.m4.3.3.1.cmml">~</mo></mover><mo id="S4.SS1.p1.8.m4.3.4.3.2.3" stretchy="false" xref="S4.SS1.p1.8.m4.3.4.3.1.cmml">]</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.8.m4.3b"><apply id="S4.SS1.p1.8.m4.3.4.cmml" xref="S4.SS1.p1.8.m4.3.4"><in id="S4.SS1.p1.8.m4.3.4.1.cmml" xref="S4.SS1.p1.8.m4.3.4.1"></in><apply id="S4.SS1.p1.8.m4.3.4.2.cmml" xref="S4.SS1.p1.8.m4.3.4.2"><times id="S4.SS1.p1.8.m4.3.4.2.1.cmml" xref="S4.SS1.p1.8.m4.3.4.2.1"></times><apply id="S4.SS1.p1.8.m4.3.4.2.2.cmml" xref="S4.SS1.p1.8.m4.3.4.2.2"><csymbol cd="ambiguous" id="S4.SS1.p1.8.m4.3.4.2.2.1.cmml" xref="S4.SS1.p1.8.m4.3.4.2.2">subscript</csymbol><ci id="S4.SS1.p1.8.m4.3.4.2.2.2.cmml" xref="S4.SS1.p1.8.m4.3.4.2.2.2">𝑓</ci><ci id="S4.SS1.p1.8.m4.3.4.2.2.3.cmml" xref="S4.SS1.p1.8.m4.3.4.2.2.3">𝑠</ci></apply><ci id="S4.SS1.p1.8.m4.1.1.cmml" xref="S4.SS1.p1.8.m4.1.1">⋅</ci></apply><interval closure="closed" id="S4.SS1.p1.8.m4.3.4.3.1.cmml" xref="S4.SS1.p1.8.m4.3.4.3.2"><cn id="S4.SS1.p1.8.m4.2.2.cmml" type="integer" xref="S4.SS1.p1.8.m4.2.2">0</cn><apply id="S4.SS1.p1.8.m4.3.3.cmml" xref="S4.SS1.p1.8.m4.3.3"><ci id="S4.SS1.p1.8.m4.3.3.1.cmml" xref="S4.SS1.p1.8.m4.3.3.1">~</ci><ci id="S4.SS1.p1.8.m4.3.3.2.cmml" xref="S4.SS1.p1.8.m4.3.3.2">𝑠</ci></apply></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.8.m4.3c">f_{s}(\cdot)\in[0,\tilde{s}]</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.8.m4.3d">italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( ⋅ ) ∈ [ 0 , over~ start_ARG italic_s end_ARG ]</annotation></semantics></math> represents the average score assigned to all outputs for scoring point <math alttext="s" class="ltx_Math" display="inline" id="S4.SS1.p1.9.m5.1"><semantics id="S4.SS1.p1.9.m5.1a"><mi id="S4.SS1.p1.9.m5.1.1" xref="S4.SS1.p1.9.m5.1.1.cmml">s</mi><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.9.m5.1b"><ci id="S4.SS1.p1.9.m5.1.1.cmml" xref="S4.SS1.p1.9.m5.1.1">𝑠</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.9.m5.1c">s</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.9.m5.1d">italic_s</annotation></semantics></math> by the automated evaluation program. Naturally, defining <math alttext="T=\{t_{i}\}_{i=1}^{N_{t}}" class="ltx_Math" display="inline" id="S4.SS1.p1.10.m6.1"><semantics id="S4.SS1.p1.10.m6.1a"><mrow id="S4.SS1.p1.10.m6.1.1" xref="S4.SS1.p1.10.m6.1.1.cmml"><mi id="S4.SS1.p1.10.m6.1.1.3" xref="S4.SS1.p1.10.m6.1.1.3.cmml">T</mi><mo id="S4.SS1.p1.10.m6.1.1.2" xref="S4.SS1.p1.10.m6.1.1.2.cmml">=</mo><msubsup id="S4.SS1.p1.10.m6.1.1.1" xref="S4.SS1.p1.10.m6.1.1.1.cmml"><mrow id="S4.SS1.p1.10.m6.1.1.1.1.1.1" xref="S4.SS1.p1.10.m6.1.1.1.1.1.2.cmml"><mo id="S4.SS1.p1.10.m6.1.1.1.1.1.1.2" stretchy="false" xref="S4.SS1.p1.10.m6.1.1.1.1.1.2.cmml">{</mo><msub id="S4.SS1.p1.10.m6.1.1.1.1.1.1.1" xref="S4.SS1.p1.10.m6.1.1.1.1.1.1.1.cmml"><mi id="S4.SS1.p1.10.m6.1.1.1.1.1.1.1.2" xref="S4.SS1.p1.10.m6.1.1.1.1.1.1.1.2.cmml">t</mi><mi id="S4.SS1.p1.10.m6.1.1.1.1.1.1.1.3" xref="S4.SS1.p1.10.m6.1.1.1.1.1.1.1.3.cmml">i</mi></msub><mo id="S4.SS1.p1.10.m6.1.1.1.1.1.1.3" stretchy="false" xref="S4.SS1.p1.10.m6.1.1.1.1.1.2.cmml">}</mo></mrow><mrow id="S4.SS1.p1.10.m6.1.1.1.1.3" xref="S4.SS1.p1.10.m6.1.1.1.1.3.cmml"><mi id="S4.SS1.p1.10.m6.1.1.1.1.3.2" xref="S4.SS1.p1.10.m6.1.1.1.1.3.2.cmml">i</mi><mo id="S4.SS1.p1.10.m6.1.1.1.1.3.1" xref="S4.SS1.p1.10.m6.1.1.1.1.3.1.cmml">=</mo><mn id="S4.SS1.p1.10.m6.1.1.1.1.3.3" xref="S4.SS1.p1.10.m6.1.1.1.1.3.3.cmml">1</mn></mrow><msub id="S4.SS1.p1.10.m6.1.1.1.3" xref="S4.SS1.p1.10.m6.1.1.1.3.cmml"><mi id="S4.SS1.p1.10.m6.1.1.1.3.2" xref="S4.SS1.p1.10.m6.1.1.1.3.2.cmml">N</mi><mi id="S4.SS1.p1.10.m6.1.1.1.3.3" xref="S4.SS1.p1.10.m6.1.1.1.3.3.cmml">t</mi></msub></msubsup></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.10.m6.1b"><apply id="S4.SS1.p1.10.m6.1.1.cmml" xref="S4.SS1.p1.10.m6.1.1"><eq id="S4.SS1.p1.10.m6.1.1.2.cmml" xref="S4.SS1.p1.10.m6.1.1.2"></eq><ci id="S4.SS1.p1.10.m6.1.1.3.cmml" xref="S4.SS1.p1.10.m6.1.1.3">𝑇</ci><apply id="S4.SS1.p1.10.m6.1.1.1.cmml" xref="S4.SS1.p1.10.m6.1.1.1"><csymbol cd="ambiguous" id="S4.SS1.p1.10.m6.1.1.1.2.cmml" xref="S4.SS1.p1.10.m6.1.1.1">superscript</csymbol><apply id="S4.SS1.p1.10.m6.1.1.1.1.cmml" xref="S4.SS1.p1.10.m6.1.1.1"><csymbol cd="ambiguous" id="S4.SS1.p1.10.m6.1.1.1.1.2.cmml" xref="S4.SS1.p1.10.m6.1.1.1">subscript</csymbol><set id="S4.SS1.p1.10.m6.1.1.1.1.1.2.cmml" xref="S4.SS1.p1.10.m6.1.1.1.1.1.1"><apply id="S4.SS1.p1.10.m6.1.1.1.1.1.1.1.cmml" xref="S4.SS1.p1.10.m6.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S4.SS1.p1.10.m6.1.1.1.1.1.1.1.1.cmml" xref="S4.SS1.p1.10.m6.1.1.1.1.1.1.1">subscript</csymbol><ci id="S4.SS1.p1.10.m6.1.1.1.1.1.1.1.2.cmml" xref="S4.SS1.p1.10.m6.1.1.1.1.1.1.1.2">𝑡</ci><ci id="S4.SS1.p1.10.m6.1.1.1.1.1.1.1.3.cmml" xref="S4.SS1.p1.10.m6.1.1.1.1.1.1.1.3">𝑖</ci></apply></set><apply id="S4.SS1.p1.10.m6.1.1.1.1.3.cmml" xref="S4.SS1.p1.10.m6.1.1.1.1.3"><eq id="S4.SS1.p1.10.m6.1.1.1.1.3.1.cmml" xref="S4.SS1.p1.10.m6.1.1.1.1.3.1"></eq><ci id="S4.SS1.p1.10.m6.1.1.1.1.3.2.cmml" xref="S4.SS1.p1.10.m6.1.1.1.1.3.2">𝑖</ci><cn id="S4.SS1.p1.10.m6.1.1.1.1.3.3.cmml" type="integer" xref="S4.SS1.p1.10.m6.1.1.1.1.3.3">1</cn></apply></apply><apply id="S4.SS1.p1.10.m6.1.1.1.3.cmml" xref="S4.SS1.p1.10.m6.1.1.1.3"><csymbol cd="ambiguous" id="S4.SS1.p1.10.m6.1.1.1.3.1.cmml" xref="S4.SS1.p1.10.m6.1.1.1.3">subscript</csymbol><ci id="S4.SS1.p1.10.m6.1.1.1.3.2.cmml" xref="S4.SS1.p1.10.m6.1.1.1.3.2">𝑁</ci><ci id="S4.SS1.p1.10.m6.1.1.1.3.3.cmml" xref="S4.SS1.p1.10.m6.1.1.1.3.3">𝑡</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.10.m6.1c">T=\{t_{i}\}_{i=1}^{N_{t}}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.10.m6.1d">italic_T = { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT</annotation></semantics></math> as the set of all tasks, the overall test result for any model on LIFBench is the weighted average of scores across all tasks:</p> <table class="ltx_equation ltx_eqn_table" id="S4.E3"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="ARS_{overall}=\frac{\sum_{t\in T}\tilde{R}_{t}\cdot ARS_{t}}{\sum_{t\in T}% \tilde{R}_{t}}" class="ltx_Math" display="block" id="S4.E3.m1.1"><semantics id="S4.E3.m1.1a"><mrow id="S4.E3.m1.1.1" xref="S4.E3.m1.1.1.cmml"><mrow id="S4.E3.m1.1.1.2" xref="S4.E3.m1.1.1.2.cmml"><mi id="S4.E3.m1.1.1.2.2" xref="S4.E3.m1.1.1.2.2.cmml">A</mi><mo id="S4.E3.m1.1.1.2.1" xref="S4.E3.m1.1.1.2.1.cmml"></mo><mi id="S4.E3.m1.1.1.2.3" xref="S4.E3.m1.1.1.2.3.cmml">R</mi><mo id="S4.E3.m1.1.1.2.1a" xref="S4.E3.m1.1.1.2.1.cmml"></mo><msub id="S4.E3.m1.1.1.2.4" xref="S4.E3.m1.1.1.2.4.cmml"><mi id="S4.E3.m1.1.1.2.4.2" xref="S4.E3.m1.1.1.2.4.2.cmml">S</mi><mrow id="S4.E3.m1.1.1.2.4.3" xref="S4.E3.m1.1.1.2.4.3.cmml"><mi id="S4.E3.m1.1.1.2.4.3.2" xref="S4.E3.m1.1.1.2.4.3.2.cmml">o</mi><mo id="S4.E3.m1.1.1.2.4.3.1" xref="S4.E3.m1.1.1.2.4.3.1.cmml"></mo><mi id="S4.E3.m1.1.1.2.4.3.3" xref="S4.E3.m1.1.1.2.4.3.3.cmml">v</mi><mo id="S4.E3.m1.1.1.2.4.3.1a" xref="S4.E3.m1.1.1.2.4.3.1.cmml"></mo><mi id="S4.E3.m1.1.1.2.4.3.4" xref="S4.E3.m1.1.1.2.4.3.4.cmml">e</mi><mo id="S4.E3.m1.1.1.2.4.3.1b" xref="S4.E3.m1.1.1.2.4.3.1.cmml"></mo><mi id="S4.E3.m1.1.1.2.4.3.5" xref="S4.E3.m1.1.1.2.4.3.5.cmml">r</mi><mo id="S4.E3.m1.1.1.2.4.3.1c" xref="S4.E3.m1.1.1.2.4.3.1.cmml"></mo><mi id="S4.E3.m1.1.1.2.4.3.6" xref="S4.E3.m1.1.1.2.4.3.6.cmml">a</mi><mo id="S4.E3.m1.1.1.2.4.3.1d" xref="S4.E3.m1.1.1.2.4.3.1.cmml"></mo><mi id="S4.E3.m1.1.1.2.4.3.7" xref="S4.E3.m1.1.1.2.4.3.7.cmml">l</mi><mo id="S4.E3.m1.1.1.2.4.3.1e" xref="S4.E3.m1.1.1.2.4.3.1.cmml"></mo><mi id="S4.E3.m1.1.1.2.4.3.8" xref="S4.E3.m1.1.1.2.4.3.8.cmml">l</mi></mrow></msub></mrow><mo id="S4.E3.m1.1.1.1" xref="S4.E3.m1.1.1.1.cmml">=</mo><mfrac id="S4.E3.m1.1.1.3" xref="S4.E3.m1.1.1.3.cmml"><mrow id="S4.E3.m1.1.1.3.2" xref="S4.E3.m1.1.1.3.2.cmml"><msub id="S4.E3.m1.1.1.3.2.1" xref="S4.E3.m1.1.1.3.2.1.cmml"><mo id="S4.E3.m1.1.1.3.2.1.2" xref="S4.E3.m1.1.1.3.2.1.2.cmml">∑</mo><mrow id="S4.E3.m1.1.1.3.2.1.3" xref="S4.E3.m1.1.1.3.2.1.3.cmml"><mi id="S4.E3.m1.1.1.3.2.1.3.2" xref="S4.E3.m1.1.1.3.2.1.3.2.cmml">t</mi><mo id="S4.E3.m1.1.1.3.2.1.3.1" xref="S4.E3.m1.1.1.3.2.1.3.1.cmml">∈</mo><mi id="S4.E3.m1.1.1.3.2.1.3.3" xref="S4.E3.m1.1.1.3.2.1.3.3.cmml">T</mi></mrow></msub><mrow id="S4.E3.m1.1.1.3.2.2" xref="S4.E3.m1.1.1.3.2.2.cmml"><mrow id="S4.E3.m1.1.1.3.2.2.2" xref="S4.E3.m1.1.1.3.2.2.2.cmml"><msub id="S4.E3.m1.1.1.3.2.2.2.2" xref="S4.E3.m1.1.1.3.2.2.2.2.cmml"><mover accent="true" id="S4.E3.m1.1.1.3.2.2.2.2.2" xref="S4.E3.m1.1.1.3.2.2.2.2.2.cmml"><mi id="S4.E3.m1.1.1.3.2.2.2.2.2.2" xref="S4.E3.m1.1.1.3.2.2.2.2.2.2.cmml">R</mi><mo id="S4.E3.m1.1.1.3.2.2.2.2.2.1" xref="S4.E3.m1.1.1.3.2.2.2.2.2.1.cmml">~</mo></mover><mi id="S4.E3.m1.1.1.3.2.2.2.2.3" xref="S4.E3.m1.1.1.3.2.2.2.2.3.cmml">t</mi></msub><mo id="S4.E3.m1.1.1.3.2.2.2.1" lspace="0.222em" rspace="0.222em" xref="S4.E3.m1.1.1.3.2.2.2.1.cmml">⋅</mo><mi id="S4.E3.m1.1.1.3.2.2.2.3" xref="S4.E3.m1.1.1.3.2.2.2.3.cmml">A</mi></mrow><mo id="S4.E3.m1.1.1.3.2.2.1" xref="S4.E3.m1.1.1.3.2.2.1.cmml"></mo><mi id="S4.E3.m1.1.1.3.2.2.3" xref="S4.E3.m1.1.1.3.2.2.3.cmml">R</mi><mo id="S4.E3.m1.1.1.3.2.2.1a" xref="S4.E3.m1.1.1.3.2.2.1.cmml"></mo><msub id="S4.E3.m1.1.1.3.2.2.4" xref="S4.E3.m1.1.1.3.2.2.4.cmml"><mi id="S4.E3.m1.1.1.3.2.2.4.2" xref="S4.E3.m1.1.1.3.2.2.4.2.cmml">S</mi><mi id="S4.E3.m1.1.1.3.2.2.4.3" xref="S4.E3.m1.1.1.3.2.2.4.3.cmml">t</mi></msub></mrow></mrow><mrow id="S4.E3.m1.1.1.3.3" xref="S4.E3.m1.1.1.3.3.cmml"><msub id="S4.E3.m1.1.1.3.3.1" xref="S4.E3.m1.1.1.3.3.1.cmml"><mo id="S4.E3.m1.1.1.3.3.1.2" xref="S4.E3.m1.1.1.3.3.1.2.cmml">∑</mo><mrow id="S4.E3.m1.1.1.3.3.1.3" xref="S4.E3.m1.1.1.3.3.1.3.cmml"><mi id="S4.E3.m1.1.1.3.3.1.3.2" xref="S4.E3.m1.1.1.3.3.1.3.2.cmml">t</mi><mo id="S4.E3.m1.1.1.3.3.1.3.1" xref="S4.E3.m1.1.1.3.3.1.3.1.cmml">∈</mo><mi id="S4.E3.m1.1.1.3.3.1.3.3" xref="S4.E3.m1.1.1.3.3.1.3.3.cmml">T</mi></mrow></msub><msub id="S4.E3.m1.1.1.3.3.2" xref="S4.E3.m1.1.1.3.3.2.cmml"><mover accent="true" id="S4.E3.m1.1.1.3.3.2.2" xref="S4.E3.m1.1.1.3.3.2.2.cmml"><mi id="S4.E3.m1.1.1.3.3.2.2.2" xref="S4.E3.m1.1.1.3.3.2.2.2.cmml">R</mi><mo id="S4.E3.m1.1.1.3.3.2.2.1" xref="S4.E3.m1.1.1.3.3.2.2.1.cmml">~</mo></mover><mi id="S4.E3.m1.1.1.3.3.2.3" xref="S4.E3.m1.1.1.3.3.2.3.cmml">t</mi></msub></mrow></mfrac></mrow><annotation-xml encoding="MathML-Content" id="S4.E3.m1.1b"><apply id="S4.E3.m1.1.1.cmml" xref="S4.E3.m1.1.1"><eq id="S4.E3.m1.1.1.1.cmml" xref="S4.E3.m1.1.1.1"></eq><apply id="S4.E3.m1.1.1.2.cmml" xref="S4.E3.m1.1.1.2"><times id="S4.E3.m1.1.1.2.1.cmml" xref="S4.E3.m1.1.1.2.1"></times><ci id="S4.E3.m1.1.1.2.2.cmml" xref="S4.E3.m1.1.1.2.2">𝐴</ci><ci id="S4.E3.m1.1.1.2.3.cmml" xref="S4.E3.m1.1.1.2.3">𝑅</ci><apply id="S4.E3.m1.1.1.2.4.cmml" xref="S4.E3.m1.1.1.2.4"><csymbol cd="ambiguous" id="S4.E3.m1.1.1.2.4.1.cmml" xref="S4.E3.m1.1.1.2.4">subscript</csymbol><ci id="S4.E3.m1.1.1.2.4.2.cmml" xref="S4.E3.m1.1.1.2.4.2">𝑆</ci><apply id="S4.E3.m1.1.1.2.4.3.cmml" xref="S4.E3.m1.1.1.2.4.3"><times id="S4.E3.m1.1.1.2.4.3.1.cmml" xref="S4.E3.m1.1.1.2.4.3.1"></times><ci id="S4.E3.m1.1.1.2.4.3.2.cmml" xref="S4.E3.m1.1.1.2.4.3.2">𝑜</ci><ci id="S4.E3.m1.1.1.2.4.3.3.cmml" xref="S4.E3.m1.1.1.2.4.3.3">𝑣</ci><ci id="S4.E3.m1.1.1.2.4.3.4.cmml" xref="S4.E3.m1.1.1.2.4.3.4">𝑒</ci><ci id="S4.E3.m1.1.1.2.4.3.5.cmml" xref="S4.E3.m1.1.1.2.4.3.5">𝑟</ci><ci id="S4.E3.m1.1.1.2.4.3.6.cmml" xref="S4.E3.m1.1.1.2.4.3.6">𝑎</ci><ci id="S4.E3.m1.1.1.2.4.3.7.cmml" xref="S4.E3.m1.1.1.2.4.3.7">𝑙</ci><ci id="S4.E3.m1.1.1.2.4.3.8.cmml" xref="S4.E3.m1.1.1.2.4.3.8">𝑙</ci></apply></apply></apply><apply id="S4.E3.m1.1.1.3.cmml" xref="S4.E3.m1.1.1.3"><divide id="S4.E3.m1.1.1.3.1.cmml" xref="S4.E3.m1.1.1.3"></divide><apply id="S4.E3.m1.1.1.3.2.cmml" xref="S4.E3.m1.1.1.3.2"><apply id="S4.E3.m1.1.1.3.2.1.cmml" xref="S4.E3.m1.1.1.3.2.1"><csymbol cd="ambiguous" id="S4.E3.m1.1.1.3.2.1.1.cmml" xref="S4.E3.m1.1.1.3.2.1">subscript</csymbol><sum id="S4.E3.m1.1.1.3.2.1.2.cmml" xref="S4.E3.m1.1.1.3.2.1.2"></sum><apply id="S4.E3.m1.1.1.3.2.1.3.cmml" xref="S4.E3.m1.1.1.3.2.1.3"><in id="S4.E3.m1.1.1.3.2.1.3.1.cmml" xref="S4.E3.m1.1.1.3.2.1.3.1"></in><ci id="S4.E3.m1.1.1.3.2.1.3.2.cmml" xref="S4.E3.m1.1.1.3.2.1.3.2">𝑡</ci><ci id="S4.E3.m1.1.1.3.2.1.3.3.cmml" xref="S4.E3.m1.1.1.3.2.1.3.3">𝑇</ci></apply></apply><apply id="S4.E3.m1.1.1.3.2.2.cmml" xref="S4.E3.m1.1.1.3.2.2"><times id="S4.E3.m1.1.1.3.2.2.1.cmml" xref="S4.E3.m1.1.1.3.2.2.1"></times><apply id="S4.E3.m1.1.1.3.2.2.2.cmml" xref="S4.E3.m1.1.1.3.2.2.2"><ci id="S4.E3.m1.1.1.3.2.2.2.1.cmml" xref="S4.E3.m1.1.1.3.2.2.2.1">⋅</ci><apply id="S4.E3.m1.1.1.3.2.2.2.2.cmml" xref="S4.E3.m1.1.1.3.2.2.2.2"><csymbol cd="ambiguous" id="S4.E3.m1.1.1.3.2.2.2.2.1.cmml" xref="S4.E3.m1.1.1.3.2.2.2.2">subscript</csymbol><apply id="S4.E3.m1.1.1.3.2.2.2.2.2.cmml" xref="S4.E3.m1.1.1.3.2.2.2.2.2"><ci id="S4.E3.m1.1.1.3.2.2.2.2.2.1.cmml" xref="S4.E3.m1.1.1.3.2.2.2.2.2.1">~</ci><ci id="S4.E3.m1.1.1.3.2.2.2.2.2.2.cmml" xref="S4.E3.m1.1.1.3.2.2.2.2.2.2">𝑅</ci></apply><ci id="S4.E3.m1.1.1.3.2.2.2.2.3.cmml" xref="S4.E3.m1.1.1.3.2.2.2.2.3">𝑡</ci></apply><ci id="S4.E3.m1.1.1.3.2.2.2.3.cmml" xref="S4.E3.m1.1.1.3.2.2.2.3">𝐴</ci></apply><ci id="S4.E3.m1.1.1.3.2.2.3.cmml" xref="S4.E3.m1.1.1.3.2.2.3">𝑅</ci><apply id="S4.E3.m1.1.1.3.2.2.4.cmml" xref="S4.E3.m1.1.1.3.2.2.4"><csymbol cd="ambiguous" id="S4.E3.m1.1.1.3.2.2.4.1.cmml" xref="S4.E3.m1.1.1.3.2.2.4">subscript</csymbol><ci id="S4.E3.m1.1.1.3.2.2.4.2.cmml" xref="S4.E3.m1.1.1.3.2.2.4.2">𝑆</ci><ci id="S4.E3.m1.1.1.3.2.2.4.3.cmml" xref="S4.E3.m1.1.1.3.2.2.4.3">𝑡</ci></apply></apply></apply><apply id="S4.E3.m1.1.1.3.3.cmml" xref="S4.E3.m1.1.1.3.3"><apply id="S4.E3.m1.1.1.3.3.1.cmml" xref="S4.E3.m1.1.1.3.3.1"><csymbol cd="ambiguous" id="S4.E3.m1.1.1.3.3.1.1.cmml" xref="S4.E3.m1.1.1.3.3.1">subscript</csymbol><sum id="S4.E3.m1.1.1.3.3.1.2.cmml" xref="S4.E3.m1.1.1.3.3.1.2"></sum><apply id="S4.E3.m1.1.1.3.3.1.3.cmml" xref="S4.E3.m1.1.1.3.3.1.3"><in id="S4.E3.m1.1.1.3.3.1.3.1.cmml" xref="S4.E3.m1.1.1.3.3.1.3.1"></in><ci id="S4.E3.m1.1.1.3.3.1.3.2.cmml" xref="S4.E3.m1.1.1.3.3.1.3.2">𝑡</ci><ci id="S4.E3.m1.1.1.3.3.1.3.3.cmml" xref="S4.E3.m1.1.1.3.3.1.3.3">𝑇</ci></apply></apply><apply id="S4.E3.m1.1.1.3.3.2.cmml" xref="S4.E3.m1.1.1.3.3.2"><csymbol cd="ambiguous" id="S4.E3.m1.1.1.3.3.2.1.cmml" xref="S4.E3.m1.1.1.3.3.2">subscript</csymbol><apply id="S4.E3.m1.1.1.3.3.2.2.cmml" xref="S4.E3.m1.1.1.3.3.2.2"><ci id="S4.E3.m1.1.1.3.3.2.2.1.cmml" xref="S4.E3.m1.1.1.3.3.2.2.1">~</ci><ci id="S4.E3.m1.1.1.3.3.2.2.2.cmml" xref="S4.E3.m1.1.1.3.3.2.2.2">𝑅</ci></apply><ci id="S4.E3.m1.1.1.3.3.2.3.cmml" xref="S4.E3.m1.1.1.3.3.2.3">𝑡</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E3.m1.1c">ARS_{overall}=\frac{\sum_{t\in T}\tilde{R}_{t}\cdot ARS_{t}}{\sum_{t\in T}% \tilde{R}_{t}}</annotation><annotation encoding="application/x-llamapun" id="S4.E3.m1.1d">italic_A italic_R italic_S start_POSTSUBSCRIPT italic_o italic_v italic_e italic_r italic_a italic_l italic_l end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_t ∈ italic_T end_POSTSUBSCRIPT over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_A italic_R italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t ∈ italic_T end_POSTSUBSCRIPT over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(3)</span></td> </tr></tbody> </table> </div> <figure class="ltx_table" id="S4.T3"> <div class="ltx_inline-block ltx_align_center ltx_transformed_outer" id="S4.T3.10" style="width:433.6pt;height:327.3pt;vertical-align:-0.7pt;"><span class="ltx_transformed_inner" style="transform:translate(-93.8pt,70.7pt) scale(0.697928667883007,0.697928667883007) ;"> <table class="ltx_tabular ltx_align_middle" id="S4.T3.10.10"> <tr class="ltx_tr" id="S4.T3.10.10.11" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left" id="S4.T3.10.10.11.1"><span class="ltx_rule" style="width:100%;height:1.0pt;background:black;display:inline-block;"> </span></td> <td class="ltx_td ltx_align_center" colspan="3" id="S4.T3.10.10.11.2"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.11.2.1">OneDoc</span></td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.11.3"></td> <td class="ltx_td ltx_align_center" colspan="6" id="S4.T3.10.10.11.4"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.11.4.1">List</span></td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.11.5"></td> <td class="ltx_td ltx_align_center" colspan="2" id="S4.T3.10.10.11.6"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.11.6.1">MultiDoc</span></td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.11.7"></td> <td class="ltx_td" id="S4.T3.10.10.11.8"></td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.12"> <td class="ltx_td ltx_align_left" id="S4.T3.10.10.12.1"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.12.1.1" style="background-color:#E6E6E6;">Model</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.10.10.12.2"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.12.2.1" style="background-color:#E6E6E6;">OR</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.10.10.12.3"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.12.3.1" style="background-color:#E6E6E6;">OQ</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.10.10.12.4"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.12.4.1" style="background-color:#E6E6E6;">OE</span></td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.12.5"></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.10.10.12.6"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.12.6.1" style="background-color:#E6E6E6;">LSI</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.10.10.12.7"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.12.7.1" style="background-color:#E6E6E6;">LMI</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.10.10.12.8"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.12.8.1" style="background-color:#E6E6E6;">LOI</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.10.10.12.9"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.12.9.1" style="background-color:#E6E6E6;">LOE</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.10.10.12.10"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.12.10.1" style="background-color:#E6E6E6;">LBI</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.10.10.12.11"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.12.11.1" style="background-color:#E6E6E6;">LBE</span></td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.12.12"></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.10.10.12.13"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.12.13.1" style="background-color:#E6E6E6;">MB</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.10.10.12.14"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.12.14.1" style="background-color:#E6E6E6;">MF</span></td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.12.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.12.16"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.12.16.1" style="background-color:#E6E6E6;">Overall</span></td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.13"> <td class="ltx_td ltx_align_center ltx_border_t" colspan="16" id="S4.T3.10.10.13.1"><span class="ltx_text ltx_font_italic" id="S4.T3.10.10.13.1.1">API Models</span></td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.14"> <td class="ltx_td ltx_align_left" id="S4.T3.10.10.14.1">gpt-4o-2024-08-06</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.14.2"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.14.2.1">0.797</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.14.3"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.14.3.1">0.885</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.14.4"><span class="ltx_text ltx_framed ltx_framed_underline" id="S4.T3.10.10.14.4.1">0.826</span></td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.14.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.14.6"><span class="ltx_text ltx_framed ltx_framed_underline" id="S4.T3.10.10.14.6.1">0.871</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.14.7"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.14.7.1">0.788</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.14.8"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.14.8.1">0.712</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.14.9"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.14.9.1">0.805</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.14.10">0.726</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.14.11">0.805</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.14.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.14.13">0.719</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.14.14">0.657</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.14.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.14.16"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.14.16.1">0.764</span></td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.15"> <td class="ltx_td ltx_align_left" id="S4.T3.10.10.15.1">gpt-4-0125-preview</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.15.2">0.707</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.15.3">0.846</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.15.4">0.734</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.15.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.15.6"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.15.6.1">0.878</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.15.7">0.673</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.15.8"><span class="ltx_text ltx_framed ltx_framed_underline" id="S4.T3.10.10.15.8.1">0.668</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.15.9">0.753</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.15.10">0.750</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.15.11"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.15.11.1">0.833</span></td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.15.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.15.13">0.600</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.15.14"><span class="ltx_text ltx_font_bold" id="S4.T3.10.10.15.14.1">0.805</span></td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.15.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.15.16"><span class="ltx_text ltx_framed ltx_framed_underline" id="S4.T3.10.10.15.16.1">0.734</span></td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.16"> <td class="ltx_td ltx_align_center ltx_border_t" colspan="16" id="S4.T3.10.10.16.1"><span class="ltx_text ltx_font_italic" id="S4.T3.10.10.16.1.1">Models Larger Than 20B Parameters</span></td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.17"> <td class="ltx_td ltx_align_left" id="S4.T3.10.10.17.1">Meta-Llama-3.1-70B-Instruct</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.17.2">0.730</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.17.3"><span class="ltx_text ltx_framed ltx_framed_underline" id="S4.T3.10.10.17.3.1">0.864</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.17.4">0.706</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.17.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.17.6">0.794</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.17.7"><span class="ltx_text ltx_framed ltx_framed_underline" id="S4.T3.10.10.17.7.1">0.763</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.17.8">0.621</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.17.9"><span class="ltx_text ltx_framed ltx_framed_underline" id="S4.T3.10.10.17.9.1">0.780</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.17.10">0.689</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.17.11">0.799</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.17.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.17.13">0.657</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.17.14">0.572</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.17.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.17.16">0.700</td> </tr> <tr class="ltx_tr" id="S4.T3.1.1.1"> <td class="ltx_td ltx_align_left" id="S4.T3.1.1.1.1">Qwen2.5-72B-Instruct<sup class="ltx_sup" id="S4.T3.1.1.1.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S4.T3.1.1.1.2"><span class="ltx_text ltx_framed ltx_framed_underline" id="S4.T3.1.1.1.2.1">0.759</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.1.1.1.3">0.756</td> <td class="ltx_td ltx_align_center" id="S4.T3.1.1.1.4">0.808</td> <td class="ltx_td ltx_align_top" id="S4.T3.1.1.1.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.1.1.1.6">0.851</td> <td class="ltx_td ltx_align_center" id="S4.T3.1.1.1.7">0.598</td> <td class="ltx_td ltx_align_center" id="S4.T3.1.1.1.8">0.644</td> <td class="ltx_td ltx_align_center" id="S4.T3.1.1.1.9">0.651</td> <td class="ltx_td ltx_align_center" id="S4.T3.1.1.1.10"><span class="ltx_text ltx_font_bold" id="S4.T3.1.1.1.10.1">0.828</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.1.1.1.11"><span class="ltx_text ltx_framed ltx_framed_underline" id="S4.T3.1.1.1.11.1">0.815</span></td> <td class="ltx_td ltx_align_top" id="S4.T3.1.1.1.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.1.1.1.13">0.609</td> <td class="ltx_td ltx_align_center" id="S4.T3.1.1.1.14">0.614</td> <td class="ltx_td ltx_align_top" id="S4.T3.1.1.1.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.1.1.1.16">0.700</td> </tr> <tr class="ltx_tr" id="S4.T3.2.2.2"> <td class="ltx_td ltx_align_left" id="S4.T3.2.2.2.1">Qwen2.5-32B-Instruct<sup class="ltx_sup" id="S4.T3.2.2.2.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S4.T3.2.2.2.2">0.702</td> <td class="ltx_td ltx_align_center" id="S4.T3.2.2.2.3">0.780</td> <td class="ltx_td ltx_align_center" id="S4.T3.2.2.2.4"><span class="ltx_text ltx_font_bold" id="S4.T3.2.2.2.4.1">0.841</span></td> <td class="ltx_td ltx_align_top" id="S4.T3.2.2.2.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.2.2.2.6">0.629</td> <td class="ltx_td ltx_align_center" id="S4.T3.2.2.2.7">0.535</td> <td class="ltx_td ltx_align_center" id="S4.T3.2.2.2.8">0.505</td> <td class="ltx_td ltx_align_center" id="S4.T3.2.2.2.9">0.480</td> <td class="ltx_td ltx_align_center" id="S4.T3.2.2.2.10"><span class="ltx_text ltx_framed ltx_framed_underline" id="S4.T3.2.2.2.10.1">0.764</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.2.2.2.11">0.764</td> <td class="ltx_td ltx_align_top" id="S4.T3.2.2.2.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.2.2.2.13">0.608</td> <td class="ltx_td ltx_align_center" id="S4.T3.2.2.2.14">0.498</td> <td class="ltx_td ltx_align_top" id="S4.T3.2.2.2.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.2.2.2.16">0.632</td> </tr> <tr class="ltx_tr" id="S4.T3.3.3.3"> <td class="ltx_td ltx_align_left" id="S4.T3.3.3.3.1">c4ai-command-r-08-2024 (32B)<sup class="ltx_sup" id="S4.T3.3.3.3.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.3.2">0.529</td> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.3.3">0.835</td> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.3.4">0.530</td> <td class="ltx_td ltx_align_top" id="S4.T3.3.3.3.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.3.6">0.659</td> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.3.7">0.552</td> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.3.8">0.455</td> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.3.9">0.490</td> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.3.10">0.650</td> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.3.11">0.594</td> <td class="ltx_td ltx_align_top" id="S4.T3.3.3.3.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.3.13"><span class="ltx_text ltx_framed ltx_framed_underline" id="S4.T3.3.3.3.13.1">0.729</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.3.14"><span class="ltx_text ltx_framed ltx_framed_underline" id="S4.T3.3.3.3.14.1">0.699</span></td> <td class="ltx_td ltx_align_top" id="S4.T3.3.3.3.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.3.16">0.611</td> </tr> <tr class="ltx_tr" id="S4.T3.4.4.4"> <td class="ltx_td ltx_align_left" id="S4.T3.4.4.4.1">c4ai-command-r-v01 (35B)<sup class="ltx_sup" id="S4.T3.4.4.4.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.2">0.495</td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.3">0.816</td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.4">0.422</td> <td class="ltx_td ltx_align_top" id="S4.T3.4.4.4.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.6">0.614</td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.7">0.511</td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.8">0.448</td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.9">0.461</td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.10">0.684</td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.11">0.635</td> <td class="ltx_td ltx_align_top" id="S4.T3.4.4.4.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.13">0.721</td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.14">0.686</td> <td class="ltx_td ltx_align_top" id="S4.T3.4.4.4.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.16">0.584</td> </tr> <tr class="ltx_tr" id="S4.T3.5.5.5"> <td class="ltx_td ltx_align_left" id="S4.T3.5.5.5.1">Qwen2.5-72B<sup class="ltx_sup" id="S4.T3.5.5.5.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S4.T3.5.5.5.2">0.402</td> <td class="ltx_td ltx_align_center" id="S4.T3.5.5.5.3">0.706</td> <td class="ltx_td ltx_align_center" id="S4.T3.5.5.5.4">0.414</td> <td class="ltx_td ltx_align_top" id="S4.T3.5.5.5.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.5.5.5.6">0.557</td> <td class="ltx_td ltx_align_center" id="S4.T3.5.5.5.7">0.510</td> <td class="ltx_td ltx_align_center" id="S4.T3.5.5.5.8">0.413</td> <td class="ltx_td ltx_align_center" id="S4.T3.5.5.5.9">0.449</td> <td class="ltx_td ltx_align_center" id="S4.T3.5.5.5.10">0.626</td> <td class="ltx_td ltx_align_center" id="S4.T3.5.5.5.11">0.665</td> <td class="ltx_td ltx_align_top" id="S4.T3.5.5.5.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.5.5.5.13">0.726</td> <td class="ltx_td ltx_align_center" id="S4.T3.5.5.5.14">0.586</td> <td class="ltx_td ltx_align_top" id="S4.T3.5.5.5.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.5.5.5.16">0.542</td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.18"> <td class="ltx_td ltx_align_left" id="S4.T3.10.10.18.1">Meta-Llama-3.1-70B</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.18.2">0.337</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.18.3">0.335</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.18.4">0.114</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.18.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.18.6">0.624</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.18.7">0.282</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.18.8">0.489</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.18.9">0.538</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.18.10">0.706</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.18.11">0.702</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.18.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.18.13">0.708</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.18.14">0.380</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.18.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.18.16">0.436</td> </tr> <tr class="ltx_tr" id="S4.T3.6.6.6"> <td class="ltx_td ltx_align_left" id="S4.T3.6.6.6.1">Qwen2.5-32B<sup class="ltx_sup" id="S4.T3.6.6.6.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.2">0.347</td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.3">0.541</td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.4">0.266</td> <td class="ltx_td ltx_align_top" id="S4.T3.6.6.6.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.6">0.238</td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.7">0.286</td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.8">0.208</td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.9">0.275</td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.10">0.318</td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.11">0.283</td> <td class="ltx_td ltx_align_top" id="S4.T3.6.6.6.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.13"><span class="ltx_text ltx_font_bold" id="S4.T3.6.6.6.13.1">0.732</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.14">0.417</td> <td class="ltx_td ltx_align_top" id="S4.T3.6.6.6.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.16">0.377</td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.19"> <td class="ltx_td ltx_align_center ltx_border_t" colspan="16" id="S4.T3.10.10.19.1"><span class="ltx_text ltx_font_italic" id="S4.T3.10.10.19.1.1">Models with 7-20B Parameters</span></td> </tr> <tr class="ltx_tr" id="S4.T3.7.7.7"> <td class="ltx_td ltx_align_left" id="S4.T3.7.7.7.1">Qwen2.5-14B-Instruct<sup class="ltx_sup" id="S4.T3.7.7.7.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S4.T3.7.7.7.2">0.593</td> <td class="ltx_td ltx_align_center" id="S4.T3.7.7.7.3">0.754</td> <td class="ltx_td ltx_align_center" id="S4.T3.7.7.7.4">0.634</td> <td class="ltx_td ltx_align_top" id="S4.T3.7.7.7.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.7.7.7.6">0.493</td> <td class="ltx_td ltx_align_center" id="S4.T3.7.7.7.7">0.520</td> <td class="ltx_td ltx_align_center" id="S4.T3.7.7.7.8">0.408</td> <td class="ltx_td ltx_align_center" id="S4.T3.7.7.7.9">0.330</td> <td class="ltx_td ltx_align_center" id="S4.T3.7.7.7.10">0.665</td> <td class="ltx_td ltx_align_center" id="S4.T3.7.7.7.11">0.548</td> <td class="ltx_td ltx_align_top" id="S4.T3.7.7.7.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.7.7.7.13">0.591</td> <td class="ltx_td ltx_align_center" id="S4.T3.7.7.7.14">0.362</td> <td class="ltx_td ltx_align_top" id="S4.T3.7.7.7.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.7.7.7.16">0.524</td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.20"> <td class="ltx_td ltx_align_left" id="S4.T3.10.10.20.1">internlm2_5-7b-chat-1m</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.20.2">0.446</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.20.3">0.810</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.20.4">0.384</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.20.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.20.6">0.568</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.20.7">0.345</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.20.8">0.507</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.20.9">0.600</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.20.10">0.751</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.20.11">0.783</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.20.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.20.13">0.619</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.20.14">0.479</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.20.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.20.16">0.523</td> </tr> <tr class="ltx_tr" id="S4.T3.8.8.8"> <td class="ltx_td ltx_align_left" id="S4.T3.8.8.8.1">Qwen2.5-7B-Instruct<sup class="ltx_sup" id="S4.T3.8.8.8.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.2">0.507</td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.3">0.817</td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.4">0.457</td> <td class="ltx_td ltx_align_top" id="S4.T3.8.8.8.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.6">0.593</td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.7">0.495</td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.8">0.405</td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.9">0.493</td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.10">0.687</td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.11">0.722</td> <td class="ltx_td ltx_align_top" id="S4.T3.8.8.8.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.13">0.445</td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.14">0.460</td> <td class="ltx_td ltx_align_top" id="S4.T3.8.8.8.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.16">0.515</td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.21"> <td class="ltx_td ltx_align_left" id="S4.T3.10.10.21.1">glm-4-9b-chat-1m</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.21.2">0.484</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.21.3">0.818</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.21.4">0.277</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.21.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.21.6">0.681</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.21.7">0.472</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.21.8">0.342</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.21.9">0.474</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.21.10">0.679</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.21.11">0.683</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.21.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.21.13">0.688</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.21.14">0.317</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.21.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.21.16">0.492</td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.22"> <td class="ltx_td ltx_align_left" id="S4.T3.10.10.22.1">Meta-Llama-3.1-8B-Instruct</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.22.2">0.537</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.22.3">0.667</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.22.4">0.390</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.22.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.22.6">0.668</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.22.7">0.446</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.22.8">0.352</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.22.9">0.494</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.22.10">0.662</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.22.11">0.638</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.22.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.22.13">0.522</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.22.14">0.367</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.22.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.22.16">0.488</td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.23"> <td class="ltx_td ltx_align_left" id="S4.T3.10.10.23.1">LWM-Text-Chat-1M (7B)</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.23.2">0.413</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.23.3">0.708</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.23.4">0.074</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.23.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.23.6">0.579</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.23.7">0.256</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.23.8">0.290</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.23.9">0.570</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.23.10">0.608</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.23.11">0.620</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.23.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.23.13">0.128</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.23.14">0.688</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.23.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.23.16">0.409</td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.24"> <td class="ltx_td ltx_align_left" id="S4.T3.10.10.24.1">Meta-Llama-3.1-8B</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.24.2">0.347</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.24.3">0.345</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.24.4">0.042</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.24.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.24.6">0.556</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.24.7">0.152</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.24.8">0.390</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.24.9">0.515</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.24.10">0.635</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.24.11">0.647</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.24.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.24.13">0.471</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.24.14">0.572</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.24.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.24.16">0.399</td> </tr> <tr class="ltx_tr" id="S4.T3.9.9.9"> <td class="ltx_td ltx_align_left" id="S4.T3.9.9.9.1">Qwen2.5-14B<sup class="ltx_sup" id="S4.T3.9.9.9.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S4.T3.9.9.9.2">0.273</td> <td class="ltx_td ltx_align_center" id="S4.T3.9.9.9.3">0.560</td> <td class="ltx_td ltx_align_center" id="S4.T3.9.9.9.4">0.243</td> <td class="ltx_td ltx_align_top" id="S4.T3.9.9.9.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.9.9.9.6">0.313</td> <td class="ltx_td ltx_align_center" id="S4.T3.9.9.9.7">0.272</td> <td class="ltx_td ltx_align_center" id="S4.T3.9.9.9.8">0.251</td> <td class="ltx_td ltx_align_center" id="S4.T3.9.9.9.9">0.255</td> <td class="ltx_td ltx_align_center" id="S4.T3.9.9.9.10">0.339</td> <td class="ltx_td ltx_align_center" id="S4.T3.9.9.9.11">0.272</td> <td class="ltx_td ltx_align_top" id="S4.T3.9.9.9.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.9.9.9.13">0.697</td> <td class="ltx_td ltx_align_center" id="S4.T3.9.9.9.14">0.451</td> <td class="ltx_td ltx_align_top" id="S4.T3.9.9.9.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.9.9.9.16">0.371</td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.25"> <td class="ltx_td ltx_align_left" id="S4.T3.10.10.25.1">LWM-Text-1M (7B)</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.25.2">0.307</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.25.3">0.302</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.25.4">0.052</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.25.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.25.6">0.141</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.25.7">0.103</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.25.8">0.094</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.25.9">0.262</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.25.10">0.398</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.25.11">0.441</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.25.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.25.13">0.112</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.25.14">0.295</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.25.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.25.16">0.212</td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.10"> <td class="ltx_td ltx_align_left" id="S4.T3.10.10.10.1">Qwen2.5-7B<sup class="ltx_sup" id="S4.T3.10.10.10.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.2">0.268</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.3">0.497</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.4">0.220</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.10.5"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.6">0.085</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.7">0.086</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.8">0.055</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.9">0.080</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.10">0.109</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.11">0.140</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.10.12"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.13">0.244</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.14">0.213</td> <td class="ltx_td ltx_align_top" id="S4.T3.10.10.10.15"></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.16">0.187</td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.26"> <td class="ltx_td ltx_align_left" id="S4.T3.10.10.26.1"><span class="ltx_rule" style="width:100%;height:1.0pt;background:black;display:inline-block;"> </span></td> <td class="ltx_td" id="S4.T3.10.10.26.2"></td> <td class="ltx_td" id="S4.T3.10.10.26.3"></td> <td class="ltx_td" id="S4.T3.10.10.26.4"></td> <td class="ltx_td ltx_align_justify ltx_align_top" id="S4.T3.10.10.26.5"></td> <td class="ltx_td" id="S4.T3.10.10.26.6"></td> <td class="ltx_td" id="S4.T3.10.10.26.7"></td> <td class="ltx_td" id="S4.T3.10.10.26.8"></td> <td class="ltx_td" id="S4.T3.10.10.26.9"></td> <td class="ltx_td" id="S4.T3.10.10.26.10"></td> <td class="ltx_td" id="S4.T3.10.10.26.11"></td> <td class="ltx_td ltx_align_justify ltx_align_top" id="S4.T3.10.10.26.12"></td> <td class="ltx_td" id="S4.T3.10.10.26.13"></td> <td class="ltx_td" id="S4.T3.10.10.26.14"></td> <td class="ltx_td ltx_align_justify ltx_align_top" id="S4.T3.10.10.26.15"></td> <td class="ltx_td" id="S4.T3.10.10.26.16"></td> </tr> </table> </span></div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 3: </span>The ARS scores of models on different tasks. The abbreviations of the tasks can be found in Table <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S3.T2" title="Table 2 ‣ 3.2 Data Collection ‣ 3 LIFBench ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_tag">2</span></a>. The overall score is calculated by Eq. <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S4.E3" title="In 4.1 Automated Rubric-based Scoring ‣ 4 LIFEval ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_tag">3</span></a>. The <span class="ltx_text ltx_font_bold" id="S4.T3.17.1">bold</span> and <span class="ltx_text ltx_framed ltx_framed_underline" id="S4.T3.18.2">underlined</span> denote the first and second rankings respectively. <math alttext="\dagger" class="ltx_Math" display="inline" id="S4.T3.13.m1.1"><semantics id="S4.T3.13.m1.1b"><mo id="S4.T3.13.m1.1.1" xref="S4.T3.13.m1.1.1.cmml">†</mo><annotation-xml encoding="MathML-Content" id="S4.T3.13.m1.1c"><ci id="S4.T3.13.m1.1.1.cmml" xref="S4.T3.13.m1.1.1">†</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.13.m1.1d">\dagger</annotation><annotation encoding="application/x-llamapun" id="S4.T3.13.m1.1e">†</annotation></semantics></math> indicates that the input <math alttext="X" class="ltx_Math" display="inline" id="S4.T3.14.m2.1"><semantics id="S4.T3.14.m2.1b"><mi id="S4.T3.14.m2.1.1" xref="S4.T3.14.m2.1.1.cmml">X</mi><annotation-xml encoding="MathML-Content" id="S4.T3.14.m2.1c"><ci id="S4.T3.14.m2.1.1.cmml" xref="S4.T3.14.m2.1.1">𝑋</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.14.m2.1d">X</annotation><annotation encoding="application/x-llamapun" id="S4.T3.14.m2.1e">italic_X</annotation></semantics></math> on the longest interval is right-truncated.</figcaption> </figure> </section> <section class="ltx_subsection" id="S4.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.2 </span>Score-Capability Mapping</h3> <div class="ltx_para" id="S4.SS2.p1"> <p class="ltx_p" id="S4.SS2.p1.1">To further analyze the instruction-following abilities of LLMs, we introduce Score-Capability Mapping. This mechanism maps the individual point in the scoring rubric to six fundamental capabilities about instruction following. By associating task assessment perspective to specific capabilities, the ARS score is translated into meaningful insights regarding the model’s strengths and weaknesses in different aspects of instruction-following. With reference to previous studies <cite class="ltx_cite ltx_citemacro_cite">He et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib16" title="">2024</a>); Zhou et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib55" title="">2023b</a>)</cite>, there are six core capabilities defined as follows:</p> </div> <div class="ltx_para ltx_noindent" id="S4.SS2.p2"> <p class="ltx_p" id="S4.SS2.p2.1"><span class="ltx_text ltx_font_bold" id="S4.SS2.p2.1.1">Ori</span> (Original Content): Measures the model’s ability to reproduce the original input accurately.</p> </div> <div class="ltx_para ltx_noindent" id="S4.SS2.p3"> <p class="ltx_p" id="S4.SS2.p3.1"><span class="ltx_text ltx_font_bold" id="S4.SS2.p3.1.1">Num</span> (Numerical Ability): Assesses handling of numerical data, such as recognition, counting, and basic arithmetic.</p> </div> <div class="ltx_para ltx_noindent" id="S4.SS2.p4"> <p class="ltx_p" id="S4.SS2.p4.1"><span class="ltx_text ltx_font_bold" id="S4.SS2.p4.1.1">Spat</span> (Spatial Awareness): Evaluates the understanding of spatial relationships and sequences.</p> </div> <div class="ltx_para ltx_noindent" id="S4.SS2.p5"> <p class="ltx_p" id="S4.SS2.p5.1"><span class="ltx_text ltx_font_bold" id="S4.SS2.p5.1.1">Fmt</span> (Format): Reflects proficiency in modifying and structuring content according to format rules.</p> </div> <div class="ltx_para ltx_noindent" id="S4.SS2.p6"> <p class="ltx_p" id="S4.SS2.p6.1"><span class="ltx_text ltx_font_bold" id="S4.SS2.p6.1.1">Logit</span> (Logic Execution): Tests the model’s ability to follow logical conditions and decision branches.</p> </div> <div class="ltx_para ltx_noindent" id="S4.SS2.p7"> <p class="ltx_p" id="S4.SS2.p7.1"><span class="ltx_text ltx_font_bold" id="S4.SS2.p7.1.1">Recog</span> (Recognition Ability): Captures the capacity to differentiate and focus on key elements of the input.</p> </div> <div class="ltx_para" id="S4.SS2.p8"> <p class="ltx_p" id="S4.SS2.p8.1">To quantitatively assess the performance of LLMs across these six capabilities, we derive a score for each capability based on the automated rubric-based scoring. The Instruction Following Performance (IFP) for a specific capability <math alttext="c" class="ltx_Math" display="inline" id="S4.SS2.p8.1.m1.1"><semantics id="S4.SS2.p8.1.m1.1a"><mi id="S4.SS2.p8.1.m1.1.1" xref="S4.SS2.p8.1.m1.1.1.cmml">c</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p8.1.m1.1b"><ci id="S4.SS2.p8.1.m1.1.1.cmml" xref="S4.SS2.p8.1.m1.1.1">𝑐</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p8.1.m1.1c">c</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p8.1.m1.1d">italic_c</annotation></semantics></math> can be expressed mathematically as follows:</p> </div> <div class="ltx_para" id="S4.SS2.p9"> <table class="ltx_equation ltx_eqn_table" id="S4.E4"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="IFP_{c}=\frac{\sum_{t\in T}\sum_{s\in R_{t}}\mathbb{I}(s,c)\cdot f_{s}(A)}{% \sum_{t\in T}\sum_{s\in R_{t}}\mathbb{I}(s,c)\cdot\tilde{s}}" class="ltx_Math" display="block" id="S4.E4.m1.5"><semantics id="S4.E4.m1.5a"><mrow id="S4.E4.m1.5.6" xref="S4.E4.m1.5.6.cmml"><mrow id="S4.E4.m1.5.6.2" xref="S4.E4.m1.5.6.2.cmml"><mi id="S4.E4.m1.5.6.2.2" xref="S4.E4.m1.5.6.2.2.cmml">I</mi><mo id="S4.E4.m1.5.6.2.1" xref="S4.E4.m1.5.6.2.1.cmml"></mo><mi id="S4.E4.m1.5.6.2.3" xref="S4.E4.m1.5.6.2.3.cmml">F</mi><mo id="S4.E4.m1.5.6.2.1a" xref="S4.E4.m1.5.6.2.1.cmml"></mo><msub id="S4.E4.m1.5.6.2.4" xref="S4.E4.m1.5.6.2.4.cmml"><mi id="S4.E4.m1.5.6.2.4.2" xref="S4.E4.m1.5.6.2.4.2.cmml">P</mi><mi id="S4.E4.m1.5.6.2.4.3" xref="S4.E4.m1.5.6.2.4.3.cmml">c</mi></msub></mrow><mo id="S4.E4.m1.5.6.1" xref="S4.E4.m1.5.6.1.cmml">=</mo><mfrac id="S4.E4.m1.5.5" xref="S4.E4.m1.5.5.cmml"><mrow id="S4.E4.m1.3.3.3" xref="S4.E4.m1.3.3.3.cmml"><msub id="S4.E4.m1.3.3.3.4" xref="S4.E4.m1.3.3.3.4.cmml"><mo id="S4.E4.m1.3.3.3.4.2" xref="S4.E4.m1.3.3.3.4.2.cmml">∑</mo><mrow id="S4.E4.m1.3.3.3.4.3" xref="S4.E4.m1.3.3.3.4.3.cmml"><mi id="S4.E4.m1.3.3.3.4.3.2" xref="S4.E4.m1.3.3.3.4.3.2.cmml">t</mi><mo id="S4.E4.m1.3.3.3.4.3.1" xref="S4.E4.m1.3.3.3.4.3.1.cmml">∈</mo><mi id="S4.E4.m1.3.3.3.4.3.3" xref="S4.E4.m1.3.3.3.4.3.3.cmml">T</mi></mrow></msub><mrow id="S4.E4.m1.3.3.3.5" xref="S4.E4.m1.3.3.3.5.cmml"><msub id="S4.E4.m1.3.3.3.5.1" xref="S4.E4.m1.3.3.3.5.1.cmml"><mo id="S4.E4.m1.3.3.3.5.1.2" lspace="0.167em" xref="S4.E4.m1.3.3.3.5.1.2.cmml">∑</mo><mrow id="S4.E4.m1.3.3.3.5.1.3" xref="S4.E4.m1.3.3.3.5.1.3.cmml"><mi id="S4.E4.m1.3.3.3.5.1.3.2" xref="S4.E4.m1.3.3.3.5.1.3.2.cmml">s</mi><mo id="S4.E4.m1.3.3.3.5.1.3.1" xref="S4.E4.m1.3.3.3.5.1.3.1.cmml">∈</mo><msub id="S4.E4.m1.3.3.3.5.1.3.3" xref="S4.E4.m1.3.3.3.5.1.3.3.cmml"><mi id="S4.E4.m1.3.3.3.5.1.3.3.2" xref="S4.E4.m1.3.3.3.5.1.3.3.2.cmml">R</mi><mi id="S4.E4.m1.3.3.3.5.1.3.3.3" xref="S4.E4.m1.3.3.3.5.1.3.3.3.cmml">t</mi></msub></mrow></msub><mrow id="S4.E4.m1.3.3.3.5.2" xref="S4.E4.m1.3.3.3.5.2.cmml"><mrow id="S4.E4.m1.3.3.3.5.2.2" xref="S4.E4.m1.3.3.3.5.2.2.cmml"><mrow id="S4.E4.m1.3.3.3.5.2.2.2" xref="S4.E4.m1.3.3.3.5.2.2.2.cmml"><mi id="S4.E4.m1.3.3.3.5.2.2.2.2" xref="S4.E4.m1.3.3.3.5.2.2.2.2.cmml">𝕀</mi><mo id="S4.E4.m1.3.3.3.5.2.2.2.1" xref="S4.E4.m1.3.3.3.5.2.2.2.1.cmml"></mo><mrow id="S4.E4.m1.3.3.3.5.2.2.2.3.2" xref="S4.E4.m1.3.3.3.5.2.2.2.3.1.cmml"><mo id="S4.E4.m1.3.3.3.5.2.2.2.3.2.1" stretchy="false" xref="S4.E4.m1.3.3.3.5.2.2.2.3.1.cmml">(</mo><mi id="S4.E4.m1.1.1.1.1" xref="S4.E4.m1.1.1.1.1.cmml">s</mi><mo id="S4.E4.m1.3.3.3.5.2.2.2.3.2.2" xref="S4.E4.m1.3.3.3.5.2.2.2.3.1.cmml">,</mo><mi id="S4.E4.m1.2.2.2.2" xref="S4.E4.m1.2.2.2.2.cmml">c</mi><mo id="S4.E4.m1.3.3.3.5.2.2.2.3.2.3" rspace="0.055em" stretchy="false" xref="S4.E4.m1.3.3.3.5.2.2.2.3.1.cmml">)</mo></mrow></mrow><mo id="S4.E4.m1.3.3.3.5.2.2.1" rspace="0.222em" xref="S4.E4.m1.3.3.3.5.2.2.1.cmml">⋅</mo><msub id="S4.E4.m1.3.3.3.5.2.2.3" xref="S4.E4.m1.3.3.3.5.2.2.3.cmml"><mi id="S4.E4.m1.3.3.3.5.2.2.3.2" xref="S4.E4.m1.3.3.3.5.2.2.3.2.cmml">f</mi><mi id="S4.E4.m1.3.3.3.5.2.2.3.3" xref="S4.E4.m1.3.3.3.5.2.2.3.3.cmml">s</mi></msub></mrow><mo id="S4.E4.m1.3.3.3.5.2.1" xref="S4.E4.m1.3.3.3.5.2.1.cmml"></mo><mrow id="S4.E4.m1.3.3.3.5.2.3.2" xref="S4.E4.m1.3.3.3.5.2.cmml"><mo id="S4.E4.m1.3.3.3.5.2.3.2.1" stretchy="false" xref="S4.E4.m1.3.3.3.5.2.cmml">(</mo><mi id="S4.E4.m1.3.3.3.3" xref="S4.E4.m1.3.3.3.3.cmml">A</mi><mo id="S4.E4.m1.3.3.3.5.2.3.2.2" stretchy="false" xref="S4.E4.m1.3.3.3.5.2.cmml">)</mo></mrow></mrow></mrow></mrow><mrow id="S4.E4.m1.5.5.5" xref="S4.E4.m1.5.5.5.cmml"><msub id="S4.E4.m1.5.5.5.3" xref="S4.E4.m1.5.5.5.3.cmml"><mo id="S4.E4.m1.5.5.5.3.2" xref="S4.E4.m1.5.5.5.3.2.cmml">∑</mo><mrow id="S4.E4.m1.5.5.5.3.3" xref="S4.E4.m1.5.5.5.3.3.cmml"><mi id="S4.E4.m1.5.5.5.3.3.2" xref="S4.E4.m1.5.5.5.3.3.2.cmml">t</mi><mo id="S4.E4.m1.5.5.5.3.3.1" xref="S4.E4.m1.5.5.5.3.3.1.cmml">∈</mo><mi id="S4.E4.m1.5.5.5.3.3.3" xref="S4.E4.m1.5.5.5.3.3.3.cmml">T</mi></mrow></msub><mrow id="S4.E4.m1.5.5.5.4" xref="S4.E4.m1.5.5.5.4.cmml"><msub id="S4.E4.m1.5.5.5.4.1" xref="S4.E4.m1.5.5.5.4.1.cmml"><mo id="S4.E4.m1.5.5.5.4.1.2" lspace="0.167em" xref="S4.E4.m1.5.5.5.4.1.2.cmml">∑</mo><mrow id="S4.E4.m1.5.5.5.4.1.3" xref="S4.E4.m1.5.5.5.4.1.3.cmml"><mi id="S4.E4.m1.5.5.5.4.1.3.2" xref="S4.E4.m1.5.5.5.4.1.3.2.cmml">s</mi><mo id="S4.E4.m1.5.5.5.4.1.3.1" xref="S4.E4.m1.5.5.5.4.1.3.1.cmml">∈</mo><msub id="S4.E4.m1.5.5.5.4.1.3.3" xref="S4.E4.m1.5.5.5.4.1.3.3.cmml"><mi id="S4.E4.m1.5.5.5.4.1.3.3.2" xref="S4.E4.m1.5.5.5.4.1.3.3.2.cmml">R</mi><mi id="S4.E4.m1.5.5.5.4.1.3.3.3" xref="S4.E4.m1.5.5.5.4.1.3.3.3.cmml">t</mi></msub></mrow></msub><mrow id="S4.E4.m1.5.5.5.4.2" xref="S4.E4.m1.5.5.5.4.2.cmml"><mrow id="S4.E4.m1.5.5.5.4.2.2" xref="S4.E4.m1.5.5.5.4.2.2.cmml"><mi id="S4.E4.m1.5.5.5.4.2.2.2" xref="S4.E4.m1.5.5.5.4.2.2.2.cmml">𝕀</mi><mo id="S4.E4.m1.5.5.5.4.2.2.1" xref="S4.E4.m1.5.5.5.4.2.2.1.cmml"></mo><mrow id="S4.E4.m1.5.5.5.4.2.2.3.2" xref="S4.E4.m1.5.5.5.4.2.2.3.1.cmml"><mo id="S4.E4.m1.5.5.5.4.2.2.3.2.1" stretchy="false" xref="S4.E4.m1.5.5.5.4.2.2.3.1.cmml">(</mo><mi id="S4.E4.m1.4.4.4.1" xref="S4.E4.m1.4.4.4.1.cmml">s</mi><mo id="S4.E4.m1.5.5.5.4.2.2.3.2.2" xref="S4.E4.m1.5.5.5.4.2.2.3.1.cmml">,</mo><mi id="S4.E4.m1.5.5.5.2" xref="S4.E4.m1.5.5.5.2.cmml">c</mi><mo id="S4.E4.m1.5.5.5.4.2.2.3.2.3" rspace="0.055em" stretchy="false" xref="S4.E4.m1.5.5.5.4.2.2.3.1.cmml">)</mo></mrow></mrow><mo id="S4.E4.m1.5.5.5.4.2.1" rspace="0.222em" xref="S4.E4.m1.5.5.5.4.2.1.cmml">⋅</mo><mover accent="true" id="S4.E4.m1.5.5.5.4.2.3" xref="S4.E4.m1.5.5.5.4.2.3.cmml"><mi id="S4.E4.m1.5.5.5.4.2.3.2" xref="S4.E4.m1.5.5.5.4.2.3.2.cmml">s</mi><mo id="S4.E4.m1.5.5.5.4.2.3.1" xref="S4.E4.m1.5.5.5.4.2.3.1.cmml">~</mo></mover></mrow></mrow></mrow></mfrac></mrow><annotation-xml encoding="MathML-Content" id="S4.E4.m1.5b"><apply id="S4.E4.m1.5.6.cmml" xref="S4.E4.m1.5.6"><eq id="S4.E4.m1.5.6.1.cmml" xref="S4.E4.m1.5.6.1"></eq><apply id="S4.E4.m1.5.6.2.cmml" xref="S4.E4.m1.5.6.2"><times id="S4.E4.m1.5.6.2.1.cmml" xref="S4.E4.m1.5.6.2.1"></times><ci id="S4.E4.m1.5.6.2.2.cmml" xref="S4.E4.m1.5.6.2.2">𝐼</ci><ci id="S4.E4.m1.5.6.2.3.cmml" xref="S4.E4.m1.5.6.2.3">𝐹</ci><apply id="S4.E4.m1.5.6.2.4.cmml" xref="S4.E4.m1.5.6.2.4"><csymbol cd="ambiguous" id="S4.E4.m1.5.6.2.4.1.cmml" xref="S4.E4.m1.5.6.2.4">subscript</csymbol><ci id="S4.E4.m1.5.6.2.4.2.cmml" xref="S4.E4.m1.5.6.2.4.2">𝑃</ci><ci id="S4.E4.m1.5.6.2.4.3.cmml" xref="S4.E4.m1.5.6.2.4.3">𝑐</ci></apply></apply><apply id="S4.E4.m1.5.5.cmml" xref="S4.E4.m1.5.5"><divide id="S4.E4.m1.5.5.6.cmml" xref="S4.E4.m1.5.5"></divide><apply id="S4.E4.m1.3.3.3.cmml" xref="S4.E4.m1.3.3.3"><apply id="S4.E4.m1.3.3.3.4.cmml" xref="S4.E4.m1.3.3.3.4"><csymbol cd="ambiguous" id="S4.E4.m1.3.3.3.4.1.cmml" xref="S4.E4.m1.3.3.3.4">subscript</csymbol><sum id="S4.E4.m1.3.3.3.4.2.cmml" xref="S4.E4.m1.3.3.3.4.2"></sum><apply id="S4.E4.m1.3.3.3.4.3.cmml" xref="S4.E4.m1.3.3.3.4.3"><in id="S4.E4.m1.3.3.3.4.3.1.cmml" xref="S4.E4.m1.3.3.3.4.3.1"></in><ci id="S4.E4.m1.3.3.3.4.3.2.cmml" xref="S4.E4.m1.3.3.3.4.3.2">𝑡</ci><ci id="S4.E4.m1.3.3.3.4.3.3.cmml" xref="S4.E4.m1.3.3.3.4.3.3">𝑇</ci></apply></apply><apply id="S4.E4.m1.3.3.3.5.cmml" xref="S4.E4.m1.3.3.3.5"><apply id="S4.E4.m1.3.3.3.5.1.cmml" xref="S4.E4.m1.3.3.3.5.1"><csymbol cd="ambiguous" id="S4.E4.m1.3.3.3.5.1.1.cmml" xref="S4.E4.m1.3.3.3.5.1">subscript</csymbol><sum id="S4.E4.m1.3.3.3.5.1.2.cmml" xref="S4.E4.m1.3.3.3.5.1.2"></sum><apply id="S4.E4.m1.3.3.3.5.1.3.cmml" xref="S4.E4.m1.3.3.3.5.1.3"><in id="S4.E4.m1.3.3.3.5.1.3.1.cmml" xref="S4.E4.m1.3.3.3.5.1.3.1"></in><ci id="S4.E4.m1.3.3.3.5.1.3.2.cmml" xref="S4.E4.m1.3.3.3.5.1.3.2">𝑠</ci><apply id="S4.E4.m1.3.3.3.5.1.3.3.cmml" xref="S4.E4.m1.3.3.3.5.1.3.3"><csymbol cd="ambiguous" id="S4.E4.m1.3.3.3.5.1.3.3.1.cmml" xref="S4.E4.m1.3.3.3.5.1.3.3">subscript</csymbol><ci id="S4.E4.m1.3.3.3.5.1.3.3.2.cmml" xref="S4.E4.m1.3.3.3.5.1.3.3.2">𝑅</ci><ci id="S4.E4.m1.3.3.3.5.1.3.3.3.cmml" xref="S4.E4.m1.3.3.3.5.1.3.3.3">𝑡</ci></apply></apply></apply><apply id="S4.E4.m1.3.3.3.5.2.cmml" xref="S4.E4.m1.3.3.3.5.2"><times id="S4.E4.m1.3.3.3.5.2.1.cmml" xref="S4.E4.m1.3.3.3.5.2.1"></times><apply id="S4.E4.m1.3.3.3.5.2.2.cmml" xref="S4.E4.m1.3.3.3.5.2.2"><ci id="S4.E4.m1.3.3.3.5.2.2.1.cmml" xref="S4.E4.m1.3.3.3.5.2.2.1">⋅</ci><apply id="S4.E4.m1.3.3.3.5.2.2.2.cmml" xref="S4.E4.m1.3.3.3.5.2.2.2"><times id="S4.E4.m1.3.3.3.5.2.2.2.1.cmml" xref="S4.E4.m1.3.3.3.5.2.2.2.1"></times><ci id="S4.E4.m1.3.3.3.5.2.2.2.2.cmml" xref="S4.E4.m1.3.3.3.5.2.2.2.2">𝕀</ci><interval closure="open" id="S4.E4.m1.3.3.3.5.2.2.2.3.1.cmml" xref="S4.E4.m1.3.3.3.5.2.2.2.3.2"><ci id="S4.E4.m1.1.1.1.1.cmml" xref="S4.E4.m1.1.1.1.1">𝑠</ci><ci id="S4.E4.m1.2.2.2.2.cmml" xref="S4.E4.m1.2.2.2.2">𝑐</ci></interval></apply><apply id="S4.E4.m1.3.3.3.5.2.2.3.cmml" xref="S4.E4.m1.3.3.3.5.2.2.3"><csymbol cd="ambiguous" id="S4.E4.m1.3.3.3.5.2.2.3.1.cmml" xref="S4.E4.m1.3.3.3.5.2.2.3">subscript</csymbol><ci id="S4.E4.m1.3.3.3.5.2.2.3.2.cmml" xref="S4.E4.m1.3.3.3.5.2.2.3.2">𝑓</ci><ci id="S4.E4.m1.3.3.3.5.2.2.3.3.cmml" xref="S4.E4.m1.3.3.3.5.2.2.3.3">𝑠</ci></apply></apply><ci id="S4.E4.m1.3.3.3.3.cmml" xref="S4.E4.m1.3.3.3.3">𝐴</ci></apply></apply></apply><apply id="S4.E4.m1.5.5.5.cmml" xref="S4.E4.m1.5.5.5"><apply id="S4.E4.m1.5.5.5.3.cmml" xref="S4.E4.m1.5.5.5.3"><csymbol cd="ambiguous" id="S4.E4.m1.5.5.5.3.1.cmml" xref="S4.E4.m1.5.5.5.3">subscript</csymbol><sum id="S4.E4.m1.5.5.5.3.2.cmml" xref="S4.E4.m1.5.5.5.3.2"></sum><apply id="S4.E4.m1.5.5.5.3.3.cmml" xref="S4.E4.m1.5.5.5.3.3"><in id="S4.E4.m1.5.5.5.3.3.1.cmml" xref="S4.E4.m1.5.5.5.3.3.1"></in><ci id="S4.E4.m1.5.5.5.3.3.2.cmml" xref="S4.E4.m1.5.5.5.3.3.2">𝑡</ci><ci id="S4.E4.m1.5.5.5.3.3.3.cmml" xref="S4.E4.m1.5.5.5.3.3.3">𝑇</ci></apply></apply><apply id="S4.E4.m1.5.5.5.4.cmml" xref="S4.E4.m1.5.5.5.4"><apply id="S4.E4.m1.5.5.5.4.1.cmml" xref="S4.E4.m1.5.5.5.4.1"><csymbol cd="ambiguous" id="S4.E4.m1.5.5.5.4.1.1.cmml" xref="S4.E4.m1.5.5.5.4.1">subscript</csymbol><sum id="S4.E4.m1.5.5.5.4.1.2.cmml" xref="S4.E4.m1.5.5.5.4.1.2"></sum><apply id="S4.E4.m1.5.5.5.4.1.3.cmml" xref="S4.E4.m1.5.5.5.4.1.3"><in id="S4.E4.m1.5.5.5.4.1.3.1.cmml" xref="S4.E4.m1.5.5.5.4.1.3.1"></in><ci id="S4.E4.m1.5.5.5.4.1.3.2.cmml" xref="S4.E4.m1.5.5.5.4.1.3.2">𝑠</ci><apply id="S4.E4.m1.5.5.5.4.1.3.3.cmml" xref="S4.E4.m1.5.5.5.4.1.3.3"><csymbol cd="ambiguous" id="S4.E4.m1.5.5.5.4.1.3.3.1.cmml" xref="S4.E4.m1.5.5.5.4.1.3.3">subscript</csymbol><ci id="S4.E4.m1.5.5.5.4.1.3.3.2.cmml" xref="S4.E4.m1.5.5.5.4.1.3.3.2">𝑅</ci><ci id="S4.E4.m1.5.5.5.4.1.3.3.3.cmml" xref="S4.E4.m1.5.5.5.4.1.3.3.3">𝑡</ci></apply></apply></apply><apply id="S4.E4.m1.5.5.5.4.2.cmml" xref="S4.E4.m1.5.5.5.4.2"><ci id="S4.E4.m1.5.5.5.4.2.1.cmml" xref="S4.E4.m1.5.5.5.4.2.1">⋅</ci><apply id="S4.E4.m1.5.5.5.4.2.2.cmml" xref="S4.E4.m1.5.5.5.4.2.2"><times id="S4.E4.m1.5.5.5.4.2.2.1.cmml" xref="S4.E4.m1.5.5.5.4.2.2.1"></times><ci id="S4.E4.m1.5.5.5.4.2.2.2.cmml" xref="S4.E4.m1.5.5.5.4.2.2.2">𝕀</ci><interval closure="open" id="S4.E4.m1.5.5.5.4.2.2.3.1.cmml" xref="S4.E4.m1.5.5.5.4.2.2.3.2"><ci id="S4.E4.m1.4.4.4.1.cmml" xref="S4.E4.m1.4.4.4.1">𝑠</ci><ci id="S4.E4.m1.5.5.5.2.cmml" xref="S4.E4.m1.5.5.5.2">𝑐</ci></interval></apply><apply id="S4.E4.m1.5.5.5.4.2.3.cmml" xref="S4.E4.m1.5.5.5.4.2.3"><ci id="S4.E4.m1.5.5.5.4.2.3.1.cmml" xref="S4.E4.m1.5.5.5.4.2.3.1">~</ci><ci id="S4.E4.m1.5.5.5.4.2.3.2.cmml" xref="S4.E4.m1.5.5.5.4.2.3.2">𝑠</ci></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E4.m1.5c">IFP_{c}=\frac{\sum_{t\in T}\sum_{s\in R_{t}}\mathbb{I}(s,c)\cdot f_{s}(A)}{% \sum_{t\in T}\sum_{s\in R_{t}}\mathbb{I}(s,c)\cdot\tilde{s}}</annotation><annotation encoding="application/x-llamapun" id="S4.E4.m1.5d">italic_I italic_F italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_t ∈ italic_T end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_I ( italic_s , italic_c ) ⋅ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_A ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t ∈ italic_T end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_I ( italic_s , italic_c ) ⋅ over~ start_ARG italic_s end_ARG end_ARG</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(4)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S4.SS2.p9.6">where <math alttext="\mathbb{I}(s,c)" class="ltx_Math" display="inline" id="S4.SS2.p9.1.m1.2"><semantics id="S4.SS2.p9.1.m1.2a"><mrow id="S4.SS2.p9.1.m1.2.3" xref="S4.SS2.p9.1.m1.2.3.cmml"><mi id="S4.SS2.p9.1.m1.2.3.2" xref="S4.SS2.p9.1.m1.2.3.2.cmml">𝕀</mi><mo id="S4.SS2.p9.1.m1.2.3.1" xref="S4.SS2.p9.1.m1.2.3.1.cmml"></mo><mrow id="S4.SS2.p9.1.m1.2.3.3.2" xref="S4.SS2.p9.1.m1.2.3.3.1.cmml"><mo id="S4.SS2.p9.1.m1.2.3.3.2.1" stretchy="false" xref="S4.SS2.p9.1.m1.2.3.3.1.cmml">(</mo><mi id="S4.SS2.p9.1.m1.1.1" xref="S4.SS2.p9.1.m1.1.1.cmml">s</mi><mo id="S4.SS2.p9.1.m1.2.3.3.2.2" xref="S4.SS2.p9.1.m1.2.3.3.1.cmml">,</mo><mi id="S4.SS2.p9.1.m1.2.2" xref="S4.SS2.p9.1.m1.2.2.cmml">c</mi><mo id="S4.SS2.p9.1.m1.2.3.3.2.3" stretchy="false" xref="S4.SS2.p9.1.m1.2.3.3.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.p9.1.m1.2b"><apply id="S4.SS2.p9.1.m1.2.3.cmml" xref="S4.SS2.p9.1.m1.2.3"><times id="S4.SS2.p9.1.m1.2.3.1.cmml" xref="S4.SS2.p9.1.m1.2.3.1"></times><ci id="S4.SS2.p9.1.m1.2.3.2.cmml" xref="S4.SS2.p9.1.m1.2.3.2">𝕀</ci><interval closure="open" id="S4.SS2.p9.1.m1.2.3.3.1.cmml" xref="S4.SS2.p9.1.m1.2.3.3.2"><ci id="S4.SS2.p9.1.m1.1.1.cmml" xref="S4.SS2.p9.1.m1.1.1">𝑠</ci><ci id="S4.SS2.p9.1.m1.2.2.cmml" xref="S4.SS2.p9.1.m1.2.2">𝑐</ci></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p9.1.m1.2c">\mathbb{I}(s,c)</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p9.1.m1.2d">blackboard_I ( italic_s , italic_c )</annotation></semantics></math> is the indicator function that identifies whether a scoring point <math alttext="s" class="ltx_Math" display="inline" id="S4.SS2.p9.2.m2.1"><semantics id="S4.SS2.p9.2.m2.1a"><mi id="S4.SS2.p9.2.m2.1.1" xref="S4.SS2.p9.2.m2.1.1.cmml">s</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p9.2.m2.1b"><ci id="S4.SS2.p9.2.m2.1.1.cmml" xref="S4.SS2.p9.2.m2.1.1">𝑠</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p9.2.m2.1c">s</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p9.2.m2.1d">italic_s</annotation></semantics></math> is related to the capability <math alttext="c" class="ltx_Math" display="inline" id="S4.SS2.p9.3.m3.1"><semantics id="S4.SS2.p9.3.m3.1a"><mi id="S4.SS2.p9.3.m3.1.1" xref="S4.SS2.p9.3.m3.1.1.cmml">c</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p9.3.m3.1b"><ci id="S4.SS2.p9.3.m3.1.1.cmml" xref="S4.SS2.p9.3.m3.1.1">𝑐</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p9.3.m3.1c">c</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p9.3.m3.1d">italic_c</annotation></semantics></math>. We manually crafted this mapping so that when a scoring point <math alttext="s" class="ltx_Math" display="inline" id="S4.SS2.p9.4.m4.1"><semantics id="S4.SS2.p9.4.m4.1a"><mi id="S4.SS2.p9.4.m4.1.1" xref="S4.SS2.p9.4.m4.1.1.cmml">s</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p9.4.m4.1b"><ci id="S4.SS2.p9.4.m4.1.1.cmml" xref="S4.SS2.p9.4.m4.1.1">𝑠</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p9.4.m4.1c">s</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p9.4.m4.1d">italic_s</annotation></semantics></math> is relevant to the capability <math alttext="c" class="ltx_Math" display="inline" id="S4.SS2.p9.5.m5.1"><semantics id="S4.SS2.p9.5.m5.1a"><mi id="S4.SS2.p9.5.m5.1.1" xref="S4.SS2.p9.5.m5.1.1.cmml">c</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p9.5.m5.1b"><ci id="S4.SS2.p9.5.m5.1.1.cmml" xref="S4.SS2.p9.5.m5.1.1">𝑐</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p9.5.m5.1c">c</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p9.5.m5.1d">italic_c</annotation></semantics></math>, the indicator <math alttext="\mathbb{I}(s,c)" class="ltx_Math" display="inline" id="S4.SS2.p9.6.m6.2"><semantics id="S4.SS2.p9.6.m6.2a"><mrow id="S4.SS2.p9.6.m6.2.3" xref="S4.SS2.p9.6.m6.2.3.cmml"><mi id="S4.SS2.p9.6.m6.2.3.2" xref="S4.SS2.p9.6.m6.2.3.2.cmml">𝕀</mi><mo id="S4.SS2.p9.6.m6.2.3.1" xref="S4.SS2.p9.6.m6.2.3.1.cmml"></mo><mrow id="S4.SS2.p9.6.m6.2.3.3.2" xref="S4.SS2.p9.6.m6.2.3.3.1.cmml"><mo id="S4.SS2.p9.6.m6.2.3.3.2.1" stretchy="false" xref="S4.SS2.p9.6.m6.2.3.3.1.cmml">(</mo><mi id="S4.SS2.p9.6.m6.1.1" xref="S4.SS2.p9.6.m6.1.1.cmml">s</mi><mo id="S4.SS2.p9.6.m6.2.3.3.2.2" xref="S4.SS2.p9.6.m6.2.3.3.1.cmml">,</mo><mi id="S4.SS2.p9.6.m6.2.2" xref="S4.SS2.p9.6.m6.2.2.cmml">c</mi><mo id="S4.SS2.p9.6.m6.2.3.3.2.3" stretchy="false" xref="S4.SS2.p9.6.m6.2.3.3.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.p9.6.m6.2b"><apply id="S4.SS2.p9.6.m6.2.3.cmml" xref="S4.SS2.p9.6.m6.2.3"><times id="S4.SS2.p9.6.m6.2.3.1.cmml" xref="S4.SS2.p9.6.m6.2.3.1"></times><ci id="S4.SS2.p9.6.m6.2.3.2.cmml" xref="S4.SS2.p9.6.m6.2.3.2">𝕀</ci><interval closure="open" id="S4.SS2.p9.6.m6.2.3.3.1.cmml" xref="S4.SS2.p9.6.m6.2.3.3.2"><ci id="S4.SS2.p9.6.m6.1.1.cmml" xref="S4.SS2.p9.6.m6.1.1">𝑠</ci><ci id="S4.SS2.p9.6.m6.2.2.cmml" xref="S4.SS2.p9.6.m6.2.2">𝑐</ci></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p9.6.m6.2c">\mathbb{I}(s,c)</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p9.6.m6.2d">blackboard_I ( italic_s , italic_c )</annotation></semantics></math> equals 1, and 0 otherwise.</p> </div> </section> <section class="ltx_subsection" id="S4.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.3 </span>Instruction Following Stability</h3> <div class="ltx_para" id="S4.SS3.p1"> <p class="ltx_p" id="S4.SS3.p1.1">To complement the analysis of instruction-following abilities, we introduce a novel metric: Instruction Following Stability (IFS). By analyzing fluctuations in the ARS scores from different views, IFS assesses how consistently a model adheres to instructions, providing a quantifiable assessment of the reliability and robustness of LLMs.</p> </div> <div class="ltx_para" id="S4.SS3.p2"> <p class="ltx_p" id="S4.SS3.p2.5">We firstly define the observation perspective of stability, <math alttext="p" class="ltx_Math" display="inline" id="S4.SS3.p2.1.m1.1"><semantics id="S4.SS3.p2.1.m1.1a"><mi id="S4.SS3.p2.1.m1.1.1" xref="S4.SS3.p2.1.m1.1.1.cmml">p</mi><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.1.m1.1b"><ci id="S4.SS3.p2.1.m1.1.1.cmml" xref="S4.SS3.p2.1.m1.1.1">𝑝</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.1.m1.1c">p</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.1.m1.1d">italic_p</annotation></semantics></math>, which can be interpreted as a feature of the test sample. In LIFBench, there are three perspectives: prompt length, expression(the instruction template in prompt), and instruction variables. Subsequently, we partition the set of answers <math alttext="A_{t}" class="ltx_Math" display="inline" id="S4.SS3.p2.2.m2.1"><semantics id="S4.SS3.p2.2.m2.1a"><msub id="S4.SS3.p2.2.m2.1.1" xref="S4.SS3.p2.2.m2.1.1.cmml"><mi id="S4.SS3.p2.2.m2.1.1.2" xref="S4.SS3.p2.2.m2.1.1.2.cmml">A</mi><mi id="S4.SS3.p2.2.m2.1.1.3" xref="S4.SS3.p2.2.m2.1.1.3.cmml">t</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.2.m2.1b"><apply id="S4.SS3.p2.2.m2.1.1.cmml" xref="S4.SS3.p2.2.m2.1.1"><csymbol cd="ambiguous" id="S4.SS3.p2.2.m2.1.1.1.cmml" xref="S4.SS3.p2.2.m2.1.1">subscript</csymbol><ci id="S4.SS3.p2.2.m2.1.1.2.cmml" xref="S4.SS3.p2.2.m2.1.1.2">𝐴</ci><ci id="S4.SS3.p2.2.m2.1.1.3.cmml" xref="S4.SS3.p2.2.m2.1.1.3">𝑡</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.2.m2.1c">A_{t}</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.2.m2.1d">italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT</annotation></semantics></math> into <math alttext="N_{p}" class="ltx_Math" display="inline" id="S4.SS3.p2.3.m3.1"><semantics id="S4.SS3.p2.3.m3.1a"><msub id="S4.SS3.p2.3.m3.1.1" xref="S4.SS3.p2.3.m3.1.1.cmml"><mi id="S4.SS3.p2.3.m3.1.1.2" xref="S4.SS3.p2.3.m3.1.1.2.cmml">N</mi><mi id="S4.SS3.p2.3.m3.1.1.3" xref="S4.SS3.p2.3.m3.1.1.3.cmml">p</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.3.m3.1b"><apply id="S4.SS3.p2.3.m3.1.1.cmml" xref="S4.SS3.p2.3.m3.1.1"><csymbol cd="ambiguous" id="S4.SS3.p2.3.m3.1.1.1.cmml" xref="S4.SS3.p2.3.m3.1.1">subscript</csymbol><ci id="S4.SS3.p2.3.m3.1.1.2.cmml" xref="S4.SS3.p2.3.m3.1.1.2">𝑁</ci><ci id="S4.SS3.p2.3.m3.1.1.3.cmml" xref="S4.SS3.p2.3.m3.1.1.3">𝑝</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.3.m3.1c">N_{p}</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.3.m3.1d">italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT</annotation></semantics></math> groups based on the selected perspective, denoted as <math alttext="A_{t}^{p}=\{A_{t}^{j}\}_{j=0}^{N_{p}}" class="ltx_Math" display="inline" id="S4.SS3.p2.4.m4.1"><semantics id="S4.SS3.p2.4.m4.1a"><mrow id="S4.SS3.p2.4.m4.1.1" xref="S4.SS3.p2.4.m4.1.1.cmml"><msubsup id="S4.SS3.p2.4.m4.1.1.3" xref="S4.SS3.p2.4.m4.1.1.3.cmml"><mi id="S4.SS3.p2.4.m4.1.1.3.2.2" xref="S4.SS3.p2.4.m4.1.1.3.2.2.cmml">A</mi><mi id="S4.SS3.p2.4.m4.1.1.3.2.3" xref="S4.SS3.p2.4.m4.1.1.3.2.3.cmml">t</mi><mi id="S4.SS3.p2.4.m4.1.1.3.3" xref="S4.SS3.p2.4.m4.1.1.3.3.cmml">p</mi></msubsup><mo id="S4.SS3.p2.4.m4.1.1.2" xref="S4.SS3.p2.4.m4.1.1.2.cmml">=</mo><msubsup id="S4.SS3.p2.4.m4.1.1.1" xref="S4.SS3.p2.4.m4.1.1.1.cmml"><mrow id="S4.SS3.p2.4.m4.1.1.1.1.1.1" xref="S4.SS3.p2.4.m4.1.1.1.1.1.2.cmml"><mo id="S4.SS3.p2.4.m4.1.1.1.1.1.1.2" stretchy="false" xref="S4.SS3.p2.4.m4.1.1.1.1.1.2.cmml">{</mo><msubsup id="S4.SS3.p2.4.m4.1.1.1.1.1.1.1" xref="S4.SS3.p2.4.m4.1.1.1.1.1.1.1.cmml"><mi id="S4.SS3.p2.4.m4.1.1.1.1.1.1.1.2.2" xref="S4.SS3.p2.4.m4.1.1.1.1.1.1.1.2.2.cmml">A</mi><mi id="S4.SS3.p2.4.m4.1.1.1.1.1.1.1.2.3" xref="S4.SS3.p2.4.m4.1.1.1.1.1.1.1.2.3.cmml">t</mi><mi id="S4.SS3.p2.4.m4.1.1.1.1.1.1.1.3" xref="S4.SS3.p2.4.m4.1.1.1.1.1.1.1.3.cmml">j</mi></msubsup><mo id="S4.SS3.p2.4.m4.1.1.1.1.1.1.3" stretchy="false" xref="S4.SS3.p2.4.m4.1.1.1.1.1.2.cmml">}</mo></mrow><mrow id="S4.SS3.p2.4.m4.1.1.1.1.3" xref="S4.SS3.p2.4.m4.1.1.1.1.3.cmml"><mi id="S4.SS3.p2.4.m4.1.1.1.1.3.2" xref="S4.SS3.p2.4.m4.1.1.1.1.3.2.cmml">j</mi><mo id="S4.SS3.p2.4.m4.1.1.1.1.3.1" xref="S4.SS3.p2.4.m4.1.1.1.1.3.1.cmml">=</mo><mn id="S4.SS3.p2.4.m4.1.1.1.1.3.3" xref="S4.SS3.p2.4.m4.1.1.1.1.3.3.cmml">0</mn></mrow><msub id="S4.SS3.p2.4.m4.1.1.1.3" xref="S4.SS3.p2.4.m4.1.1.1.3.cmml"><mi id="S4.SS3.p2.4.m4.1.1.1.3.2" xref="S4.SS3.p2.4.m4.1.1.1.3.2.cmml">N</mi><mi id="S4.SS3.p2.4.m4.1.1.1.3.3" xref="S4.SS3.p2.4.m4.1.1.1.3.3.cmml">p</mi></msub></msubsup></mrow><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.4.m4.1b"><apply id="S4.SS3.p2.4.m4.1.1.cmml" xref="S4.SS3.p2.4.m4.1.1"><eq id="S4.SS3.p2.4.m4.1.1.2.cmml" xref="S4.SS3.p2.4.m4.1.1.2"></eq><apply id="S4.SS3.p2.4.m4.1.1.3.cmml" xref="S4.SS3.p2.4.m4.1.1.3"><csymbol cd="ambiguous" id="S4.SS3.p2.4.m4.1.1.3.1.cmml" xref="S4.SS3.p2.4.m4.1.1.3">superscript</csymbol><apply id="S4.SS3.p2.4.m4.1.1.3.2.cmml" xref="S4.SS3.p2.4.m4.1.1.3"><csymbol cd="ambiguous" id="S4.SS3.p2.4.m4.1.1.3.2.1.cmml" xref="S4.SS3.p2.4.m4.1.1.3">subscript</csymbol><ci id="S4.SS3.p2.4.m4.1.1.3.2.2.cmml" xref="S4.SS3.p2.4.m4.1.1.3.2.2">𝐴</ci><ci id="S4.SS3.p2.4.m4.1.1.3.2.3.cmml" xref="S4.SS3.p2.4.m4.1.1.3.2.3">𝑡</ci></apply><ci id="S4.SS3.p2.4.m4.1.1.3.3.cmml" xref="S4.SS3.p2.4.m4.1.1.3.3">𝑝</ci></apply><apply id="S4.SS3.p2.4.m4.1.1.1.cmml" xref="S4.SS3.p2.4.m4.1.1.1"><csymbol cd="ambiguous" id="S4.SS3.p2.4.m4.1.1.1.2.cmml" xref="S4.SS3.p2.4.m4.1.1.1">superscript</csymbol><apply id="S4.SS3.p2.4.m4.1.1.1.1.cmml" xref="S4.SS3.p2.4.m4.1.1.1"><csymbol cd="ambiguous" id="S4.SS3.p2.4.m4.1.1.1.1.2.cmml" xref="S4.SS3.p2.4.m4.1.1.1">subscript</csymbol><set id="S4.SS3.p2.4.m4.1.1.1.1.1.2.cmml" xref="S4.SS3.p2.4.m4.1.1.1.1.1.1"><apply id="S4.SS3.p2.4.m4.1.1.1.1.1.1.1.cmml" xref="S4.SS3.p2.4.m4.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S4.SS3.p2.4.m4.1.1.1.1.1.1.1.1.cmml" xref="S4.SS3.p2.4.m4.1.1.1.1.1.1.1">superscript</csymbol><apply id="S4.SS3.p2.4.m4.1.1.1.1.1.1.1.2.cmml" xref="S4.SS3.p2.4.m4.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S4.SS3.p2.4.m4.1.1.1.1.1.1.1.2.1.cmml" xref="S4.SS3.p2.4.m4.1.1.1.1.1.1.1">subscript</csymbol><ci id="S4.SS3.p2.4.m4.1.1.1.1.1.1.1.2.2.cmml" xref="S4.SS3.p2.4.m4.1.1.1.1.1.1.1.2.2">𝐴</ci><ci id="S4.SS3.p2.4.m4.1.1.1.1.1.1.1.2.3.cmml" xref="S4.SS3.p2.4.m4.1.1.1.1.1.1.1.2.3">𝑡</ci></apply><ci id="S4.SS3.p2.4.m4.1.1.1.1.1.1.1.3.cmml" xref="S4.SS3.p2.4.m4.1.1.1.1.1.1.1.3">𝑗</ci></apply></set><apply id="S4.SS3.p2.4.m4.1.1.1.1.3.cmml" xref="S4.SS3.p2.4.m4.1.1.1.1.3"><eq id="S4.SS3.p2.4.m4.1.1.1.1.3.1.cmml" xref="S4.SS3.p2.4.m4.1.1.1.1.3.1"></eq><ci id="S4.SS3.p2.4.m4.1.1.1.1.3.2.cmml" xref="S4.SS3.p2.4.m4.1.1.1.1.3.2">𝑗</ci><cn id="S4.SS3.p2.4.m4.1.1.1.1.3.3.cmml" type="integer" xref="S4.SS3.p2.4.m4.1.1.1.1.3.3">0</cn></apply></apply><apply id="S4.SS3.p2.4.m4.1.1.1.3.cmml" xref="S4.SS3.p2.4.m4.1.1.1.3"><csymbol cd="ambiguous" id="S4.SS3.p2.4.m4.1.1.1.3.1.cmml" xref="S4.SS3.p2.4.m4.1.1.1.3">subscript</csymbol><ci id="S4.SS3.p2.4.m4.1.1.1.3.2.cmml" xref="S4.SS3.p2.4.m4.1.1.1.3.2">𝑁</ci><ci id="S4.SS3.p2.4.m4.1.1.1.3.3.cmml" xref="S4.SS3.p2.4.m4.1.1.1.3.3">𝑝</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.4.m4.1c">A_{t}^{p}=\{A_{t}^{j}\}_{j=0}^{N_{p}}</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.4.m4.1d">italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = { italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT</annotation></semantics></math>. For each group of answers, we compute the performance using the ARS method mentioned in Sec <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S4.SS1" title="4.1 Automated Rubric-based Scoring ‣ 4 LIFEval ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_tag">4.1</span></a>, resulting in a set of performance values <math alttext="Y_{t}^{v}" class="ltx_Math" display="inline" id="S4.SS3.p2.5.m5.1"><semantics id="S4.SS3.p2.5.m5.1a"><msubsup id="S4.SS3.p2.5.m5.1.1" xref="S4.SS3.p2.5.m5.1.1.cmml"><mi id="S4.SS3.p2.5.m5.1.1.2.2" xref="S4.SS3.p2.5.m5.1.1.2.2.cmml">Y</mi><mi id="S4.SS3.p2.5.m5.1.1.2.3" xref="S4.SS3.p2.5.m5.1.1.2.3.cmml">t</mi><mi id="S4.SS3.p2.5.m5.1.1.3" xref="S4.SS3.p2.5.m5.1.1.3.cmml">v</mi></msubsup><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.5.m5.1b"><apply id="S4.SS3.p2.5.m5.1.1.cmml" xref="S4.SS3.p2.5.m5.1.1"><csymbol cd="ambiguous" id="S4.SS3.p2.5.m5.1.1.1.cmml" xref="S4.SS3.p2.5.m5.1.1">superscript</csymbol><apply id="S4.SS3.p2.5.m5.1.1.2.cmml" xref="S4.SS3.p2.5.m5.1.1"><csymbol cd="ambiguous" id="S4.SS3.p2.5.m5.1.1.2.1.cmml" xref="S4.SS3.p2.5.m5.1.1">subscript</csymbol><ci id="S4.SS3.p2.5.m5.1.1.2.2.cmml" xref="S4.SS3.p2.5.m5.1.1.2.2">𝑌</ci><ci id="S4.SS3.p2.5.m5.1.1.2.3.cmml" xref="S4.SS3.p2.5.m5.1.1.2.3">𝑡</ci></apply><ci id="S4.SS3.p2.5.m5.1.1.3.cmml" xref="S4.SS3.p2.5.m5.1.1.3">𝑣</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.5.m5.1c">Y_{t}^{v}</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.5.m5.1d">italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT</annotation></semantics></math>, which can be expressed as follows:</p> <table class="ltx_equation ltx_eqn_table" id="S4.E5"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\begin{split}Y_{t}^{p}=\{ARS_{t}(A_{t}^{j})\}_{j=0}^{N_{p}}\end{split}" class="ltx_Math" display="block" id="S4.E5.m1.19"><semantics id="S4.E5.m1.19a"><mtable displaystyle="true" id="S4.E5.m1.19.19.2" xref="S4.E5.m1.18.18.1.cmml"><mtr id="S4.E5.m1.19.19.2a" xref="S4.E5.m1.18.18.1.cmml"><mtd class="ltx_align_right" columnalign="right" id="S4.E5.m1.19.19.2b" xref="S4.E5.m1.18.18.1.cmml"><mrow id="S4.E5.m1.19.19.2.18.18.18" xref="S4.E5.m1.18.18.1.cmml"><msubsup id="S4.E5.m1.19.19.2.18.18.18.19" xref="S4.E5.m1.18.18.1.cmml"><mi id="S4.E5.m1.1.1.1.1.1.1" xref="S4.E5.m1.1.1.1.1.1.1.cmml">Y</mi><mi id="S4.E5.m1.2.2.2.2.2.2.1" xref="S4.E5.m1.2.2.2.2.2.2.1.cmml">t</mi><mi id="S4.E5.m1.3.3.3.3.3.3.1" xref="S4.E5.m1.3.3.3.3.3.3.1.cmml">p</mi></msubsup><mo id="S4.E5.m1.4.4.4.4.4.4" xref="S4.E5.m1.4.4.4.4.4.4.cmml">=</mo><msubsup id="S4.E5.m1.19.19.2.18.18.18.18" xref="S4.E5.m1.18.18.1.cmml"><mrow id="S4.E5.m1.19.19.2.18.18.18.18.1.1.1" xref="S4.E5.m1.18.18.1.cmml"><mo id="S4.E5.m1.5.5.5.5.5.5" stretchy="false" xref="S4.E5.m1.18.18.1.cmml">{</mo><mrow id="S4.E5.m1.19.19.2.18.18.18.18.1.1.1.1" xref="S4.E5.m1.18.18.1.cmml"><mi id="S4.E5.m1.6.6.6.6.6.6" xref="S4.E5.m1.6.6.6.6.6.6.cmml">A</mi><mo id="S4.E5.m1.19.19.2.18.18.18.18.1.1.1.1.2" xref="S4.E5.m1.18.18.1.cmml"></mo><mi id="S4.E5.m1.7.7.7.7.7.7" xref="S4.E5.m1.7.7.7.7.7.7.cmml">R</mi><mo id="S4.E5.m1.19.19.2.18.18.18.18.1.1.1.1.2a" xref="S4.E5.m1.18.18.1.cmml"></mo><msub id="S4.E5.m1.19.19.2.18.18.18.18.1.1.1.1.3" xref="S4.E5.m1.18.18.1.cmml"><mi id="S4.E5.m1.8.8.8.8.8.8" xref="S4.E5.m1.8.8.8.8.8.8.cmml">S</mi><mi id="S4.E5.m1.9.9.9.9.9.9.1" xref="S4.E5.m1.9.9.9.9.9.9.1.cmml">t</mi></msub><mo id="S4.E5.m1.19.19.2.18.18.18.18.1.1.1.1.2b" xref="S4.E5.m1.18.18.1.cmml"></mo><mrow id="S4.E5.m1.19.19.2.18.18.18.18.1.1.1.1.1.1" xref="S4.E5.m1.18.18.1.cmml"><mo id="S4.E5.m1.10.10.10.10.10.10" stretchy="false" xref="S4.E5.m1.18.18.1.cmml">(</mo><msubsup id="S4.E5.m1.19.19.2.18.18.18.18.1.1.1.1.1.1.1" xref="S4.E5.m1.18.18.1.cmml"><mi id="S4.E5.m1.11.11.11.11.11.11" xref="S4.E5.m1.11.11.11.11.11.11.cmml">A</mi><mi id="S4.E5.m1.12.12.12.12.12.12.1" xref="S4.E5.m1.12.12.12.12.12.12.1.cmml">t</mi><mi id="S4.E5.m1.13.13.13.13.13.13.1" xref="S4.E5.m1.13.13.13.13.13.13.1.cmml">j</mi></msubsup><mo id="S4.E5.m1.14.14.14.14.14.14" stretchy="false" xref="S4.E5.m1.18.18.1.cmml">)</mo></mrow></mrow><mo id="S4.E5.m1.15.15.15.15.15.15" stretchy="false" xref="S4.E5.m1.18.18.1.cmml">}</mo></mrow><mrow id="S4.E5.m1.16.16.16.16.16.16.1" xref="S4.E5.m1.16.16.16.16.16.16.1.cmml"><mi id="S4.E5.m1.16.16.16.16.16.16.1.2" xref="S4.E5.m1.16.16.16.16.16.16.1.2.cmml">j</mi><mo id="S4.E5.m1.16.16.16.16.16.16.1.1" xref="S4.E5.m1.16.16.16.16.16.16.1.1.cmml">=</mo><mn id="S4.E5.m1.16.16.16.16.16.16.1.3" xref="S4.E5.m1.16.16.16.16.16.16.1.3.cmml">0</mn></mrow><msub id="S4.E5.m1.17.17.17.17.17.17.1" xref="S4.E5.m1.17.17.17.17.17.17.1.cmml"><mi id="S4.E5.m1.17.17.17.17.17.17.1.2" xref="S4.E5.m1.17.17.17.17.17.17.1.2.cmml">N</mi><mi id="S4.E5.m1.17.17.17.17.17.17.1.3" xref="S4.E5.m1.17.17.17.17.17.17.1.3.cmml">p</mi></msub></msubsup></mrow></mtd></mtr></mtable><annotation-xml encoding="MathML-Content" id="S4.E5.m1.19b"><apply id="S4.E5.m1.18.18.1.cmml" xref="S4.E5.m1.19.19.2"><eq id="S4.E5.m1.4.4.4.4.4.4.cmml" xref="S4.E5.m1.4.4.4.4.4.4"></eq><apply id="S4.E5.m1.18.18.1.3.cmml" xref="S4.E5.m1.19.19.2"><csymbol cd="ambiguous" id="S4.E5.m1.18.18.1.3.1.cmml" xref="S4.E5.m1.19.19.2">superscript</csymbol><apply id="S4.E5.m1.18.18.1.3.2.cmml" xref="S4.E5.m1.19.19.2"><csymbol cd="ambiguous" id="S4.E5.m1.18.18.1.3.2.1.cmml" xref="S4.E5.m1.19.19.2">subscript</csymbol><ci id="S4.E5.m1.1.1.1.1.1.1.cmml" xref="S4.E5.m1.1.1.1.1.1.1">𝑌</ci><ci id="S4.E5.m1.2.2.2.2.2.2.1.cmml" xref="S4.E5.m1.2.2.2.2.2.2.1">𝑡</ci></apply><ci id="S4.E5.m1.3.3.3.3.3.3.1.cmml" xref="S4.E5.m1.3.3.3.3.3.3.1">𝑝</ci></apply><apply id="S4.E5.m1.18.18.1.1.cmml" xref="S4.E5.m1.19.19.2"><csymbol cd="ambiguous" id="S4.E5.m1.18.18.1.1.2.cmml" xref="S4.E5.m1.19.19.2">superscript</csymbol><apply id="S4.E5.m1.18.18.1.1.1.cmml" xref="S4.E5.m1.19.19.2"><csymbol cd="ambiguous" id="S4.E5.m1.18.18.1.1.1.2.cmml" xref="S4.E5.m1.19.19.2">subscript</csymbol><set id="S4.E5.m1.18.18.1.1.1.1.2.cmml" xref="S4.E5.m1.19.19.2"><apply id="S4.E5.m1.18.18.1.1.1.1.1.1.cmml" xref="S4.E5.m1.19.19.2"><times id="S4.E5.m1.18.18.1.1.1.1.1.1.2.cmml" xref="S4.E5.m1.19.19.2"></times><ci id="S4.E5.m1.6.6.6.6.6.6.cmml" xref="S4.E5.m1.6.6.6.6.6.6">𝐴</ci><ci id="S4.E5.m1.7.7.7.7.7.7.cmml" xref="S4.E5.m1.7.7.7.7.7.7">𝑅</ci><apply id="S4.E5.m1.18.18.1.1.1.1.1.1.5.cmml" xref="S4.E5.m1.19.19.2"><csymbol cd="ambiguous" id="S4.E5.m1.18.18.1.1.1.1.1.1.5.1.cmml" xref="S4.E5.m1.19.19.2">subscript</csymbol><ci id="S4.E5.m1.8.8.8.8.8.8.cmml" xref="S4.E5.m1.8.8.8.8.8.8">𝑆</ci><ci id="S4.E5.m1.9.9.9.9.9.9.1.cmml" xref="S4.E5.m1.9.9.9.9.9.9.1">𝑡</ci></apply><apply id="S4.E5.m1.18.18.1.1.1.1.1.1.1.1.1.cmml" xref="S4.E5.m1.19.19.2"><csymbol cd="ambiguous" id="S4.E5.m1.18.18.1.1.1.1.1.1.1.1.1.1.cmml" xref="S4.E5.m1.19.19.2">superscript</csymbol><apply id="S4.E5.m1.18.18.1.1.1.1.1.1.1.1.1.2.cmml" xref="S4.E5.m1.19.19.2"><csymbol cd="ambiguous" id="S4.E5.m1.18.18.1.1.1.1.1.1.1.1.1.2.1.cmml" xref="S4.E5.m1.19.19.2">subscript</csymbol><ci id="S4.E5.m1.11.11.11.11.11.11.cmml" xref="S4.E5.m1.11.11.11.11.11.11">𝐴</ci><ci id="S4.E5.m1.12.12.12.12.12.12.1.cmml" xref="S4.E5.m1.12.12.12.12.12.12.1">𝑡</ci></apply><ci id="S4.E5.m1.13.13.13.13.13.13.1.cmml" xref="S4.E5.m1.13.13.13.13.13.13.1">𝑗</ci></apply></apply></set><apply id="S4.E5.m1.16.16.16.16.16.16.1.cmml" xref="S4.E5.m1.16.16.16.16.16.16.1"><eq id="S4.E5.m1.16.16.16.16.16.16.1.1.cmml" xref="S4.E5.m1.16.16.16.16.16.16.1.1"></eq><ci id="S4.E5.m1.16.16.16.16.16.16.1.2.cmml" xref="S4.E5.m1.16.16.16.16.16.16.1.2">𝑗</ci><cn id="S4.E5.m1.16.16.16.16.16.16.1.3.cmml" type="integer" xref="S4.E5.m1.16.16.16.16.16.16.1.3">0</cn></apply></apply><apply id="S4.E5.m1.17.17.17.17.17.17.1.cmml" xref="S4.E5.m1.17.17.17.17.17.17.1"><csymbol cd="ambiguous" id="S4.E5.m1.17.17.17.17.17.17.1.1.cmml" xref="S4.E5.m1.17.17.17.17.17.17.1">subscript</csymbol><ci id="S4.E5.m1.17.17.17.17.17.17.1.2.cmml" xref="S4.E5.m1.17.17.17.17.17.17.1.2">𝑁</ci><ci id="S4.E5.m1.17.17.17.17.17.17.1.3.cmml" xref="S4.E5.m1.17.17.17.17.17.17.1.3">𝑝</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E5.m1.19c">\begin{split}Y_{t}^{p}=\{ARS_{t}(A_{t}^{j})\}_{j=0}^{N_{p}}\end{split}</annotation><annotation encoding="application/x-llamapun" id="S4.E5.m1.19d">start_ROW start_CELL italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = { italic_A italic_R italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(5)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S4.SS3.p2.7">Finally, the Instruction Following Stability (IFS) on task <math alttext="t" class="ltx_Math" display="inline" id="S4.SS3.p2.6.m1.1"><semantics id="S4.SS3.p2.6.m1.1a"><mi id="S4.SS3.p2.6.m1.1.1" xref="S4.SS3.p2.6.m1.1.1.cmml">t</mi><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.6.m1.1b"><ci id="S4.SS3.p2.6.m1.1.1.cmml" xref="S4.SS3.p2.6.m1.1.1">𝑡</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.6.m1.1c">t</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.6.m1.1d">italic_t</annotation></semantics></math> is calculated as the variance of the performance values <math alttext="Y" class="ltx_Math" display="inline" id="S4.SS3.p2.7.m2.1"><semantics id="S4.SS3.p2.7.m2.1a"><mi id="S4.SS3.p2.7.m2.1.1" xref="S4.SS3.p2.7.m2.1.1.cmml">Y</mi><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.7.m2.1b"><ci id="S4.SS3.p2.7.m2.1.1.cmml" xref="S4.SS3.p2.7.m2.1.1">𝑌</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.7.m2.1c">Y</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.7.m2.1d">italic_Y</annotation></semantics></math> divided by their mean, formally expressed as:</p> <table class="ltx_equation ltx_eqn_table" id="S4.E6"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\begin{split}IFS_{t}^{p}=\frac{\sigma(Y_{t}^{p})}{\bar{Y}_{t}^{p}}\in[0,+% \infty)\end{split}" class="ltx_Math" display="block" id="S4.E6.m1.16"><semantics id="S4.E6.m1.16a"><mtable displaystyle="true" id="S4.E6.m1.16.16.2" xref="S4.E6.m1.15.15.1.cmml"><mtr id="S4.E6.m1.16.16.2a" xref="S4.E6.m1.15.15.1.cmml"><mtd class="ltx_align_right" columnalign="right" id="S4.E6.m1.16.16.2b" xref="S4.E6.m1.15.15.1.cmml"><mrow id="S4.E6.m1.16.16.2.15.15.15" xref="S4.E6.m1.15.15.1.cmml"><mrow id="S4.E6.m1.16.16.2.15.15.15.17" xref="S4.E6.m1.15.15.1.cmml"><mi id="S4.E6.m1.1.1.1.1.1.1" xref="S4.E6.m1.1.1.1.1.1.1.cmml">I</mi><mo id="S4.E6.m1.16.16.2.15.15.15.17.1" xref="S4.E6.m1.15.15.1a.cmml"></mo><mi id="S4.E6.m1.2.2.2.2.2.2" xref="S4.E6.m1.2.2.2.2.2.2.cmml">F</mi><mo id="S4.E6.m1.16.16.2.15.15.15.17.1a" xref="S4.E6.m1.15.15.1a.cmml"></mo><msubsup id="S4.E6.m1.16.16.2.15.15.15.17.2" xref="S4.E6.m1.15.15.1.cmml"><mi id="S4.E6.m1.3.3.3.3.3.3" xref="S4.E6.m1.3.3.3.3.3.3.cmml">S</mi><mi id="S4.E6.m1.4.4.4.4.4.4.1" xref="S4.E6.m1.4.4.4.4.4.4.1.cmml">t</mi><mi id="S4.E6.m1.5.5.5.5.5.5.1" xref="S4.E6.m1.5.5.5.5.5.5.1.cmml">p</mi></msubsup></mrow><mo id="S4.E6.m1.6.6.6.6.6.6" xref="S4.E6.m1.6.6.6.6.6.6.cmml">=</mo><mfrac id="S4.E6.m1.7.7.7.7.7.7" xref="S4.E6.m1.7.7.7.7.7.7.cmml"><mrow id="S4.E6.m1.7.7.7.7.7.7.1" xref="S4.E6.m1.7.7.7.7.7.7.1.cmml"><mi id="S4.E6.m1.7.7.7.7.7.7.1.3" xref="S4.E6.m1.7.7.7.7.7.7.1.3.cmml">σ</mi><mo id="S4.E6.m1.7.7.7.7.7.7.1.2" xref="S4.E6.m1.7.7.7.7.7.7.1.2.cmml"></mo><mrow id="S4.E6.m1.7.7.7.7.7.7.1.1.1" xref="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.cmml"><mo id="S4.E6.m1.7.7.7.7.7.7.1.1.1.2" stretchy="false" xref="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.cmml">(</mo><msubsup id="S4.E6.m1.7.7.7.7.7.7.1.1.1.1" xref="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.cmml"><mi id="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.2.2" xref="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.2.2.cmml">Y</mi><mi id="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.2.3" xref="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.2.3.cmml">t</mi><mi id="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.3" xref="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.3.cmml">p</mi></msubsup><mo id="S4.E6.m1.7.7.7.7.7.7.1.1.1.3" stretchy="false" xref="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.cmml">)</mo></mrow></mrow><msubsup id="S4.E6.m1.7.7.7.7.7.7.3" xref="S4.E6.m1.7.7.7.7.7.7.3.cmml"><mover accent="true" id="S4.E6.m1.7.7.7.7.7.7.3.2.2" xref="S4.E6.m1.7.7.7.7.7.7.3.2.2.cmml"><mi id="S4.E6.m1.7.7.7.7.7.7.3.2.2.2" xref="S4.E6.m1.7.7.7.7.7.7.3.2.2.2.cmml">Y</mi><mo id="S4.E6.m1.7.7.7.7.7.7.3.2.2.1" xref="S4.E6.m1.7.7.7.7.7.7.3.2.2.1.cmml">¯</mo></mover><mi id="S4.E6.m1.7.7.7.7.7.7.3.2.3" xref="S4.E6.m1.7.7.7.7.7.7.3.2.3.cmml">t</mi><mi id="S4.E6.m1.7.7.7.7.7.7.3.3" xref="S4.E6.m1.7.7.7.7.7.7.3.3.cmml">p</mi></msubsup></mfrac><mo id="S4.E6.m1.8.8.8.8.8.8" xref="S4.E6.m1.8.8.8.8.8.8.cmml">∈</mo><mrow id="S4.E6.m1.16.16.2.15.15.15.15.1" xref="S4.E6.m1.15.15.1.cmml"><mo id="S4.E6.m1.9.9.9.9.9.9" stretchy="false" xref="S4.E6.m1.15.15.1a.cmml">[</mo><mn id="S4.E6.m1.10.10.10.10.10.10" xref="S4.E6.m1.10.10.10.10.10.10.cmml">0</mn><mo id="S4.E6.m1.11.11.11.11.11.11" xref="S4.E6.m1.15.15.1a.cmml">,</mo><mrow id="S4.E6.m1.16.16.2.15.15.15.15.1.1" xref="S4.E6.m1.15.15.1.cmml"><mo id="S4.E6.m1.16.16.2.15.15.15.15.1.1a" xref="S4.E6.m1.15.15.1a.cmml">+</mo><mi id="S4.E6.m1.13.13.13.13.13.13" mathvariant="normal" xref="S4.E6.m1.13.13.13.13.13.13.cmml">∞</mi></mrow><mo id="S4.E6.m1.14.14.14.14.14.14" stretchy="false" xref="S4.E6.m1.15.15.1a.cmml">)</mo></mrow></mrow></mtd></mtr></mtable><annotation-xml encoding="MathML-Content" id="S4.E6.m1.16b"><apply id="S4.E6.m1.15.15.1.cmml" xref="S4.E6.m1.16.16.2"><and id="S4.E6.m1.15.15.1a.cmml" xref="S4.E6.m1.16.16.2.15.15.15.17.1"></and><apply id="S4.E6.m1.15.15.1b.cmml" xref="S4.E6.m1.16.16.2"><eq id="S4.E6.m1.6.6.6.6.6.6.cmml" xref="S4.E6.m1.6.6.6.6.6.6"></eq><apply id="S4.E6.m1.15.15.1.3.cmml" xref="S4.E6.m1.16.16.2"><times id="S4.E6.m1.15.15.1.3.1.cmml" xref="S4.E6.m1.16.16.2.15.15.15.17.1"></times><ci id="S4.E6.m1.1.1.1.1.1.1.cmml" xref="S4.E6.m1.1.1.1.1.1.1">𝐼</ci><ci id="S4.E6.m1.2.2.2.2.2.2.cmml" xref="S4.E6.m1.2.2.2.2.2.2">𝐹</ci><apply id="S4.E6.m1.15.15.1.3.4.cmml" xref="S4.E6.m1.16.16.2"><csymbol cd="ambiguous" id="S4.E6.m1.15.15.1.3.4.1.cmml" xref="S4.E6.m1.16.16.2.15.15.15.17.1">superscript</csymbol><apply id="S4.E6.m1.15.15.1.3.4.2.cmml" xref="S4.E6.m1.16.16.2"><csymbol cd="ambiguous" id="S4.E6.m1.15.15.1.3.4.2.1.cmml" xref="S4.E6.m1.16.16.2.15.15.15.17.1">subscript</csymbol><ci id="S4.E6.m1.3.3.3.3.3.3.cmml" xref="S4.E6.m1.3.3.3.3.3.3">𝑆</ci><ci id="S4.E6.m1.4.4.4.4.4.4.1.cmml" xref="S4.E6.m1.4.4.4.4.4.4.1">𝑡</ci></apply><ci id="S4.E6.m1.5.5.5.5.5.5.1.cmml" xref="S4.E6.m1.5.5.5.5.5.5.1">𝑝</ci></apply></apply><apply id="S4.E6.m1.7.7.7.7.7.7.cmml" xref="S4.E6.m1.7.7.7.7.7.7"><divide id="S4.E6.m1.7.7.7.7.7.7.2.cmml" xref="S4.E6.m1.7.7.7.7.7.7"></divide><apply id="S4.E6.m1.7.7.7.7.7.7.1.cmml" xref="S4.E6.m1.7.7.7.7.7.7.1"><times id="S4.E6.m1.7.7.7.7.7.7.1.2.cmml" xref="S4.E6.m1.7.7.7.7.7.7.1.2"></times><ci id="S4.E6.m1.7.7.7.7.7.7.1.3.cmml" xref="S4.E6.m1.7.7.7.7.7.7.1.3">𝜎</ci><apply id="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.cmml" xref="S4.E6.m1.7.7.7.7.7.7.1.1.1"><csymbol cd="ambiguous" id="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.1.cmml" xref="S4.E6.m1.7.7.7.7.7.7.1.1.1">superscript</csymbol><apply id="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.2.cmml" xref="S4.E6.m1.7.7.7.7.7.7.1.1.1"><csymbol cd="ambiguous" id="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.2.1.cmml" xref="S4.E6.m1.7.7.7.7.7.7.1.1.1">subscript</csymbol><ci id="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.2.2.cmml" xref="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.2.2">𝑌</ci><ci id="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.2.3.cmml" xref="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.2.3">𝑡</ci></apply><ci id="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.3.cmml" xref="S4.E6.m1.7.7.7.7.7.7.1.1.1.1.3">𝑝</ci></apply></apply><apply id="S4.E6.m1.7.7.7.7.7.7.3.cmml" xref="S4.E6.m1.7.7.7.7.7.7.3"><csymbol cd="ambiguous" id="S4.E6.m1.7.7.7.7.7.7.3.1.cmml" xref="S4.E6.m1.7.7.7.7.7.7.3">superscript</csymbol><apply id="S4.E6.m1.7.7.7.7.7.7.3.2.cmml" xref="S4.E6.m1.7.7.7.7.7.7.3"><csymbol cd="ambiguous" id="S4.E6.m1.7.7.7.7.7.7.3.2.1.cmml" xref="S4.E6.m1.7.7.7.7.7.7.3">subscript</csymbol><apply id="S4.E6.m1.7.7.7.7.7.7.3.2.2.cmml" xref="S4.E6.m1.7.7.7.7.7.7.3.2.2"><ci id="S4.E6.m1.7.7.7.7.7.7.3.2.2.1.cmml" xref="S4.E6.m1.7.7.7.7.7.7.3.2.2.1">¯</ci><ci id="S4.E6.m1.7.7.7.7.7.7.3.2.2.2.cmml" xref="S4.E6.m1.7.7.7.7.7.7.3.2.2.2">𝑌</ci></apply><ci id="S4.E6.m1.7.7.7.7.7.7.3.2.3.cmml" xref="S4.E6.m1.7.7.7.7.7.7.3.2.3">𝑡</ci></apply><ci id="S4.E6.m1.7.7.7.7.7.7.3.3.cmml" xref="S4.E6.m1.7.7.7.7.7.7.3.3">𝑝</ci></apply></apply></apply><apply id="S4.E6.m1.15.15.1c.cmml" xref="S4.E6.m1.16.16.2"><in id="S4.E6.m1.8.8.8.8.8.8.cmml" xref="S4.E6.m1.8.8.8.8.8.8"></in><share href="https://arxiv.org/html/2411.07037v1#S4.E6.m1.15.15.1.5.cmml" id="S4.E6.m1.15.15.1d.cmml" xref="S4.E6.m1.16.16.2.15.15.15.17.1"></share><interval closure="closed-open" id="S4.E6.m1.15.15.1.1.2.cmml" xref="S4.E6.m1.16.16.2"><cn id="S4.E6.m1.10.10.10.10.10.10.cmml" type="integer" xref="S4.E6.m1.10.10.10.10.10.10">0</cn><apply id="S4.E6.m1.15.15.1.1.1.1.cmml" xref="S4.E6.m1.16.16.2"><plus id="S4.E6.m1.12.12.12.12.12.12.cmml" xref="S4.E6.m1.16.16.2"></plus><infinity id="S4.E6.m1.13.13.13.13.13.13.cmml" xref="S4.E6.m1.13.13.13.13.13.13"></infinity></apply></interval></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E6.m1.16c">\begin{split}IFS_{t}^{p}=\frac{\sigma(Y_{t}^{p})}{\bar{Y}_{t}^{p}}\in[0,+% \infty)\end{split}</annotation><annotation encoding="application/x-llamapun" id="S4.E6.m1.16d">start_ROW start_CELL italic_I italic_F italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = divide start_ARG italic_σ ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) end_ARG start_ARG over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_ARG ∈ [ 0 , + ∞ ) end_CELL end_ROW</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(6)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S4.SS3.p2.8">The overall IFS for the model is calculated as the average IFS across all tasks. A lower IFS indicates greater stability in instruction-following under perspective <math alttext="p" class="ltx_Math" display="inline" id="S4.SS3.p2.8.m1.1"><semantics id="S4.SS3.p2.8.m1.1a"><mi id="S4.SS3.p2.8.m1.1.1" xref="S4.SS3.p2.8.m1.1.1.cmml">p</mi><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.8.m1.1b"><ci id="S4.SS3.p2.8.m1.1.1.cmml" xref="S4.SS3.p2.8.m1.1.1">𝑝</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.8.m1.1c">p</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.8.m1.1d">italic_p</annotation></semantics></math>, while a higher IFS suggests less stability.</p> </div> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">5 </span>Experiments</h2> <figure class="ltx_figure" id="S5.F2"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="732" id="S5.F2.g1" src="x2.png" width="746"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 2: </span>The Instruction Following Performance of models in six core capabilities. Lines of the same color in the chart represent models from the same family or series. Dashed lines represent base models, while solid lines represent their fine-tuned variants.</figcaption> </figure> <section class="ltx_subsection" id="S5.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">5.1 </span>Experiment Setup</h3> <div class="ltx_para" id="S5.SS1.p1"> <p class="ltx_p" id="S5.SS1.p1.1">We evaluate 20 popular LLMs that feature long context capability, including various models from the GPT <cite class="ltx_cite ltx_citemacro_cite">Achiam et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib1" title="">2023</a>)</cite>, Llama <cite class="ltx_cite ltx_citemacro_cite">Dubey et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib13" title="">2024</a>)</cite>, Qwen <cite class="ltx_cite ltx_citemacro_cite">Yang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib49" title="">2024</a>)</cite>, C4AI <cite class="ltx_cite ltx_citemacro_cite">Cohere For AI (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib10" title="">2024</a>)</cite>, LWM <cite class="ltx_cite ltx_citemacro_cite">Liu et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib29" title="">2024a</a>)</cite>, InternLM <cite class="ltx_cite ltx_citemacro_cite">Cai et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib6" title="">2024</a>)</cite>, and GLM <cite class="ltx_cite ltx_citemacro_cite">GLM et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib14" title="">2024</a>)</cite> series. All of the models officially claim to support context lengths in excess of 128k tokens. the instruct version of the Qwen2.5 model extends the 32k context length to 128k with YaRN <cite class="ltx_cite ltx_citemacro_cite">Peng et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib38" title="">2023</a>)</cite>. We access GPT-4 by the official API. We access other open-source LLMs from their official GitHub repositories.</p> </div> <div class="ltx_para" id="S5.SS1.p2"> <p class="ltx_p" id="S5.SS1.p2.1">During the inference process, we complete the deployment of all open-source models with the vLLM <cite class="ltx_cite ltx_citemacro_cite">Kwon et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib23" title="">2023</a>)</cite>. The temperature is set to 0 to ensure deterministic outputs. In order to avoid unnecessary time consumption resulting from endless repetitions in the output, we set the maximum generation length for different tasks. List scenario tasks that require outputting a single element are set to 100 tokens; The tasks in MultiDoc scenario requires more output space, so it is set to 4096 tokens; The output limit for the rest of the tasks is set to 512 tokens.</p> </div> <div class="ltx_para" id="S5.SS1.p3"> <p class="ltx_p" id="S5.SS1.p3.1">We conduct experiments across six exponentially increasing length intervals ranging from 4k to 128k tokens, and ensure sufficient space for model outputs within each interval. The token count for all texts was measured by GPT-4’s tokenizer <span class="ltx_note ltx_role_footnote" id="footnote2"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup><span class="ltx_tag ltx_tag_note">2</span><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/openai/tiktoken" title="">https://github.com/openai/tiktoken</a></span></span></span>. However, due to variations in encoding efficiency across models’ tokenizer, some models could not accommodate the longest inputs. To address this, we applied right truncation to make the input <math alttext="X" class="ltx_Math" display="inline" id="S5.SS1.p3.1.m1.1"><semantics id="S5.SS1.p3.1.m1.1a"><mi id="S5.SS1.p3.1.m1.1.1" xref="S5.SS1.p3.1.m1.1.1.cmml">X</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p3.1.m1.1b"><ci id="S5.SS1.p3.1.m1.1.1.cmml" xref="S5.SS1.p3.1.m1.1.1">𝑋</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p3.1.m1.1c">X</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p3.1.m1.1d">italic_X</annotation></semantics></math> at proper length.</p> </div> </section> <section class="ltx_subsection" id="S5.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">5.2 </span>Results on LIFBench</h3> <div class="ltx_para ltx_noindent" id="S5.SS2.p1"> <p class="ltx_p" id="S5.SS2.p1.1"><span class="ltx_text ltx_font_bold" id="S5.SS2.p1.1.1">Task-categorized Performance</span> The results on all tasks are shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S4.T3" title="Table 3 ‣ 4.1 Automated Rubric-based Scoring ‣ 4 LIFEval ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_tag">3</span></a>. The top two highest scores are achieved by two closed-source models, though the maximum overall score is only 0.764. This indicates that LLMs still have substantial room for improving instruction-following capabilities in long-context scenarios. Among models trained under identical configurations, parameter size remains a crucial factor in model performance, with larger models generally achieving better results. However, among models in the same series, “-instruction” or “-chat” variants significantly outperformed their base counterparts; in some cases, smaller models with instruction-tuned or chat-oriented fine-tuning surpassed larger models. For instance, internlm2_5-7b-chat-1m scores 0.087 higher than Meta-Llama-3.1-70B. This finding highlights that fine-tuning on instruction or conversational data can substantially enhance instruction-following capabilities in LLMs. Across various tasks, closed-source models achieved the highest scores in most cases, while open-source models outperformed in only a few tasks (e.g., <span class="ltx_text ltx_font_italic" id="S5.SS2.p1.1.2">Blur-Element</span> and <span class="ltx_text ltx_font_italic" id="S5.SS2.p1.1.3">Batch-label</span> ). Overall, current open-source small and medium models show a notable performance gap compared to closed-source large models like GPT-4.</p> </div> <div class="ltx_para ltx_noindent" id="S5.SS2.p2"> <p class="ltx_p" id="S5.SS2.p2.1"><span class="ltx_text ltx_font_bold" id="S5.SS2.p2.1.1">Capability-categorized Performance</span> As shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S5.F2" title="Figure 2 ‣ 5 Experiments ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_tag">2</span></a>, gpt-4o-2024-08-06 takes a lead in almost all dimensions except <span class="ltx_text ltx_font_italic" id="S5.SS2.p2.1.2">Format</span>. For models within the same series, the solid line entirely encloses the dashed line of the same color, indicating that instruction fine-tuning leads to comprehensive performance improvements. Observing the distances between the lines, we found that the <span class="ltx_text ltx_font_italic" id="S5.SS2.p2.1.3">Format</span> part is the most compact, while the <span class="ltx_text ltx_font_italic" id="S5.SS2.p2.1.4">Num</span> component shows significant distinctions. This indicates that the training corpora for most of models may contain substantial formatting requirements, resulting in similar performance levels across models. Conversely, the collection of data related to numerical cognition likely involved varying degrees of effort among the models.</p> </div> <figure class="ltx_table" id="S5.T4"> <div class="ltx_inline-block ltx_align_center ltx_transformed_outer" id="S5.T4.110" style="width:433.6pt;height:441.1pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(13.3pt,-13.6pt) scale(1.06552727340628,1.06552727340628) ;"> <table class="ltx_tabular ltx_align_middle" id="S5.T4.110.110"> <tr class="ltx_tr" id="S5.T4.110.110.111" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left" id="S5.T4.110.110.111.1"><span class="ltx_rule" style="width:100%;height:1.0pt;background:black;display:inline-block;"> </span></td> <td class="ltx_td ltx_align_center" colspan="3" id="S5.T4.110.110.111.2"><span class="ltx_text ltx_font_bold" id="S5.T4.110.110.111.2.1">IFS</span></td> <td class="ltx_td ltx_align_center" colspan="2" id="S5.T4.110.110.111.3"><span class="ltx_text ltx_font_bold" id="S5.T4.110.110.111.3.1">Overall</span></td> </tr> <tr class="ltx_tr" id="S5.T4.110.110.112"> <td class="ltx_td ltx_align_left" id="S5.T4.110.110.112.1"><span class="ltx_text ltx_font_bold" id="S5.T4.110.110.112.1.1" style="background-color:#E6E6E6;">Model</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T4.110.110.112.2"><span class="ltx_text ltx_font_bold" id="S5.T4.110.110.112.2.1" style="background-color:#E6E6E6;">Expression</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T4.110.110.112.3"><span class="ltx_text ltx_font_bold" id="S5.T4.110.110.112.3.1" style="background-color:#E6E6E6;">Variable</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T4.110.110.112.4"><span class="ltx_text ltx_font_bold" id="S5.T4.110.110.112.4.1" style="background-color:#E6E6E6;">Length</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T4.110.110.112.5"><span class="ltx_text ltx_font_bold" id="S5.T4.110.110.112.5.1" style="background-color:#E6E6E6;">IFS(AVG)</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T4.110.110.112.6"><span class="ltx_text ltx_font_bold" id="S5.T4.110.110.112.6.1" style="background-color:#E6E6E6;">ARS</span></td> </tr> <tr class="ltx_tr" id="S5.T4.5.5.5"> <td class="ltx_td ltx_align_left ltx_border_t" id="S5.T4.5.5.5.6">gpt-4o-2024-08-06</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T4.1.1.1.1">0.082 <sub class="ltx_sub" id="S5.T4.1.1.1.1.1"><span class="ltx_text ltx_font_italic" id="S5.T4.1.1.1.1.1.1">(3)</span></sub> </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T4.2.2.2.2"> <span class="ltx_text ltx_font_bold" id="S5.T4.2.2.2.2.1">0.061</span> <sub class="ltx_sub" id="S5.T4.2.2.2.2.2"><span class="ltx_text ltx_font_italic" id="S5.T4.2.2.2.2.2.1">(1)</span></sub> </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T4.3.3.3.3"> <span class="ltx_text ltx_font_bold" id="S5.T4.3.3.3.3.1">0.089</span> <sub class="ltx_sub" id="S5.T4.3.3.3.3.2"><span class="ltx_text ltx_font_italic" id="S5.T4.3.3.3.3.2.1">(1)</span></sub> </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T4.4.4.4.4"> <span class="ltx_text ltx_font_bold" id="S5.T4.4.4.4.4.1">0.078</span> <sub class="ltx_sub" id="S5.T4.4.4.4.4.2"><span class="ltx_text ltx_font_italic" id="S5.T4.4.4.4.4.2.1">(1)</span></sub> </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T4.5.5.5.5"><span class="ltx_text ltx_font_bold" id="S5.T4.5.5.5.5.1">0.764 <sub class="ltx_sub" id="S5.T4.5.5.5.5.1.1"><span class="ltx_text ltx_font_medium ltx_font_italic" id="S5.T4.5.5.5.5.1.1.1">(1)</span></sub></span></td> </tr> <tr class="ltx_tr" id="S5.T4.11.11.11"> <td class="ltx_td ltx_align_left" id="S5.T4.6.6.6.1">Qwen2.5-72B-Instruct<sup class="ltx_sup" id="S5.T4.6.6.6.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S5.T4.7.7.7.2"> <span class="ltx_text ltx_font_bold" id="S5.T4.7.7.7.2.1">0.064</span> <sub class="ltx_sub" id="S5.T4.7.7.7.2.2"><span class="ltx_text ltx_font_italic" id="S5.T4.7.7.7.2.2.1">(1)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.8.8.8.3">0.075 <sub class="ltx_sub" id="S5.T4.8.8.8.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.8.8.8.3.1.1">(3)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.9.9.9.4">0.173 <sub class="ltx_sub" id="S5.T4.9.9.9.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.9.9.9.4.1.1">(4)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.10.10.10.5"> <span class="ltx_text ltx_framed ltx_framed_underline" id="S5.T4.10.10.10.5.1">0.104</span> <sub class="ltx_sub" id="S5.T4.10.10.10.5.2"><span class="ltx_text ltx_font_italic" id="S5.T4.10.10.10.5.2.1">(2)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.11.11.11.6">0.700 <sub class="ltx_sub" id="S5.T4.11.11.11.6.1"><span class="ltx_text ltx_font_italic" id="S5.T4.11.11.11.6.1.1">(4)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.16.16.16"> <td class="ltx_td ltx_align_left" id="S5.T4.16.16.16.6">gpt-4-0125-preview</td> <td class="ltx_td ltx_align_center" id="S5.T4.12.12.12.1"> <span class="ltx_text ltx_framed ltx_framed_underline" id="S5.T4.12.12.12.1.1">0.065</span> <sub class="ltx_sub" id="S5.T4.12.12.12.1.2"><span class="ltx_text ltx_font_italic" id="S5.T4.12.12.12.1.2.1">(2)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.13.13.13.2">0.098 <sub class="ltx_sub" id="S5.T4.13.13.13.2.1"><span class="ltx_text ltx_font_italic" id="S5.T4.13.13.13.2.1.1">(8)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.14.14.14.3">0.157 <sub class="ltx_sub" id="S5.T4.14.14.14.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.14.14.14.3.1.1">(3)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.15.15.15.4">0.107 <sub class="ltx_sub" id="S5.T4.15.15.15.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.15.15.15.4.1.1">(3)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.16.16.16.5"> <span class="ltx_text ltx_framed ltx_framed_underline" id="S5.T4.16.16.16.5.1">0.734</span> <sub class="ltx_sub" id="S5.T4.16.16.16.5.2"><span class="ltx_text ltx_font_italic" id="S5.T4.16.16.16.5.2.1">(2)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.22.22.22"> <td class="ltx_td ltx_align_left" id="S5.T4.17.17.17.1">c4ai-command-r-v01<sup class="ltx_sup" id="S5.T4.17.17.17.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S5.T4.18.18.18.2">0.142 <sub class="ltx_sub" id="S5.T4.18.18.18.2.1"><span class="ltx_text ltx_font_italic" id="S5.T4.18.18.18.2.1.1">(10)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.19.19.19.3">0.079 <sub class="ltx_sub" id="S5.T4.19.19.19.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.19.19.19.3.1.1">(5)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.20.20.20.4"> <span class="ltx_text ltx_framed ltx_framed_underline" id="S5.T4.20.20.20.4.1">0.148</span> <sub class="ltx_sub" id="S5.T4.20.20.20.4.2"><span class="ltx_text ltx_font_italic" id="S5.T4.20.20.20.4.2.1">(2)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.21.21.21.5">0.123 <sub class="ltx_sub" id="S5.T4.21.21.21.5.1"><span class="ltx_text ltx_font_italic" id="S5.T4.21.21.21.5.1.1">(4)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.22.22.22.6">0.584 <sub class="ltx_sub" id="S5.T4.22.22.22.6.1"><span class="ltx_text ltx_font_italic" id="S5.T4.22.22.22.6.1.1">(7)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.28.28.28"> <td class="ltx_td ltx_align_left" id="S5.T4.23.23.23.1">Qwen2.5-32B-Instruct<sup class="ltx_sup" id="S5.T4.23.23.23.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S5.T4.24.24.24.2">0.107 <sub class="ltx_sub" id="S5.T4.24.24.24.2.1"><span class="ltx_text ltx_font_italic" id="S5.T4.24.24.24.2.1.1">(5)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.25.25.25.3">0.084 <sub class="ltx_sub" id="S5.T4.25.25.25.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.25.25.25.3.1.1">(6)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.26.26.26.4">0.194 <sub class="ltx_sub" id="S5.T4.26.26.26.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.26.26.26.4.1.1">(6)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.27.27.27.5">0.129 <sub class="ltx_sub" id="S5.T4.27.27.27.5.1"><span class="ltx_text ltx_font_italic" id="S5.T4.27.27.27.5.1.1">(5)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.28.28.28.6">0.632 <sub class="ltx_sub" id="S5.T4.28.28.28.6.1"><span class="ltx_text ltx_font_italic" id="S5.T4.28.28.28.6.1.1">(5)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.33.33.33"> <td class="ltx_td ltx_align_left" id="S5.T4.33.33.33.6">Meta-Llama-3.1-70B-Instruct</td> <td class="ltx_td ltx_align_center" id="S5.T4.29.29.29.1">0.097 <sub class="ltx_sub" id="S5.T4.29.29.29.1.1"><span class="ltx_text ltx_font_italic" id="S5.T4.29.29.29.1.1.1">(4)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.30.30.30.2"> <span class="ltx_text ltx_framed ltx_framed_underline" id="S5.T4.30.30.30.2.1">0.074</span> <sub class="ltx_sub" id="S5.T4.30.30.30.2.2"><span class="ltx_text ltx_font_italic" id="S5.T4.30.30.30.2.2.1">(2)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.31.31.31.3">0.265 <sub class="ltx_sub" id="S5.T4.31.31.31.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.31.31.31.3.1.1">(13)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.32.32.32.4">0.146 <sub class="ltx_sub" id="S5.T4.32.32.32.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.32.32.32.4.1.1">(7)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.33.33.33.5">0.700 <sub class="ltx_sub" id="S5.T4.33.33.33.5.1"><span class="ltx_text ltx_font_italic" id="S5.T4.33.33.33.5.1.1">(3)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.39.39.39"> <td class="ltx_td ltx_align_left" id="S5.T4.34.34.34.1">c4ai-command-r-08-2024<sup class="ltx_sup" id="S5.T4.34.34.34.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S5.T4.35.35.35.2">0.115 <sub class="ltx_sub" id="S5.T4.35.35.35.2.1"><span class="ltx_text ltx_font_italic" id="S5.T4.35.35.35.2.1.1">(7)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.36.36.36.3">0.077 <sub class="ltx_sub" id="S5.T4.36.36.36.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.36.36.36.3.1.1">(4)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.37.37.37.4">0.251 <sub class="ltx_sub" id="S5.T4.37.37.37.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.37.37.37.4.1.1">(11)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.38.38.38.5">0.148 <sub class="ltx_sub" id="S5.T4.38.38.38.5.1"><span class="ltx_text ltx_font_italic" id="S5.T4.38.38.38.5.1.1">(8)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.39.39.39.6">0.611 <sub class="ltx_sub" id="S5.T4.39.39.39.6.1"><span class="ltx_text ltx_font_italic" id="S5.T4.39.39.39.6.1.1">(6)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.44.44.44"> <td class="ltx_td ltx_align_left" id="S5.T4.44.44.44.6">internlm2_5-7b-chat-1m</td> <td class="ltx_td ltx_align_center" id="S5.T4.40.40.40.1">0.108 <sub class="ltx_sub" id="S5.T4.40.40.40.1.1"><span class="ltx_text ltx_font_italic" id="S5.T4.40.40.40.1.1.1">(6)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.41.41.41.2">0.130 <sub class="ltx_sub" id="S5.T4.41.41.41.2.1"><span class="ltx_text ltx_font_italic" id="S5.T4.41.41.41.2.1.1">(15)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.42.42.42.3">0.196 <sub class="ltx_sub" id="S5.T4.42.42.42.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.42.42.42.3.1.1">(7)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.43.43.43.4">0.145 <sub class="ltx_sub" id="S5.T4.43.43.43.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.43.43.43.4.1.1">(6)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.44.44.44.5">0.523 <sub class="ltx_sub" id="S5.T4.44.44.44.5.1"><span class="ltx_text ltx_font_italic" id="S5.T4.44.44.44.5.1.1">(10)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.50.50.50"> <td class="ltx_td ltx_align_left" id="S5.T4.45.45.45.1">Qwen2.5-72B<sup class="ltx_sup" id="S5.T4.45.45.45.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S5.T4.46.46.46.2">0.153 <sub class="ltx_sub" id="S5.T4.46.46.46.2.1"><span class="ltx_text ltx_font_italic" id="S5.T4.46.46.46.2.1.1">(11)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.47.47.47.3">0.094 <sub class="ltx_sub" id="S5.T4.47.47.47.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.47.47.47.3.1.1">(7)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.48.48.48.4">0.236 <sub class="ltx_sub" id="S5.T4.48.48.48.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.48.48.48.4.1.1">(10)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.49.49.49.5">0.161 <sub class="ltx_sub" id="S5.T4.49.49.49.5.1"><span class="ltx_text ltx_font_italic" id="S5.T4.49.49.49.5.1.1">(10)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.50.50.50.6">0.542 <sub class="ltx_sub" id="S5.T4.50.50.50.6.1"><span class="ltx_text ltx_font_italic" id="S5.T4.50.50.50.6.1.1">(8)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.56.56.56"> <td class="ltx_td ltx_align_left" id="S5.T4.51.51.51.1">Qwen2.5-7B-Instruct<sup class="ltx_sup" id="S5.T4.51.51.51.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S5.T4.52.52.52.2">0.136 <sub class="ltx_sub" id="S5.T4.52.52.52.2.1"><span class="ltx_text ltx_font_italic" id="S5.T4.52.52.52.2.1.1">(8)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.53.53.53.3">0.111 <sub class="ltx_sub" id="S5.T4.53.53.53.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.53.53.53.3.1.1">(9)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.54.54.54.4">0.258 <sub class="ltx_sub" id="S5.T4.54.54.54.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.54.54.54.4.1.1">(12)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.55.55.55.5">0.168 <sub class="ltx_sub" id="S5.T4.55.55.55.5.1"><span class="ltx_text ltx_font_italic" id="S5.T4.55.55.55.5.1.1">(12)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.56.56.56.6">0.515 <sub class="ltx_sub" id="S5.T4.56.56.56.6.1"><span class="ltx_text ltx_font_italic" id="S5.T4.56.56.56.6.1.1">(11)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.62.62.62"> <td class="ltx_td ltx_align_left" id="S5.T4.57.57.57.1">Qwen2.5-14B-Instruct<sup class="ltx_sup" id="S5.T4.57.57.57.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S5.T4.58.58.58.2">0.140 <sub class="ltx_sub" id="S5.T4.58.58.58.2.1"><span class="ltx_text ltx_font_italic" id="S5.T4.58.58.58.2.1.1">(9)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.59.59.59.3">0.126 <sub class="ltx_sub" id="S5.T4.59.59.59.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.59.59.59.3.1.1">(14)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.60.60.60.4">0.224 <sub class="ltx_sub" id="S5.T4.60.60.60.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.60.60.60.4.1.1">(9)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.61.61.61.5">0.163 <sub class="ltx_sub" id="S5.T4.61.61.61.5.1"><span class="ltx_text ltx_font_italic" id="S5.T4.61.61.61.5.1.1">(11)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.62.62.62.6">0.524 <sub class="ltx_sub" id="S5.T4.62.62.62.6.1"><span class="ltx_text ltx_font_italic" id="S5.T4.62.62.62.6.1.1">(9)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.67.67.67"> <td class="ltx_td ltx_align_left" id="S5.T4.67.67.67.6">glm-4-9b-chat-1m</td> <td class="ltx_td ltx_align_center" id="S5.T4.63.63.63.1">0.157 <sub class="ltx_sub" id="S5.T4.63.63.63.1.1"><span class="ltx_text ltx_font_italic" id="S5.T4.63.63.63.1.1.1">(12)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.64.64.64.2">0.141 <sub class="ltx_sub" id="S5.T4.64.64.64.2.1"><span class="ltx_text ltx_font_italic" id="S5.T4.64.64.64.2.1.1">(18)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.65.65.65.3">0.178 <sub class="ltx_sub" id="S5.T4.65.65.65.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.65.65.65.3.1.1">(5)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.66.66.66.4">0.159 <sub class="ltx_sub" id="S5.T4.66.66.66.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.66.66.66.4.1.1">(9)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.67.67.67.5">0.492 <sub class="ltx_sub" id="S5.T4.67.67.67.5.1"><span class="ltx_text ltx_font_italic" id="S5.T4.67.67.67.5.1.1">(12)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.72.72.72"> <td class="ltx_td ltx_align_left" id="S5.T4.72.72.72.6">Meta-Llama-3.1-8B-Instruct</td> <td class="ltx_td ltx_align_center" id="S5.T4.68.68.68.1">0.168 <sub class="ltx_sub" id="S5.T4.68.68.68.1.1"><span class="ltx_text ltx_font_italic" id="S5.T4.68.68.68.1.1.1">(13)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.69.69.69.2">0.136 <sub class="ltx_sub" id="S5.T4.69.69.69.2.1"><span class="ltx_text ltx_font_italic" id="S5.T4.69.69.69.2.1.1">(17)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.70.70.70.3">0.210 <sub class="ltx_sub" id="S5.T4.70.70.70.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.70.70.70.3.1.1">(8)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.71.71.71.4">0.171 <sub class="ltx_sub" id="S5.T4.71.71.71.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.71.71.71.4.1.1">(13)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.72.72.72.5">0.488 <sub class="ltx_sub" id="S5.T4.72.72.72.5.1"><span class="ltx_text ltx_font_italic" id="S5.T4.72.72.72.5.1.1">(13)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.77.77.77"> <td class="ltx_td ltx_align_left" id="S5.T4.77.77.77.6">Meta-Llama-3.1-8B</td> <td class="ltx_td ltx_align_center" id="S5.T4.73.73.73.1">0.207 <sub class="ltx_sub" id="S5.T4.73.73.73.1.1"><span class="ltx_text ltx_font_italic" id="S5.T4.73.73.73.1.1.1">(17)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.74.74.74.2">0.115 <sub class="ltx_sub" id="S5.T4.74.74.74.2.1"><span class="ltx_text ltx_font_italic" id="S5.T4.74.74.74.2.1.1">(10)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.75.75.75.3">0.321 <sub class="ltx_sub" id="S5.T4.75.75.75.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.75.75.75.3.1.1">(16)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.76.76.76.4">0.215 <sub class="ltx_sub" id="S5.T4.76.76.76.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.76.76.76.4.1.1">(15)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.77.77.77.5">0.399 <sub class="ltx_sub" id="S5.T4.77.77.77.5.1"><span class="ltx_text ltx_font_italic" id="S5.T4.77.77.77.5.1.1">(16)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.83.83.83"> <td class="ltx_td ltx_align_left" id="S5.T4.78.78.78.1">Qwen2.5-14B<sup class="ltx_sup" id="S5.T4.78.78.78.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S5.T4.79.79.79.2">0.170 <sub class="ltx_sub" id="S5.T4.79.79.79.2.1"><span class="ltx_text ltx_font_italic" id="S5.T4.79.79.79.2.1.1">(14)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.80.80.80.3">0.121 <sub class="ltx_sub" id="S5.T4.80.80.80.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.80.80.80.3.1.1">(11)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.81.81.81.4">0.524 <sub class="ltx_sub" id="S5.T4.81.81.81.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.81.81.81.4.1.1">(18)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.82.82.82.5">0.272 <sub class="ltx_sub" id="S5.T4.82.82.82.5.1"><span class="ltx_text ltx_font_italic" id="S5.T4.82.82.82.5.1.1">(18)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.83.83.83.6">0.371 <sub class="ltx_sub" id="S5.T4.83.83.83.6.1"><span class="ltx_text ltx_font_italic" id="S5.T4.83.83.83.6.1.1">(18)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.88.88.88"> <td class="ltx_td ltx_align_left" id="S5.T4.88.88.88.6">LWM-Text-Chat-1M</td> <td class="ltx_td ltx_align_center" id="S5.T4.84.84.84.1">0.199 <sub class="ltx_sub" id="S5.T4.84.84.84.1.1"><span class="ltx_text ltx_font_italic" id="S5.T4.84.84.84.1.1.1">(15)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.85.85.85.2">0.130 <sub class="ltx_sub" id="S5.T4.85.85.85.2.1"><span class="ltx_text ltx_font_italic" id="S5.T4.85.85.85.2.1.1">(16)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.86.86.86.3">0.269 <sub class="ltx_sub" id="S5.T4.86.86.86.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.86.86.86.3.1.1">(14)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.87.87.87.4">0.200 <sub class="ltx_sub" id="S5.T4.87.87.87.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.87.87.87.4.1.1">(14)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.88.88.88.5">0.409 <sub class="ltx_sub" id="S5.T4.88.88.88.5.1"><span class="ltx_text ltx_font_italic" id="S5.T4.88.88.88.5.1.1">(15)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.94.94.94"> <td class="ltx_td ltx_align_left" id="S5.T4.89.89.89.1">Qwen2.5-32B<sup class="ltx_sup" id="S5.T4.89.89.89.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S5.T4.90.90.90.2">0.200 <sub class="ltx_sub" id="S5.T4.90.90.90.2.1"><span class="ltx_text ltx_font_italic" id="S5.T4.90.90.90.2.1.1">(16)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.91.91.91.3">0.121 <sub class="ltx_sub" id="S5.T4.91.91.91.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.91.91.91.3.1.1">(12)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.92.92.92.4">0.689 <sub class="ltx_sub" id="S5.T4.92.92.92.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.92.92.92.4.1.1">(19)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.93.93.93.5">0.337 <sub class="ltx_sub" id="S5.T4.93.93.93.5.1"><span class="ltx_text ltx_font_italic" id="S5.T4.93.93.93.5.1.1">(19)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.94.94.94.6">0.377 <sub class="ltx_sub" id="S5.T4.94.94.94.6.1"><span class="ltx_text ltx_font_italic" id="S5.T4.94.94.94.6.1.1">(17)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.100.100.100"> <td class="ltx_td ltx_align_left" id="S5.T4.95.95.95.1">Qwen2.5-7B<sup class="ltx_sup" id="S5.T4.95.95.95.1.1">†</sup> </td> <td class="ltx_td ltx_align_center" id="S5.T4.96.96.96.2">0.267 <sub class="ltx_sub" id="S5.T4.96.96.96.2.1"><span class="ltx_text ltx_font_italic" id="S5.T4.96.96.96.2.1.1">(20)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.97.97.97.3">0.125 <sub class="ltx_sub" id="S5.T4.97.97.97.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.97.97.97.3.1.1">(13)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.98.98.98.4">0.779 <sub class="ltx_sub" id="S5.T4.98.98.98.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.98.98.98.4.1.1">(20)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.99.99.99.5">0.390 <sub class="ltx_sub" id="S5.T4.99.99.99.5.1"><span class="ltx_text ltx_font_italic" id="S5.T4.99.99.99.5.1.1">(20)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.100.100.100.6">0.187 <sub class="ltx_sub" id="S5.T4.100.100.100.6.1"><span class="ltx_text ltx_font_italic" id="S5.T4.100.100.100.6.1.1">(20)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.105.105.105"> <td class="ltx_td ltx_align_left" id="S5.T4.105.105.105.6">Meta-Llama-3.1-70B</td> <td class="ltx_td ltx_align_center" id="S5.T4.101.101.101.1">0.226 <sub class="ltx_sub" id="S5.T4.101.101.101.1.1"><span class="ltx_text ltx_font_italic" id="S5.T4.101.101.101.1.1.1">(18)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.102.102.102.2">0.170 <sub class="ltx_sub" id="S5.T4.102.102.102.2.1"><span class="ltx_text ltx_font_italic" id="S5.T4.102.102.102.2.1.1">(19)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.103.103.103.3">0.399 <sub class="ltx_sub" id="S5.T4.103.103.103.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.103.103.103.3.1.1">(17)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.104.104.104.4">0.265 <sub class="ltx_sub" id="S5.T4.104.104.104.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.104.104.104.4.1.1">(17)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.105.105.105.5">0.436 <sub class="ltx_sub" id="S5.T4.105.105.105.5.1"><span class="ltx_text ltx_font_italic" id="S5.T4.105.105.105.5.1.1">(14)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.110.110.110"> <td class="ltx_td ltx_align_left" id="S5.T4.110.110.110.6">LWM-Text-1M</td> <td class="ltx_td ltx_align_center" id="S5.T4.106.106.106.1">0.245 <sub class="ltx_sub" id="S5.T4.106.106.106.1.1"><span class="ltx_text ltx_font_italic" id="S5.T4.106.106.106.1.1.1">(19)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.107.107.107.2">0.186 <sub class="ltx_sub" id="S5.T4.107.107.107.2.1"><span class="ltx_text ltx_font_italic" id="S5.T4.107.107.107.2.1.1">(20)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.108.108.108.3">0.290 <sub class="ltx_sub" id="S5.T4.108.108.108.3.1"><span class="ltx_text ltx_font_italic" id="S5.T4.108.108.108.3.1.1">(15)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.109.109.109.4">0.240 <sub class="ltx_sub" id="S5.T4.109.109.109.4.1"><span class="ltx_text ltx_font_italic" id="S5.T4.109.109.109.4.1.1">(16)</span></sub> </td> <td class="ltx_td ltx_align_center" id="S5.T4.110.110.110.5">0.212 <sub class="ltx_sub" id="S5.T4.110.110.110.5.1"><span class="ltx_text ltx_font_italic" id="S5.T4.110.110.110.5.1.1">(19)</span></sub> </td> </tr> <tr class="ltx_tr" id="S5.T4.110.110.113"> <td class="ltx_td ltx_align_left" id="S5.T4.110.110.113.1"><span class="ltx_rule" style="width:100%;height:1.0pt;background:black;display:inline-block;"> </span></td> <td class="ltx_td" id="S5.T4.110.110.113.2"></td> <td class="ltx_td" id="S5.T4.110.110.113.3"></td> <td class="ltx_td" id="S5.T4.110.110.113.4"></td> <td class="ltx_td" id="S5.T4.110.110.113.5"></td> <td class="ltx_td" id="S5.T4.110.110.113.6"></td> </tr> </table> </span></div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 4: </span>Instruction Following Stability of models from different perspectives. The integers in parentheses represent rankings, where smaller numbers indicate better performance. The overall order is determined by the average IFS rankings from three perspectives.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S5.SS2.p3"> <p class="ltx_p" id="S5.SS2.p3.1"><span class="ltx_text ltx_font_bold" id="S5.SS2.p3.1.1">Stability</span> Table <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S5.T4" title="Table 4 ‣ 5.2 Results on LIFBench ‣ 5 Experiments ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_tag">4</span></a> presents the IFS scores and corresponding rankings of each model from three perspectives. Comparing these scores with ARS, we observe notable discrepancies between model stability and task completion ability. For instance, while Qwen2.5-72B-Instruct underwent truncation, potentially affecting its stability in "Length", it still outperformed GPT-4-0125-preview, which had a higher ARS score. Furthermore, each model demonstrated unique strengths and weaknesses across the different perspectives. GPT-4-2024-08-06, for example, showed less stability in "Expression" than in other areas, whereas GPT-4-0125-preview struggled more with dynamic key part in instructions where Meta-Llama-3.1-70B-Instruct excelled. Among models of the same series, instruction fine-tuning generally enhanced stability; however, parameter size did not guarantee superiority. For example, Qwen2.5-72B-Instruct outperformed all closed-source models in the "Expression" aspect.</p> </div> </section> </section> <section class="ltx_section" id="S6"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">6 </span>Conclusion</h2> <div class="ltx_para" id="S6.p1"> <p class="ltx_p" id="S6.p1.1">In this work, we systematically evaluated the instruction-following capabilities and stability of LLMs in long-context scenarios. We developed LIFBench, comprising three long-context scenarios and a diverse set of 11 tasks, alongside a method for instruction expansion across three distinct perspectives. For evaluation, we introduced LIFEval, an automated rubric-based scoring framework, enabling fast and accurate assessments of model task performance and stability. Additionally, we conducted extensive experiments on 20 prominent LLMs, comparing and analyzing the performance of representative models.</p> </div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Achiam et al. (2023)</span> <span class="ltx_bibblock"> Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. </span> <span class="ltx_bibblock">Gpt-4 technical report. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib1.1.1">arXiv preprint arXiv:2303.08774</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib2"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">An et al. (2023)</span> <span class="ltx_bibblock"> Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. 2023. </span> <span class="ltx_bibblock">L-eval: Instituting standardized evaluation for long context language models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib2.1.1">arXiv preprint arXiv:2307.11088</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib3"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Anthropic (2024)</span> <span class="ltx_bibblock"> AI Anthropic. 2024. </span> <span class="ltx_bibblock">The claude 3 model family: Opus, sonnet, haiku. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib3.1.1">Claude-3 Model Card</em>, 1. </span> </li> <li class="ltx_bibitem" id="bib.bib4"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Bai et al. (2023)</span> <span class="ltx_bibblock"> Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. 2023. </span> <span class="ltx_bibblock">Longbench: A bilingual, multitask benchmark for long context understanding. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib4.1.1">arXiv preprint arXiv:2308.14508</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib5"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Brown (2020)</span> <span class="ltx_bibblock"> Tom B Brown. 2020. </span> <span class="ltx_bibblock">Language models are few-shot learners. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib5.1.1">arXiv preprint arXiv:2005.14165</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib6"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Cai et al. (2024)</span> <span class="ltx_bibblock"> Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. 2024. </span> <span class="ltx_bibblock">Internlm2 technical report. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib6.1.1">arXiv preprint arXiv:2403.17297</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib7"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Chen et al. (2024a)</span> <span class="ltx_bibblock"> Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024a. </span> <span class="ltx_bibblock">Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib7.1.1">arXiv preprint arXiv:2402.03216</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib8"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Chen et al. (2024b)</span> <span class="ltx_bibblock"> Xinyi Chen, Baohao Liao, Jirui Qi, Panagiotis Eustratiadis, Christof Monz, Arianna Bisazza, and Maarten de Rijke. 2024b. </span> <span class="ltx_bibblock">The sifo benchmark: Investigating the sequential instruction following ability of large language models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib8.1.1">arXiv preprint arXiv:2406.19999</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib9"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Chowdhery et al. (2023)</span> <span class="ltx_bibblock"> Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. </span> <span class="ltx_bibblock">Palm: Scaling language modeling with pathways. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib9.1.1">Journal of Machine Learning Research</em>, 24(240):1–113. </span> </li> <li class="ltx_bibitem" id="bib.bib10"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Cohere For AI (2024)</span> <span class="ltx_bibblock"> Cohere For AI. 2024. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_href" href="https://doi.org/10.57967/hf/3134" title="">c4ai-command-r-08-2024</a>. </span> </li> <li class="ltx_bibitem" id="bib.bib11"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Cook et al. (2024)</span> <span class="ltx_bibblock"> Jonathan Cook, Tim Rocktäschel, Jakob Foerster, Dennis Aumiller, and Alex Wang. 2024. </span> <span class="ltx_bibblock">Ticking all the boxes: Generated checklists improve llm evaluation and generation. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib11.1.1">arXiv preprint arXiv:2410.03608</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib12"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Dong et al. (2024)</span> <span class="ltx_bibblock"> Zican Dong, Tianyi Tang, Junyi Li, Wayne Xin Zhao, and Ji-Rong Wen. 2024. </span> <span class="ltx_bibblock">Bamboo: A comprehensive benchmark for evaluating long text modeling capacities of large language models. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib12.1.1">Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)</em>, pages 2086–2099. </span> </li> <li class="ltx_bibitem" id="bib.bib13"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Dubey et al. (2024)</span> <span class="ltx_bibblock"> Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. </span> <span class="ltx_bibblock">The llama 3 herd of models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib13.1.1">arXiv preprint arXiv:2407.21783</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib14"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">GLM et al. (2024)</span> <span class="ltx_bibblock"> Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, et al. 2024. </span> <span class="ltx_bibblock">Chatglm: A family of large language models from glm-130b to glm-4 all tools. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib14.1.1">arXiv preprint arXiv:2406.12793</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib15"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Greg Kamradt (2023)</span> <span class="ltx_bibblock"> Greg Kamradt. 2023. </span> <span class="ltx_bibblock">LLMTest: Needle in a Haystack. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/gkamradt/LLMTest_NeedleInAHaystack" title="">https://github.com/gkamradt/LLMTest_NeedleInAHaystack</a>. </span> </li> <li class="ltx_bibitem" id="bib.bib16"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">He et al. (2024)</span> <span class="ltx_bibblock"> Qianyu He, Jie Zeng, Wenhao Huang, Lina Chen, Jin Xiao, Qianxi He, Xunzhe Zhou, Jiaqing Liang, and Yanghua Xiao. 2024. </span> <span class="ltx_bibblock">Can large language models understand real-world complex instructions? </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib16.1.1">Proceedings of the AAAI Conference on Artificial Intelligence</em>, volume 38, pages 18188–18196. </span> </li> <li class="ltx_bibitem" id="bib.bib17"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Honovich et al. (2023)</span> <span class="ltx_bibblock"> Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. 2023. </span> <span class="ltx_bibblock">Unnatural instructions: Tuning language models with (almost) no human labor. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib17.1.1">Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</em>, pages 14409–14428. </span> </li> <li class="ltx_bibitem" id="bib.bib18"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Hsieh et al. (2024)</span> <span class="ltx_bibblock"> Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. 2024. </span> <span class="ltx_bibblock">Ruler: What’s the real context size of your long-context language models? </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib18.1.1">arXiv preprint arXiv:2404.06654</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib19"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Huang et al. (2021)</span> <span class="ltx_bibblock"> Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. 2021. </span> <span class="ltx_bibblock">Efficient attentions for long document summarization. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib19.1.1">arXiv preprint arXiv:2104.02112</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib20"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Jiang et al. (2023)</span> <span class="ltx_bibblock"> Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang. 2023. </span> <span class="ltx_bibblock">Followbench: A multi-level fine-grained constraints following benchmark for large language models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib20.1.1">arXiv preprint arXiv:2310.20410</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib21"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Jing et al. (2023)</span> <span class="ltx_bibblock"> Yimin Jing, Renren Jin, Jiahao Hu, Huishi Qiu, Xiaohua Wang, Peng Wang, and Deyi Xiong. 2023. </span> <span class="ltx_bibblock">Followeval: A multi-dimensional benchmark for assessing the instruction-following capability of large language models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib21.1.1">arXiv preprint arXiv:2311.09829</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib22"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kwan et al. (2023)</span> <span class="ltx_bibblock"> Wai-Chung Kwan, Xingshan Zeng, Yufei Wang, Yusen Sun, Liangyou Li, Lifeng Shang, Qun Liu, and Kam-Fai Wong. 2023. </span> <span class="ltx_bibblock">M4le: A multi-ability multi-range multi-task multi-domain long-context evaluation benchmark for large language models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib22.1.1">arXiv preprint arXiv:2310.19240</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib23"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kwon et al. (2023)</span> <span class="ltx_bibblock"> Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. </span> <span class="ltx_bibblock">Efficient memory management for large language model serving with pagedattention. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib23.1.1">Proceedings of the 29th Symposium on Operating Systems Principles</em>, pages 611–626. </span> </li> <li class="ltx_bibitem" id="bib.bib24"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Levy et al. (2024)</span> <span class="ltx_bibblock"> Mosh Levy, Alon Jacoby, and Yoav Goldberg. 2024. </span> <span class="ltx_bibblock">Same task, more tokens: the impact of input length on the reasoning performance of large language models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib24.1.1">arXiv preprint arXiv:2402.14848</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib25"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li et al. (2023)</span> <span class="ltx_bibblock"> Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. 2023. </span> <span class="ltx_bibblock">Loogle: Can long-context language models understand long contexts? </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib25.1.1">arXiv preprint arXiv:2311.04939</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib26"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li et al. (2024a)</span> <span class="ltx_bibblock"> Qintong Li, Leyang Cui, Xueliang Zhao, Lingpeng Kong, and Wei Bi. 2024a. </span> <span class="ltx_bibblock">Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib26.1.1">arXiv preprint arXiv:2402.19255</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib27"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li et al. (2024b)</span> <span class="ltx_bibblock"> Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, and Wenhu Chen. 2024b. </span> <span class="ltx_bibblock">Long-context llms struggle with long in-context learning. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib27.1.1">arXiv preprint arXiv:2404.02060</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib28"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Lin (2004)</span> <span class="ltx_bibblock"> Chin-Yew Lin. 2004. </span> <span class="ltx_bibblock">Rouge: A package for automatic evaluation of summaries. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib28.1.1">Text summarization branches out</em>, pages 74–81. </span> </li> <li class="ltx_bibitem" id="bib.bib29"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Liu et al. (2024a)</span> <span class="ltx_bibblock"> Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. 2024a. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_href" href="https://api.semanticscholar.org/CorpusID:267637090" title="">World model on million-length video and language with blockwise ringattention</a>. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib29.1.1">ArXiv</em>, abs/2402.08268. </span> </li> <li class="ltx_bibitem" id="bib.bib30"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Liu et al. (2024b)</span> <span class="ltx_bibblock"> Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024b. </span> <span class="ltx_bibblock">Lost in the middle: How language models use long contexts. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib30.1.1">Transactions of the Association for Computational Linguistics</em>, 12:157–173. </span> </li> <li class="ltx_bibitem" id="bib.bib31"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Lu et al. (2023)</span> <span class="ltx_bibblock"> Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. 2023. </span> <span class="ltx_bibblock">Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib31.1.1">arXiv e-prints</em>, pages arXiv–2310. </span> </li> <li class="ltx_bibitem" id="bib.bib32"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Macqueen (1967)</span> <span class="ltx_bibblock"> J Macqueen. 1967. </span> <span class="ltx_bibblock">Some methods for classification and analysis of multivariate observations. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib32.1.1">Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability/University of California Press</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib33"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Narayan et al. (2018)</span> <span class="ltx_bibblock"> Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018. </span> <span class="ltx_bibblock">Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib33.1.1">arXiv preprint arXiv:1808.08745</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib34"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ni et al. (2024)</span> <span class="ltx_bibblock"> Xuanfan Ni, Hengyi Cai, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, and Piji Li. 2024. </span> <span class="ltx_bibblock">Xl<sup class="ltx_sup" id="bib.bib34.1.1">2</sup> bench: A benchmark for extremely long context understanding with long-range dependencies. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib34.2.1">arXiv preprint arXiv:2404.05446</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib35"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">O’Connor and Andreas (2021)</span> <span class="ltx_bibblock"> Joe O’Connor and Jacob Andreas. 2021. </span> <span class="ltx_bibblock">What context features can transformer language models use? </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib35.1.1">Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)</em>, pages 851–864. </span> </li> <li class="ltx_bibitem" id="bib.bib36"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Papineni et al. (2002)</span> <span class="ltx_bibblock"> Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. </span> <span class="ltx_bibblock">Bleu: a method for automatic evaluation of machine translation. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib36.1.1">Proceedings of the 40th annual meeting of the Association for Computational Linguistics</em>, pages 311–318. </span> </li> <li class="ltx_bibitem" id="bib.bib37"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Parmar et al. (2024)</span> <span class="ltx_bibblock"> Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, and Chitta Baral. 2024. </span> <span class="ltx_bibblock">Logicbench: Towards systematic evaluation of logical reasoning ability of large language models. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib37.1.1">Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</em>, pages 13679–13707. </span> </li> <li class="ltx_bibitem" id="bib.bib38"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Peng et al. (2023)</span> <span class="ltx_bibblock"> Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. 2023. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_href" href="https://api.semanticscholar.org/CorpusID:261493986" title="">Yarn: Efficient context window extension of large language models</a>. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib38.1.1">ArXiv</em>, abs/2309.00071. </span> </li> <li class="ltx_bibitem" id="bib.bib39"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Qin et al. (2024a)</span> <span class="ltx_bibblock"> Yanzhao Qin, Tao Zhang, Yanjun Shen, Wenjing Luo, Haoze Sun, Yan Zhang, Yujing Qiao, Weipeng Chen, Zenan Zhou, Wentao Zhang, et al. 2024a. </span> <span class="ltx_bibblock">Sysbench: Can large language models follow system messages? </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib39.1.1">arXiv preprint arXiv:2408.10943</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib40"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Qin et al. (2024b)</span> <span class="ltx_bibblock"> Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, and Dong Yu. 2024b. </span> <span class="ltx_bibblock">Infobench: Evaluating instruction following ability in large language models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib40.1.1">arXiv preprint arXiv:2401.03601</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib41"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Que et al. (2024)</span> <span class="ltx_bibblock"> Haoran Que, Feiyu Duan, Liqun He, Yutao Mou, Wangchunshu Zhou, Jiaheng Liu, Wenge Rong, Zekun Moore Wang, Jian Yang, Ge Zhang, et al. 2024. </span> <span class="ltx_bibblock">Hellobench: Evaluating long text generation capabilities of large language models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib41.1.1">arXiv preprint arXiv:2409.16191</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib42"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Shaham et al. (2023)</span> <span class="ltx_bibblock"> Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, and Omer Levy. 2023. </span> <span class="ltx_bibblock">Zeroscrolls: A zero-shot benchmark for long text understanding. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib42.1.1">Findings of the Association for Computational Linguistics: EMNLP 2023</em>, pages 7977–7989. </span> </li> <li class="ltx_bibitem" id="bib.bib43"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Shaham et al. (2022)</span> <span class="ltx_bibblock"> Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, et al. 2022. </span> <span class="ltx_bibblock">Scrolls: Standardized comparison over long language sequences. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib43.1.1">Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</em>, pages 12007–12021. </span> </li> <li class="ltx_bibitem" id="bib.bib44"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Tan et al. (2024)</span> <span class="ltx_bibblock"> Haochen Tan, Zhijiang Guo, Zhan Shi, Lu Xu, Zhili Liu, Yunlong Feng, Xiaoguang Li, Yasheng Wang, Lifeng Shang, Qun Liu, et al. 2024. </span> <span class="ltx_bibblock">Proxyqa: An alternative framework for evaluating long-form text generation with large language models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib44.1.1">arXiv preprint arXiv:2401.15042</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib45"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Taori et al. (2023)</span> <span class="ltx_bibblock"> Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. </span> <span class="ltx_bibblock">Stanford alpaca: An instruction-following llama model. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/tatsu-lab/stanford_alpaca" title="">https://github.com/tatsu-lab/stanford_alpaca</a>. </span> </li> <li class="ltx_bibitem" id="bib.bib46"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wang et al. (2024)</span> <span class="ltx_bibblock"> Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, et al. 2024. </span> <span class="ltx_bibblock">Leave no document behind: Benchmarking long-context llms with extended multi-doc qa. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib46.1.1">arXiv preprint arXiv:2406.17419</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib47"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wang et al. (2023)</span> <span class="ltx_bibblock"> Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023. </span> <span class="ltx_bibblock">Large language models are not fair evaluators. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib47.1.1">arXiv preprint arXiv:2305.17926</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib48"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wen et al. (2024)</span> <span class="ltx_bibblock"> Bosi Wen, Pei Ke, Xiaotao Gu, Lindong Wu, Hao Huang, Jinfeng Zhou, Wenchuang Li, Binxin Hu, Wendy Gao, Jiaxin Xu, et al. 2024. </span> <span class="ltx_bibblock">Benchmarking complex instruction-following with multiple constraints composition. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib48.1.1">arXiv preprint arXiv:2407.03978</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib49"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yang et al. (2024)</span> <span class="ltx_bibblock"> An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. 2024. </span> <span class="ltx_bibblock">Qwen2 technical report. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib49.1.1">arXiv preprint arXiv:2407.10671</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib50"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhang et al. (2024a)</span> <span class="ltx_bibblock"> Tao Zhang, Yanjun Shen, Wenjing Luo, Yan Zhang, Hao Liang, Fan Yang, Mingan Lin, Yujing Qiao, Weipeng Chen, Bin Cui, et al. 2024a. </span> <span class="ltx_bibblock">Cfbench: A comprehensive constraints-following benchmark for llms. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib50.1.1">arXiv preprint arXiv:2408.01122</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib51"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhang et al. (2024b)</span> <span class="ltx_bibblock"> Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen Mckeown, and Tatsunori B Hashimoto. 2024b. </span> <span class="ltx_bibblock">Benchmarking large language models for news summarization. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib51.1.1">Transactions of the Association for Computational Linguistics</em>, 11:39–57. </span> </li> <li class="ltx_bibitem" id="bib.bib52"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhang et al. (2024c)</span> <span class="ltx_bibblock"> Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Hao, Xu Han, Zhen Thai, Shuo Wang, Zhiyuan Liu, et al. 2024c. </span> <span class="ltx_bibblock"><math alttext="\infty" class="ltx_Math" display="inline" id="bib.bib52.1.m1.1"><semantics id="bib.bib52.1.m1.1a"><mi id="bib.bib52.1.m1.1.1" mathvariant="normal" xref="bib.bib52.1.m1.1.1.cmml">∞</mi><annotation-xml encoding="MathML-Content" id="bib.bib52.1.m1.1b"><infinity id="bib.bib52.1.m1.1.1.cmml" xref="bib.bib52.1.m1.1.1"></infinity></annotation-xml><annotation encoding="application/x-tex" id="bib.bib52.1.m1.1c">\infty</annotation><annotation encoding="application/x-llamapun" id="bib.bib52.1.m1.1d">∞</annotation></semantics></math> bench: Extending long context evaluation beyond 100k tokens. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib52.2.1">Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</em>, pages 15262–15277. </span> </li> <li class="ltx_bibitem" id="bib.bib53"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhong et al. (2021)</span> <span class="ltx_bibblock"> Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. 2021. </span> <span class="ltx_bibblock">Qmsum: A new benchmark for query-based multi-domain meeting summarization. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib53.1.1">arXiv preprint arXiv:2104.05938</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib54"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhou et al. (2023a)</span> <span class="ltx_bibblock"> Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2023a. </span> <span class="ltx_bibblock">Instruction-following evaluation for large language models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib54.1.1">arXiv preprint arXiv:2311.07911</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib55"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhou et al. (2023b)</span> <span class="ltx_bibblock"> Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2023b. </span> <span class="ltx_bibblock">Instruction-following evaluation for large language models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib55.1.1">arXiv preprint arXiv:2311.07911</em>. </span> </li> </ul> </section> <div class="ltx_pagination ltx_role_newpage"></div> <section class="ltx_appendix" id="A1"> <h2 class="ltx_title ltx_title_appendix"> <span class="ltx_tag ltx_tag_appendix">Appendix A </span>Details in data collection</h2> <section class="ltx_subsection" id="A1.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">A.1 </span>List</h3> <div class="ltx_para" id="A1.SS1.p1"> <p class="ltx_p" id="A1.SS1.p1.1">To ensure the quality of tasks, the input for the List scenario is constructed as a ordered list, consisting of two types of elements: randomly generated UUIDs and natural language instruction texts. <cite class="ltx_cite ltx_citemacro_cite">Liu et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib30" title="">2024b</a>)</cite> used UUIDs to build key-value pairs and explained how large models utilize input context. Inspired by this work, we generated a set of unique 128-bit UUIDs as the first part of the list. However, real-world scenarios often involve some level of semantic noise, and Transformer language models may exhibit varying sensitivities to different linguistic features in their input <cite class="ltx_cite ltx_citemacro_cite">O’Connor and Andreas (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib35" title="">2021</a>)</cite>. Therefore, to enhance complexity and realism, we selected a subset of instruction texts from the Alpaca-52k dataset <cite class="ltx_cite ltx_citemacro_cite">Taori et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib45" title="">2023</a>)</cite>, which serve as the second type of elements in the list. We found that when instruction texts are mixed into the retrieval list, the model’s attention tends to be drawn to the embedded instructions, leading to a prioritization of following them rather than focusing on the originally assigned task. As a result, we chose the "Instructions" part of the Alpaca-52k dataset as list elements. In addition, to ensure the appropriateness of text length, all selected instructions were limited to 5~40 tokens.</p> </div> </section> <section class="ltx_subsection" id="A1.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">A.2 </span>MultiDoc</h3> <div class="ltx_para ltx_noindent" id="A1.SS2.p1"> <p class="ltx_p" id="A1.SS2.p1.1"><span class="ltx_text ltx_font_bold" id="A1.SS2.p1.1.1">Datasets</span> GovReport <cite class="ltx_cite ltx_citemacro_cite">Huang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib19" title="">2021</a>)</cite> is a long document summary dataset consisting of reports written by government research agencies; XSum <cite class="ltx_cite ltx_citemacro_cite">Narayan et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib33" title="">2018</a>)</cite> and QMSum <cite class="ltx_cite ltx_citemacro_cite">Zhong et al. (<a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#bib.bib53" title="">2021</a>)</cite> are respectively a news summary dataset and a metting summary dataset , both of which cover a wide range of domains; and the Paul Graham Essays dataset<span class="ltx_note ltx_role_footnote" id="footnote3"><sup class="ltx_note_mark">3</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">3</sup><span class="ltx_tag ltx_tag_note">3</span><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://huggingface.co/datasets/sgoel9/paul_graham_essays" title="">https://huggingface.co/datasets/sgoel9/paul_graham_essays</a></span></span></span> collects articles written by Paul Graham in a variety of fields.</p> </div> <div class="ltx_para ltx_noindent" id="A1.SS2.p2"> <p class="ltx_p" id="A1.SS2.p2.1"><span class="ltx_text ltx_font_bold" id="A1.SS2.p2.1.1">Scenario Design</span> for each of datasets, we excluded the summary part, and proportionally sampled the main texts from each of the four datasets, processing them into multiple documents to serve as task inputs. To increase the number of documents in input, we limit the length of each text to between 300 and 500 tokens through filtering and truncation.</p> </div> <div class="ltx_para" id="A1.SS2.p3"> <p class="ltx_p" id="A1.SS2.p3.1">As shown in Table <span class="ltx_ref ltx_missing_label ltx_ref_self">LABEL:tab:allprompts</span>, where all documents in input contain six fields, i.e. "text", "id", "iD2", "title", "date", "source". "text" holds the processed text of the above-mentioned datasets, and the remaining five fields are additional attributes added for task design. Some of them are provided by the dataset itself, while others are manually constructed. We must be aware that each document has unique "id" and "iD2" fields. And not all documents have all six fields, some documents may lack either or both of the "title" or "source" fields. Additionally, to reduce the dependence of LLMs on parameter knowledge, we randomly annotate the "source" field. This means that we isolate the correlation between the "source" and "text" fields, ensuring that the model cannot infer the source of the "text" through its pre-trained knowledge and can only adhere to the constructed input. This creates additional challenges for the model in complying with the instructions in the following tasks. Finally, we make some documents with identical "text" content by selecting several "text" and replacing them in other documents. The proportion of these duplicates was maintained at approximately 25%.</p> </div> <div class="ltx_pagination ltx_role_newpage"></div> </section> </section> <section class="ltx_appendix" id="A2"> <h2 class="ltx_title ltx_title_appendix"> <span class="ltx_tag ltx_tag_appendix">Appendix B </span>Effectiveness of Expression Extension</h2> <figure class="ltx_figure" id="A2.F3"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="394" id="A2.F3.g1" src="x3.png" width="789"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 3: </span>IFS Gap Across Three Perspectives with Expression Extension vs. Random Sampling. Positive values indicate an IFS increase with expression extension compared to Random Sampling, while negative values indicate a decrease. "MEAN" on the x-axis represents the average IFS gap.</figcaption> </figure> <div class="ltx_para" id="A2.p1"> <p class="ltx_p" id="A2.p1.1">To validate the effectiveness of the Expression Extension method proposed in Section <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#S3.SS3" title="3.3 Data Extension ‣ 3 LIFBench ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_tag">3.3</span></a>, we compared it with a random sampling approach, as illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2411.07037v1#A2.F3" title="Figure 3 ‣ Appendix B Effectiveness of Expression Extension ‣ LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios"><span class="ltx_text ltx_ref_tag">3</span></a>. Specifically, after the Rewriting phase, we constructed a dataset by randomly sampling (rather than clustering) the same number of instructions and calculated the IFS values for sixteen models on this dataset.</p> </div> <div class="ltx_para" id="A2.p2"> <p class="ltx_p" id="A2.p2.1">The results show that, compared to random sampling, Expression Extension led to higher expression IFS scores in 75% of the models, outperforming the IFS improvement rates observed for the Variable and Length perspectives, at 37.5% and 62.5%, respectively. Additionally, when comparing the mean IFS values across all models, we observed that Expression Extension improved expression IFS while slightly lowering variable IFS and length IFS. This suggests that the Expression Extension method make it more challenging for models to maintain stability when handling varied expressions, which shows the effectiveness as well.</p> <div class="ltx_pagination ltx_role_newpage"></div> </div> <div class="ltx_pagination ltx_role_newpage"></div> </section> <section class="ltx_appendix" id="A3"> <h2 class="ltx_title ltx_title_appendix"> <span class="ltx_tag ltx_tag_appendix">Appendix C </span>Examples in LIFBench</h2> <figure class="ltx_table" id="A3.T5"> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 5: </span> prompts under all scenarios and tasks in LIFBench. Bold font denotes the scenario description <math alttext="D" class="ltx_Math" display="inline" id="A3.T5.1.m1.1"><semantics id="A3.T5.1.m1.1b"><mi id="A3.T5.1.m1.1.1" xref="A3.T5.1.m1.1.1.cmml">D</mi><annotation-xml encoding="MathML-Content" id="A3.T5.1.m1.1c"><ci id="A3.T5.1.m1.1.1.cmml" xref="A3.T5.1.m1.1.1">𝐷</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.T5.1.m1.1d">D</annotation><annotation encoding="application/x-llamapun" id="A3.T5.1.m1.1e">italic_D</annotation></semantics></math>; normal font denotes the context <math alttext="X" class="ltx_Math" display="inline" id="A3.T5.2.m2.1"><semantics id="A3.T5.2.m2.1b"><mi id="A3.T5.2.m2.1.1" xref="A3.T5.2.m2.1.1.cmml">X</mi><annotation-xml encoding="MathML-Content" id="A3.T5.2.m2.1c"><ci id="A3.T5.2.m2.1.1.cmml" xref="A3.T5.2.m2.1.1">𝑋</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.T5.2.m2.1d">X</annotation><annotation encoding="application/x-llamapun" id="A3.T5.2.m2.1e">italic_X</annotation></semantics></math>; italics denotes instruction <math alttext="I" class="ltx_Math" display="inline" id="A3.T5.3.m3.1"><semantics id="A3.T5.3.m3.1b"><mi id="A3.T5.3.m3.1.1" xref="A3.T5.3.m3.1.1.cmml">I</mi><annotation-xml encoding="MathML-Content" id="A3.T5.3.m3.1c"><ci id="A3.T5.3.m3.1.1.cmml" xref="A3.T5.3.m3.1.1">𝐼</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.T5.3.m3.1d">I</annotation><annotation encoding="application/x-llamapun" id="A3.T5.3.m3.1e">italic_I</annotation></semantics></math>, where the red parts indicate the instruction variables <math alttext="v" class="ltx_Math" display="inline" id="A3.T5.4.m4.1"><semantics id="A3.T5.4.m4.1b"><mi id="A3.T5.4.m4.1.1" xref="A3.T5.4.m4.1.1.cmml">v</mi><annotation-xml encoding="MathML-Content" id="A3.T5.4.m4.1c"><ci id="A3.T5.4.m4.1.1.cmml" xref="A3.T5.4.m4.1.1">𝑣</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.T5.4.m4.1d">v</annotation><annotation encoding="application/x-llamapun" id="A3.T5.4.m4.1e">italic_v</annotation></semantics></math>, and the remaining black parts correspond to the instruction template <math alttext="t" class="ltx_Math" display="inline" id="A3.T5.5.m5.1"><semantics id="A3.T5.5.m5.1b"><mi id="A3.T5.5.m5.1.1" xref="A3.T5.5.m5.1.1.cmml">t</mi><annotation-xml encoding="MathML-Content" id="A3.T5.5.m5.1c"><ci id="A3.T5.5.m5.1.1.cmml" xref="A3.T5.5.m5.1.1">𝑡</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.T5.5.m5.1d">t</annotation><annotation encoding="application/x-llamapun" id="A3.T5.5.m5.1e">italic_t</annotation></semantics></math> </figcaption> <table class="ltx_tabular" id="A3.T5.11"> <thead class="ltx_thead"> <tr class="ltx_tr" id="A3.T5.11.1.1"> <th class="ltx_td ltx_align_justify ltx_align_top ltx_th ltx_th_column ltx_border_t" id="A3.T5.11.1.1.1" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.1.1.1.1"> <span class="ltx_p" id="A3.T5.11.1.1.1.1.1" style="width:85.4pt;"><span class="ltx_text ltx_font_bold" id="A3.T5.11.1.1.1.1.1.1">Task Name</span></span> </span> </th> <th class="ltx_td ltx_align_justify ltx_align_top ltx_th ltx_th_column ltx_border_t" id="A3.T5.11.1.1.2" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.1.1.2.1"> <span class="ltx_p" id="A3.T5.11.1.1.2.1.1" style="width:327.2pt;"><span class="ltx_text ltx_font_bold" id="A3.T5.11.1.1.2.1.1.1">Test Example</span></span> </span> </th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="A3.T5.11.2.1"> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.2.1.1" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.2.1.1.1"> <span class="ltx_p" id="A3.T5.11.2.1.1.1.1" style="width:85.4pt;">Scenario: <span class="ltx_text ltx_font_italic" id="A3.T5.11.2.1.1.1.1.1">List</span> <br class="ltx_break"/>Task: <span class="ltx_text ltx_font_italic" id="A3.T5.11.2.1.1.1.1.2">Single-ID</span></span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.2.1.2" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.2.1.2.1"> <span class="ltx_p" id="A3.T5.11.2.1.2.1.1" style="width:327.2pt;"><span class="ltx_text ltx_font_bold" id="A3.T5.11.2.1.2.1.1.1">You’re a searcher. You need to output the corresponding list elements based on the instructions and the list below. Please follow the instructions directly without anything else.</span> <br class="ltx_break"/>List to be retrieved: <br class="ltx_break"/>1. 4f63efbe7f5111ef8b42581122bf941e <br class="ltx_break"/>2.Summarize the Rome Statute of the International Criminal Court. <br class="ltx_break"/>… <br class="ltx_break"/>153. 4f7ea64c7f5111ef8b42581122bf941e <br class="ltx_break"/><span class="ltx_text ltx_font_italic" id="A3.T5.11.2.1.2.1.1.2">Instruction: Retrieve the entry at position <span class="ltx_text" id="A3.T5.11.2.1.2.1.1.2.1" style="color:#FF0000;">8th</span> in the list. Display it immediately.</span></span> </span> </td> </tr> <tr class="ltx_tr" id="A3.T5.11.3.2"> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.3.2.1" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.3.2.1.1"> <span class="ltx_p" id="A3.T5.11.3.2.1.1.1" style="width:85.4pt;">Scenario: <span class="ltx_text ltx_font_italic" id="A3.T5.11.3.2.1.1.1.1">List</span> <br class="ltx_break"/>Task: <span class="ltx_text ltx_font_italic" id="A3.T5.11.3.2.1.1.1.2">Multi-ID</span></span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.3.2.2" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.3.2.2.1"> <span class="ltx_p" id="A3.T5.11.3.2.2.1.1" style="width:327.2pt;"><span class="ltx_text ltx_font_bold" id="A3.T5.11.3.2.2.1.1.1">You’re a searcher. You need to output the corresponding list elements based on the instructions and the list below. Please follow the instructions directly without anything else.</span> <br class="ltx_break"/>List to be retrieved: <br class="ltx_break"/>1. 4f63efbe7f5111ef8b42581122bf941e <br class="ltx_break"/>2.Summarize the Rome Statute of the International Criminal Court. <br class="ltx_break"/>… <br class="ltx_break"/>153. 4f7ea64c7f5111ef8b42581122bf941e <br class="ltx_break"/><span class="ltx_text ltx_font_italic" id="A3.T5.11.3.2.2.1.1.2">Instruction: Identify the items at the corresponding places in list <span class="ltx_text" id="A3.T5.11.3.2.2.1.1.2.1" style="color:#FF0000;">[1, 5, 7]</span> and deliver the result in JSON list form.</span></span> </span> </td> </tr> <tr class="ltx_tr" id="A3.T5.11.4.3"> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.4.3.1" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.4.3.1.1"> <span class="ltx_p" id="A3.T5.11.4.3.1.1.1" style="width:85.4pt;">Scenario: <span class="ltx_text ltx_font_italic" id="A3.T5.11.4.3.1.1.1.1">List</span> <br class="ltx_break"/>Task: <span class="ltx_text ltx_font_italic" id="A3.T5.11.4.3.1.1.1.2">Offset-ID</span></span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.4.3.2" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.4.3.2.1"> <span class="ltx_p" id="A3.T5.11.4.3.2.1.1" style="width:327.2pt;"><span class="ltx_text ltx_font_bold" id="A3.T5.11.4.3.2.1.1.1">You’re a searcher. You need to output the corresponding list elements based on the instructions and the list below. Please follow the instructions directly without anything else.</span> <br class="ltx_break"/>List to be retrieved: <br class="ltx_break"/>1. 4f63efbe7f5111ef8b42581122bf941e <br class="ltx_break"/>2.Summarize the Rome Statute of the International Criminal Court. <br class="ltx_break"/>… <br class="ltx_break"/>153. 4f7ea64c7f5111ef8b42581122bf941e <br class="ltx_break"/><span class="ltx_text ltx_font_italic" id="A3.T5.11.4.3.2.1.1.2">Instruction: Please identify <span class="ltx_text" id="A3.T5.11.4.3.2.1.1.2.1" style="color:#FF0000;">the next</span> item to <span class="ltx_text" id="A3.T5.11.4.3.2.1.1.2.2" style="color:#FF0000;">8th</span> in the list described above.</span></span> </span> </td> </tr> <tr class="ltx_tr" id="A3.T5.11.5.4"> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.5.4.1" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.5.4.1.1"> <span class="ltx_p" id="A3.T5.11.5.4.1.1.1" style="width:85.4pt;">Scenario: <span class="ltx_text ltx_font_italic" id="A3.T5.11.5.4.1.1.1.1">List</span> <br class="ltx_break"/>Task: <span class="ltx_text ltx_font_italic" id="A3.T5.11.5.4.1.1.1.2">Offset-Element</span></span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.5.4.2" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.5.4.2.1"> <span class="ltx_p" id="A3.T5.11.5.4.2.1.1" style="width:327.2pt;"><span class="ltx_text ltx_font_bold" id="A3.T5.11.5.4.2.1.1.1">You’re a searcher. You need to output the corresponding list elements based on the instructions and the list below. Please follow the instructions directly without anything else.</span> <br class="ltx_break"/>List to be retrieved: <br class="ltx_break"/>1. 4f63efbe7f5111ef8b42581122bf941e <br class="ltx_break"/>2.Summarize the Rome Statute of the International Criminal Court. <br class="ltx_break"/>… <br class="ltx_break"/>153. 4f7ea64c7f5111ef8b42581122bf941e <br class="ltx_break"/><span class="ltx_text ltx_font_italic" id="A3.T5.11.5.4.2.1.1.2">Instruction: "Considering the arrangement of the list, what is <span class="ltx_text" id="A3.T5.11.5.4.2.1.1.2.1" style="color:#FF0000;">the next</span> element after <span class="ltx_text" id="A3.T5.11.5.4.2.1.1.2.2" style="color:#FF0000;">"4f78d6f47f5111ef8b42581122bf941e"</span>?</span></span> </span> </td> </tr> <tr class="ltx_tr" id="A3.T5.11.6.5"> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.6.5.1" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.6.5.1.1"> <span class="ltx_p" id="A3.T5.11.6.5.1.1.1" style="width:85.4pt;">Scenario: <span class="ltx_text ltx_font_italic" id="A3.T5.11.6.5.1.1.1.1">List</span> <br class="ltx_break"/>Task: <span class="ltx_text ltx_font_italic" id="A3.T5.11.6.5.1.1.1.2">Blur-ID</span></span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.6.5.2" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.6.5.2.1"> <span class="ltx_p" id="A3.T5.11.6.5.2.1.1" style="width:327.2pt;"><span class="ltx_text ltx_font_bold" id="A3.T5.11.6.5.2.1.1.1">You’re a searcher. You need to output the corresponding list elements based on the instructions and the list below. Please follow the instructions directly without anything else.</span> <br class="ltx_break"/>List to be retrieved: <br class="ltx_break"/>1. 4f63efbe7f5111ef8b42581122bf941e <br class="ltx_break"/>2.Summarize the Rome Statute of the International Criminal Court. <br class="ltx_break"/>… <br class="ltx_break"/>153. 4f7ea64c7f5111ef8b42581122bf941e <br class="ltx_break"/><span class="ltx_text ltx_font_italic" id="A3.T5.11.6.5.2.1.1.2">Instruction:Randomly select an item from the list above, <span class="ltx_text" id="A3.T5.11.6.5.2.1.1.2.1" style="color:#FF0000;">after</span> the <span class="ltx_text" id="A3.T5.11.6.5.2.1.1.2.2" style="color:#FF0000;">8th</span> element.</span></span> </span> </td> </tr> <tr class="ltx_tr" id="A3.T5.11.7.6"> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.7.6.1" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.7.6.1.1"> <span class="ltx_p" id="A3.T5.11.7.6.1.1.1" style="width:85.4pt;">Scenario: <span class="ltx_text ltx_font_italic" id="A3.T5.11.7.6.1.1.1.1">List</span> <br class="ltx_break"/>Task: <span class="ltx_text ltx_font_italic" id="A3.T5.11.7.6.1.1.1.2">Blur-Element</span></span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.7.6.2" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.7.6.2.1"> <span class="ltx_p" id="A3.T5.11.7.6.2.1.1" style="width:327.2pt;"><span class="ltx_text ltx_font_bold" id="A3.T5.11.7.6.2.1.1.1">You’re a searcher. You need to output the corresponding list elements based on the instructions and the list below. Please follow the instructions directly without anything else.</span> <br class="ltx_break"/>List to be retrieved: <br class="ltx_break"/>1. 4f63efbe7f5111ef8b42581122bf941e <br class="ltx_break"/>2.Summarize the Rome Statute of the International Criminal Court. <br class="ltx_break"/>… <br class="ltx_break"/>153. 4f7ea64c7f5111ef8b42581122bf941e <br class="ltx_break"/><span class="ltx_text ltx_font_italic" id="A3.T5.11.7.6.2.1.1.2">Instruction: Retrieve the entry at position <span class="ltx_text" id="A3.T5.11.7.6.2.1.1.2.1" style="color:#FF0000;">8th</span> in the list. Display it immediately.</span></span> </span> </td> </tr> <tr class="ltx_tr" id="A3.T5.11.8.7"> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.8.7.1" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.8.7.1.1"> <span class="ltx_p" id="A3.T5.11.8.7.1.1.1" style="width:85.4pt;">Scenario: <span class="ltx_text ltx_font_italic" id="A3.T5.11.8.7.1.1.1.1">MultiDoc</span> <br class="ltx_break"/>Task: <span class="ltx_text ltx_font_italic" id="A3.T5.11.8.7.1.1.1.2">Find-dup-doc</span></span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.8.7.2" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.8.7.2.1"> <span class="ltx_p" id="A3.T5.11.8.7.2.1.1" style="width:327.2pt;"><span class="ltx_text ltx_font_bold" id="A3.T5.11.8.7.2.1.1.1">You are a document manager. Here is a collection of documents. Each document includes information such as title, date, source, id, iD2 and specific article content (text). You need to read the documents and follow the instructions to give some information directly, without something else. Also note: <br class="ltx_break"/>1. Some documents may be missing information such as title or source, which may affect the final output. <br class="ltx_break"/>2. Some article conent (i.e. values corresponding to the text keyword) may be duplicated.</span> <br class="ltx_break"/>Documents: <br class="ltx_break"/>******************** doc-1 ******************** <br class="ltx_break"/>text: Following discussions with the PSNI, … \̈newline source: meeting <br class="ltx_break"/>iD2: bec60ab1-35bc-417d-b378-0b2e86dfcd23 <br class="ltx_break"/>id: xWTp7xEyQrqGmGj5p6CdVQ <br class="ltx_break"/>title: 39272756 <br class="ltx_break"/>date: 2016-06-20 <br class="ltx_break"/>******************** doc-2 ******************** <br class="ltx_break"/>id: SOl-GT2FSyKFYJzBhkYzxw <br class="ltx_break"/>text: What did the participants think about using CD’s for backup? … <br class="ltx_break"/>source: news <br class="ltx_break"/>date: 2001-01-15 <br class="ltx_break"/>iD2: 12a0f93d-0f57-46bc-bf28-9ec856fc974a <br class="ltx_break"/>title: tr-sq-167 <br class="ltx_break"/>******************** doc-3 ******************** <br class="ltx_break"/>…. <br class="ltx_break"/><span class="ltx_text ltx_font_italic" id="A3.T5.11.8.7.2.1.1.2">Instruction: Within the supplied documents, certain documents contain replicated content in their ’text’ field, although other fields (such as id, iD2, date, title, source) may be different. Additionally, there could be N sets of documents, each set comprising any number of replicates. Please identify these replicated documents and present <span class="ltx_text" id="A3.T5.11.8.7.2.1.1.2.1" style="color:#FF0000;">iD2</span> in sequence. The output should have N lines, each line symbolizing a set of replicated documents. Format the output for each document set as depicted in the example. Avoid providing explanations. If a document lacks information in a specific field, use ’None’ instead. <br class="ltx_break"/>output example: <br class="ltx_break"/><span class="ltx_text" id="A3.T5.11.8.7.2.1.1.2.2" style="color:#FF0000;">[["iD2_1"], ["iD2_2"], ["iD2_3"]] <br class="ltx_break"/>[["iD2_4"], ["iD2_5"]]</span></span></span> </span> </td> </tr> <tr class="ltx_tr" id="A3.T5.11.9.8"> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.9.8.1" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.9.8.1.1"> <span class="ltx_p" id="A3.T5.11.9.8.1.1.1" style="width:85.4pt;">Scenario: <span class="ltx_text ltx_font_italic" id="A3.T5.11.9.8.1.1.1.1">MultiDoc</span> <br class="ltx_break"/>Task: <span class="ltx_text ltx_font_italic" id="A3.T5.11.9.8.1.1.1.2">Batch-label</span></span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.9.8.2" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.9.8.2.1"> <span class="ltx_p" id="A3.T5.11.9.8.2.1.1" style="width:327.2pt;"><span class="ltx_text ltx_font_bold" id="A3.T5.11.9.8.2.1.1.1">You are a document manager. Here is a collection of documents. Each document includes information such as title, date, source, id, iD2 and specific article content (text). You need to read the documents and follow the instructions to give some information directly, without something else. Also note: <br class="ltx_break"/>1. Some documents may be missing information such as title or source, which may affect the final output. <br class="ltx_break"/>2. Some article conent (i.e. values corresponding to the text keyword) may be duplicated.</span> <br class="ltx_break"/>Documents: <br class="ltx_break"/>******************** doc-1 ******************** <br class="ltx_break"/>text: Following discussions with the PSNI, … \̈newline source: meeting <br class="ltx_break"/>iD2: bec60ab1-35bc-417d-b378-0b2e86dfcd23 <br class="ltx_break"/>id: xWTp7xEyQrqGmGj5p6CdVQ <br class="ltx_break"/>title: 39272756 <br class="ltx_break"/>date: 2016-06-20 <br class="ltx_break"/>******************** doc-2 ******************** <br class="ltx_break"/>id: SOl-GT2FSyKFYJzBhkYzxw <br class="ltx_break"/>text: What did the participants think about using CD’s for backup? … <br class="ltx_break"/>source: news <br class="ltx_break"/>date: 2001-01-15 <br class="ltx_break"/>iD2: 12a0f93d-0f57-46bc-bf28-9ec856fc974a <br class="ltx_break"/>title: tr-sq-167 <br class="ltx_break"/>******************** doc-3 ******************** <br class="ltx_break"/>…. <br class="ltx_break"/><span class="ltx_text ltx_font_italic" id="A3.T5.11.9.8.2.1.1.2">Instruction: Assign labels to documents in order using the provided list of [’11311’, ’22422’, ’33233’, ’44444’]. The labeling rules to follow are: <br class="ltx_break"/><span class="ltx_text" id="A3.T5.11.9.8.2.1.1.2.1" style="color:#FF0000;"> 1. If the document contains both title and source information, mark it as "11311". <br class="ltx_break"/>2. If the document is missing the source information but not the title, mark it as "22422". <br class="ltx_break"/>3. If the document is missing title information but not source, mark it as "33233". <br class="ltx_break"/>4. If the document is missing both title and source information, mark it as "44444".</span> <br class="ltx_break"/>The tags should be output in JSON dictionary format, for example: </span>{<span class="ltx_text ltx_font_italic" id="A3.T5.11.9.8.2.1.1.3">"doc1":"11311","doc2":"22422"</span>}</span> </span> </td> </tr> <tr class="ltx_tr" id="A3.T5.11.10.9"> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.10.9.1" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.10.9.1.1"> <span class="ltx_p" id="A3.T5.11.10.9.1.1.1" style="width:85.4pt;">Scenario: <span class="ltx_text ltx_font_italic" id="A3.T5.11.10.9.1.1.1.1">OneDoc</span> <br class="ltx_break"/>Task: <span class="ltx_text ltx_font_italic" id="A3.T5.11.10.9.1.1.1.2">Repeat</span></span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.10.9.2" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.10.9.2.1"> <span class="ltx_p" id="A3.T5.11.10.9.2.1.1" style="width:327.2pt;"><span class="ltx_text ltx_font_bold" id="A3.T5.11.10.9.2.1.1.1">There are several different types of KEY SENTENCE in the input text, which are marked by special tags. These special tags a total of six kinds, respectively is <#Topic#>, <@argument@>, "<!Transition!>", "<|Summary|>", "<*Evidence*>", "<-Concession->". Different tags represent different types of key sentence. If a sentence in the text is KEY SENTENCE, we will add a special tag with the same attribute to the beginning and end of the sentence. The head tag also contains id order information in the format <type-id>. For example, the head tag with type ’#Topic#’ and id 1 is <#Topic#-1>. Also note that when the head tag and tail tag attributes are inconsistent, this means that the sentence is a fake KEY SENTENCE. Please read the input text carefully and give the answer directly according to the instruction requirements.</span> <br class="ltx_break"/>Input text: Remember the essays you had to write in high school? Topic sentence, introductory paragraph, supporting paragraphs, conclusion. <#Topic#-2>With the result that writing is made to seem boring and pointless.<#Topic#> Who cares about symbolism in Dickens? … <br class="ltx_break"/><span class="ltx_text ltx_font_italic" id="A3.T5.11.10.9.2.1.1.2">Instruction: Provide <span class="ltx_text" id="A3.T5.11.10.9.2.1.1.2.1" style="color:#FF0000;">4</span> KEY SENTENCE and their categories directly. Display the output on <span class="ltx_text" id="A3.T5.11.10.9.2.1.1.2.2" style="color:#FF0000;">4</span> individual lines with each line containing a KEY SENTENCE and its category, separated by <span class="ltx_text" id="A3.T5.11.10.9.2.1.1.2.3" style="color:#FF0000;">|||||</span>. <br class="ltx_break"/>Output example: <br class="ltx_break"/>[KEY_SENTENCE_1] <span class="ltx_text" id="A3.T5.11.10.9.2.1.1.2.4" style="color:#FF0000;">|||||</span> #Topic# <br class="ltx_break"/>[KEY_SENTENCE_2] <span class="ltx_text" id="A3.T5.11.10.9.2.1.1.2.5" style="color:#FF0000;">|||||</span> *Evidence*</span></span> </span> </td> </tr> <tr class="ltx_tr" id="A3.T5.11.11.10"> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.11.10.1" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.11.10.1.1"> <span class="ltx_p" id="A3.T5.11.11.10.1.1.1" style="width:85.4pt;">Scenario: <span class="ltx_text ltx_font_italic" id="A3.T5.11.11.10.1.1.1.1">OneDoc</span> <br class="ltx_break"/>Task: <span class="ltx_text ltx_font_italic" id="A3.T5.11.11.10.1.1.1.2">Extract</span></span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A3.T5.11.11.10.2" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.11.10.2.1"> <span class="ltx_p" id="A3.T5.11.11.10.2.1.1" style="width:327.2pt;"><span class="ltx_text ltx_font_bold" id="A3.T5.11.11.10.2.1.1.1">There are several different types of KEY SENTENCE in the input text, which are marked by special tags. These special tags a total of six kinds, respectively is <#Topic#>, <@argument@>, "<!Transition!>", "<|Summary|>", "<*Evidence*>", "<-Concession->". Different tags represent different types of key sentence. If a sentence in the text is KEY SENTENCE, we will add a special tag with the same attribute to the beginning and end of the sentence. The head tag also contains id order information in the format <type-id>. For example, the head tag with type ’#Topic#’ and id 1 is <#Topic#-1>. Also note that when the head tag and tail tag attributes are inconsistent, this means that the sentence is a fake KEY SENTENCE. Please read the input text carefully and give the answer directly according to the instruction requirements.</span> <br class="ltx_break"/>Input text: Remember the essays you had to write in high school? Topic sentence, introductory paragraph, supporting paragraphs, conclusion. <#Topic#-2>With the result that writing is made to seem boring and pointless.<#Topic#> Who cares about symbolism in Dickens? … <br class="ltx_break"/><span class="ltx_text ltx_font_italic" id="A3.T5.11.11.10.2.1.1.2">Instruction: Gather every instance of KEY SENTENCE classified as <span class="ltx_text" id="A3.T5.11.11.10.2.1.1.2.1" style="color:#FF0000;">#Topic#</span>. The output should be a Json list arranged by ids. If none are found, provide an empty array. <br class="ltx_break"/>Output Example 1: [KEY SENTENCE1, KEY SENTENCE2, KEY SENTENCE3,…] <br class="ltx_break"/>Output Example 2: []</span></span> </span> </td> </tr> <tr class="ltx_tr" id="A3.T5.11.12.11"> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_b ltx_border_t" id="A3.T5.11.12.11.1" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.12.11.1.1"> <span class="ltx_p" id="A3.T5.11.12.11.1.1.1" style="width:85.4pt;">Scenario: <span class="ltx_text ltx_font_italic" id="A3.T5.11.12.11.1.1.1.1">OneDoc</span> <br class="ltx_break"/>Task: <span class="ltx_text ltx_font_italic" id="A3.T5.11.12.11.1.1.1.2">QA</span></span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_b ltx_border_t" id="A3.T5.11.12.11.2" style="padding-top:2.5pt;padding-bottom:2.5pt;"> <span class="ltx_inline-block ltx_align_top" id="A3.T5.11.12.11.2.1"> <span class="ltx_p" id="A3.T5.11.12.11.2.1.1" style="width:327.2pt;"><span class="ltx_text ltx_font_bold" id="A3.T5.11.12.11.2.1.1.1">There are several different types of KEY SENTENCE in the input text, which are marked by special tags. These special tags a total of six kinds, respectively is <#Topic#>, <@argument@>, "<!Transition!>", "<|Summary|>", "<*Evidence*>", "<-Concession->". Different tags represent different types of key sentence. If a sentence in the text is KEY SENTENCE, we will add a special tag with the same attribute to the beginning and end of the sentence. The head tag also contains id order information in the format <type-id>. For example, the head tag with type ’#Topic#’ and id 1 is <#Topic#-1>. Also note that when the head tag and tail tag attributes are inconsistent, this means that the sentence is a fake KEY SENTENCE. Please read the input text carefully and give the answer directly according to the instruction requirements.</span> <br class="ltx_break"/>Input text: Remember the essays you had to write in high school? Topic sentence, introductory paragraph, supporting paragraphs, conclusion. <#Topic#-2>With the result that writing is made to seem boring and pointless.<#Topic#> Who cares about symbolism in Dickens? … <br class="ltx_break"/><span class="ltx_text ltx_font_italic" id="A3.T5.11.12.11.2.1.1.2">Instruction: Is <span class="ltx_text" id="A3.T5.11.12.11.2.1.1.2.1" style="color:#FF0000;">"With the result that writing is made to seem boring and pointless."</span> the KEY SENTENCE of the text? Indicate <span class="ltx_text" id="A3.T5.11.12.11.2.1.1.2.2" style="color:#FF0000;">"True"</span> for yes, <span class="ltx_text" id="A3.T5.11.12.11.2.1.1.2.3" style="color:#FF0000;">"False"</span> for no.</span></span> </span> </td> </tr> </tbody> </table> </figure> </section> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Mon Nov 11 14:40:48 2024 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/"><span style="letter-spacing:-0.2em; margin-right:0.1em;">L<span class="ltx_font_smallcaps" style="position:relative; bottom:2.2pt;">a</span>T<span class="ltx_font_smallcaps" style="font-size:120%;position:relative; bottom:-0.2ex;">e</span></span><span style="font-size:90%; position:relative; bottom:-0.2ex;">XML</span><img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>