aria-label="pagination"> <a href="" class="pagination-previous is-invisible">Previous </a> <a href="/search/?searchtype=author&amp;query=Cohan%2C+A&amp;start=50" class="pagination-next" >Next </a> <ul class="pagination-list"> <li> <a href="/search/?searchtype=author&amp;query=Cohan%2C+A&amp;start=0" class="pagination-link is-current" aria-label="Goto page 1">1 </a> </li> <li> <a href="/search/?searchtype=author&amp;query=Cohan%2C+A&amp;start=50" class="pagination-link " aria-label="Page 2" aria-current="page">2 </a> </li> </ul> </nav> <ol class="breathe-horizontal" start="1"> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.16736</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Chemical Physics">physics.chem-ph</span> </div> </div> <p class="title is-5 mathjax"> ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+H">Haochen Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+X">Xiangru Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+Z">Ziran Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Han%2C+X">Xiao Han</a>, <a href="/search/cs?searchtype=author&amp;query=Feng%2C+X">Xuanzhi Feng</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Y">Yueqing Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Cheng%2C+S">Senhao Cheng</a>, <a href="/search/cs?searchtype=author&amp;query=Jin%2C+D">Di Jin</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Gerstein%2C+M">Mark Gerstein</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.16736v1-abstract-short" style="display: inline;"> The advancement and extensive application of large language models (LLMs) have been remarkable, including their use in scientific research assistance. However, these models often generate scientifically incorrect or unsafe responses, and in some cases, they may encourage users to engage in dangerous behavior. To address this issue in the field of chemistry, we introduce ChemSafetyBench, a benchmar&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.16736v1-abstract-full').style.display = 'inline'; document.getElementById('2411.16736v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.16736v1-abstract-full" style="display: none;"> The advancement and extensive application of large language models (LLMs) have been remarkable, including their use in scientific research assistance. However, these models often generate scientifically incorrect or unsafe responses, and in some cases, they may encourage users to engage in dangerous behavior. To address this issue in the field of chemistry, we introduce ChemSafetyBench, a benchmark designed to evaluate the accuracy and safety of LLM responses. ChemSafetyBench encompasses three key tasks: querying chemical properties, assessing the legality of chemical uses, and describing synthesis methods, each requiring increasingly deeper chemical knowledge. Our dataset has more than 30K samples across various chemical materials. We incorporate handcrafted templates and advanced jailbreaking scenarios to enhance task diversity. Our automated evaluation framework thoroughly assesses the safety, accuracy, and appropriateness of LLM responses. Extensive experiments with state-of-the-art LLMs reveal notable strengths and critical vulnerabilities, underscoring the need for robust safety measures. ChemSafetyBench aims to be a pivotal tool in developing safer AI technologies in chemistry. Our code and dataset are available at Warning: this paper contains discussions on the synthesis of controlled chemicals using AI models. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.16736v1-abstract-full').style.display = 'none'; document.getElementById('2411.16736v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 23 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.05764</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Long%2C+Y">Yitao Long</a>, <a href="/search/cs?searchtype=author&amp;query=Jiang%2C+Y">Yuru Jiang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+C">Chengye Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+W">Weiyuan Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+H">Hongjun Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Yiming Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+X">Xiangru Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+C">Chen Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.05764v1-abstract-short" style="display: inline;"> We introduce FinDVer, a comprehensive benchmark specifically designed to evaluate the explainable claim verification capabilities of LLMs in the context of understanding and analyzing long, hybrid-content financial documents. FinDVer contains 2,400 expert-annotated examples, divided into three subsets: information extraction, numerical reasoning, and knowledge-intensive reasoning, each addressing&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05764v1-abstract-full').style.display = 'inline'; document.getElementById('2411.05764v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.05764v1-abstract-full" style="display: none;"> We introduce FinDVer, a comprehensive benchmark specifically designed to evaluate the explainable claim verification capabilities of LLMs in the context of understanding and analyzing long, hybrid-content financial documents. FinDVer contains 2,400 expert-annotated examples, divided into three subsets: information extraction, numerical reasoning, and knowledge-intensive reasoning, each addressing common scenarios encountered in real-world financial contexts. We assess a broad spectrum of LLMs under long-context and RAG settings. Our results show that even the current best-performing system, GPT-4o, still lags behind human experts. We further provide in-depth analysis on long-context and RAG setting, Chain-of-Thought reasoning, and model reasoning errors, offering insights to drive future advancements. We believe that FinDVer can serve as a valuable benchmark for evaluating LLMs in claim verification over complex, expert-domain documents. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05764v1-abstract-full').style.display = 'none'; document.getElementById('2411.05764v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 8 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">EMNLP 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.05338</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Singh%2C+S">Shruti Singh</a>, <a href="/search/cs?searchtype=author&amp;query=Sarkar%2C+N">Nandan Sarkar</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.05338v1-abstract-short" style="display: inline;"> Scientific literature is typically dense, requiring significant background knowledge and deep comprehension for effective engagement. We introduce SciDQA, a new dataset for reading comprehension that challenges LLMs for a deep understanding of scientific articles, consisting of 2,937 QA pairs. Unlike other scientific QA datasets, SciDQA sources questions from peer reviews by domain experts and ans&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05338v1-abstract-full').style.display = 'inline'; document.getElementById('2411.05338v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.05338v1-abstract-full" style="display: none;"> Scientific literature is typically dense, requiring significant background knowledge and deep comprehension for effective engagement. We introduce SciDQA, a new dataset for reading comprehension that challenges LLMs for a deep understanding of scientific articles, consisting of 2,937 QA pairs. Unlike other scientific QA datasets, SciDQA sources questions from peer reviews by domain experts and answers by paper authors, ensuring a thorough examination of the literature. We enhance the dataset&#39;s quality through a process that carefully filters out lower quality questions, decontextualizes the content, tracks the source document across different versions, and incorporates a bibliography for multi-document question-answering. Questions in SciDQA necessitate reasoning across figures, tables, equations, appendices, and supplementary materials, and require multi-document reasoning. We evaluate several open-source and proprietary LLMs across various configurations to explore their capabilities in generating relevant and factual responses. Our comprehensive evaluation, based on metrics for surface-level similarity and LLM judgements, highlights notable performance discrepancies. SciDQA represents a rigorously curated, naturally derived scientific QA dataset, designed to facilitate research on complex scientific text understanding. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05338v1-abstract-full').style.display = 'none'; document.getElementById('2411.05338v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 8 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">18 pages, Accepted to EMNLP 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.04424</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Bayesian Calibration of Win Rate Estimation with LLM Evaluators </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Gao%2C+Y">Yicheng Gao</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+G">Gonghan Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Z">Zhe Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.04424v1-abstract-short" style="display: inline;"> Recent advances in large language models (LLMs) show the potential of using LLMs as evaluators for assessing the quality of text generations from LLMs. However, applying LLM evaluators naively to compare or judge between different systems can lead to unreliable results due to the intrinsic win rate estimation bias of LLM evaluators. In order to mitigate this problem, we propose two calibration met&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.04424v1-abstract-full').style.display = 'inline'; document.getElementById('2411.04424v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.04424v1-abstract-full" style="display: none;"> Recent advances in large language models (LLMs) show the potential of using LLMs as evaluators for assessing the quality of text generations from LLMs. However, applying LLM evaluators naively to compare or judge between different systems can lead to unreliable results due to the intrinsic win rate estimation bias of LLM evaluators. In order to mitigate this problem, we propose two calibration methods, Bayesian Win Rate Sampling (BWRS) and Bayesian Dawid-Skene, both of which leverage Bayesian inference to more accurately infer the true win rate of generative language models. We empirically validate our methods on six datasets covering story generation, summarization, and instruction following tasks. We show that both our methods are effective in improving the accuracy of win rate estimation using LLMs as evaluators, offering a promising direction for reliable automatic text quality evaluation. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.04424v1-abstract-full').style.display = 'none'; document.getElementById('2411.04424v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by EMNLP 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.04075</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Li%2C+C">Chuhan Li</a>, <a href="/search/cs?searchtype=author&amp;query=Shangguan%2C+Z">Ziyao Shangguan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+D">Deyuan Li</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yixin Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.04075v1-abstract-short" style="display: inline;"> Existing benchmarks for evaluating foundation models mainly focus on single-document, text-only tasks. However, they often fail to fully capture the complexity of research workflows, which typically involve interpreting non-textual data and gathering information across multiple documents. To address this gap, we introduce M3SciQA, a multi-modal, multi-document scientific question answering benchma&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.04075v1-abstract-full').style.display = 'inline'; document.getElementById('2411.04075v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.04075v1-abstract-full" style="display: none;"> Existing benchmarks for evaluating foundation models mainly focus on single-document, text-only tasks. However, they often fail to fully capture the complexity of research workflows, which typically involve interpreting non-textual data and gathering information across multiple documents. To address this gap, we introduce M3SciQA, a multi-modal, multi-document scientific question answering benchmark designed for a more comprehensive evaluation of foundation models. M3SciQA consists of 1,452 expert-annotated questions spanning 70 natural language processing paper clusters, where each cluster represents a primary paper along with all its cited documents, mirroring the workflow of comprehending a single paper by requiring multi-modal and multi-document data. With M3SciQA, we conduct a comprehensive evaluation of 18 foundation models. Our results indicate that current foundation models still significantly underperform compared to human experts in multi-modal information retrieval and in reasoning across multiple scientific documents. Additionally, we explore the implications of these findings for the future advancement of applying foundation models in multi-modal scientific literature analysis. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.04075v1-abstract-full').style.display = 'none'; document.getElementById('2411.04075v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.23463</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> MDCure: A Scalable Pipeline for Multi-Document Instruction-Following </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Liu%2C+G+K">Gabrielle Kaili-May Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Shi%2C+B">Bowen Shi</a>, <a href="/search/cs?searchtype=author&amp;query=Caciularu%2C+A">Avi Caciularu</a>, <a href="/search/cs?searchtype=author&amp;query=Szpektor%2C+I">Idan Szpektor</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.23463v2-abstract-short" style="display: inline;"> Multi-document (MD) processing is crucial for LLMs to handle real-world tasks such as summarization and question-answering across large sets of documents. While LLMs have improved at processing long inputs, MD contexts still present challenges, such as managing inter-document dependencies, redundancy, and incoherent structures. We introduce MDCure, a scalable and effective fine-tuning pipeline to&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.23463v2-abstract-full').style.display = 'inline'; document.getElementById('2410.23463v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.23463v2-abstract-full" style="display: none;"> Multi-document (MD) processing is crucial for LLMs to handle real-world tasks such as summarization and question-answering across large sets of documents. While LLMs have improved at processing long inputs, MD contexts still present challenges, such as managing inter-document dependencies, redundancy, and incoherent structures. We introduce MDCure, a scalable and effective fine-tuning pipeline to enhance the MD capabilities of LLMs without the computational cost of pre-training or reliance on human annotated data. MDCure is based on generation of high-quality synthetic MD instruction data from sets of related articles via targeted prompts. We further introduce MDCureRM, a multi-objective reward model which filters generated data based on their training utility for MD settings. With MDCure, we fine-tune a variety of LLMs, from the FlanT5, Qwen2, and LLAMA3.1 model families, up to 70B parameters in size. Extensive evaluations on a wide range of MD and long-context benchmarks spanning various tasks show MDCure consistently improves performance over pre-trained baselines and over corresponding base models by up to 75.5%. Our code, datasets, and models are available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.23463v2-abstract-full').style.display = 'none'; document.getElementById('2410.23463v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 13 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 30 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.23266</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Shangguan%2C+Z">Ziyao Shangguan</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+C">Chuhan Li</a>, <a href="/search/cs?searchtype=author&amp;query=Ding%2C+Y">Yuxuan Ding</a>, <a href="/search/cs?searchtype=author&amp;query=Zheng%2C+Y">Yanan Zheng</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Fitzgerald%2C+T">Tesca Fitzgerald</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.23266v1-abstract-short" style="display: inline;"> Existing benchmarks often highlight the remarkable performance achieved by state-of-the-art Multimodal Foundation Models (MFMs) in leveraging temporal context for video understanding. However, how well do the models truly perform visual temporal reasoning? Our study of existing benchmarks shows that this capability of MFMs is likely overestimated as many questions can be solved by using a single,&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.23266v1-abstract-full').style.display = 'inline'; document.getElementById('2410.23266v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.23266v1-abstract-full" style="display: none;"> Existing benchmarks often highlight the remarkable performance achieved by state-of-the-art Multimodal Foundation Models (MFMs) in leveraging temporal context for video understanding. However, how well do the models truly perform visual temporal reasoning? Our study of existing benchmarks shows that this capability of MFMs is likely overestimated as many questions can be solved by using a single, few, or out-of-order frames. To systematically examine current visual temporal reasoning tasks, we propose three principles with corresponding metrics: (1) Multi-Frame Gain, (2) Frame Order Sensitivity, and (3) Frame Information Disparity. Following these principles, we introduce TOMATO, Temporal Reasoning Multimodal Evaluation, a novel benchmark crafted to rigorously assess MFMs&#39; temporal reasoning capabilities in video understanding. TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks (i.e., action count, direction, rotation, shape &amp; trend, velocity &amp; frequency, and visual cues), applied to 1,417 videos, including 805 self-recorded and -generated videos, that encompass human-centric, real-world, and simulated scenarios. Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model. Moreover, our in-depth analysis uncovers more fundamental limitations beyond this gap in current MFMs. While they can accurately recognize events in isolated frames, they fail to interpret these frames as a continuous sequence. We believe TOMATO will serve as a crucial testbed for evaluating the next-generation MFMs and as a call to the community to develop AI systems capable of comprehending human world dynamics through the video modality. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.23266v1-abstract-full').style.display = 'none'; document.getElementById('2410.23266v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 30 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.23223</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Science and Game Theory">cs.GT</span> </div> </div> <p class="title is-5 mathjax"> COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yixin Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Oikonomou%2C+A">Argyris Oikonomou</a>, <a href="/search/cs?searchtype=author&amp;query=Zheng%2C+W">Weiqiang Zheng</a>, <a href="/search/cs?searchtype=author&amp;query=Cai%2C+Y">Yang Cai</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.23223v1-abstract-short" style="display: inline;"> Many alignment methods, including reinforcement learning from human feedback (RLHF), rely on the Bradley-Terry reward assumption, which is insufficient to capture the full range of general human preferences. To achieve robust alignment with general preferences, we model the alignment problem as a two-player zero-sum game, where the Nash equilibrium policy guarantees a 50% win rate against any comp&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.23223v1-abstract-full').style.display = 'inline'; document.getElementById('2410.23223v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.23223v1-abstract-full" style="display: none;"> Many alignment methods, including reinforcement learning from human feedback (RLHF), rely on the Bradley-Terry reward assumption, which is insufficient to capture the full range of general human preferences. To achieve robust alignment with general preferences, we model the alignment problem as a two-player zero-sum game, where the Nash equilibrium policy guarantees a 50% win rate against any competing policy. However, previous algorithms for finding the Nash policy either diverge or converge to a Nash policy in a modified game, even in a simple synthetic setting, thereby failing to maintain the 50% win rate guarantee against all other policies. We propose a meta-algorithm, Convergent Meta Alignment Algorithm (COMAL), for language model alignment with general preferences, inspired by convergent algorithms in game theory. Theoretically, we prove that our meta-algorithm converges to an exact Nash policy in the last iterate. Additionally, our meta-algorithm is simple and can be integrated with many existing methods designed for RLHF and preference optimization with minimal changes. Experimental results demonstrate the effectiveness of the proposed framework when combined with existing preference policy optimization methods. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.23223v1-abstract-full').style.display = 'none'; document.getElementById('2410.23223v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 30 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.09207</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Han%2C+S">Simeng Han</a>, <a href="/search/cs?searchtype=author&amp;query=Yu%2C+A">Aaron Yu</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+R">Rui Shen</a>, <a href="/search/cs?searchtype=author&amp;query=Qi%2C+Z">Zhenting Qi</a>, <a href="/search/cs?searchtype=author&amp;query=Riddell%2C+M">Martin Riddell</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+W">Wenfei Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Qiao%2C+Y">Yujie Qiao</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Yavuz%2C+S">Semih Yavuz</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Ye Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Joty%2C+S">Shafiq Joty</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+Y">Yingbo Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Xiong%2C+C">Caiming Xiong</a>, <a href="/search/cs?searchtype=author&amp;query=Radev%2C+D">Dragomir Radev</a>, <a href="/search/cs?searchtype=author&amp;query=Ying%2C+R">Rex Ying</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.09207v1-abstract-short" style="display: inline;"> Existing methods on understanding the capabilities of LLMs in logical reasoning rely on binary entailment classification or synthetically derived rationales, which are not sufficient for proper investigation of model&#39;s capabilities. We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains for a set of realistic logical reasoning stories also written by human&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.09207v1-abstract-full').style.display = 'inline'; document.getElementById('2410.09207v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.09207v1-abstract-full" style="display: none;"> Existing methods on understanding the capabilities of LLMs in logical reasoning rely on binary entailment classification or synthetically derived rationales, which are not sufficient for proper investigation of model&#39;s capabilities. We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains for a set of realistic logical reasoning stories also written by humans. P-FOLIO is collected with an annotation protocol that facilitates humans to annotate well-structured natural language proofs for first-order logic reasoning problems in a step-by-step manner. The number of reasoning steps in P-FOLIO span from 0 to 20. We further use P-FOLIO to evaluate and improve large-language-model (LLM) reasoning capabilities. We evaluate LLM reasoning capabilities at a fine granularity via single-step inference rule classification, with more diverse inference rules of more diverse and higher levels of complexities than previous works. Given that a single model-generated reasoning chain could take a completely different path than the human-annotated one, we sample multiple reasoning chains from a model and use pass@k metrics for evaluating the quality of model-generated reasoning chains. We show that human-written reasoning chains significantly boost the logical reasoning capabilities of LLMs via many-shot prompting and fine-tuning. Furthermore, fine-tuning Llama3-7B on P-FOLIO improves the model performance by 10% or more on three other out-of-domain logical reasoning datasets. We also conduct detailed analysis to show where most powerful LLMs fall short in reasoning. We will release the dataset and code publicly. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.09207v1-abstract-full').style.display = 'none'; document.getElementById('2410.09207v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 11 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.07069</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> ReIFE: Re-evaluating Instruction-Following Evaluation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yixin Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Shi%2C+K">Kejian Shi</a>, <a href="/search/cs?searchtype=author&amp;query=Fabbri%2C+A+R">Alexander R. Fabbri</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+P">Peifeng Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+C">Chien-Sheng Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Joty%2C+S">Shafiq Joty</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.07069v1-abstract-short" style="display: inline;"> The automatic evaluation of instruction following typically involves using large language models (LLMs) to assess response quality. However, there is a lack of comprehensive evaluation of these LLM-based evaluators across two dimensions: the base LLMs and the evaluation protocols. Therefore, we present a thorough meta-evaluation of instruction following, including 25 base LLMs and 15 recently prop&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.07069v1-abstract-full').style.display = 'inline'; document.getElementById('2410.07069v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.07069v1-abstract-full" style="display: none;"> The automatic evaluation of instruction following typically involves using large language models (LLMs) to assess response quality. However, there is a lack of comprehensive evaluation of these LLM-based evaluators across two dimensions: the base LLMs and the evaluation protocols. Therefore, we present a thorough meta-evaluation of instruction following, including 25 base LLMs and 15 recently proposed evaluation protocols, on 4 human-annotated datasets, assessing the evaluation accuracy of the LLM-evaluators. Our evaluation allows us to identify the best-performing base LLMs and evaluation protocols with a high degree of robustness. Moreover, our large-scale evaluation reveals: (1) Base LLM performance ranking remains largely consistent across evaluation protocols, with less capable LLMs showing greater improvement from protocol enhancements; (2) Robust evaluation of evaluation protocols requires many base LLMs with varying capability levels, as protocol effectiveness can depend on the base LLM used; (3) Evaluation results on different datasets are not always consistent, so a rigorous evaluation requires multiple datasets with distinctive features. We release our meta-evaluation suite ReIFE, which provides the codebase and evaluation result collection for more than 500 LLM-evaluator configurations, to support future research in instruction-following evaluation. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.07069v1-abstract-full').style.display = 'none'; document.getElementById('2410.07069v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 9 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">GitHub Repo:, Evaluation Result Collection:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.19381</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> INC-Math: Integrating Natural Language and Code for Enhanced Mathematical Reasoning in Large Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Xiong%2C+X">Xuyuan Xiong</a>, <a href="/search/cs?searchtype=author&amp;query=Han%2C+S">Simeng Han</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+Z">Ziyue Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.19381v3-abstract-short" style="display: inline;"> Large Language Models (LLMs) are commonly used to generate solutions for mathematical reasoning problems in the following formats: natural language, code, or a combination of both. In this paper, we explore fundamental questions related to solving mathematical reasoning problems using natural language and code with state-of-the-art LLMs, including GPT-4o-mini and LLama-3.1-8b-Turbo. Our findings s&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.19381v3-abstract-full').style.display = 'inline'; document.getElementById('2409.19381v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.19381v3-abstract-full" style="display: none;"> Large Language Models (LLMs) are commonly used to generate solutions for mathematical reasoning problems in the following formats: natural language, code, or a combination of both. In this paper, we explore fundamental questions related to solving mathematical reasoning problems using natural language and code with state-of-the-art LLMs, including GPT-4o-mini and LLama-3.1-8b-Turbo. Our findings show that LLMs are better at reasoning in natural language compared to code. Additionally, although natural language and code serve as complementary forms of reasoning, they can affect each other in a negative way in certain scenarios. These insights motivate our development of a new prompting method, INC-Math, which leverages an LLM to dynamically select the most appropriate reasoning form, resulting in improved performance over comparable baselines with GPT-4o-mini. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.19381v3-abstract-full').style.display = 'none'; document.getElementById('2409.19381v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 1 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 28 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.02685</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Information Retrieval">cs.IR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> RouterRetriever: Exploring the Benefits of Routing over Multiple Expert Embedding Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lee%2C+H">Hyunji Lee</a>, <a href="/search/cs?searchtype=author&amp;query=Soldaini%2C+L">Luca Soldaini</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Seo%2C+M">Minjoon Seo</a>, <a href="/search/cs?searchtype=author&amp;query=Lo%2C+K">Kyle Lo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.02685v1-abstract-short" style="display: inline;"> Information retrieval methods often rely on a single embedding model trained on large, general-domain datasets like MSMARCO. While this approach can produce a retriever with reasonable overall performance, models trained on domain-specific data often yield better results within their respective domains. While prior work in information retrieval has tackled this through multi-task training, the top&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.02685v1-abstract-full').style.display = 'inline'; document.getElementById('2409.02685v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.02685v1-abstract-full" style="display: none;"> Information retrieval methods often rely on a single embedding model trained on large, general-domain datasets like MSMARCO. While this approach can produce a retriever with reasonable overall performance, models trained on domain-specific data often yield better results within their respective domains. While prior work in information retrieval has tackled this through multi-task training, the topic of combining multiple domain-specific expert retrievers remains unexplored, despite its popularity in language model generation. In this work, we introduce RouterRetriever, a retrieval model that leverages multiple domain-specific experts along with a routing mechanism to select the most appropriate expert for each query. It is lightweight and allows easy addition or removal of experts without additional training. Evaluation on the BEIR benchmark demonstrates that RouterRetriever outperforms both MSMARCO-trained (+2.1 absolute nDCG@10) and multi-task trained (+3.2) models. This is achieved by employing our routing mechanism, which surpasses other routing techniques (+1.8 on average) commonly used in language modeling. Furthermore, the benefit generalizes well to other datasets, even in the absence of a specific expert on the dataset. To our knowledge, RouterRetriever is the first work to demonstrate the advantages of using multiple domain-specific expert embedding models with effective routing over a single, general-purpose embedding model in retrieval tasks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.02685v1-abstract-full').style.display = 'none'; document.getElementById('2409.02685v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.13709</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Understanding Reference Policies in Direct Preference Optimization </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yixin Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+P">Pengfei Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.13709v2-abstract-short" style="display: inline;"> Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs). In this work, we explore an under-investigated aspect of DPO - its dependency on the reference model or policy. Such reference policies, typically instantiated as the model to be further fine-tuned, are important since they can impose an upper limit on DPO&#39;&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.13709v2-abstract-full').style.display = 'inline'; document.getElementById('2407.13709v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.13709v2-abstract-full" style="display: none;"> Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs). In this work, we explore an under-investigated aspect of DPO - its dependency on the reference model or policy. Such reference policies, typically instantiated as the model to be further fine-tuned, are important since they can impose an upper limit on DPO&#39;s effectiveness. Therefore, we address three related research questions in this work. First, we explore the optimal strength of the KL divergence constraint in DPO, which penalizes deviations from the reference policy, and find that DPO is sensitive to this strength. Next, we examine the necessity of the KL-constraint from the reference policies in DPO by providing both theoretical and empirical comparisons between DPO and related learning objectives, demonstrating DPO&#39;s superiority in this controlled setting. Additionally, we investigate whether DPO benefits from stronger reference policies, finding that a stronger reference policy can lead to improved performance, but only when it is similar to the model being fine-tuned. Our findings highlight the confounding role of reference policies in DPO and offer insights for best practices, while also identifying open research questions for future studies. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.13709v2-abstract-full').style.display = 'none'; document.getElementById('2407.13709v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 22 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 18 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">GitHub Repo:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.14644</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Deng%2C+C">Chunyuan Deng</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Heng%2C+Y">Yuzhao Heng</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Y">Yitong Li</a>, <a href="/search/cs?searchtype=author&amp;query=Cao%2C+J">Jiannan Cao</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+X">Xiangru Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.14644v1-abstract-short" style="display: inline;"> Data contamination has garnered increased attention in the era of large language models (LLMs) due to the reliance on extensive internet-derived training corpora. The issue of training corpus overlap with evaluation benchmarks--referred to as contamination--has been the focus of significant recent research. This body of work aims to identify contamination, understand its impacts, and explore mitig&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.14644v1-abstract-full').style.display = 'inline'; document.getElementById('2406.14644v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.14644v1-abstract-full" style="display: none;"> Data contamination has garnered increased attention in the era of large language models (LLMs) due to the reliance on extensive internet-derived training corpora. The issue of training corpus overlap with evaluation benchmarks--referred to as contamination--has been the focus of significant recent research. This body of work aims to identify contamination, understand its impacts, and explore mitigation strategies from diverse perspectives. However, comprehensive studies that provide a clear pathway from foundational concepts to advanced insights are lacking in this nascent field. Therefore, we present a comprehensive survey in the field of data contamination, laying out the key issues, methodologies, and findings to date, and highlighting areas in need of further research and development. In particular, we begin by examining the effects of data contamination across various stages and forms. We then provide a detailed analysis of current contamination detection methods, categorizing them to highlight their focus, assumptions, strengths, and limitations. We also discuss mitigation strategies, offering a clear guide for future research. This survey serves as a succinct overview of the most recent advancements in data contamination research, providing a straightforward guide for the benefit of future research endeavors. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.14644v1-abstract-full').style.display = 'none'; document.getElementById('2406.14644v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 20 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">ACL 2024 Camera-Ready Version</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.14275</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Step-Back Profiling: Distilling User History for Personalized Scientific Writing </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Tang%2C+X">Xiangru Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+X">Xingyao Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Shao%2C+Y">Yanjun Shao</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+J">Jie Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Gong%2C+M">Ming Gong</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+D">Dongmei Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Gerstein%2C+M">Mark Gerstein</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.14275v2-abstract-short" style="display: inline;"> Large language models (LLM) excel at a variety of natural language processing tasks, yet they struggle to generate personalized content for individuals, particularly in real-world scenarios like scientific writing. Addressing this challenge, we introduce STEP-BACK PROFILING to personalize LLMs by distilling user history into concise profiles, including essential traits and preferences of users. To&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.14275v2-abstract-full').style.display = 'inline'; document.getElementById('2406.14275v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.14275v2-abstract-full" style="display: none;"> Large language models (LLM) excel at a variety of natural language processing tasks, yet they struggle to generate personalized content for individuals, particularly in real-world scenarios like scientific writing. Addressing this challenge, we introduce STEP-BACK PROFILING to personalize LLMs by distilling user history into concise profiles, including essential traits and preferences of users. To conduct the experiments, we construct a Personalized Scientific Writing (PSW) dataset to study multi-user personalization. PSW requires the models to write scientific papers given specialized author groups with diverse academic backgrounds. As for the results, we demonstrate the effectiveness of capturing user characteristics via STEP-BACK PROFILING for collaborative writing. Moreover, our approach outperforms the baselines by up to 3.6 points on the general personalization benchmark (LaMP), including 7 personalization LLM tasks. Our ablation studies validate the contributions of different components in our method and provide insights into our task definition. Our dataset and code are available at \url{}. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.14275v2-abstract-full').style.display = 'none'; document.getElementById('2406.14275v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 11 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 20 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.07835</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wadden%2C+D">David Wadden</a>, <a href="/search/cs?searchtype=author&amp;query=Shi%2C+K">Kejian Shi</a>, <a href="/search/cs?searchtype=author&amp;query=Morrison%2C+J">Jacob Morrison</a>, <a href="/search/cs?searchtype=author&amp;query=Naik%2C+A">Aakanksha Naik</a>, <a href="/search/cs?searchtype=author&amp;query=Singh%2C+S">Shruti Singh</a>, <a href="/search/cs?searchtype=author&amp;query=Barzilay%2C+N">Nitzan Barzilay</a>, <a href="/search/cs?searchtype=author&amp;query=Lo%2C+K">Kyle Lo</a>, <a href="/search/cs?searchtype=author&amp;query=Hope%2C+T">Tom Hope</a>, <a href="/search/cs?searchtype=author&amp;query=Soldaini%2C+L">Luca Soldaini</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+S+Z">Shannon Zejiang Shen</a>, <a href="/search/cs?searchtype=author&amp;query=Downey%2C+D">Doug Downey</a>, <a href="/search/cs?searchtype=author&amp;query=Hajishirzi%2C+H">Hannaneh Hajishirzi</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.07835v3-abstract-short" style="display: inline;"> We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks covering five essential scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. SciRIFF demonstrations are notable for their long input contexts, detailed t&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.07835v3-abstract-full').style.display = 'inline'; document.getElementById('2406.07835v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.07835v3-abstract-full" style="display: none;"> We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks covering five essential scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. SciRIFF demonstrations are notable for their long input contexts, detailed task specifications, and complex structured outputs. While instruction-following resources are available in specific domains such as clinical medicine and chemistry, SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields. To demonstrate the utility of SciRIFF, we develop a sample-efficient strategy to adapt a general instruction-following model for science by performing additional finetuning on a mix of general-domain and SciRIFF demonstrations. In evaluations on nine held-out scientific tasks, our model -- called SciTulu -- improves over a strong LLM baseline by 28.1% and 6.5% at the 7B and 70B scales respectively, while maintaining general instruction-following performance within 2% of the baseline. We are optimistic that SciRIFF will facilitate the development and evaluation of LLMs to help researchers navigate the ever-growing body of scientific literature. We release our dataset, model checkpoints, and data processing and evaluation code to enable further research. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.07835v3-abstract-full').style.display = 'none'; document.getElementById('2406.07835v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 19 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 10 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Submitted to NeurIPS Datasets and Benchmarks 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2404.14662</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Programming Languages">cs.PL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Software Engineering">cs.SE</span> </div> </div> <p class="title is-5 mathjax"> NExT: Teaching Large Language Models to Reason about Code Execution </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Ni%2C+A">Ansong Ni</a>, <a href="/search/cs?searchtype=author&amp;query=Allamanis%2C+M">Miltiadis Allamanis</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Deng%2C+Y">Yinlin Deng</a>, <a href="/search/cs?searchtype=author&amp;query=Shi%2C+K">Kensen Shi</a>, <a href="/search/cs?searchtype=author&amp;query=Sutton%2C+C">Charles Sutton</a>, <a href="/search/cs?searchtype=author&amp;query=Yin%2C+P">Pengcheng Yin</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2404.14662v1-abstract-short" style="display: inline;"> A fundamental skill among human developers is the ability to understand and reason about program execution. As an example, a programmer can mentally simulate code execution in natural language to debug and repair code (aka. rubber duck debugging). However, large language models (LLMs) of code are typically trained on the surface textual form of programs, thus may lack a semantic understanding of h&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2404.14662v1-abstract-full').style.display = 'inline'; document.getElementById('2404.14662v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2404.14662v1-abstract-full" style="display: none;"> A fundamental skill among human developers is the ability to understand and reason about program execution. As an example, a programmer can mentally simulate code execution in natural language to debug and repair code (aka. rubber duck debugging). However, large language models (LLMs) of code are typically trained on the surface textual form of programs, thus may lack a semantic understanding of how programs execute at run-time. To address this issue, we propose NExT, a method to teach LLMs to inspect the execution traces of programs (variable states of executed lines) and reason about their run-time behavior through chain-of-thought (CoT) rationales. Specifically, NExT uses self-training to bootstrap a synthetic training set of execution-aware rationales that lead to correct task solutions (e.g., fixed programs) without laborious manual annotation. Experiments on program repair tasks based on MBPP and HumanEval demonstrate that NExT improves the fix rate of a PaLM 2 model, by 26.1% and 14.3% absolute, respectively, with significantly improved rationale quality as verified by automated metrics and human raters. Our model can also generalize to scenarios where program traces are absent at test-time. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2404.14662v1-abstract-full').style.display = 'none'; document.getElementById('2404.14662v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 22 April, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> April 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">35 pages</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2404.04285</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> MIMIR: A Streamlined Platform for Personalized Agent Tuning in Domain Expertise </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Deng%2C+C">Chunyuan Deng</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+X">Xiangru Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+H">Hanming Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+H">Haoran Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+W">Wangchunshu Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Gerstein%2C+M">Mark Gerstein</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2404.04285v1-abstract-short" style="display: inline;"> Recently, large language models (LLMs) have evolved into interactive agents, proficient in planning, tool use, and task execution across a wide variety of tasks. However, without specific agent tuning, open-source models like LLaMA currently struggle to match the efficiency of GPT- 4, particularly given the scarcity of agent-tuning datasets for fine-tuning. In response, we introduce \textsc{Mimir}&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2404.04285v1-abstract-full').style.display = 'inline'; document.getElementById('2404.04285v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2404.04285v1-abstract-full" style="display: none;"> Recently, large language models (LLMs) have evolved into interactive agents, proficient in planning, tool use, and task execution across a wide variety of tasks. However, without specific agent tuning, open-source models like LLaMA currently struggle to match the efficiency of GPT- 4, particularly given the scarcity of agent-tuning datasets for fine-tuning. In response, we introduce \textsc{Mimir}: a streamlined platform offering a customizable pipeline that enables users to leverage both private knowledge and publicly available, legally compliant datasets at scale for \textbf{personalized agent tuning}. Additionally, \textsc{Mimir} supports the generation of general instruction-tuning datasets from the same input. This dual capability ensures that language agents developed through the platform possess both specific agent abilities and general competencies. \textsc{Mimir} integrates these features into a cohesive end-to-end platform, facilitating everything from the uploading of personalized files to one-click agent fine-tuning. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2404.04285v1-abstract-full').style.display = 'none'; document.getElementById('2404.04285v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 3 April, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> April 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2404.03602</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Evaluating LLMs at Detecting Errors in LLM Responses </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kamoi%2C+R">Ryo Kamoi</a>, <a href="/search/cs?searchtype=author&amp;query=Das%2C+S+S+S">Sarkar Snigdha Sarathi Das</a>, <a href="/search/cs?searchtype=author&amp;query=Lou%2C+R">Renze Lou</a>, <a href="/search/cs?searchtype=author&amp;query=Ahn%2C+J+J">Jihyun Janice Ahn</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Lu%2C+X">Xiaoxin Lu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+N">Nan Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Yusen Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+R+H">Ranran Haoran Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Vummanthala%2C+S+R">Sujeeth Reddy Vummanthala</a>, <a href="/search/cs?searchtype=author&amp;query=Dave%2C+S">Salika Dave</a>, <a href="/search/cs?searchtype=author&amp;query=Qin%2C+S">Shaobo Qin</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Yin%2C+W">Wenpeng Yin</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+R">Rui Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2404.03602v2-abstract-short" style="display: inline;"> With Large Language Models (LLMs) being widely used across various tasks, detecting errors in their responses is increasingly crucial. However, little research has been conducted on error detection of LLM responses. Collecting error annotations on LLM responses is challenging due to the subjective nature of many NLP tasks, and thus previous research focuses on tasks of little practical value (e.g.&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2404.03602v2-abstract-full').style.display = 'inline'; document.getElementById('2404.03602v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2404.03602v2-abstract-full" style="display: none;"> With Large Language Models (LLMs) being widely used across various tasks, detecting errors in their responses is increasingly crucial. However, little research has been conducted on error detection of LLM responses. Collecting error annotations on LLM responses is challenging due to the subjective nature of many NLP tasks, and thus previous research focuses on tasks of little practical value (e.g., word sorting) or limited error types (e.g., faithfulness in summarization). This work introduces ReaLMistake, the first error detection benchmark consisting of objective, realistic, and diverse errors made by LLMs. ReaLMistake contains three challenging and meaningful tasks that introduce objectively assessable errors in four categories (reasoning correctness, instruction-following, context-faithfulness, and parameterized knowledge), eliciting naturally observed and diverse errors in responses of GPT-4 and Llama 2 70B annotated by experts. We use ReaLMistake to evaluate error detectors based on 12 LLMs. Our findings show: 1) Top LLMs like GPT-4 and Claude 3 detect errors made by LLMs at very low recall, and all LLM-based error detectors perform much worse than humans. 2) Explanations by LLM-based error detectors lack reliability. 3) LLMs-based error detection is sensitive to small changes in prompts but remains challenging to improve. 4) Popular approaches to improving LLMs, including self-consistency and majority vote, do not improve the error detection performance. Our benchmark and code are provided at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2404.03602v2-abstract-full').style.display = 'none'; document.getElementById('2404.03602v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 27 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 4 April, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> April 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">COLM 2024, 46 pages, Benchmark and code:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2403.15246</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Information Retrieval">cs.IR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Weller%2C+O">Orion Weller</a>, <a href="/search/cs?searchtype=author&amp;query=Chang%2C+B">Benjamin Chang</a>, <a href="/search/cs?searchtype=author&amp;query=MacAvaney%2C+S">Sean MacAvaney</a>, <a href="/search/cs?searchtype=author&amp;query=Lo%2C+K">Kyle Lo</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Van+Durme%2C+B">Benjamin Van Durme</a>, <a href="/search/cs?searchtype=author&amp;query=Lawrie%2C+D">Dawn Lawrie</a>, <a href="/search/cs?searchtype=author&amp;query=Soldaini%2C+L">Luca Soldaini</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2403.15246v3-abstract-short" style="display: inline;"> Modern Language Models (LMs) are capable of following long and complex instructions that enable a large and diverse set of user requests. While Information Retrieval (IR) models use these LMs as the backbone of their architectures, virtually none of them allow users to provide detailed instructions alongside queries, thus limiting their ability to satisfy complex information needs. In this work, w&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.15246v3-abstract-full').style.display = 'inline'; document.getElementById('2403.15246v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2403.15246v3-abstract-full" style="display: none;"> Modern Language Models (LMs) are capable of following long and complex instructions that enable a large and diverse set of user requests. While Information Retrieval (IR) models use these LMs as the backbone of their architectures, virtually none of them allow users to provide detailed instructions alongside queries, thus limiting their ability to satisfy complex information needs. In this work, we study the use of instructions in IR systems. First, we introduce our dataset FollowIR, which contains a rigorous instruction evaluation benchmark as well as a training set for helping IR models learn to better follow real-world instructions. FollowIR repurposes detailed instructions -- also known as narratives -- developed for professional assessors to evaluate retrieval systems. In particular, we build our benchmark from three collections curated for shared tasks at the Text REtrieval Conference (TREC). These collections contains hundreds to thousands of labeled documents per query, making them suitable for our exploration. Through this process, we can measure how well IR models follow instructions, through a new pairwise evaluation framework. Our results indicate that existing retrieval models fail to correctly use instructions, using them for basic keywords and struggling to understand long-form information. However, we show that it is possible for IR models to learn to follow complex instructions: our new FollowIR-7B model has significant improvements after fine-tuning on our training set. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.15246v3-abstract-full').style.display = 'none'; document.getElementById('2403.15246v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 7 May, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 22 March, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> March 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2403.05788</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> On the Benefits of Fine-Grained Loss Truncation: A Case Study on Factuality in Summarization </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Flores%2C+L+J+Y">Lorenzo Jaime Yu Flores</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2403.05788v1-abstract-short" style="display: inline;"> Text summarization and simplification are among the most widely used applications of AI. However, models developed for such tasks are often prone to hallucination, which can result from training on unaligned data. One efficient approach to address this issue is Loss Truncation (LT) (Kang and Hashimoto, 2020), an approach to modify the standard log loss to adaptively remove noisy examples during tr&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.05788v1-abstract-full').style.display = 'inline'; document.getElementById('2403.05788v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2403.05788v1-abstract-full" style="display: none;"> Text summarization and simplification are among the most widely used applications of AI. However, models developed for such tasks are often prone to hallucination, which can result from training on unaligned data. One efficient approach to address this issue is Loss Truncation (LT) (Kang and Hashimoto, 2020), an approach to modify the standard log loss to adaptively remove noisy examples during training. However, we find that LT alone yields a considerable number of hallucinated entities on various datasets. We study the behavior of the underlying losses between factual and non-factual examples, to understand and refine the performance of LT. We demonstrate that LT&#39;s performance is limited when the underlying assumption that noisy targets have higher NLL loss is not satisfied, and find that word-level NLL among entities provides better signal for distinguishing factuality. We then leverage this to propose a fine-grained NLL loss and fine-grained data cleaning strategies, and observe improvements in hallucination reduction across some datasets. Our work is available at https:// <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.05788v1-abstract-full').style.display = 'none'; document.getElementById('2403.05788v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 8 March, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> March 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">EACL 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2403.04811</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Software Engineering">cs.SE</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Riddell%2C+M">Martin Riddell</a>, <a href="/search/cs?searchtype=author&amp;query=Ni%2C+A">Ansong Ni</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2403.04811v1-abstract-short" style="display: inline;"> While large language models have achieved remarkable performance on various code generation benchmarks, there have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and finetuning data. While recent work has investigated contamination in natural language generation and understanding tasks, there has been less extensive research into&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.04811v1-abstract-full').style.display = 'inline'; document.getElementById('2403.04811v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2403.04811v1-abstract-full" style="display: none;"> While large language models have achieved remarkable performance on various code generation benchmarks, there have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and finetuning data. While recent work has investigated contamination in natural language generation and understanding tasks, there has been less extensive research into how data contamination impacts the evaluation of code generation, which is critical for understanding the robustness and reliability of LLMs in programming contexts. In this work, we perform a comprehensive study of data contamination of popular code generation benchmarks, and precisely quantify their overlap with pretraining corpus through both surface-level and semantic-level matching. In our experiments, we show that there are substantial overlap between popular code generation benchmarks and open training corpus, and models perform significantly better on the subset of the benchmarks where similar solutions are seen during training. We also conduct extensive analysis on the factors that affects model memorization and generalization, such as model size, problem difficulty, and question length. We release all resulting files from our matching pipeline for future research. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.04811v1-abstract-full').style.display = 'none'; document.getElementById('2403.04811v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 March, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> March 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2402.06544</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Calibrating Long-form Generations from Large Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Huang%2C+Y">Yukun Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yixin Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Thirukovalluru%2C+R">Raghuveer Thirukovalluru</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Dhingra%2C+B">Bhuwan Dhingra</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2402.06544v2-abstract-short" style="display: inline;"> To enhance Large Language Models&#39; (LLMs) reliability, calibration is essential -- the model&#39;s assessed confidence scores should align with the actual likelihood of its responses being correct. However, current confidence elicitation methods and calibration metrics typically rely on a binary true/false assessment of response correctness. This approach does not apply to long-form generation, where a&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2402.06544v2-abstract-full').style.display = 'inline'; document.getElementById('2402.06544v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2402.06544v2-abstract-full" style="display: none;"> To enhance Large Language Models&#39; (LLMs) reliability, calibration is essential -- the model&#39;s assessed confidence scores should align with the actual likelihood of its responses being correct. However, current confidence elicitation methods and calibration metrics typically rely on a binary true/false assessment of response correctness. This approach does not apply to long-form generation, where an answer can be partially correct. Addressing this gap, we introduce a unified calibration framework, in which both the correctness of the LLMs&#39; responses and their associated confidence levels are treated as distributions across a range of scores. Within this framework, we develop three metrics to precisely evaluate LLM calibration and further propose two confidence elicitation methods based on self-consistency and self-evaluation. Our experiments, which include long-form QA and summarization tasks, demonstrate that larger models don&#39;t necessarily guarantee better calibration, that calibration performance is found to be metric-dependent, and that self-consistency methods excel in factoid datasets. We also find that calibration can be enhanced through techniques such as fine-tuning, integrating relevant source documents, scaling the temperature, and combining self-consistency with self-evaluation. Lastly, we showcase a practical application of our system: selecting and cascading open-source models and ChatGPT to optimize correctness given a limited API budget. This research not only challenges existing notions of LLM calibration but also offers practical methodologies for improving trustworthiness in long-form generation. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2402.06544v2-abstract-full').style.display = 'none'; document.getElementById('2402.06544v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 25 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 9 February, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> February 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2402.04247</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computers and Society">cs.CY</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Tang%2C+X">Xiangru Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Jin%2C+Q">Qiao Jin</a>, <a href="/search/cs?searchtype=author&amp;query=Zhu%2C+K">Kunlun Zhu</a>, <a href="/search/cs?searchtype=author&amp;query=Yuan%2C+T">Tongxin Yuan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Yichi Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+W">Wangchunshu Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Qu%2C+M">Meng Qu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+J">Jian Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Z">Zhuosheng Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Lu%2C+Z">Zhiyong Lu</a>, <a href="/search/cs?searchtype=author&amp;query=Gerstein%2C+M">Mark Gerstein</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2402.04247v4-abstract-short" style="display: inline;"> Intelligent agents powered by large language models (LLMs) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. While their capabilities are promising, these agents, called scientific LLM agents, also introduce novel vulnerabilities that demand careful consideration for safety. However, there exists a notab&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2402.04247v4-abstract-full').style.display = 'inline'; document.getElementById('2402.04247v4-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2402.04247v4-abstract-full" style="display: none;"> Intelligent agents powered by large language models (LLMs) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. While their capabilities are promising, these agents, called scientific LLM agents, also introduce novel vulnerabilities that demand careful consideration for safety. However, there exists a notable gap in the literature, as there has been no comprehensive exploration of these vulnerabilities. This perspective paper fills this gap by conducting a thorough examination of vulnerabilities in LLM-based agents within scientific domains, shedding light on potential risks associated with their misuse and emphasizing the need for safety measures. We begin by providing a comprehensive overview of the potential risks inherent to scientific LLM agents, taking into account user intent, the specific scientific domain, and their potential impact on the external environment. Then, we delve into the origins of these vulnerabilities and provide a scoping review of the limited existing works. Based on our analysis, we propose a triadic framework involving human regulation, agent alignment, and an understanding of environmental feedback (agent regulation) to mitigate these identified risks. Furthermore, we highlight the limitations and challenges associated with safeguarding scientific agents and advocate for the development of improved models, robust benchmarks, and comprehensive regulations to address these issues effectively. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2402.04247v4-abstract-full').style.display = 'none'; document.getElementById('2402.04247v4-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 5 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 6 February, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> February 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2402.00838</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> OLMo: Accelerating the Science of Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Groeneveld%2C+D">Dirk Groeneveld</a>, <a href="/search/cs?searchtype=author&amp;query=Beltagy%2C+I">Iz Beltagy</a>, <a href="/search/cs?searchtype=author&amp;query=Walsh%2C+P">Pete Walsh</a>, <a href="/search/cs?searchtype=author&amp;query=Bhagia%2C+A">Akshita Bhagia</a>, <a href="/search/cs?searchtype=author&amp;query=Kinney%2C+R">Rodney Kinney</a>, <a href="/search/cs?searchtype=author&amp;query=Tafjord%2C+O">Oyvind Tafjord</a>, <a href="/search/cs?searchtype=author&amp;query=Jha%2C+A+H">Ananya Harsh Jha</a>, <a href="/search/cs?searchtype=author&amp;query=Ivison%2C+H">Hamish Ivison</a>, <a href="/search/cs?searchtype=author&amp;query=Magnusson%2C+I">Ian Magnusson</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yizhong Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Arora%2C+S">Shane Arora</a>, <a href="/search/cs?searchtype=author&amp;query=Atkinson%2C+D">David Atkinson</a>, <a href="/search/cs?searchtype=author&amp;query=Authur%2C+R">Russell Authur</a>, <a href="/search/cs?searchtype=author&amp;query=Chandu%2C+K+R">Khyathi Raghavi Chandu</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Dumas%2C+J">Jennifer Dumas</a>, <a href="/search/cs?searchtype=author&amp;query=Elazar%2C+Y">Yanai Elazar</a>, <a href="/search/cs?searchtype=author&amp;query=Gu%2C+Y">Yuling Gu</a>, <a href="/search/cs?searchtype=author&amp;query=Hessel%2C+J">Jack Hessel</a>, <a href="/search/cs?searchtype=author&amp;query=Khot%2C+T">Tushar Khot</a>, <a href="/search/cs?searchtype=author&amp;query=Merrill%2C+W">William Merrill</a>, <a href="/search/cs?searchtype=author&amp;query=Morrison%2C+J">Jacob Morrison</a>, <a href="/search/cs?searchtype=author&amp;query=Muennighoff%2C+N">Niklas Muennighoff</a>, <a href="/search/cs?searchtype=author&amp;query=Naik%2C+A">Aakanksha Naik</a>, <a href="/search/cs?searchtype=author&amp;query=Nam%2C+C">Crystal Nam</a> , et al. (18 additional authors not shown) </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2402.00838v4-abstract-short" style="display: inline;"> Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2402.00838v4-abstract-full').style.display = 'inline'; document.getElementById('2402.00838v4-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2402.00838v4-abstract-full" style="display: none;"> Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, we have built OLMo, a competitive, truly Open Language Model, to enable the scientific study of language models. Unlike most prior efforts that have only released model weights and inference code, we release OLMo alongside open training data and training and evaluation code. We hope this release will empower the open research community and inspire a new wave of innovation. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2402.00838v4-abstract-full').style.display = 'none'; document.getElementById('2402.00838v4-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 7 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 1 February, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> February 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2312.16291</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Observable Propagation: Uncovering Feature Vectors in Transformers </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Dunefsky%2C+J">Jacob Dunefsky</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2312.16291v2-abstract-short" style="display: inline;"> A key goal of current mechanistic interpretability research in NLP is to find linear features (also called &#34;feature vectors&#34;) for transformers: directions in activation space corresponding to concepts that are used by a given model in its computation. Present state-of-the-art methods for finding linear features require large amounts of labelled data -- both laborious to acquire and computationally&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2312.16291v2-abstract-full').style.display = 'inline'; document.getElementById('2312.16291v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2312.16291v2-abstract-full" style="display: none;"> A key goal of current mechanistic interpretability research in NLP is to find linear features (also called &#34;feature vectors&#34;) for transformers: directions in activation space corresponding to concepts that are used by a given model in its computation. Present state-of-the-art methods for finding linear features require large amounts of labelled data -- both laborious to acquire and computationally expensive to utilize. In this work, we introduce a novel method, called &#34;observable propagation&#34; (in short: ObProp), for finding linear features used by transformer language models in computing a given task -- using almost no data. Our paradigm centers on the concept of &#34;observables&#34;, linear functionals corresponding to given tasks. We then introduce a mathematical theory for the analysis of feature vectors, including a similarity metric between feature vectors called the coupling coefficient which estimates the degree to which one feature&#39;s output correlates with another&#39;s. We use ObProp to perform extensive qualitative investigations into several tasks, including gendered occupational bias, political party prediction, and programming language detection. Our results suggest that ObProp surpasses traditional approaches for finding feature vectors in the low-data regime, and that ObProp can be used to better understand the mechanisms responsible for bias in large language models. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2312.16291v2-abstract-full').style.display = 'none'; document.getElementById('2312.16291v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 3 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 26 December, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">42 pages, 6 tables, 3 figures. ICML 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2311.10537</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Tang%2C+X">Xiangru Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Zou%2C+A">Anni Zou</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Z">Zhuosheng Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Z">Ziming Li</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+X">Xingyao Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Gerstein%2C+M">Mark Gerstein</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2311.10537v4-abstract-short" style="display: inline;"> Large language models (LLMs), despite their remarkable progress across various general domains, encounter significant barriers in medicine and healthcare. This field faces unique challenges such as domain-specific terminologies and reasoning over specialized knowledge. To address these issues, we propose MedAgents, a novel multi-disciplinary collaboration framework for the medical domain. MedAgent&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.10537v4-abstract-full').style.display = 'inline'; document.getElementById('2311.10537v4-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2311.10537v4-abstract-full" style="display: none;"> Large language models (LLMs), despite their remarkable progress across various general domains, encounter significant barriers in medicine and healthcare. This field faces unique challenges such as domain-specific terminologies and reasoning over specialized knowledge. To address these issues, we propose MedAgents, a novel multi-disciplinary collaboration framework for the medical domain. MedAgents leverages LLM-based agents in a role-playing setting that participate in a collaborative multi-round discussion, thereby enhancing LLM proficiency and reasoning capabilities. This training-free framework encompasses five critical steps: gathering domain experts, proposing individual analyses, summarising these analyses into a report, iterating over discussions until a consensus is reached, and ultimately making a decision. Our work focuses on the zero-shot setting, which is applicable in real-world scenarios. Experimental results on nine datasets (MedQA, MedMCQA, PubMedQA, and six subtasks from MMLU) establish that our proposed MedAgents framework excels at mining and harnessing the medical expertise within LLMs, as well as extending its reasoning abilities. Our code can be found at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.10537v4-abstract-full').style.display = 'none'; document.getElementById('2311.10537v4-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 16 November, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2311.09835</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Tang%2C+X">Xiangru Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yuliang Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Cai%2C+Z">Zefan Cai</a>, <a href="/search/cs?searchtype=author&amp;query=Shao%2C+Y">Yanjun Shao</a>, <a href="/search/cs?searchtype=author&amp;query=Lu%2C+J">Junjie Lu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Yichi Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Deng%2C+Z">Zexuan Deng</a>, <a href="/search/cs?searchtype=author&amp;query=Hu%2C+H">Helan Hu</a>, <a href="/search/cs?searchtype=author&amp;query=An%2C+K">Kaikai An</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+R">Ruijun Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Si%2C+S">Shuzheng Si</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+S">Sheng Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+H">Haozhe Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+L">Liang Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yan Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+T">Tianyu Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Jiang%2C+Z">Zhiwei Jiang</a>, <a href="/search/cs?searchtype=author&amp;query=Chang%2C+B">Baobao Chang</a>, <a href="/search/cs?searchtype=author&amp;query=Fang%2C+Y">Yin Fang</a>, <a href="/search/cs?searchtype=author&amp;query=Qin%2C+Y">Yujia Qin</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+W">Wangchunshu Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Gerstein%2C+M">Mark Gerstein</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2311.09835v5-abstract-short" style="display: inline;"> Despite Large Language Models (LLMs) like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines), requiring a deeper comprehension of complex file interactions. Also, recently, people have developed LLM agents that attempt to interact with repository code (e.g., com&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.09835v5-abstract-full').style.display = 'inline'; document.getElementById('2311.09835v5-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2311.09835v5-abstract-full" style="display: none;"> Despite Large Language Models (LLMs) like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines), requiring a deeper comprehension of complex file interactions. Also, recently, people have developed LLM agents that attempt to interact with repository code (e.g., compiling and evaluating its execution), prompting the need to evaluate their performance. These gaps have motivated our development of ML-Bench, a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. Addressing the need for LLMs to interpret long code contexts and translate instructions into precise, executable scripts, ML-Bench encompasses annotated 9,641 examples across 18 GitHub repositories, challenging LLMs to accommodate user-specified arguments and documentation intricacies effectively. To evaluate both LLMs and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs&#39; text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment. Our findings indicate that while GPT-4o leads with a Pass@5 rate surpassing 50%, there remains significant scope for improvement, highlighted by issues such as hallucinated outputs and difficulties with bash script generation. Notably, in the more demanding ML-Agent-Bench, GPT-4o achieves a 76.47% success rate, reflecting the efficacy of iterative action and feedback in complex task resolution. Our code, dataset, and models are available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.09835v5-abstract-full').style.display = 'none'; document.getElementById('2311.09835v5-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 16 November, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2311.09805</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Long%2C+Y">Yitao Long</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+H">Hongjun Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Kamoi%2C+R">Ryo Kamoi</a>, <a href="/search/cs?searchtype=author&amp;query=Nan%2C+L">Linyong Nan</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+L">Lyuhao Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yixin Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+X">Xiangru Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+R">Rui Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2311.09805v3-abstract-short" style="display: inline;"> Recent LLMs have demonstrated remarkable performance in solving exam-like math word problems. However, the degree to which these numerical reasoning skills are effective in real-world scenarios, particularly in expert domains, is still largely unexplored. This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning capabilities of LLMs in&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.09805v3-abstract-full').style.display = 'inline'; document.getElementById('2311.09805v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2311.09805v3-abstract-full" style="display: none;"> Recent LLMs have demonstrated remarkable performance in solving exam-like math word problems. However, the degree to which these numerical reasoning skills are effective in real-world scenarios, particularly in expert domains, is still largely unexplored. This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning capabilities of LLMs in the context of understanding and analyzing specialized documents containing both text and tables. We conduct an extensive evaluation of 48 LLMs with Chain-of-Thought and Program-of-Thought prompting methods, aiming to comprehensively assess the capabilities and limitations of existing LLMs in DocMath-Eval. We found that even the current best-performing system (i.e., GPT-4o) still significantly lags behind human experts in solving complex numerical reasoning problems grounded in long contexts. We believe that DocMath-Eval can serve as a valuable benchmark for evaluating LLMs&#39; capabilities in solving challenging numerical reasoning problems within expert domains. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.09805v3-abstract-full').style.display = 'none'; document.getElementById('2311.09805v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 9 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 16 November, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">ACL 2024 Oral</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2311.09797</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> FinanceMath: Knowledge-Intensive Math Reasoning in Finance Domains </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+H">Hongjun Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Long%2C+Y">Yitao Long</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+R">Rui Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+C">Chen Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2311.09797v2-abstract-short" style="display: inline;"> We introduce FinanceMath, a novel benchmark designed to evaluate LLMs&#39; capabilities in solving knowledge-intensive math reasoning problems. Compared to prior works, this study features three core advancements. First, FinanceMath includes 1,200 problems with a hybrid of textual and tabular content. These problems require college-level knowledge in the finance domain for effective resolution. Second&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.09797v2-abstract-full').style.display = 'inline'; document.getElementById('2311.09797v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2311.09797v2-abstract-full" style="display: none;"> We introduce FinanceMath, a novel benchmark designed to evaluate LLMs&#39; capabilities in solving knowledge-intensive math reasoning problems. Compared to prior works, this study features three core advancements. First, FinanceMath includes 1,200 problems with a hybrid of textual and tabular content. These problems require college-level knowledge in the finance domain for effective resolution. Second, we provide expert-annotated, detailed solution references in Python program format, ensuring a high-quality benchmark for LLM assessment. We also construct a finance-domain knowledge bank and investigate various knowledge integration strategies. Finally, we evaluate a wide spectrum of 44 LLMs with both Chain-of-Thought and Program-of-Thought prompting methods. Our experimental results reveal that the current best-performing system (i.e., GPT-4o) achieves only 60.9% accuracy using CoT prompting, leaving substantial room for improvement. Moreover, while augmenting LLMs with external knowledge can improve model performance (e.g., from 47.5% to 54.5% for Gemini-1.5-Pro), their accuracy remains significantly lower than the estimated human expert performance of 92%. We believe that FinanceMath can advance future research in the area of domain-specific knowledge retrieval and integration, particularly within the context of solving reasoning-intensive tasks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.09797v2-abstract-full').style.display = 'none'; document.getElementById('2311.09797v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 8 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 16 November, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">ACL 2024 Oral</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2311.09783</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Investigating Data Contamination in Modern Benchmarks for Large Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Deng%2C+C">Chunyuan Deng</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+X">Xiangru Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Gerstein%2C+M">Mark Gerstein</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2311.09783v2-abstract-short" style="display: inline;"> Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks. This issue is especially critical for closed-source models and certain open-source models where training data transparency is lacking. In this paper we study data contamination by proposing two methods ta&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.09783v2-abstract-full').style.display = 'inline'; document.getElementById('2311.09783v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2311.09783v2-abstract-full" style="display: none;"> Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks. This issue is especially critical for closed-source models and certain open-source models where training data transparency is lacking. In this paper we study data contamination by proposing two methods tailored for both open-source and proprietary LLMs. We first introduce a retrieval-based system to explore potential overlaps between evaluation benchmarks and pretraining corpora. We further present a novel investigation protocol named \textbf{T}estset \textbf{S}lot Guessing (\textit{TS-Guessing}), applicable to both open and proprietary models. This approach entails masking a wrong answer in a multiple-choice question and prompting the model to fill in the gap. Additionally, it involves obscuring an unlikely word in an evaluation example and asking the model to produce it. We find that certain commercial LLMs could surprisingly guess the missing option in various test sets. Specifically, in the TruthfulQA benchmark, we find that LLMs exhibit notable performance improvement when provided with additional metadata in the benchmark. Further, in the MMLU benchmark, ChatGPT and GPT-4 demonstrated an exact match rate of 52\% and 57\%, respectively, in guessing the missing options in benchmark test data. We hope these results underscore the need for more robust evaluation methodologies and benchmarks in the field. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.09783v2-abstract-full').style.display = 'none'; document.getElementById('2311.09783v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 3 April, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 16 November, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">NAACL 2024 Version</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2311.09765</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Information Retrieval">cs.IR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lee%2C+H">Hyunji Lee</a>, <a href="/search/cs?searchtype=author&amp;query=Soldaini%2C+L">Luca Soldaini</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Seo%2C+M">Minjoon Seo</a>, <a href="/search/cs?searchtype=author&amp;query=Lo%2C+K">Kyle Lo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2311.09765v1-abstract-short" style="display: inline;"> Prevailing research practice today often relies on training dense retrievers on existing large datasets such as MSMARCO and then experimenting with ways to improve zero-shot generalization capabilities to unseen domains. While prior work has tackled this challenge through resource-intensive steps such as data augmentation, architectural modifications, increasing model size, or even further base mo&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.09765v1-abstract-full').style.display = 'inline'; document.getElementById('2311.09765v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2311.09765v1-abstract-full" style="display: none;"> Prevailing research practice today often relies on training dense retrievers on existing large datasets such as MSMARCO and then experimenting with ways to improve zero-shot generalization capabilities to unseen domains. While prior work has tackled this challenge through resource-intensive steps such as data augmentation, architectural modifications, increasing model size, or even further base model pretraining, comparatively little investigation has examined whether the training procedures themselves can be improved to yield better generalization capabilities in the resulting models. In this work, we recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives. We validate these recommendations using the BEIR benchmark and find results are persistent across choice of dense encoder and base model size and are complementary to other resource-intensive strategies for out-of-domain generalization such as architectural modifications or additional pretraining. We hope that this thorough and impartial study around various training techniques, which augments other resource-intensive methods, offers practical insights for developing a dense retrieval model that effectively generalizes, even when trained on a single dataset. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.09765v1-abstract-full').style.display = 'none'; document.getElementById('2311.09765v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 16 November, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2311.09721</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Nan%2C+L">Linyong Nan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+E">Ellen Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Zou%2C+W">Weijin Zou</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+W">Wenfei Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2311.09721v1-abstract-short" style="display: inline;"> This study introduces a new long-form database question answering dataset designed to evaluate how Large Language Models (LLMs) interact with a SQL interpreter. The task necessitates LLMs to strategically generate multiple SQL queries to retrieve sufficient data from a database, to reason with the acquired context, and to synthesize them into a comprehensive analytical narrative. Our findings high&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.09721v1-abstract-full').style.display = 'inline'; document.getElementById('2311.09721v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2311.09721v1-abstract-full" style="display: none;"> This study introduces a new long-form database question answering dataset designed to evaluate how Large Language Models (LLMs) interact with a SQL interpreter. The task necessitates LLMs to strategically generate multiple SQL queries to retrieve sufficient data from a database, to reason with the acquired context, and to synthesize them into a comprehensive analytical narrative. Our findings highlight that this task poses great challenges even for the state-of-the-art GPT-4 model. We propose and evaluate two interaction strategies, and provide a fine-grained analysis of the individual stages within the interaction. A key discovery is the identification of two primary bottlenecks hindering effective interaction: the capacity for planning and the ability to generate multiple SQL queries. To address the challenge of accurately assessing answer quality, we introduce a multi-agent evaluation framework that simulates the academic peer-review process, enhancing the precision and reliability of our evaluations. This framework allows for a more nuanced understanding of the strengths and limitations of current LLMs in complex retrieval and reasoning tasks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.09721v1-abstract-full').style.display = 'none'; document.getElementById('2311.09721v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 16 November, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2311.09184</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yixin Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Fabbri%2C+A+R">Alexander R. Fabbri</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+J">Jiawen Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Han%2C+S">Simeng Han</a>, <a href="/search/cs?searchtype=author&amp;query=Joty%2C+S">Shafiq Joty</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+P">Pengfei Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Radev%2C+D">Dragomir Radev</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+C">Chien-Sheng Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2311.09184v2-abstract-short" style="display: inline;"> While large language models (LLMs) can already achieve strong performance on standard generic summarization benchmarks, their performance on more complex summarization task settings is less studied. Therefore, we benchmark LLMs on instruction controllable text summarization, where the model input consists of both a source article and a natural language requirement for desired summary characteristi&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.09184v2-abstract-full').style.display = 'inline'; document.getElementById('2311.09184v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2311.09184v2-abstract-full" style="display: none;"> While large language models (LLMs) can already achieve strong performance on standard generic summarization benchmarks, their performance on more complex summarization task settings is less studied. Therefore, we benchmark LLMs on instruction controllable text summarization, where the model input consists of both a source article and a natural language requirement for desired summary characteristics. To this end, we curate an evaluation-only dataset for this task setting and conduct human evaluations of five LLM-based systems to assess their instruction-following capabilities in controllable summarization. We then benchmark LLM-based automatic evaluation for this task with 4 different evaluation protocols and 11 LLMs, resulting in 40 evaluation methods. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs, since (1) all LLMs evaluated still make factual and other types of errors in their summaries; (2) no LLM-based evaluation methods can achieve a strong alignment with human annotators when judging the quality of candidate summaries; (3) different LLMs show large performance gaps in summary generation and evaluation capabilities. We make our collected benchmark InstruSum publicly available to facilitate future research in this direction. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.09184v2-abstract-full').style.display = 'none'; document.getElementById('2311.09184v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 15 November, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">NAACL 2024 Findings, GitHub Repo:, LLM-evaluators Leaderboard:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2310.11191</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Medical Text Simplification: Optimizing for Readability with Unlikelihood Training and Reranked Beam Search Decoding </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Flores%2C+L+J+Y">Lorenzo Jaime Yu Flores</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+H">Heyuan Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Shi%2C+K">Kejian Shi</a>, <a href="/search/cs?searchtype=author&amp;query=Chheang%2C+S">Sophie Chheang</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2310.11191v2-abstract-short" style="display: inline;"> Text simplification has emerged as an increasingly useful application of AI for bridging the communication gap in specialized fields such as medicine, where the lexicon is often dominated by technical jargon and complex constructs. Despite notable progress, methods in medical simplification sometimes result in the generated text having lower quality and diversity. In this work, we explore ways to&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.11191v2-abstract-full').style.display = 'inline'; document.getElementById('2310.11191v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2310.11191v2-abstract-full" style="display: none;"> Text simplification has emerged as an increasingly useful application of AI for bridging the communication gap in specialized fields such as medicine, where the lexicon is often dominated by technical jargon and complex constructs. Despite notable progress, methods in medical simplification sometimes result in the generated text having lower quality and diversity. In this work, we explore ways to further improve the readability of text simplification in the medical domain. We propose (1) a new unlikelihood loss that encourages generation of simpler terms and (2) a reranked beam search decoding method that optimizes for simplicity, which achieve better performance on readability metrics on three datasets. This study&#39;s findings offer promising avenues for improving text simplification in the medical field. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.11191v2-abstract-full').style.display = 'none'; document.getElementById('2310.11191v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 25 October, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 17 October, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">EMNLP 2023 Findings</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2309.17446</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Programming Languages">cs.PL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Software Engineering">cs.SE</span> </div> </div> <p class="title is-5 mathjax"> L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Ni%2C+A">Ansong Ni</a>, <a href="/search/cs?searchtype=author&amp;query=Yin%2C+P">Pengcheng Yin</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Riddell%2C+M">Martin Riddell</a>, <a href="/search/cs?searchtype=author&amp;query=Feng%2C+T">Troy Feng</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+R">Rui Shen</a>, <a href="/search/cs?searchtype=author&amp;query=Yin%2C+S">Stephen Yin</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Ye Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Yavuz%2C+S">Semih Yavuz</a>, <a href="/search/cs?searchtype=author&amp;query=Xiong%2C+C">Caiming Xiong</a>, <a href="/search/cs?searchtype=author&amp;query=Joty%2C+S">Shafiq Joty</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+Y">Yingbo Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Radev%2C+D">Dragomir Radev</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2309.17446v2-abstract-short" style="display: inline;"> Recently, large language models (LLMs), especially those that are pretrained on code, have demonstrated strong capabilities in generating programs from natural language inputs in a few-shot or even zero-shot manner. Despite promising results, there is a notable lack of a comprehensive evaluation of these models language-to-code generation capabilities. Existing studies often focus on specific task&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2309.17446v2-abstract-full').style.display = 'inline'; document.getElementById('2309.17446v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2309.17446v2-abstract-full" style="display: none;"> Recently, large language models (LLMs), especially those that are pretrained on code, have demonstrated strong capabilities in generating programs from natural language inputs in a few-shot or even zero-shot manner. Despite promising results, there is a notable lack of a comprehensive evaluation of these models language-to-code generation capabilities. Existing studies often focus on specific tasks, model architectures, or learning paradigms, leading to a fragmented understanding of the overall landscape. In this work, we present L2CEval, a systematic evaluation of the language-to-code generation capabilities of LLMs on 7 tasks across the domain spectrum of semantic parsing, math reasoning and Python programming, analyzing the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs. This enables us to identify and analyze the typical failure modes across various tasks and models. L2CEval offers a comprehensive understanding of the capabilities and limitations of LLMs in language-to-code generation. We also release the evaluation framework and all model outputs, hoping to lay the groundwork for further future research in this domain. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2309.17446v2-abstract-full').style.display = 'none'; document.getElementById('2309.17446v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 2 October, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 29 September, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Project Website:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2309.08963</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Tang%2C+X">Xiangru Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Zong%2C+Y">Yiming Zong</a>, <a href="/search/cs?searchtype=author&amp;query=Phang%2C+J">Jason Phang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+W">Wangchunshu Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Gerstein%2C+M">Mark Gerstein</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2309.08963v3-abstract-short" style="display: inline;"> Despite the remarkable capabilities of Large Language Models (LLMs) like GPT-4, producing complex, structured tabular data remains challenging. Our study assesses LLMs&#39; proficiency in structuring tables and introduces a novel fine-tuning method, cognizant of data structures, to bolster their performance. We unveil Struc-Bench, a comprehensive benchmark featuring prominent LLMs (GPT-NeoX-20B, GPT-3&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2309.08963v3-abstract-full').style.display = 'inline'; document.getElementById('2309.08963v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2309.08963v3-abstract-full" style="display: none;"> Despite the remarkable capabilities of Large Language Models (LLMs) like GPT-4, producing complex, structured tabular data remains challenging. Our study assesses LLMs&#39; proficiency in structuring tables and introduces a novel fine-tuning method, cognizant of data structures, to bolster their performance. We unveil Struc-Bench, a comprehensive benchmark featuring prominent LLMs (GPT-NeoX-20B, GPT-3.5, GPT-4, and Vicuna), which spans text tables, HTML, and LaTeX formats. Our proposed FormatCoT aids in crafting format-specific instructions from the intended outputs to populate this benchmark. Addressing the gap in task-centered evaluation, we propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score), to more accurately gauge LLM performance. Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains, outshining its LLM counterparts across most measures. In-depth error analysis and creating an ability map across six dimensions -- coverage, formatting, reasoning, comprehension, pragmatics, and hallucination -- highlight areas for future enhancements and suggest forthcoming research trajectories. Our code and models can be found at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2309.08963v3-abstract-full').style.display = 'none'; document.getElementById('2309.08963v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 April, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 16 September, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2309.08960</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> ODSum: New Benchmarks for Open Domain Multi-Document Summarization </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+Y">Yijie Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Shi%2C+K">Kejian Shi</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+W">Wencai Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yixin Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2309.08960v1-abstract-short" style="display: inline;"> Open-domain Multi-Document Summarization (ODMDS) is a critical tool for condensing vast arrays of documents into coherent, concise summaries. With a more inter-related document set, there does not necessarily exist a correct answer for the retrieval, making it hard to measure the retrieving performance. We propose a rule-based method to process query-based document summarization datasets into ODMD&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2309.08960v1-abstract-full').style.display = 'inline'; document.getElementById('2309.08960v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2309.08960v1-abstract-full" style="display: none;"> Open-domain Multi-Document Summarization (ODMDS) is a critical tool for condensing vast arrays of documents into coherent, concise summaries. With a more inter-related document set, there does not necessarily exist a correct answer for the retrieval, making it hard to measure the retrieving performance. We propose a rule-based method to process query-based document summarization datasets into ODMDS datasets. Based on this method, we introduce a novel dataset, ODSum, a sophisticated case with its document index interdependent and often interrelated. We tackle ODMDS with the \textit{retrieve-then-summarize} method, and the performance of a list of retrievers and summarizers is investigated. Through extensive experiments, we identify variances in evaluation metrics and provide insights into their reliability. We also found that LLMs suffer great performance loss from retrieving errors. We further experimented methods to improve the performance as well as investigate their robustness against imperfect retrieval. We will release our data and code at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2309.08960v1-abstract-full').style.display = 'none'; document.getElementById('2309.08960v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 16 September, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2309.08541</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Information Retrieval">cs.IR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Weller%2C+O">Orion Weller</a>, <a href="/search/cs?searchtype=author&amp;query=Lo%2C+K">Kyle Lo</a>, <a href="/search/cs?searchtype=author&amp;query=Wadden%2C+D">David Wadden</a>, <a href="/search/cs?searchtype=author&amp;query=Lawrie%2C+D">Dawn Lawrie</a>, <a href="/search/cs?searchtype=author&amp;query=Van+Durme%2C+B">Benjamin Van Durme</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Soldaini%2C+L">Luca Soldaini</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2309.08541v2-abstract-short" style="display: inline;"> Using large language models (LMs) for query or document expansion can improve generalization in information retrieval. However, it is unknown whether these techniques are universally beneficial or only effective in specific settings, such as for particular retrieval models, dataset domains, or query types. To answer this, we conduct the first comprehensive analysis of LM-based expansion. We find t&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2309.08541v2-abstract-full').style.display = 'inline'; document.getElementById('2309.08541v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2309.08541v2-abstract-full" style="display: none;"> Using large language models (LMs) for query or document expansion can improve generalization in information retrieval. However, it is unknown whether these techniques are universally beneficial or only effective in specific settings, such as for particular retrieval models, dataset domains, or query types. To answer this, we conduct the first comprehensive analysis of LM-based expansion. We find that there exists a strong negative correlation between retriever performance and gains from expansion: expansion improves scores for weaker models, but generally harms stronger models. We show this trend holds across a set of eleven expansion techniques, twelve datasets with diverse distribution shifts, and twenty-four retrieval models. Through qualitative error analysis, we hypothesize that although expansions provide extra information (potentially improving recall), they add additional noise that makes it difficult to discern between the top relevant documents (thus introducing false positives). Our results suggest the following recipe: use expansions for weaker models or when the target dataset significantly differs from training corpus in format; otherwise, avoid expansions to keep the relevance signal clear. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2309.08541v2-abstract-full').style.display = 'none'; document.getElementById('2309.08541v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 26 February, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 15 September, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">EACL 2024 camera ready</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2305.15387</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Caciularu%2C+A">Avi Caciularu</a>, <a href="/search/cs?searchtype=author&amp;query=Peters%2C+M+E">Matthew E. Peters</a>, <a href="/search/cs?searchtype=author&amp;query=Goldberger%2C+J">Jacob Goldberger</a>, <a href="/search/cs?searchtype=author&amp;query=Dagan%2C+I">Ido Dagan</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2305.15387v1-abstract-short" style="display: inline;"> The integration of multi-document pre-training objectives into language models has resulted in remarkable improvements in multi-document downstream tasks. In this work, we propose extending this idea by pre-training a generic multi-document model from a novel cross-document question answering pre-training objective. To that end, given a set (or cluster) of topically-related documents, we systemati&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.15387v1-abstract-full').style.display = 'inline'; document.getElementById('2305.15387v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2305.15387v1-abstract-full" style="display: none;"> The integration of multi-document pre-training objectives into language models has resulted in remarkable improvements in multi-document downstream tasks. In this work, we propose extending this idea by pre-training a generic multi-document model from a novel cross-document question answering pre-training objective. To that end, given a set (or cluster) of topically-related documents, we systematically generate semantically-oriented questions from a salient sentence in one document and challenge the model, during pre-training, to answer these questions while &#34;peeking&#34; into other topically-related documents. In a similar manner, the model is also challenged to recover the sentence from which the question was generated, again while leveraging cross-document information. This novel multi-document QA formulation directs the model to better recover cross-text informational relations, and introduces a natural augmentation that artificially increases the pre-training data. Further, unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation (e.g., QA) and long text generation (e.g., summarization). Following this scheme, we pre-train our model -- termed QAmden -- and evaluate its performance across several multi-document tasks, including multi-document QA, summarization, and query-focused summarization, yielding improvements of up to 7%, and significantly outperforms zero-shot GPT-3.5 and GPT-4. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.15387v1-abstract-full').style.display = 'none'; document.getElementById('2305.15387v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 24 May, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted at ACL 2023; camera-ready version</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2305.14987</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Investigating Table-to-Text Generation Capabilities of LLMs in Real-World Information Seeking Scenarios </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+H">Haowei Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Si%2C+S">Shengyun Si</a>, <a href="/search/cs?searchtype=author&amp;query=Nan%2C+L">Linyong Nan</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+X">Xiangru Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2305.14987v2-abstract-short" style="display: inline;"> Tabular data is prevalent across various industries, necessitating significant time and effort for users to understand and manipulate for their information-seeking purposes. The advancements in large language models (LLMs) have shown enormous potential to improve user efficiency. However, the adoption of LLMs in real-world applications for table information seeking remains underexplored. In this p&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.14987v2-abstract-full').style.display = 'inline'; document.getElementById('2305.14987v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2305.14987v2-abstract-full" style="display: none;"> Tabular data is prevalent across various industries, necessitating significant time and effort for users to understand and manipulate for their information-seeking purposes. The advancements in large language models (LLMs) have shown enormous potential to improve user efficiency. However, the adoption of LLMs in real-world applications for table information seeking remains underexplored. In this paper, we investigate the table-to-text capabilities of different LLMs using four datasets within two real-world information seeking scenarios. These include the LogicNLG and our newly-constructed LoTNLG datasets for data insight generation, along with the FeTaQA and our newly-constructed F2WTQ datasets for query-based generation. We structure our investigation around three research questions, evaluating the performance of LLMs in table-to-text generation, automated evaluation, and feedback generation, respectively. Experimental results indicate that the current high-performing LLM, specifically GPT-4, can effectively serve as a table-to-text generator, evaluator, and feedback generator, facilitating users&#39; information seeking purposes in real-world scenarios. However, a significant performance gap still exists between other open-sourced LLMs (e.g., Tulu and LLaMA-2) and GPT-4 models. Our data and code are publicly available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.14987v2-abstract-full').style.display = 'none'; document.getElementById('2305.14987v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 30 October, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 24 May, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Camera-ready version for EMNLP 2023 industry track</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2305.14772</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> A Question Answering Framework for Decontextualizing User-facing Snippets from Scientific Documents </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Newman%2C+B">Benjamin Newman</a>, <a href="/search/cs?searchtype=author&amp;query=Soldaini%2C+L">Luca Soldaini</a>, <a href="/search/cs?searchtype=author&amp;query=Fok%2C+R">Raymond Fok</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Lo%2C+K">Kyle Lo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2305.14772v3-abstract-short" style="display: inline;"> Many real-world applications (e.g., note taking, search) require extracting a sentence or paragraph from a document and showing that snippet to a human outside of the source document. Yet, users may find snippets difficult to understand as they lack context from the original document. In this work, we use language models to rewrite snippets from scientific documents to be read on their own. First,&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.14772v3-abstract-full').style.display = 'inline'; document.getElementById('2305.14772v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2305.14772v3-abstract-full" style="display: none;"> Many real-world applications (e.g., note taking, search) require extracting a sentence or paragraph from a document and showing that snippet to a human outside of the source document. Yet, users may find snippets difficult to understand as they lack context from the original document. In this work, we use language models to rewrite snippets from scientific documents to be read on their own. First, we define the requirements and challenges for this user-facing decontextualization task, such as clarifying where edits occur and handling references to other documents. Second, we propose a framework that decomposes the task into three stages: question generation, question answering, and rewriting. Using this framework, we collect gold decontextualizations from experienced scientific article readers. We then conduct a range of experiments across state-of-the-art commercial and open-source language models to identify how to best provide missing-but-relevant information to models for our task. Finally, we develop QaDecontext, a simple prompting strategy inspired by our framework that improves over end-to-end prompting. We conclude with analysis that finds, while rewriting is easy, question generation and answering remain challenging for today&#39;s models. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.14772v3-abstract-full').style.display = 'none'; document.getElementById('2305.14772v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 30 November, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 24 May, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">19 pages, 2 figures, 8 tables, EMNLP2023</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2305.14303</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> QTSumm: Query-Focused Summarization over Tabular Data </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Qi%2C+Z">Zhenting Qi</a>, <a href="/search/cs?searchtype=author&amp;query=Nan%2C+L">Linyong Nan</a>, <a href="/search/cs?searchtype=author&amp;query=Mi%2C+B">Boyu Mi</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yixin Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Zou%2C+W">Weijin Zou</a>, <a href="/search/cs?searchtype=author&amp;query=Han%2C+S">Simeng Han</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+R">Ruizhe Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+X">Xiangru Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+Y">Yumo Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Radev%2C+D">Dragomir Radev</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2305.14303v2-abstract-short" style="display: inline;"> People primarily consult tables to conduct data analysis or answer specific questions. Text generation systems that can provide accurate table summaries tailored to users&#39; information needs can facilitate more efficient access to relevant data insights. Motivated by this, we define a new query-focused table summarization task, where text generation models have to perform human-like reasoning and a&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.14303v2-abstract-full').style.display = 'inline'; document.getElementById('2305.14303v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2305.14303v2-abstract-full" style="display: none;"> People primarily consult tables to conduct data analysis or answer specific questions. Text generation systems that can provide accurate table summaries tailored to users&#39; information needs can facilitate more efficient access to relevant data insights. Motivated by this, we define a new query-focused table summarization task, where text generation models have to perform human-like reasoning and analysis over the given table to generate a tailored summary. We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables covering diverse topics. We investigate a set of strong baselines on QTSumm, including text generation, table-to-text generation, and large language models. Experimental results and manual analysis reveal that the new task presents significant challenges in table-to-text generation for future research. Moreover, we propose a new approach named ReFactor, to retrieve and reason over query-relevant information from tabular data to generate several natural language facts. Experimental results demonstrate that ReFactor can bring improvements to baselines by concatenating the generated facts to the model input. Our data and code are publicly available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.14303v2-abstract-full').style.display = 'none'; document.getElementById('2305.14303v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 November, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 23 May, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted at EMNLP 2023</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2305.14239</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> On Learning to Summarize with Large Language Models as References </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yixin Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Shi%2C+K">Kejian Shi</a>, <a href="/search/cs?searchtype=author&amp;query=He%2C+K+S">Katherine S He</a>, <a href="/search/cs?searchtype=author&amp;query=Ye%2C+L">Longtian Ye</a>, <a href="/search/cs?searchtype=author&amp;query=Fabbri%2C+A+R">Alexander R. Fabbri</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+P">Pengfei Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Radev%2C+D">Dragomir Radev</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2305.14239v3-abstract-short" style="display: inline;"> Recent studies have found that summaries generated by large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. Therefore, we study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved. To this end, we use LLMs as both oracle&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.14239v3-abstract-full').style.display = 'inline'; document.getElementById('2305.14239v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2305.14239v3-abstract-full" style="display: none;"> Recent studies have found that summaries generated by large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. Therefore, we study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved. To this end, we use LLMs as both oracle summary generators for standard supervised fine-tuning and oracle summary evaluators for efficient contrastive learning that leverages the LLMs&#39; supervision signals. We conduct comprehensive experiments with source news articles and find that (1) summarization models trained under the LLM-as-reference setting achieve significant performance improvement in both LLM and human evaluations; (2) contrastive learning outperforms standard supervised fine-tuning under both low and high resource settings. Our experimental results also enable a meta-analysis of LLMs&#39; summary evaluation capacities under a challenging setting, showing that LLMs are not well-aligned with human evaluators. Particularly, our expert human evaluation reveals remaining nuanced performance gaps between LLMs and our fine-tuned models, which LLMs fail to capture. Thus, we call for further studies into both the potential and challenges of using LLMs in summarization model development. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.14239v3-abstract-full').style.display = 'none'; document.getElementById('2305.14239v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 18 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 23 May, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">NAACL 2024, GitHub Repo:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2305.12586</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Enhancing Few-shot Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Nan%2C+L">Linyong Nan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yilun Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Zou%2C+W">Weijin Zou</a>, <a href="/search/cs?searchtype=author&amp;query=Ri%2C+N">Narutatsu Ri</a>, <a href="/search/cs?searchtype=author&amp;query=Tae%2C+J">Jaesung Tae</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+E">Ellen Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Radev%2C+D">Dragomir Radev</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2305.12586v1-abstract-short" style="display: inline;"> In-context learning (ICL) has emerged as a new approach to various natural language processing tasks, utilizing large language models (LLMs) to make predictions based on context that has been supplemented with a few examples or task-specific instructions. In this paper, we aim to extend this method to question answering tasks that utilize structured knowledge sources, and improve Text-to-SQL syste&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.12586v1-abstract-full').style.display = 'inline'; document.getElementById('2305.12586v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2305.12586v1-abstract-full" style="display: none;"> In-context learning (ICL) has emerged as a new approach to various natural language processing tasks, utilizing large language models (LLMs) to make predictions based on context that has been supplemented with a few examples or task-specific instructions. In this paper, we aim to extend this method to question answering tasks that utilize structured knowledge sources, and improve Text-to-SQL systems by exploring various prompt design strategies for employing LLMs. We conduct a systematic investigation into different demonstration selection methods and optimal instruction formats for prompting LLMs in the Text-to-SQL task. Our approach involves leveraging the syntactic structure of an example&#39;s SQL query to retrieve demonstrations, and we demonstrate that pursuing both diversity and similarity in demonstration selection leads to enhanced performance. Furthermore, we show that LLMs benefit from database-related knowledge augmentations. Our most effective strategy outperforms the state-of-the-art system by 2.5 points (Execution Accuracy) and the best fine-tuned system by 5.1 points on the Spider dataset. These results highlight the effectiveness of our approach in adapting LLMs to the Text-to-SQL task, and we present an analysis of the factors contributing to the success of our strategy. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.12586v1-abstract-full').style.display = 'none'; document.getElementById('2305.12586v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 May, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2305.11744</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Information Retrieval">cs.IR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> ReFIT: Relevance Feedback from a Reranker during Inference </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Reddy%2C+R+G">Revanth Gangi Reddy</a>, <a href="/search/cs?searchtype=author&amp;query=Dasigi%2C+P">Pradeep Dasigi</a>, <a href="/search/cs?searchtype=author&amp;query=Sultan%2C+M+A">Md Arafat Sultan</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Sil%2C+A">Avirup Sil</a>, <a href="/search/cs?searchtype=author&amp;query=Ji%2C+H">Heng Ji</a>, <a href="/search/cs?searchtype=author&amp;query=Hajishirzi%2C+H">Hannaneh Hajishirzi</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2305.11744v2-abstract-short" style="display: inline;"> Retrieve-and-rerank is a prevalent framework in neural information retrieval, wherein a bi-encoder network initially retrieves a pre-defined number of candidates (e.g., K=100), which are then reranked by a more powerful cross-encoder model. While the reranker often yields improved candidate scores compared to the retriever, its scope is confined to only the top K retrieved candidates. As a result,&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.11744v2-abstract-full').style.display = 'inline'; document.getElementById('2305.11744v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2305.11744v2-abstract-full" style="display: none;"> Retrieve-and-rerank is a prevalent framework in neural information retrieval, wherein a bi-encoder network initially retrieves a pre-defined number of candidates (e.g., K=100), which are then reranked by a more powerful cross-encoder model. While the reranker often yields improved candidate scores compared to the retriever, its scope is confined to only the top K retrieved candidates. As a result, the reranker cannot improve retrieval performance in terms of Recall@K. In this work, we propose to leverage the reranker to improve recall by making it provide relevance feedback to the retriever at inference time. Specifically, given a test instance during inference, we distill the reranker&#39;s predictions for that instance into the retriever&#39;s query representation using a lightweight update mechanism. The aim of the distillation loss is to align the retriever&#39;s candidate scores more closely with those produced by the reranker. The algorithm then proceeds by executing a second retrieval step using the updated query vector. We empirically demonstrate that this method, applicable to various retrieve-and-rerank frameworks, substantially enhances retrieval recall across multiple domains, languages, and modalities. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.11744v2-abstract-full').style.display = 'none'; document.getElementById('2305.11744v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 28 May, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 19 May, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Preprint</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2305.08379</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> TESS: Text-to-Text Self-Conditioned Simplex Diffusion </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Mahabadi%2C+R+K">Rabeeh Karimi Mahabadi</a>, <a href="/search/cs?searchtype=author&amp;query=Ivison%2C+H">Hamish Ivison</a>, <a href="/search/cs?searchtype=author&amp;query=Tae%2C+J">Jaesung Tae</a>, <a href="/search/cs?searchtype=author&amp;query=Henderson%2C+J">James Henderson</a>, <a href="/search/cs?searchtype=author&amp;query=Beltagy%2C+I">Iz Beltagy</a>, <a href="/search/cs?searchtype=author&amp;query=Peters%2C+M+E">Matthew E. Peters</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2305.08379v2-abstract-short" style="display: inline;"> Diffusion models have emerged as a powerful paradigm for generation, obtaining strong performance in various continuous domains. However, applying continuous diffusion models to natural language remains challenging due to its discrete nature and the need for a large number of diffusion steps to generate text, making diffusion-based generation expensive. In this work, we propose Text-to-text Self-c&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.08379v2-abstract-full').style.display = 'inline'; document.getElementById('2305.08379v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2305.08379v2-abstract-full" style="display: none;"> Diffusion models have emerged as a powerful paradigm for generation, obtaining strong performance in various continuous domains. However, applying continuous diffusion models to natural language remains challenging due to its discrete nature and the need for a large number of diffusion steps to generate text, making diffusion-based generation expensive. In this work, we propose Text-to-text Self-conditioned Simplex Diffusion (TESS), a text diffusion model that is fully non-autoregressive, employs a new form of self-conditioning, and applies the diffusion process on the logit simplex space rather than the learned embedding space. Through extensive experiments on natural language understanding and generation tasks including summarization, text simplification, paraphrase generation, and question generation, we demonstrate that TESS outperforms state-of-the-art non-autoregressive models, requires fewer diffusion steps with minimal drop in performance, and is competitive with pretrained autoregressive sequence-to-sequence models. We publicly release our codebase at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.08379v2-abstract-full').style.display = 'none'; document.getElementById('2305.08379v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 20 February, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 15 May, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">EACL 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2301.13298</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Krishna%2C+K">Kalpesh Krishna</a>, <a href="/search/cs?searchtype=author&amp;query=Bransom%2C+E">Erin Bransom</a>, <a href="/search/cs?searchtype=author&amp;query=Kuehl%2C+B">Bailey Kuehl</a>, <a href="/search/cs?searchtype=author&amp;query=Iyyer%2C+M">Mohit Iyyer</a>, <a href="/search/cs?searchtype=author&amp;query=Dasigi%2C+P">Pradeep Dasigi</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Lo%2C+K">Kyle Lo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2301.13298v1-abstract-short" style="display: inline;"> While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of t&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2301.13298v1-abstract-full').style.display = 'inline'; document.getElementById('2301.13298v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2301.13298v1-abstract-full" style="display: none;"> While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of these papers do not perform any human evaluation on model-generated summaries, while other works face new difficulties that manifest when dealing with long documents (e.g., low inter-annotator agreement). Motivated by our survey, we present LongEval, a set of guidelines for human evaluation of faithfulness in long-form summaries that addresses the following challenges: (1) How can we achieve high inter-annotator agreement on faithfulness scores? (2) How can we minimize annotator workload while maintaining accurate faithfulness scores? and (3) Do humans benefit from automated alignment between summary and source snippets? We deploy LongEval in annotation studies on two long-form summarization datasets in different domains (SQuALITY and PubMed), and we find that switching to a finer granularity of judgment (e.g., clause-level) reduces inter-annotator variance in faithfulness scores (e.g., std-dev from 18.5 to 6.8). We also show that scores from a partial annotation of fine-grained units highly correlates with scores from a full annotation workload (0.89 Kendall&#39;s tau using 50% judgments). We release our human judgments, annotation templates, and our software as a Python library for future research. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2301.13298v1-abstract-full').style.display = 'none'; document.getElementById('2301.13298v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 30 January, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> January 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">EACL 2023 camera ready. Code and data can be found in</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2301.10140</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Digital Libraries">cs.DL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> The Semantic Scholar Open Data Platform </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kinney%2C+R">Rodney Kinney</a>, <a href="/search/cs?searchtype=author&amp;query=Anastasiades%2C+C">Chloe Anastasiades</a>, <a href="/search/cs?searchtype=author&amp;query=Authur%2C+R">Russell Authur</a>, <a href="/search/cs?searchtype=author&amp;query=Beltagy%2C+I">Iz Beltagy</a>, <a href="/search/cs?searchtype=author&amp;query=Bragg%2C+J">Jonathan Bragg</a>, <a href="/search/cs?searchtype=author&amp;query=Buraczynski%2C+A">Alexandra Buraczynski</a>, <a href="/search/cs?searchtype=author&amp;query=Cachola%2C+I">Isabel Cachola</a>, <a href="/search/cs?searchtype=author&amp;query=Candra%2C+S">Stefan Candra</a>, <a href="/search/cs?searchtype=author&amp;query=Chandrasekhar%2C+Y">Yoganand Chandrasekhar</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a>, <a href="/search/cs?searchtype=author&amp;query=Crawford%2C+M">Miles Crawford</a>, <a href="/search/cs?searchtype=author&amp;query=Downey%2C+D">Doug Downey</a>, <a href="/search/cs?searchtype=author&amp;query=Dunkelberger%2C+J">Jason Dunkelberger</a>, <a href="/search/cs?searchtype=author&amp;query=Etzioni%2C+O">Oren Etzioni</a>, <a href="/search/cs?searchtype=author&amp;query=Evans%2C+R">Rob Evans</a>, <a href="/search/cs?searchtype=author&amp;query=Feldman%2C+S">Sergey Feldman</a>, <a href="/search/cs?searchtype=author&amp;query=Gorney%2C+J">Joseph Gorney</a>, <a href="/search/cs?searchtype=author&amp;query=Graham%2C+D">David Graham</a>, <a href="/search/cs?searchtype=author&amp;query=Hu%2C+F">Fangzhou Hu</a>, <a href="/search/cs?searchtype=author&amp;query=Huff%2C+R">Regan Huff</a>, <a href="/search/cs?searchtype=author&amp;query=King%2C+D">Daniel King</a>, <a href="/search/cs?searchtype=author&amp;query=Kohlmeier%2C+S">Sebastian Kohlmeier</a>, <a href="/search/cs?searchtype=author&amp;query=Kuehl%2C+B">Bailey Kuehl</a>, <a href="/search/cs?searchtype=author&amp;query=Langan%2C+M">Michael Langan</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+D">Daniel Lin</a> , et al. (23 additional authors not shown) </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2301.10140v1-abstract-short" style="display: inline;"> The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF conte&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2301.10140v1-abstract-full').style.display = 'inline'; document.getElementById('2301.10140v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2301.10140v1-abstract-full" style="display: none;"> The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction to build the Semantic Scholar Academic Graph, the largest open scientific literature graph to-date, with 200M+ papers, 80M+ authors, 550M+ paper-authorship edges, and 2.4B+ citation edges. The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings. In this paper, we describe the components of the S2 data processing pipeline and the associated APIs offered by the platform. We will update this living document to reflect changes as we add new data offerings and improve existing services. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2301.10140v1-abstract-full').style.display = 'none'; document.getElementById('2301.10140v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 24 January, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> January 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">8 pages, 6 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2212.10526</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Open Domain Multi-document Summarization: A Comprehensive Study of Model Brittleness under Retrieval </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Giorgi%2C+J">John Giorgi</a>, <a href="/search/cs?searchtype=author&amp;query=Soldaini%2C+L">Luca Soldaini</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+B">Bo Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Bader%2C+G">Gary Bader</a>, <a href="/search/cs?searchtype=author&amp;query=Lo%2C+K">Kyle Lo</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+L+L">Lucy Lu Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Cohan%2C+A">Arman Cohan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2212.10526v3-abstract-short" style="display: inline;"> Multi-document summarization (MDS) assumes a set of topic-related documents are provided as input. In practice, this document set is not always available; it would need to be retrieved given an information need, i.e. a question or topic statement, a setting we dub &#34;open-domain&#34; MDS. We study this more challenging setting by formalizing the task and bootstrapping it using existing datasets, retriev&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2212.10526v3-abstract-full').style.display = 'inline'; document.getElementById('2212.10526v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2212.10526v3-abstract-full" style="display: none;"> Multi-document summarization (MDS) assumes a set of topic-related documents are provided as input. In practice, this document set is not always available; it would need to be retrieved given an information need, i.e. a question or topic statement, a setting we dub &#34;open-domain&#34; MDS. We study this more challenging setting by formalizing the task and bootstrapping it using existing datasets, retrievers and summarizers. Via extensive automatic and human evaluation, we determine: (1) state-of-the-art summarizers suffer large reductions in performance when applied to open-domain MDS, (2) additional training in the open-domain setting can reduce this sensitivity to imperfect retrieval, and (3) summarizers are insensitive to the retrieval of duplicate documents and the order of retrieved documents, but highly sensitive to other errors, like the retrieval of irrelevant documents. Based on our results, we provide practical guidelines to enable future work on open-domain MDS, e.g. how to choose the number of retrieved documents to summarize. Our results suggest that new retrieval and summarization methods and annotated resources for training and evaluation are necessary for further progress in the open-domain setting. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2212.10526v3-abstract-full').style.display = 'none'; document.getElementById('2212.10526v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 25 October, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 20 December, 2022; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2022. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted to EMNLP Findings 2023</span> </p> </li> </ol> <nav class="pagination is-small is-centered 