CINXE.COM
arXiv:2404.08680v1 [cs.CL] 8 Apr 2024
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8"><base href="https://arxiv.org/pdf/2404.08680.pdf"><div style="background:#fff;border:1px solid #999;margin:-1px -1px 0;padding:0;"><div style="background:#ddd;border:1px solid #999;color:#000;font:13px arial,sans-serif;font-weight:normal;margin:12px;padding:8px;text-align:left">This is the html version of the file <a href="https://arxiv.org/abs/2404.08680"><font color=blue>https://arxiv.org/abs/2404.08680</font></a>.<br> Google automatically generates html versions of documents as we crawl the web.</div></div><div style="position:relative"><html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta name="CreationDate" content="D:20240416010030Z"> <meta name="Creator" content="LaTeX with hyperref"> <meta name="Fullbanner" content="This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5"> <meta name="ModDate" content="D:20240416010030Z"> <meta name="Producer" content="pdfTeX-1.40.25"> <title>arXiv:2404.08680v1 [cs.CL] 8 Apr 2024</title> </head><body bgcolor=#ffffff vlink="blue" link="blue"> <table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=1><b>Page 1</b></a></font></td></tr></table><font size=4 face="Times"><span style="font-size:23px;font-family:Times"> <div style="position:absolute;top:324;left:114"><nobr>A<font style="font-size:18px">UTOMATING </font>R<font style="font-size:18px">ESEARCH </font>S<font style="font-size:18px">YNTHESIS WITH </font>D<font style="font-size:18px">OMAIN</font>-S<font style="font-size:18px">PECIFIC</font></nobr></div> <div style="position:absolute;top:354;left:228"><nobr>L<font style="font-size:18px">ARGE </font>L<font style="font-size:18px">ANGUAGE </font>M<font style="font-size:18px">ODEL </font>F<font style="font-size:18px">INE</font>-T<font style="font-size:18px">UNING</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:415;left:420"><nobr>A P<font style="font-size:9px">REPRINT</font></nobr></div> <div style="position:absolute;top:463;left:252"><nobr><a href="https://orcid.org/0000-0001-9416-1435">Teo Susnjak </a><font style="font-size:8px">∗</font></nobr></div> <div style="position:absolute;top:480;left:127"><nobr>School of Mathematical and Computational Sciences</nobr></div> <div style="position:absolute;top:496;left:230"><nobr>Massey University</nobr></div> <div style="position:absolute;top:513;left:221"><nobr>Albany, New Zealand</nobr></div> <div style="position:absolute;top:463;left:590"><nobr>Peter Hwang</nobr></div> <div style="position:absolute;top:480;left:473"><nobr>School of Mathematical and Computational Sciences</nobr></div> <div style="position:absolute;top:496;left:576"><nobr>Massey University</nobr></div> <div style="position:absolute;top:513;left:567"><nobr>Albany, New Zealand</nobr></div> <div style="position:absolute;top:555;left:232"><nobr><a href="https://orcid.org/0000-0002-0683-436X">Napoleon H. Reyes</a></nobr></div> <div style="position:absolute;top:571;left:127"><nobr>School of Mathematical and Computational Sciences</nobr></div> <div style="position:absolute;top:588;left:230"><nobr>Massey University</nobr></div> <div style="position:absolute;top:604;left:220"><nobr>Albany, New Zealand</nobr></div> <div style="position:absolute;top:555;left:489"><nobr><a href="https://orcid.org/0000-0001-7648-285X">Andre L. C. Barczak</a></nobr></div> <div style="position:absolute;top:571;left:472"><nobr>Centre for Data Analytics</nobr></div> <div style="position:absolute;top:588;left:499"><nobr>Bond University</nobr></div> <div style="position:absolute;top:604;left:484"><nobr>Gold Coast, Australia</nobr></div> <div style="position:absolute;top:646;left:196"><nobr><a href="https://orcid.org/0000-0003-0836-4266">Timothy R. McIntosh</a></nobr></div> <div style="position:absolute;top:663;left:209"><nobr>Cyberoo Pty Ltd</nobr></div> <div style="position:absolute;top:679;left:205"><nobr>Surry Hills, NSW</nobr></div> <div style="position:absolute;top:695;left:231"><nobr>Australia</nobr></div> <div style="position:absolute;top:646;left:507"><nobr><a href="https://orcid.org/0000-0003-0701-0204">Surangika Ranathunga</a></nobr></div> <div style="position:absolute;top:663;left:418"><nobr>School of Mathematical and Computational Sciences</nobr></div> <div style="position:absolute;top:679;left:521"><nobr>Massey University</nobr></div> <div style="position:absolute;top:695;left:511"><nobr>Albany, New Zealand</nobr></div> <div style="position:absolute;top:755;left:415"><nobr>April 16, 2024</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:15px;font-family:Times"> <div style="position:absolute;top:794;left:416"><nobr>A<font style="font-size:12px">BSTRACT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:844;left:161"><nobr>This research pioneers the use of finetuned Large Language Models (LLMs) to automate Systematic</nobr></div> <div style="position:absolute;top:860;left:162"><nobr>Literature Reviews (SLRs), presenting a significant and novel contribution in integrating AI to enhance</nobr></div> <div style="position:absolute;top:877;left:162"><nobr>academic research methodologies. Our study employed the latest finetuning methodologies together</nobr></div> <div style="position:absolute;top:893;left:161"><nobr>with open-sourced LLMs, and demonstrated a practical and efficient approach to automating the final</nobr></div> <div style="position:absolute;top:909;left:162"><nobr>execution stages of an SLR process that involves knowledge synthesis. The results maintained high</nobr></div> <div style="position:absolute;top:926;left:162"><nobr>fidelity in factual accuracy in LLM responses, and were validated through the replication of an existing</nobr></div> <div style="position:absolute;top:942;left:162"><nobr>PRISMA-conforming SLR. Our research proposed solutions for mitigating LLM hallucination and</nobr></div> <div style="position:absolute;top:959;left:162"><nobr>proposed mechanisms for tracking LLM responses to their sources of information, thus demonstrating</nobr></div> <div style="position:absolute;top:975;left:162"><nobr>how this approach can meet the rigorous demands of scholarly research. The findings ultimately</nobr></div> <div style="position:absolute;top:991;left:162"><nobr>confirmed the potential of finetuned LLMs in streamlining various labour-intensive processes of</nobr></div> <div style="position:absolute;top:1008;left:162"><nobr>conducting literature reviews. Given the potential of this approach and its applicability across all</nobr></div> <div style="position:absolute;top:1024;left:162"><nobr>research domains, this foundational study also advocated for updating PRISMA reporting guidelines</nobr></div> <div style="position:absolute;top:1040;left:162"><nobr>to incorporate AI-driven processes, ensuring methodological transparency and reliability in future</nobr></div> <div style="position:absolute;top:1057;left:162"><nobr>SLRs. This study broadens the appeal of AI-enhanced tools across various academic and research</nobr></div> <div style="position:absolute;top:1073;left:162"><nobr>fields, setting a new standard for conducting comprehensive and accurate literature reviews with more</nobr></div> <div style="position:absolute;top:1089;left:162"><nobr>efficiency in the face of ever-increasing volumes of academic studies.</nobr></div> <div style="position:absolute;top:1144;left:108"><nobr>Keywords LLM Fine-tuning for SLRs · SLR Automation · Retrieval-Augmented Generation for Research ·</nobr></div> <div style="position:absolute;top:1160;left:108"><nobr>Domain-Specific Model Training · Knowledge Synthesis AI · AI-Driven Research Synthesis · Literature Review</nobr></div> <div style="position:absolute;top:1177;left:107"><nobr>Automation · Generative AI · AI-Enhanced Systematic Reviews · PRISMA and AI Integration</nobr></div> </span></font> <font size=2 face="Times"><span style="font-size:6px;font-family:Times"> <div style="position:absolute;top:1243;left:126"><nobr>*<font style="font-size:11px">Corresponding author: t.susnjak@massey.ac.nz</font></nobr></div> </span></font> <font size=4 color="#7f7f7f" face="Times"><span style="font-size:28px;font-family:Times;color:#7f7f7f"> <div style="position:absolute;top:987;left:48"><nobr>arXiv:2404.08680v1 [cs.CL] 8 Apr 2024</nobr></div> </span></font> <div style="position:absolute;top:1363;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=2><b>Page 2</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:1418;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:1419;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:15px;font-family:Times"> <div style="position:absolute;top:1474;left:108"><nobr>1 Introduction</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:1511;left:108"><nobr>Systematic Literature Reviews (SLRs) serve as the bedrock of academic research, playing a crucial role in the</nobr></div> <div style="position:absolute;top:1527;left:108"><nobr>amalgamation, examination, and synthesis of existing scholarly knowledge across various fields[<a href="#20">1,</a> 2, 3]. These reviews</nobr></div> <div style="position:absolute;top:1543;left:108"><nobr>offer a methodical and replicable approach, ensuring the integrity and thoroughness of research synthesis especially</nobr></div> <div style="position:absolute;top:1560;left:107"><nobr>when combined with reporting guidelines like PRISMA [4,<a href="#20"> 5</a>, 6]. Such a foundation is indispensable for advancing</nobr></div> <div style="position:absolute;top:1576;left:108"><nobr>both theoretical understanding and practical applications. However, the traditional execution of SLRs is marked by its</nobr></div> <div style="position:absolute;top:1592;left:108"><nobr>manual and resource-intensive nature, often stretching over extensive periods, which introduces significant inefficiencies</nobr></div> <div style="position:absolute;top:1609;left:108"><nobr>into the research process [<a href="#21">7,</a> 8].</nobr></div> <div style="position:absolute;top:1633;left:108"><nobr>The rigorous yet cumbersome character of traditional SLR methodologies presents considerable bottlenecks in the</nobr></div> <div style="position:absolute;top:1650;left:108"><nobr>management and synthesis of large datasets of selected studies that hinges on effective information retrieval [<a href="#21">9,</a> 10].</nobr></div> <div style="position:absolute;top:1666;left:108"><nobr>These challenges not only prolong the synthesis - the execution phase of a review - but also hamper the ongoing updates</nobr></div> <div style="position:absolute;top:1683;left:108"><nobr>of the SLRs with newer findings, and risk diminishing the timeliness and relevance of the insights gleaned[<a href="#21">11,</a> 12]. This</nobr></div> <div style="position:absolute;top:1699;left:108"><nobr>scenario underscores the need for innovative, scalable and sustainable solutions that can streamline the extraction of</nobr></div> <div style="position:absolute;top:1715;left:108"><nobr>information from findings situated in academic papers, as well as its persistence in suitable information technologies</nobr></div> <div style="position:absolute;top:1732;left:107"><nobr>which facilitate their accurate and effective retrieval necessary for executing SLRs [<a href="#21">13,</a> 14].</nobr></div> <div style="position:absolute;top:1756;left:108"><nobr>The recent advent of a new class of Artificial Intelligence (AI) systems like Large Language Models (LLMs), heralds a</nobr></div> <div style="position:absolute;top:1773;left:108"><nobr>new epoch with the potential to dramatically redefine the SLR landscape through the automation of the information</nobr></div> <div style="position:absolute;top:1789;left:108"><nobr>retrieval processes while maintaining high factual fidelity<a href="#21"> [15</a>, 8]. These models, with their advanced natural language</nobr></div> <div style="position:absolute;top:1805;left:108"><nobr>comprehension capabilities, text generation and knowledge retention [<a href="#21">16]</a>, offer a promising avenue for automating</nobr></div> <div style="position:absolute;top:1822;left:108"><nobr>and optimizing various stages of the SLR process [17<a href="#21">],</a> and in particular the execution phase that relies on “talking to”</nobr></div> <div style="position:absolute;top:1838;left:108"><nobr>both individual academic papers via LLMs, as well as simultaneously “talking across” all target papers for synthesising</nobr></div> <div style="position:absolute;top:1854;left:108"><nobr>purposes [<a href="#21">18]</a>. Despite their potential, the broad generalist pretraining of these models which have been trained on vast</nobr></div> <div style="position:absolute;top:1871;left:108"><nobr>amounts of diverse text data means that the LLMs fall short in providing the domain-specific accuracy and precision</nobr></div> <div style="position:absolute;top:1887;left:108"><nobr>in the information retrieved that is essential for the detailed task of knowledge synthesis across a very specific and</nobr></div> <div style="position:absolute;top:1903;left:108"><nobr>narrow sets of studies. Additionally, their current propensity to hallucinate <a href="#21">[19</a>, 20, 21, 22] renders them unable to</nobr></div> <div style="position:absolute;top:1920;left:108"><nobr>consistently respond accurately, which precludes them from providing reliable responses needed to conduct SLRs</nobr></div> <div style="position:absolute;top:1936;left:107"><nobr>with integrity. Additionally, current LLMs have variable abilities to audit and track the sources of their responses [<a href="#21">23]</a>.</nobr></div> <div style="position:absolute;top:1953;left:108"><nobr>Their inability to reliably ensure that the provenance of the LLM responses can be linked to the target studies that</nobr></div> <div style="position:absolute;top:1969;left:108"><nobr>comprise an SLR corpus represents a serious limitation for using this technology for this purpose. Collectively, these</nobr></div> <div style="position:absolute;top:1985;left:108"><nobr>gaps highlight a critical area within AI-assisted SLR processes for the purpose of enhancing the information retrieval</nobr></div> <div style="position:absolute;top:2002;left:108"><nobr>capabilities of LLMs [<a href="#22">24]</a>. Indeed, resent research has repeatedly raised these concerns with respect to using LLM for</nobr></div> <div style="position:absolute;top:2018;left:108"><nobr>the purposes of SLRs, especially in the knowledge synthesis stages. While it has been suggested that LLMs could be</nobr></div> <div style="position:absolute;top:2034;left:108"><nobr>used for assisting evidence synthesis tasks in SLRs via their summarization capabilities, concerns have been raised</nobr></div> <div style="position:absolute;top:2051;left:108"><nobr>about their lack of continuous learning capability and temporal reasoning [25<a href="#22">, </a>26]. Qureshi et al. [27] noted that using</nobr></div> <div style="position:absolute;top:2067;left:108"><nobr>LLMs shows promise for aiding in systematic review-related tasks, but the authors concluded that the technology is in</nobr></div> <div style="position:absolute;top:2083;left:108"><nobr>its infancy and requires significant development for such applications. In their most recent review, Bolanos et al. <a href="#22">[28</a>]</nobr></div> <div style="position:absolute;top:2100;left:108"><nobr>found that LLM usage for SLRs is hampered by limitations such as reduced efficacy in domain-specific and narrower</nobr></div> <div style="position:absolute;top:2116;left:108"><nobr>subject areas, their propensity for hallucinating and generating misleading information, alongside their opaque decision</nobr></div> <div style="position:absolute;top:2133;left:108"><nobr>processes which cannot be audited.</nobr></div> <div style="position:absolute;top:2157;left:108"><nobr>To surmount these challenges, this study proposes the creation of finetuned open-source LLMs, trained on the corpus</nobr></div> <div style="position:absolute;top:2173;left:108"><nobr>of selected academic papers for a target SLR, expanding the generalist knowledge of an LLM with narrower domain-</nobr></div> <div style="position:absolute;top:2190;left:108"><nobr>specific expertise. This work devises a novel way to automatically extract information from a set of academic papers in</nobr></div> <div style="position:absolute;top:2206;left:108"><nobr>order to create SLR-specific datasets which can be leveraged for finetuning LLMs so that they can support question</nobr></div> <div style="position:absolute;top:2223;left:108"><nobr>and answering downstream tasks. We also devise mechanisms to mitigate LLM hallucination and to ensure that all</nobr></div> <div style="position:absolute;top:2239;left:108"><nobr>LLM responses related to an SLR can be tracked to source studies. The ensuing research presents a comprehensive</nobr></div> <div style="position:absolute;top:2255;left:108"><nobr>SLR-automation framework with a focus on the knowledge synthesis stage that aims to revolutionize information</nobr></div> <div style="position:absolute;top:2272;left:108"><nobr>retrieval mechanisms with empirical evidence of their effectiveness, thereby expediting and transforming the synthesis</nobr></div> <div style="position:absolute;top:2288;left:108"><nobr>of research findings in the context of SLRs.</nobr></div> <div style="position:absolute;top:2323;left:108"><nobr>Contribution</nobr></div> <div style="position:absolute;top:2351;left:108"><nobr>Our study substantially contributes to the field of information retrieval in general where factual recall from arrays of</nobr></div> <div style="position:absolute;top:2367;left:108"><nobr>documents is required and needs to be expressed in natural language, while possessing the ability to audit the responses</nobr></div> <div style="position:absolute;top:2384;left:107"><nobr>with respect to the source information. Our proposed framework is specifically applied and demonstrated to the context</nobr></div> <div style="position:absolute;top:2400;left:108"><nobr>of SLR-automation, where the goal was to validate the proposed framework through an empirical study that seeks to</nobr></div> <div style="position:absolute;top:2417;left:108"><nobr>replicate a previously published PRISMA-conforming SLR, which serves as a gold standard and a test use case for</nobr></div> <div style="position:absolute;top:2433;left:108"><nobr>validating the framework. The contributions of this research can be summarised as:</nobr></div> <div style="position:absolute;top:2478;left:455"><nobr>2</nobr></div> </span></font> <a href="#20" style="border:1px solid #0000ff;position:absolute;top:1526;left:689;width:11;height:13;display:block"></a><a href="#20" style="border:1px solid #0000ff;position:absolute;top:1526;left:704;width:11;height:13;display:block"></a><a href="#21" style="border:1px solid #0000ff;position:absolute;top:1558;left:472;width:10;height:14;display:block"></a><a href="#21" style="border:1px solid #0000ff;position:absolute;top:1558;left:487;width:10;height:14;display:block"></a><a href="#21" style="border:1px solid #0000ff;position:absolute;top:1607;left:277;width:11;height:14;display:block"></a><a href="#21" style="border:1px solid #0000ff;position:absolute;top:1648;left:787;width:18;height:14;display:block"></a><a href="#21" style="border:1px solid #0000ff;position:absolute;top:1681;left:754;width:18;height:13;display:block"></a><a href="#21" style="border:1px solid #0000ff;position:absolute;top:1730;left:629;width:18;height:13;display:block"></a><a href="#21" style="border:1px solid #0000ff;position:absolute;top:1787;left:479;width:10;height:14;display:block"></a><a href="#21" style="border:1px solid #0000ff;position:absolute;top:1902;left:598;width:18;height:13;display:block"></a><a href="#21" style="border:1px solid #0000ff;position:absolute;top:1902;left:621;width:18;height:13;display:block"></a><a href="#21" style="border:1px solid #0000ff;position:absolute;top:1902;left:644;width:18;height:13;display:block"></a><a href="#22" style="border:1px solid #0000ff;position:absolute;top:2049;left:573;width:18;height:14;display:block"></a><a href="#22" style="border:1px solid #0000ff;position:absolute;top:2049;left:690;width:18;height:14;display:block"></a> <div style="position:absolute;top:2551;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=3><b>Page 3</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:2606;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:2607;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:2664;left:149"><nobr>• Devising a methodical and automate approach for converting selected academic papers into datasets that can</nobr></div> <div style="position:absolute;top:2680;left:162"><nobr>be used for finetuning LLMs.</nobr></div> <div style="position:absolute;top:2707;left:149"><nobr>• Proposing an effective approach to ensuring factual recall in responses by providing a mechanism to audit the</nobr></div> <div style="position:absolute;top:2723;left:162"><nobr>source information of the LLM’s response.</nobr></div> <div style="position:absolute;top:2750;left:149"><nobr>• Developed and validated evaluation metrics with the goal of testing for factuality in the LLM responses.</nobr></div> <div style="position:absolute;top:2777;left:149"><nobr>• Benchmarked various AI automation methodologies for SLRs, focusing on the comparative effectiveness of</nobr></div> <div style="position:absolute;top:2793;left:162"><nobr>different finetuning approaches.</nobr></div> <div style="position:absolute;top:2820;left:149"><nobr>• Demonstrated the efficacy of these proposed methodologies by replicating a published PRISMA-conforming</nobr></div> <div style="position:absolute;top:2837;left:162"><nobr>SLR.</nobr></div> <div style="position:absolute;top:2864;left:149"><nobr>• Enhancing the scholarly toolkit for SLRs with advanced, efficient, and context-aware AI technologies.</nobr></div> <div style="position:absolute;top:2890;left:149"><nobr>• Setting new standards in academic research for reliability, validity, and ethical AI employment through</nobr></div> <div style="position:absolute;top:2907;left:162"><nobr>pioneering AI methodologies.</nobr></div> <div style="position:absolute;top:2934;left:149"><nobr>• Releasing a Python package <font style="font-size:8px"><a href="#3">2 </a></font>to facilitate data curation for LLM fine-tuning, tailored for the unique require-</nobr></div> <div style="position:absolute;top:2950;left:162"><nobr>ments of SLRs.</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:15px;font-family:Times"> <div style="position:absolute;top:2995;left:108"><nobr>2 Background</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:3035;left:108"><nobr>This literature review examines the SLR process, especially the knowledge synthesis phase and focuses on the integration</nobr></div> <div style="position:absolute;top:3052;left:108"><nobr>of LLMs and AI to automate its various stages. It also addresses the current efforts in SLR automation, the role of LLMs</nobr></div> <div style="position:absolute;top:3068;left:108"><nobr>in enhancing these processes, and the critical challenge of ensuring factual accuracy arising from LLM hallucinations.</nobr></div> <div style="position:absolute;top:3085;left:108"><nobr>The review further explores studies that have considered the potential of fine-tuning domain-specific LLMs to tailor</nobr></div> <div style="position:absolute;top:3101;left:108"><nobr>their performance for SLR tasks, aiming to provide a clear, concise overview of the advancements and challenges in</nobr></div> <div style="position:absolute;top:3117;left:108"><nobr>employing AI tools for SLR automation.</nobr></div> <div style="position:absolute;top:3159;left:108"><nobr>2.1 Systematic Literature Review Process and Synthesis of Studies</nobr></div> <div style="position:absolute;top:3191;left:108"><nobr>The process of conducting an SLR (Figure 1)<a href="#3"> </a>begins with the clear articulation of the review’s purpose and intended</nobr></div> <div style="position:absolute;top:3207;left:108"><nobr>outcomes. This foundational step ensures that the review’s scope and objectives are explicit to its audience [<a href="#22">29]</a>.</nobr></div> <div style="position:absolute;top:3224;left:108"><nobr>Following this, the development of a detailed protocol and the training of the review team are essential to guarantee</nobr></div> <div style="position:absolute;top:3240;left:108"><nobr>methodological rigor and consistency in the review execution. The screening process, both for inclusion and quality</nobr></div> <div style="position:absolute;top:3256;left:108"><nobr>appraisal, demands transparency in the criteria used for study selection and exclusion, ensuring that only the most</nobr></div> <div style="position:absolute;top:3273;left:108"><nobr>relevant and high-quality studies are considered for synthesis. The comprehensive search for literature, coupled with</nobr></div> <div style="position:absolute;top:3289;left:108"><nobr>systematic data extraction from the included studies, lays the groundwork for the subsequent synthesis phase. This</nobr></div> <div style="position:absolute;top:3306;left:108"><nobr>phase is where the insights gleaned from individual studies are integrated to form a coherent narrative or to distill new</nobr></div> <div style="position:absolute;top:3322;left:108"><nobr>theoretical understandings.</nobr></div> <div style="position:absolute;top:3535;left:108"><nobr>Figure 1: The general process of conducting an SLR as outlined by Okoli [<a href="#22">29]</a>, denoting which steps should be explicitly</nobr></div> <div style="position:absolute;top:3551;left:108"><nobr>reported to the reader, as well as explained to justify the comprehensiveness of the SLR despite exclusion criteria</nobr></div> <div style="position:absolute;top:3567;left:108"><nobr>together with the detailed communication of steps taken to make the SLR reproducible.</nobr></div> </span></font> <font size=2 face="Times"><span style="font-size:6px;font-family:Times"> <div style="position:absolute;top:3619;left:127"><nobr>2<font style="font-size:11px">https://github.com/peterjhwang/slr-helper</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:3666;left:455"><nobr>3</nobr></div> </span></font> <div style="position:absolute;top:3739;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=4><b>Page 4</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:3794;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:3795;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:3852;left:108"><nobr>2.1.1 Synthesis of Studies in SLRs</nobr></div> <div style="position:absolute;top:3880;left:108"><nobr>Having navigated the initial steps of the SLR process, from purpose identification through to quality appraisal, we</nobr></div> <div style="position:absolute;top:3896;left:108"><nobr>arrive at the execution part of the SLR, which is the synthesis phase (Step 4 in Figure <a href="#3">1</a>). This juncture is where the</nobr></div> <div style="position:absolute;top:3912;left:108"><nobr>accumulated evidence is methodically combined to reveals new insights or to forge theoretical advancements. The</nobr></div> <div style="position:absolute;top:3929;left:108"><nobr>approach to synthesis is inherently dependent on the nature of the primary studies under consideration—quantitative,</nobr></div> <div style="position:absolute;top:3945;left:108"><nobr>qualitative, or a combination thereof <a href="#22">[30</a>]. The synthesis phase in SLRs is arguably one of the most challenging yet</nobr></div> <div style="position:absolute;top:3962;left:108"><nobr>crucial stages of the literature review process. It demands not only a comprehensive understanding of the included</nobr></div> <div style="position:absolute;top:3978;left:108"><nobr>studies but also a demanding task of integrating their findings<a href="#22"> [31</a>]. This phase is pivotal for transcending beyond mere</nobr></div> <div style="position:absolute;top:3994;left:108"><nobr>aggregation of data and information and instead aiming to distill new insights or theoretical contributions from the</nobr></div> <div style="position:absolute;top:4011;left:108"><nobr>collective evidence <a href="#22">[32</a>]. The complexity of this task is magnified by the diversity of study designs, methodologies,</nobr></div> <div style="position:absolute;top:4027;left:108"><nobr>terminologies, and data types encountered within the corpus of literature being reviewed [<a href="#22">33]</a>.</nobr></div> <div style="position:absolute;top:4061;left:108"><nobr>2.1.2 Quantitative Synthesis: Meta-Analysis</nobr></div> <div style="position:absolute;top:4089;left:108"><nobr>Meta-Analysis (or a integrative review) is a cornerstone in the quantitative synthesis landscape <a href="#22">[34</a>], and exemplifies the</nobr></div> <div style="position:absolute;top:4106;left:108"><nobr>methodological rigor required to aggregate data across studies. Despite its potential for generating robust evidence, the</nobr></div> <div style="position:absolute;top:4122;left:108"><nobr>meta-analytical process is fraught with challenges<a href="#22"> [35</a>]. These include heterogeneity among study outcomes, variations</nobr></div> <div style="position:absolute;top:4138;left:108"><nobr>in study quality, and the potential for publication bias, all of which necessitate sophisticated statistical techniques and</nobr></div> <div style="position:absolute;top:4155;left:108"><nobr>critical judgment. The method’s reliance on homogeneity and the need for comparable effect sizes across studies further</nobr></div> <div style="position:absolute;top:4171;left:108"><nobr>complicate its application, especially in fields characterized by methodological diversity [<a href="#22">36]</a>.</nobr></div> <div style="position:absolute;top:4205;left:108"><nobr>2.1.3 Qualitative Synthesis: Meta-Synthesis</nobr></div> <div style="position:absolute;top:4233;left:108"><nobr>Meta-Synthesis, or an interpretive review, is the the qualitative counterpart that involves synthesizing findings from</nobr></div> <div style="position:absolute;top:4250;left:108"><nobr>qualitative studies to uncover overarching themes or conceptual insights [37<a href="#22">].</a> This process is inherently interpretative,</nobr></div> <div style="position:absolute;top:4266;left:108"><nobr>requiring a deep engagement with the textual data and an ability to discern patterns and meanings across disparate studies.</nobr></div> <div style="position:absolute;top:4282;left:108"><nobr>The challenges here lie in maintaining the integrity and context of individual studies while seeking commonalities or</nobr></div> <div style="position:absolute;top:4299;left:108"><nobr>divergent themes across the body of evidence [<a href="#22">38]</a>. The subjective nature of this synthesis approach underscores the</nobr></div> <div style="position:absolute;top:4315;left:108"><nobr>need for transparency and reflexivity in the analytical process, ensuring that the synthesis is both comprehensive and</nobr></div> <div style="position:absolute;top:4331;left:108"><nobr>faithful to the original studies<a href="#22"> [39</a>]. The synthesis phase, irrespective of the methodological approach, is inherently</nobr></div> <div style="position:absolute;top:4348;left:108"><nobr>complex due to the need to balance between the depth of individual studies and the breadth of the review’s scope [<a href="#22">40]</a>.</nobr></div> <div style="position:absolute;top:4364;left:108"><nobr>This balancing act is further complicated by the varying quality of the studies, the diversity of their contexts, and the</nobr></div> <div style="position:absolute;top:4380;left:108"><nobr>potential for conflicting findings<a href="#22"> [41</a>]. As such, the synthesis not only requires methodological expertise but also a</nobr></div> <div style="position:absolute;top:4397;left:108"><nobr>creative and critical approach to integrating knowledge [42<a href="#22">]. </a>It is this phase that truly tests the reviewer’s ability to</nobr></div> <div style="position:absolute;top:4413;left:108"><nobr>construct a coherent narrative or theoretical framework that advances understanding in the field, making it one of the</nobr></div> <div style="position:absolute;top:4430;left:108"><nobr>most intellectually demanding aspects of the SLR process [<a href="#22">43]</a>.</nobr></div> <div style="position:absolute;top:4466;left:108"><nobr>2.2 Current State of AI Tools and Research for Automation in SLRs</nobr></div> <div style="position:absolute;top:4496;left:108"><nobr>Recent progress in AI, Natural Language Processing (NLP), and machine learning has significantly propelled the</nobr></div> <div style="position:absolute;top:4512;left:108"><nobr>development of automation tools for supporting literature reviews. Techniques such as text mining and machine</nobr></div> <div style="position:absolute;top:4529;left:108"><nobr>learning algorithms have demonstrated potential in automating various stages of the review process, from literature</nobr></div> <div style="position:absolute;top:4545;left:108"><nobr>retrieval to data extraction<a href="#22"> [</a>44]. These tools range from comprehensive research management platforms like Covidence<font style="font-size:8px"><a href="#4">3</a></font><a href="#4"></a>,</nobr></div> <div style="position:absolute;top:4561;left:107"><nobr>ASReview<a href="#4"></a><font style="font-size:8px"><a href="#4">4</a></font><a href="#4"></a>, Mendeley<font style="font-size:8px"><a href="#4">5</a></font><a href="#4"></a>, and Zotero<font style="font-size:8px"><a href="#4">6</a></font><a href="#4"></a>, to specialized tools aimed at specific literature review stages, such as ’connected</nobr></div> <div style="position:absolute;top:4578;left:108"><nobr>papers’<a href="#4"></a><font style="font-size:8px"><a href="#4">7 </a></font>for literature search and Elicit<a href="#4"></a><font style="font-size:8px"><a href="#4">8 </a></font>for individual paper analysis. Despite the growing availability of these tools,</nobr></div> <div style="position:absolute;top:4594;left:108"><nobr>their adoption remains limited, with researchers citing challenges such as steep learning curves and inadequate support</nobr></div> <div style="position:absolute;top:4610;left:108"><nobr>[<a href="#22">45]</a>. More advanced AI tools provide citation support like Scite <font style="font-size:8px"><a href="#4">9</a></font><a href="#4"></a>, as well as ScholarAI <font style="font-size:8px"><a href="#4">10 </a></font>integration tools with GPT</nobr></div> <div style="position:absolute;top:4627;left:108"><nobr>models with numerous features for literature search support.</nobr></div> <div style="position:absolute;top:4651;left:108"><nobr>Meanwhile the body of research dedicated to automating SLR tasks is expanding, with several review papers categorizing</nobr></div> <div style="position:absolute;top:4668;left:108"><nobr>these efforts <a href="#23">[46</a>, 47, 48, 8, 49, 50, 51, 52, 53, 54, 28]. Nonetheless, the focus on automating the knowledge synthesis</nobr></div> </span></font> <font size=2 face="Times"><span style="font-size:6px;font-family:Times"> <div style="position:absolute;top:4694;left:127"><nobr>3<a href="http://www.covidence.org"></a><font style="font-size:11px"><a href="http://www.covidence.org">http://www.covidence.org</a></font></nobr></div> <div style="position:absolute;top:4710;left:127"><nobr>4<a href="https://www.asreview.nl"></a><font style="font-size:11px"><a href="https://www.asreview.nl">https://www.asreview.nl</a></font></nobr></div> <div style="position:absolute;top:4726;left:127"><nobr>5<a href="https://www.mendeley.com"></a><font style="font-size:11px"><a href="https://www.mendeley.com">https://www.mendeley.com</a></font></nobr></div> <div style="position:absolute;top:4742;left:127"><nobr>6<a href="https://www.zotero.org"></a><font style="font-size:11px"><a href="https://www.zotero.org">https://www.zotero.org</a></font></nobr></div> <div style="position:absolute;top:4758;left:127"><nobr>7<a href="https://www.connectedpapers.com/"></a><font style="font-size:11px"><a href="https://www.connectedpapers.com/">https://www.connectedpapers.com/</a></font></nobr></div> <div style="position:absolute;top:4775;left:127"><nobr>8<a href="https://elicit.com/"></a><font style="font-size:11px"><a href="https://elicit.com/">https://elicit.com/</a></font></nobr></div> <div style="position:absolute;top:4791;left:127"><nobr>9<font style="font-size:11px">https://scite.ai/</font></nobr></div> <div style="position:absolute;top:4807;left:122"><nobr>10<font style="font-size:11px">https://scholarai.io/</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:4854;left:455"><nobr>4</nobr></div> </span></font> <a href="#23" style="border:1px solid #0000ff;position:absolute;top:4666;left:211;width:18;height:13;display:block"></a><a href="#23" style="border:1px solid #0000ff;position:absolute;top:4666;left:233;width:18;height:13;display:block"></a><a href="#21" style="border:1px solid #0000ff;position:absolute;top:4666;left:256;width:10;height:13;display:block"></a><a href="#23" style="border:1px solid #0000ff;position:absolute;top:4666;left:271;width:18;height:14;display:block"></a><a href="#23" style="border:1px solid #0000ff;position:absolute;top:4666;left:293;width:18;height:13;display:block"></a><a href="#23" style="border:1px solid #0000ff;position:absolute;top:4666;left:316;width:17;height:13;display:block"></a><a href="#23" style="border:1px solid #0000ff;position:absolute;top:4666;left:338;width:18;height:13;display:block"></a><a href="#23" style="border:1px solid #0000ff;position:absolute;top:4666;left:360;width:18;height:13;display:block"></a><a href="#23" style="border:1px solid #0000ff;position:absolute;top:4666;left:383;width:18;height:13;display:block"></a><a href="#22" style="border:1px solid #0000ff;position:absolute;top:4666;left:405;width:18;height:13;display:block"></a> <div style="position:absolute;top:4927;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=5><b>Page 5</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:4982;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:4983;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:5040;left:108"><nobr>phase of SLRs remains minimal [28<a href="#22">] </a>which is a reflection of the task’s inherent complexity. Most research in SLR</nobr></div> <div style="position:absolute;top:5056;left:108"><nobr>automation employs NLP and Information Retrieval techniques, with tasks such as data extraction and monitoring often</nobr></div> <div style="position:absolute;top:5073;left:108"><nobr>framed as text summarization challenges [<a href="#21">14,</a> 55].</nobr></div> <div style="position:absolute;top:5108;left:108"><nobr>2.2.1 LLMs for research and SLRs automation</nobr></div> <div style="position:absolute;top:5137;left:108"><nobr>LLMs like GPT-3, have brought about transformative possibilities in SLRs. Rooted in the Transformer architecture,</nobr></div> <div style="position:absolute;top:5153;left:108"><nobr>these models are well-suited for processing and synthesizing information from large corpora <a href="#23">[56</a>] that include academic</nobr></div> <div style="position:absolute;top:5169;left:108"><nobr>texts. The fine-tuning of LLMs with domain-specific data presents an opportunity to enhance their effectiveness in</nobr></div> <div style="position:absolute;top:5186;left:108"><nobr>generating accurate study summaries, albeit challenges in reliability, validity, and interpretability remain [<a href="#23">57]</a>.</nobr></div> <div style="position:absolute;top:5210;left:107"><nobr>With the popularity of LLMs such as OpenAI’s ChatGPT, Google’s Gemini and Anthropic’s Claude, SLR researchers</nobr></div> <div style="position:absolute;top:5227;left:108"><nobr>now have a natural tendency to explore the capabilities of these models for SLR development. For example, Gupta</nobr></div> <div style="position:absolute;top:5243;left:108"><nobr>et al.<a href="#23"> [58</a>] used ChatGPT to generate novel SLR ideas and noted that the software was highly accurate. Hill et al. [59]</nobr></div> <div style="position:absolute;top:5259;left:108"><nobr>used Microsoft’s Bing AI to extract study characteristics from research papers as data items. Castillo-Segura et al.</nobr></div> <div style="position:absolute;top:5276;left:108"><nobr><a href="#23">[60</a>] compared six AI based tools ( (Forefront, GetGPT, ThebAI, Claude, Bard, and H2O) for the screening stage of</nobr></div> <div style="position:absolute;top:5292;left:108"><nobr>SLR. Kumar [61<a href="#23">] </a>used ChatGPT to generate small reports on research topics, but noted that the output lacks the rigour</nobr></div> <div style="position:absolute;top:5308;left:108"><nobr>expected in academic writing. Zimmermann et al. [62<a href="#23">] </a>used ChatGPt to answer several questions based on the title and</nobr></div> <div style="position:absolute;top:5325;left:108"><nobr>abstract of research papers and reported that ChatGPT has an accuracy about 70%.<a href="#23"> [63</a>] used ChatGPT to automate</nobr></div> <div style="position:absolute;top:5341;left:108"><nobr>several stages of SLR development process, including generating relevant keywords and phrases for literature search,</nobr></div> <div style="position:absolute;top:5357;left:108"><nobr>screening and filtering of studies, and extracting and synthesizing information from research papers. Interestingly, in</nobr></div> <div style="position:absolute;top:5374;left:108"><nobr>the synthesizing step, they queried ChatGPT on individual papers, as well as multiple papers. However, they do not</nobr></div> <div style="position:absolute;top:5390;left:108"><nobr>report quantifiable results on the performance of ChatGPT. Najafali et al<a href="#23">. [</a>64] conducted even more bolder experiments</nobr></div> <div style="position:absolute;top:5407;left:108"><nobr>by generating an entire SLR using ChatGPT, however noted that the output has many issues.</nobr></div> <div style="position:absolute;top:5442;left:108"><nobr>2.2.2 Hallucination in LLMs and Factuality in SLRs</nobr></div> <div style="position:absolute;top:5471;left:108"><nobr>The phenomenon of hallucination in LLMs, characterized by the generation of false yet plausible information, presents</nobr></div> <div style="position:absolute;top:5487;left:108"><nobr>a critical challenge in the deployment of LLMs [19<a href="#21">],</a> with natural relevance for SLRs. This challenge is particularly</nobr></div> <div style="position:absolute;top:5503;left:108"><nobr>acute given the stringent requirements for accuracy and reliability in SLR research, where any deviation from factual</nobr></div> <div style="position:absolute;top:5520;left:108"><nobr>correctness can significantly undermine the integrity of the review process and its outcomes. Hallucinations can</nobr></div> <div style="position:absolute;top:5536;left:108"><nobr>generally be categorized into two types, namely, open-domain and closed domain hallucinations [<a href="#21">19]</a>.</nobr></div> <div style="position:absolute;top:5561;left:108"><nobr>Hallucinations in open-domain contexts emerge when LLMs produce outputs not directly anchored in specific instances</nobr></div> <div style="position:absolute;top:5577;left:108"><nobr>of the training data. This type of hallucination is particularly problematic in the context of SLRs, where the veracity</nobr></div> <div style="position:absolute;top:5593;left:108"><nobr>of information is paramount. Strategies to mitigate open-domain hallucinations include enhancing the diversity and</nobr></div> <div style="position:absolute;top:5610;left:108"><nobr>representativeness of training datasets, incorporating mechanisms for better context comprehension, and developing</nobr></div> <div style="position:absolute;top:5626;left:108"><nobr>techniques for the LLM to recognize and flag potential inaccuracies. Meanwhile, closed-domain hallucinations occur</nobr></div> <div style="position:absolute;top:5642;left:107"><nobr>within specific contexts, such as when LLMs generate content that diverges from the source text in tasks like translation</nobr></div> <div style="position:absolute;top:5659;left:108"><nobr>or summarization. For SLRs, where synthesis of existing literature must adhere closely to the original texts, such</nobr></div> <div style="position:absolute;top:5675;left:108"><nobr>hallucinations clearly pose significant risks.</nobr></div> <div style="position:absolute;top:5700;left:108"><nobr>Reasons for LLMs generating hallucinations is commonly attributed to the presence of false information within the</nobr></div> <div style="position:absolute;top:5716;left:108"><nobr>training data itself<a href="#23"> [65</a>] or to the outdated or temporal nature of the training data which lacks information on recent</nobr></div> <div style="position:absolute;top:5732;left:108"><nobr>developments [66<a href="#24">].</a> However, these are not the sole contributors to hallucinations. Another factor is the LLMs’ training</nobr></div> <div style="position:absolute;top:5749;left:108"><nobr>approach to generate tokens sequentially, which can result in generating realistic but ultimately incorrect sequences of</nobr></div> <div style="position:absolute;top:5765;left:108"><nobr>text <a href="#24">[67</a>]. Recently, [19] demonstrated that LLMs inherently produce incorrect facts due to statistical principles rather</nobr></div> <div style="position:absolute;top:5782;left:108"><nobr>than their design or data quality, with such errors often corresponding to the rate at which certain unique facts appear in</nobr></div> <div style="position:absolute;top:5798;left:108"><nobr>the training data.</nobr></div> <div style="position:absolute;top:5835;left:108"><nobr>2.3 RAG for Enhanced Factual Accuracy in SLRs</nobr></div> <div style="position:absolute;top:5866;left:108"><nobr>Retrieval-Augmented Generation (RAG) [68<a href="#24">] is</a> a framework that combines the capabilities of large language models</nobr></div> <div style="position:absolute;top:5882;left:108"><nobr>(LLMs) with external knowledge sources through a retrieval mechanism. Unlike traditional language models that</nobr></div> <div style="position:absolute;top:5899;left:108"><nobr>generate text based solely on the text’s internal representation, RAG models retrieve relevant information from a</nobr></div> <div style="position:absolute;top:5915;left:108"><nobr>knowledge base (like a database or the internet) and integrate this into the generation process. The core components</nobr></div> <div style="position:absolute;top:5931;left:108"><nobr>of RAG include the retrieval mechanism, which fetches relevant documents or data, and the generative model, which</nobr></div> <div style="position:absolute;top:5948;left:108"><nobr>synthesizes the retrieved information into coherent and contextually relevant responses. The theoretical principles</nobr></div> <div style="position:absolute;top:5964;left:108"><nobr>underpinning RAG stem from the need to enhance language models with the ability to access and utilize external,</nobr></div> <div style="position:absolute;top:5981;left:108"><nobr>structured knowledge [69<a href="#24">].</a> This is in response to the limitations of traditional LLMs that rely solely on their pre-trained</nobr></div> <div style="position:absolute;top:5997;left:108"><nobr>parameters for knowledge, which can be outdated or incomplete. The typical architecture of RAG systems involves a</nobr></div> <div style="position:absolute;top:6042;left:455"><nobr>5</nobr></div> </span></font> <a href="#23" style="border:1px solid #0000ff;position:absolute;top:5071;left:381;width:18;height:13;display:block"></a><a href="#23" style="border:1px solid #0000ff;position:absolute;top:5241;left:789;width:18;height:14;display:block"></a><a href="#21" style="border:1px solid #0000ff;position:absolute;top:5764;left:230;width:18;height:13;display:block"></a> <div style="position:absolute;top:6115;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=6><b>Page 6</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:6170;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:6171;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:6228;left:108"><nobr>retriever that fetches relevant information from a database and a generator that incorporates this information into the</nobr></div> <div style="position:absolute;top:6244;left:108"><nobr>final output, thus, the integration of these components allows the model to produce contextually enriched and factually</nobr></div> <div style="position:absolute;top:6261;left:108"><nobr>accurate text <a href="#24">[68</a>]. The incorporation of RAG into the workflow of LLMs, therefore, presents a sophisticated approach</nobr></div> <div style="position:absolute;top:6277;left:108"><nobr>to augmenting the model’s knowledge base beyond its pretraining, specifically tailored to the demands of the evolving</nobr></div> <div style="position:absolute;top:6293;left:108"><nobr>nature of studies in SLRs. By allowing the model to access an external corpus of domain-specific literature the context</nobr></div> <div style="position:absolute;top:6310;left:107"><nobr>within which the LLM operates, it attains the ability to be enriched while a critical countermeasure to the model’s</nobr></div> <div style="position:absolute;top:6326;left:108"><nobr>propensity for generating plausible yet factually incorrect information - hallucination - is mitigated. In the realm of</nobr></div> <div style="position:absolute;top:6342;left:108"><nobr>SLRs, where the precision of synthesized knowledge is paramount, RAG’s ability to draws upon relevant information</nobr></div> <div style="position:absolute;top:6359;left:108"><nobr>from a targeted corpus helps ensure that the generative outputs of LLMs are anchored in verifiable data.</nobr></div> <div style="position:absolute;top:6396;left:108"><nobr>2.4 Advancing LLMs for Specialized Domains: From Pretraining to Fine-Tuning</nobr></div> <div style="position:absolute;top:6427;left:108"><nobr>In training LLMs, they initially undergo pretraining on extensive, diverse datasets, acquiring a foundational grasp of</nobr></div> <div style="position:absolute;top:6443;left:108"><nobr>both language and knowledge. This stage equips LLMs with a generalist understanding, however, the leap from generic</nobr></div> <div style="position:absolute;top:6459;left:108"><nobr>text generation to specialized domain proficiency requires additional training in the form of fine-tuning which is a</nobr></div> <div style="position:absolute;top:6476;left:108"><nobr>process where LLMs are further trained on relevant datasets to align their outputs with new requirements [<a href="#24">70]</a>.</nobr></div> <div style="position:absolute;top:6511;left:108"><nobr>2.4.1 PEFT: A Paradigm Shift in Finetuning</nobr></div> <div style="position:absolute;top:6540;left:108"><nobr>The evolution of fine-tuning practices, particularly with the advent of Parameter-Efficient Fine-Tuning (PEFT) techniques</nobr></div> <div style="position:absolute;top:6556;left:108"><nobr>[<a href="#24">71]</a>, marks a significant shift towards more sustainable and effective model optimization. PEFT, focuses on updating a</nobr></div> <div style="position:absolute;top:6573;left:108"><nobr>selective subset of the model’s parameters. This approach contrasts with conventional fine-tuning, where a substantial</nobr></div> <div style="position:absolute;top:6589;left:108"><nobr>portion of the parameters is modified, leading to high computational costs. By concentrating updates on strategically</nobr></div> <div style="position:absolute;top:6605;left:108"><nobr>chosen parameters, PEFT techniques enhance the model’s performance on domain-specific tasks without the need for</nobr></div> <div style="position:absolute;top:6622;left:108"><nobr>extensive computational resources, thus making fine-tuning feasible even for the most expansive LLMs [<a href="#24">72,</a> 73]. The</nobr></div> <div style="position:absolute;top:6638;left:108"><nobr>integration of PEFT techniques in the fine-tuning process is instrumental in tailoring LLMs to the narrow and specific</nobr></div> <div style="position:absolute;top:6654;left:108"><nobr>knowledge requirements fo chosen domains such as those encountered in (SLRs), which would not be captured by the</nobr></div> <div style="position:absolute;top:6671;left:108"><nobr>pretrained process of LLMs.</nobr></div> <div style="position:absolute;top:6695;left:107"><nobr>Among the myriad of PEFT techniques, LoRA [74]<a href="#24"> </a>and more recently NEFTune [75], stand out for their contributions</nobr></div> <div style="position:absolute;top:6712;left:108"><nobr>to enhancing LLMs’ capabilities. LoRA (Low-Rank Adaptation), refines the model’s efficiency by optimizing a small,</nobr></div> <div style="position:absolute;top:6728;left:108"><nobr>critical subset of parameters, thus minimizing the computational overhead while maintaining, or even enhancing,</nobr></div> <div style="position:absolute;top:6744;left:108"><nobr>the model’s performance in domain-specific applications <a href="#24">[73</a>]. NEFTune on the other hand introduces an innovative</nobr></div> <div style="position:absolute;top:6761;left:108"><nobr>approach by incorporating random noise into the embedding vectors during fine-tuning, thereby improving the model’s</nobr></div> <div style="position:absolute;top:6777;left:108"><nobr>ability to follow instructions and engage in meaningful conversations by enhancing its generalization capacity [<a href="#24">75]</a>.</nobr></div> <div style="position:absolute;top:6815;left:108"><nobr>2.5 Literature summary and gaps</nobr></div> <div style="position:absolute;top:6845;left:108"><nobr>The literature review underscores the potential and gaps in the intersection of SLRs and LLMs. While advancements in</nobr></div> <div style="position:absolute;top:6862;left:108"><nobr>text mining and machine learning herald efficiency in literature retrieval and screening, the synthesis phase remains a</nobr></div> <div style="position:absolute;top:6878;left:108"><nobr>challenge that has not been sufficiently addressed, and made difficult by the phenomenon of LLM hallucinations and</nobr></div> <div style="position:absolute;top:6894;left:108"><nobr>the inability to audit the source information from the LLMs’ responses. The review reveals the potential of fine-tuning</nobr></div> <div style="position:absolute;top:6911;left:108"><nobr>LLMs with SLR-specific datasets to tailor these models for more accurate knowledge synthesis. However, work is</nobr></div> <div style="position:absolute;top:6927;left:108"><nobr>required to ensure that both fidelity of synthesized knowledge and auditing of LLM responses can be achieved. Against</nobr></div> <div style="position:absolute;top:6943;left:108"><nobr>this backdrop, this paper proposes an SLR-automation Framework to bridge this gap by leveraging domain-specific</nobr></div> <div style="position:absolute;top:6960;left:108"><nobr>fine-tuning of LLMs via the most recent advances in PEFT, on selected SLR papers and targetting the knowledge</nobr></div> <div style="position:absolute;top:6976;left:108"><nobr>synthesis stage of SLRs, together with novelties for embedding fidelity and mitigating hallucination in the LLM</nobr></div> <div style="position:absolute;top:6992;left:108"><nobr>responses. The viability of the proposed framework is explored and illustrated using a case study that will seek to</nobr></div> <div style="position:absolute;top:7009;left:108"><nobr>replicate a PRISMA-conforming SLR as a ground truth against which the proposed framework can be evaluated.</nobr></div> <div style="position:absolute;top:7033;left:108"><nobr>Based on the research gaps in the literature, this study attempts to answer the following research questions:</nobr></div> <div style="position:absolute;top:7066;left:149"><nobr>• (RQ1) Can we leverage LLMs in combination with finetuning and/or RAG in order to facilitate the synthesis</nobr></div> <div style="position:absolute;top:7082;left:162"><nobr>phase of an SLR process?</nobr></div> <div style="position:absolute;top:7106;left:149"><nobr>• (RQ2) How can the extraction of finetuning datasets from an SLR corpus be effectively automated?</nobr></div> <div style="position:absolute;top:7129;left:149"><nobr>• (RQ3) Can finetuning of LLMs be effectively conducted on relatively small datasets representing a very narrow</nobr></div> <div style="position:absolute;top:7145;left:162"><nobr>domain?</nobr></div> <div style="position:absolute;top:7169;left:149"><nobr>• (RQ4) Can we achieve high fidelity of LLM responses and ensure that we can audit and verify the source</nobr></div> <div style="position:absolute;top:7185;left:162"><nobr>information of LLM responses, and what evaluation metrics are suitable?</nobr></div> <div style="position:absolute;top:7230;left:455"><nobr>6</nobr></div> </span></font> <a href="#24" style="border:1px solid #0000ff;position:absolute;top:6620;left:757;width:18;height:13;display:block"></a><a href="#24" style="border:1px solid #0000ff;position:absolute;top:6694;left:594;width:18;height:13;display:block"></a> <div style="position:absolute;top:7303;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=7><b>Page 7</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:7358;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:7359;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:7416;left:149"><nobr>• (RQ5) Are the proposed SLR-automation approaches viable for replicating a previously published SLR study?</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:15px;font-family:Times"> <div style="position:absolute;top:7456;left:108"><nobr>3 Proposed SLR-automation Framework</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:7492;left:108"><nobr>Our proposed framework for automating SLRs is a sequential process designed to extract a dataset that encapsulates</nobr></div> <div style="position:absolute;top:7509;left:108"><nobr>necessary information from SLR-selected papers for the purpose of finetuning LLMs in order to facilitate the knowledge</nobr></div> <div style="position:absolute;top:7525;left:108"><nobr>synthesis phase of the SLR process. Figure <a href="#7">2 </a>outlines the proposed four-step process.</nobr></div> <div style="position:absolute;top:7883;left:147"><nobr>Figure 2: Overview of the proposed SLR-automation Framework for evidence and knowledge synthesis.</nobr></div> <div style="position:absolute;top:7930;left:108"><nobr>3.1 Step 1 - Paper Selection and Research Question Definition</nobr></div> <div style="position:absolute;top:7960;left:108"><nobr>The framework initiates with a dual focus on SLR paper selection and research question articulation which is covered</nobr></div> <div style="position:absolute;top:7976;left:108"><nobr>by the planning, selection and extraction stages of the SLR process (Figure <a href="#3">1 </a>Steps 1 to 3). Once the target papers for</nobr></div> <div style="position:absolute;top:7993;left:108"><nobr>an SLR are identified and the overarching and supporting research questions for the study have been defined, they both</nobr></div> <div style="position:absolute;top:8009;left:108"><nobr>then serve as inputs to the next Step 2, and direct the automation process of data extraction and processing.</nobr></div> <div style="position:absolute;top:8045;left:108"><nobr>3.2 Step 2a - Automated Q&A Data Extraction and Synthesis</nobr></div> <div style="position:absolute;top:8075;left:108"><nobr>In this step, we developed a program <font style="font-size:8px"><a href="#7">11 </a></font>that leverages the downstream summarization as well as the question and</nobr></div> <div style="position:absolute;top:8092;left:108"><nobr>answering capabilities of an advanced LLM to extract a dataset from all the papers. This program systematically</nobr></div> <div style="position:absolute;top:8108;left:108"><nobr>processes each selected paper, using the chosen LLM’s summarization capabilities to extract the core content of</nobr></div> <div style="position:absolute;top:8124;left:108"><nobr>primary paper sections and subsections into concise summaries alongside the papers’ meta-data. From these distilled</nobr></div> <div style="position:absolute;top:8141;left:108"><nobr>summaries, the program then uses an LLM to automatically generate a series of domain-relevant questions that are</nobr></div> <div style="position:absolute;top:8157;left:108"><nobr>common denominators and can be answered by all studies. This set of questions is then used to interrogate each paper</nobr></div> <div style="position:absolute;top:8173;left:108"><nobr>and extract answers via a chosen LLM, and the combination of the two forms question and answer (Q&A) pairs that</nobr></div> <div style="position:absolute;top:8190;left:108"><nobr>represent our finetuning dataset. Simultaneously, we enable the customization of domain-specific, researcher-defined</nobr></div> <div style="position:absolute;top:8206;left:108"><nobr>questions informed and motivated by the objectives and aims of an SLR being undertaken. These questions are tailored</nobr></div> <div style="position:absolute;top:8223;left:108"><nobr>to the unique contours of the study’s domain and are formulated to draw out information that is particularly relevant</nobr></div> <div style="position:absolute;top:8239;left:108"><nobr>for the SLR’s purposes. This dual strategy ensures that while our approach remains systematically consistent across</nobr></div> <div style="position:absolute;top:8255;left:108"><nobr>the majority of academic disciplines, it is also flexible enough to accommodate the specialized requirements of the</nobr></div> <div style="position:absolute;top:8272;left:108"><nobr>academic field under investigation.</nobr></div> <div style="position:absolute;top:8296;left:107"><nobr>With both sets of questions, we are then able to automatically generate paper-level Q&A pairs that encapsulate the</nobr></div> <div style="position:absolute;top:8313;left:108"><nobr>insights of individual studies at different levels of detail and granularity. To achieve a synthesis that spans the corpus</nobr></div> <div style="position:absolute;top:8329;left:108"><nobr>of all studies under investigation, we further collate key summaries from across all papers and from these, generate</nobr></div> <div style="position:absolute;top:8345;left:108"><nobr>additional SLR-level Q&A pairs. These pairs are designed to provide responses that are relevant across the entire body</nobr></div> </span></font> <font size=2 face="Times"><span style="font-size:6px;font-family:Times"> <div style="position:absolute;top:8371;left:122"><nobr>11<font style="font-size:11px">The tool is publicly available here: https://github.com/peterjhwang/slr-helper</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:8418;left:455"><nobr>7</nobr></div> </span></font> <div style="position:absolute;top:8491;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=8><b>Page 8</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:8546;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:8547;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:8604;left:108"><nobr>of selected works, effectively offering a synthesized narrative that captures the collective information of the SLR papers.</nobr></div> <div style="position:absolute;top:8620;left:108"><nobr>Through this process, we ensure that the resulting Q&A dataset is not only rich in detail and precision at the level of</nobr></div> <div style="position:absolute;top:8637;left:108"><nobr>individual papers but also possesses the breadth and depth required to reflect the evidence synthesis of a systematic</nobr></div> <div style="position:absolute;top:8653;left:108"><nobr>literature review.</nobr></div> <div style="position:absolute;top:8694;left:108"><nobr>3.3 Step 2b - Token Insertion as New Knowledge Markers</nobr></div> <div style="position:absolute;top:8726;left:108"><nobr>The incorporation of distinct knowledge markers, referred to as unique tokens, into the fine-tuning process of LLMs is a</nobr></div> <div style="position:absolute;top:8743;left:108"><nobr>theoretical construct designed to create explicit linkages between the model’s augmented neural network parameters</nobr></div> <div style="position:absolute;top:8759;left:108"><nobr>and the new knowledge encapsulated from the SLR papers. The motivation behind this approach is to achieve a higher</nobr></div> <div style="position:absolute;top:8776;left:108"><nobr>degree of fidelity and precision in the model’s output during inference. When an LLM is fine-tuned with additional data,</nobr></div> <div style="position:absolute;top:8792;left:108"><nobr>it assimilates this new information by adjusting a small percentage of its existing parameters while adding a larger set</nobr></div> <div style="position:absolute;top:8808;left:108"><nobr>of new parameters. Without a mechanism to distinguish between pre-existing knowledge acquired from pretraining and</nobr></div> <div style="position:absolute;top:8825;left:108"><nobr>newly acquired information, the model risks providing responses based on its pretraining rather than the domain-specific</nobr></div> <div style="position:absolute;top:8841;left:108"><nobr>insights gained from the subsequent fine-tuning step. Since it is imperative that during inference time, the model does</nobr></div> <div style="position:absolute;top:8857;left:108"><nobr>not access facts from its knowledge base developed during pretraining, and instead, only responds with facts acquired</nobr></div> <div style="position:absolute;top:8874;left:108"><nobr>during finetuning, an explicit mechanism is needed to ensure this occurs. Without such mechanisms, the results of the</nobr></div> <div style="position:absolute;top:8890;left:108"><nobr>SLR risk being contaminated with knowledge that is outside of the scope of the selected papers, thus, invalidating the</nobr></div> <div style="position:absolute;top:8906;left:108"><nobr>research. In essence, the aim of the proposed approach is to insulate the LLMs responses from its pretrained knowledge</nobr></div> <div style="position:absolute;top:8923;left:108"><nobr>but to nonetheless leverage its natural language capabilities that it gained during pretraining.</nobr></div> <div style="position:absolute;top:8947;left:108"><nobr>Therefore, these tokens or markers serve as signposts within the neural network, directing the model to access and</nobr></div> <div style="position:absolute;top:8964;left:108"><nobr>utilize the parameters associated with the SLR data and facts when generating responses. It also enables the model</nobr></div> <div style="position:absolute;top:8980;left:108"><nobr>to explicitly state the source information for its responses so that the provenance of information can be audited. This</nobr></div> <div style="position:absolute;top:8997;left:108"><nobr>approach provides a method for researchers to verify that the output is indeed based on the recent fine-tuning, enabling a</nobr></div> <div style="position:absolute;top:9013;left:108"><nobr>level of interpretability and trust in the model’s responses that is critical for academic rigor and reliability and represents</nobr></div> <div style="position:absolute;top:9029;left:108"><nobr>a strategic advancement in using LLMs for SLR.</nobr></div> <div style="position:absolute;top:9071;left:108"><nobr>3.4 Step 2c - Q&A Permutation</nobr></div> <div style="position:absolute;top:9103;left:107"><nobr>As part of Step 2, to ensure that the LLMs can handle a variety of linguistic expressions during inference time, we</nobr></div> <div style="position:absolute;top:9119;left:108"><nobr>introduce permutations in the Q&A pairs. For each original question, we create several semantically equivalent</nobr></div> <div style="position:absolute;top:9136;left:108"><nobr>variations. Since accurate recall of facts is the primary objective of the system rather than style, linguistic quality, and</nobr></div> <div style="position:absolute;top:9152;left:108"><nobr>variability of the text, we retain the same answer for each permutation of the corresponding questions. This step is</nobr></div> <div style="position:absolute;top:9168;left:108"><nobr>critical for training the LLMs to recognize and correctly respond to different phrasings of the same question which</nobr></div> <div style="position:absolute;top:9185;left:108"><nobr>different users are likely to generate.</nobr></div> <div style="position:absolute;top:9226;left:108"><nobr>3.5 Step 3 - Finetuning and RAG</nobr></div> <div style="position:absolute;top:9258;left:108"><nobr>In the third step of our SLR-automation framework, we propose specific finetuning methods for LLMs, complemented</nobr></div> <div style="position:absolute;top:9275;left:108"><nobr>by the integration of RAG processes. To enhance the finetuning efficiency, we propose employing PEFT techniques</nobr></div> <div style="position:absolute;top:9291;left:108"><nobr>[<a href="#24">71]</a>, which modify only a subset of the model’s parameters. This is crucial, as fine-tuning all parameters of large</nobr></div> <div style="position:absolute;top:9307;left:108"><nobr>LLMs is computationally prohibitive and may not yield proportionate improvements in performance. PEFT techniques</nobr></div> <div style="position:absolute;top:9324;left:108"><nobr>offer a balance, concentrating the model’s learning capacity on those parameters that most significantly influence the</nobr></div> <div style="position:absolute;top:9340;left:108"><nobr>domain-specific tasks, thereby optimizing both resource expenditure and model proficiency. RAG, on the other hand,</nobr></div> <div style="position:absolute;top:9356;left:108"><nobr>serves as an augmentative mechanism during inference. It dynamically retrieves and incorporates information from the</nobr></div> <div style="position:absolute;top:9373;left:108"><nobr>SLR corpus into the LLM’s generative process, potentially providing an additional layer of contextual support. This</nobr></div> <div style="position:absolute;top:9389;left:108"><nobr>optional step may ensure that the model’s output is not solely reliant on internalized patterns but is also substantiated</nobr></div> <div style="position:absolute;top:9405;left:108"><nobr>by direct data references, thereby potentially enriching the depth and factual accuracy of the synthesized content and</nobr></div> <div style="position:absolute;top:9422;left:108"><nobr>responses.</nobr></div> <div style="position:absolute;top:9463;left:108"><nobr>3.6 Step 4 - Verification</nobr></div> <div style="position:absolute;top:9495;left:108"><nobr>The final phase of the proposed framework tests the framework to ensure that it has the ability to accurately generate</nobr></div> <div style="position:absolute;top:9512;left:108"><nobr>responses with respect to the selected SLR papers. The test dataset is extracted from the main Q&A finetuning dataset</nobr></div> <div style="position:absolute;top:9528;left:108"><nobr>and withheld from the finetuning process. An automated process is subsequently initiated after finetuning is completed</nobr></div> <div style="position:absolute;top:9545;left:107"><nobr>which tests the LLMs ability to answer the selected questions correctly using metrics that are specifically designed for</nobr></div> <div style="position:absolute;top:9561;left:108"><nobr>factual validation.</nobr></div> <div style="position:absolute;top:9606;left:455"><nobr>8</nobr></div> </span></font> <div style="position:absolute;top:9679;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=9><b>Page 9</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:9734;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:9735;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:15px;font-family:Times"> <div style="position:absolute;top:9790;left:108"><nobr>4 Methodology</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:9829;left:108"><nobr>The methodology covers details of the chosen Gold Standard SLR as a use case for replication and validation of</nobr></div> <div style="position:absolute;top:9846;left:108"><nobr>the proposed framework. It compasses details on the dataset extraction and preparation for LLM fine-tuning, the</nobr></div> <div style="position:absolute;top:9862;left:108"><nobr>experimental design outlining the setup of LLM technologies, selection of LLM types, detailed RAG implementation,</nobr></div> <div style="position:absolute;top:9878;left:108"><nobr>finetuning hyperparameters, and hardware setup. The evaluation measures are discussed which cover quantitative</nobr></div> <div style="position:absolute;top:9895;left:108"><nobr>metrics and qualitative analyses to assess the information recall.</nobr></div> <div style="position:absolute;top:9936;left:108"><nobr>4.1 Use Case: Replication of an Existing SLR</nobr></div> <div style="position:absolute;top:9968;left:107"><nobr>We selected a peer-reviewed and published PRISMA-SLR within the domain of learning analytics as the Gold Standard,</nobr></div> <div style="position:absolute;top:9984;left:107"><nobr>with the goal of investigating whether the proposed SLR-automation framework has the capability to generate responses</nobr></div> <div style="position:absolute;top:10001;left:108"><nobr>from which a manually conducted SLR can be replicated. The target SLR paper is titled “Learning Analytics Dashboard:</nobr></div> <div style="position:absolute;top:10017;left:107"><nobr>A Tool for Providing Actionable Insights to Learners” <a href="#24">[76</a>]. This SLR is contemporary, being aligned with the recent</nobr></div> <div style="position:absolute;top:10033;left:108"><nobr>advancements in learning analytics dashboards, and was chosen for its adherence to the PRISMA guidelines as well as</nobr></div> <div style="position:absolute;top:10050;left:108"><nobr>due to authors’ familiarity with the work. The SLR itself focused on aspects of advancements in educational technology,</nobr></div> <div style="position:absolute;top:10066;left:108"><nobr>delving into the impacts of learning analytics dashboards (LADs) on university student outcomes, scrutinizing the</nobr></div> <div style="position:absolute;top:10082;left:108"><nobr>technological underpinnings, the analytics methodologies most commonly employed, and the demographics of the</nobr></div> <div style="position:absolute;top:10099;left:108"><nobr>study cohorts used in trials. Furthermore, it examined the existing gaps and posited future trajectories for research in</nobr></div> <div style="position:absolute;top:10115;left:108"><nobr>this field. This SLR encompassed 17 papers. Its selection was strategic not only because it aligned with the expertise of</nobr></div> <div style="position:absolute;top:10132;left:108"><nobr>this study’s authors which assisted in guaranteeing the veracity of LLM responses, but also offered an insight into the</nobr></div> <div style="position:absolute;top:10148;left:108"><nobr>viability and effectiveness of finetuning LLMs on relatively small datasets.</nobr></div> <div style="position:absolute;top:10189;left:108"><nobr>4.2 Automated Dataset Extraction and Preparation</nobr></div> <div style="position:absolute;top:10221;left:108"><nobr>Step 2 of the proposed framework in Figure 2<a href="#7"> </a>was fully automated in order to generate a finetuning dataset. A software</nobr></div> <div style="position:absolute;top:10237;left:108"><nobr>tool was built using Python to systematically parse the 17 PDF files representing the SLR papers and generate Q&A</nobr></div> <div style="position:absolute;top:10254;left:108"><nobr>pairs for the finetuning dataset. To begin with, this tool extracted the metadata from each paper via OpenAI’s GPT-4</nobr></div> <div style="position:absolute;top:10270;left:107"><nobr>API calls and before data generation commenced, it created a high-level summary of each paper. The summaries of</nobr></div> <div style="position:absolute;top:10287;left:108"><nobr>each paper were then collated, and GPT-4 was tasked with generating a set of questions that were relevant across all</nobr></div> <div style="position:absolute;top:10303;left:108"><nobr>the papers. This produced a vehicle for automatically extracting domain-specific information that was common across</nobr></div> <div style="position:absolute;top:10319;left:108"><nobr>all the studies. These questions were then augmented with a further set of researcher-defined questions that reflected</nobr></div> <div style="position:absolute;top:10336;left:108"><nobr>the narrower aims of the SLR study and were closely aligned to the objectives and overarching questions of the SLR</nobr></div> <div style="position:absolute;top:10352;left:108"><nobr>being conducted. Once this broad set of questions was collected, GPT-4 was used via API calls to generate answers to</nobr></div> <div style="position:absolute;top:10368;left:108"><nobr>each one from every study. These initial Q&A pairs we refer to as paper-summary-level data points. These data points</nobr></div> <div style="position:absolute;top:10385;left:108"><nobr>captured high-level paper summaries that were relevant to both the research domain and the aims of the SLR study.</nobr></div> <div style="position:absolute;top:10409;left:108"><nobr>The next phase involved the generation of another set of Q&A data points based on data from individual studies at</nobr></div> <div style="position:absolute;top:10426;left:108"><nobr>different levels of granularity. In this phase, our processing divided each paper into chunks<a href="#9"></a><font style="font-size:8px"><a href="#9">12</a></font><a href="#9"></a>, and asked GPT-4 to</nobr></div> <div style="position:absolute;top:10442;left:108"><nobr>directly generate Q&A pairs based on the content of each chunk. This method yielded another dimension of information</nobr></div> <div style="position:absolute;top:10458;left:108"><nobr>from each paper that reflected some more distinct aspects of each study alongside, like their expanded and more detailed</nobr></div> <div style="position:absolute;top:10475;left:108"><nobr>contributions, idiosyncratic methodological details, key findings, and future research directions. This process was then</nobr></div> <div style="position:absolute;top:10491;left:108"><nobr>repeated with even smaller chunks of text which roughly represented individual paper paragraphs. This more detailed</nobr></div> <div style="position:absolute;top:10508;left:108"><nobr>approach allowed for the extraction of even more granular information from each study in the form of Q&A pairs,</nobr></div> <div style="position:absolute;top:10524;left:108"><nobr>revealing yet finer details of the methodologies, analyses, and discussions presented in the studies. By dissecting the</nobr></div> <div style="position:absolute;top:10540;left:108"><nobr>papers at a paragraph level, this method augmented the depth of the overall finetuning dataset with a more detailed set</nobr></div> <div style="position:absolute;top:10557;left:108"><nobr>of Q&A pairs. This layer, we collectively refer to as paper-level Q&A pairs that offer a more granular insight into each</nobr></div> <div style="position:absolute;top:10573;left:108"><nobr>study compared to paper-summary-level data points.</nobr></div> <div style="position:absolute;top:10598;left:107"><nobr>While the above strategy generated data points that would enable researchers to “talk to” the content of each paper, the</nobr></div> <div style="position:absolute;top:10614;left:108"><nobr>subsequent strategy aimed to allow researchers to “talk across” all the studies simultaneously. These Q&A pairs we</nobr></div> <div style="position:absolute;top:10630;left:108"><nobr>refer to as SLR-level data points which served the with the aim of facilitating the answering of questions that relate to</nobr></div> <div style="position:absolute;top:10647;left:108"><nobr>the synthesis of knowledge across all the studies under investigation. From the implementation perspective, this second</nobr></div> <div style="position:absolute;top:10663;left:108"><nobr>layer of data extraction involved collating answers from each paper that were generated from questions used for creating</nobr></div> <div style="position:absolute;top:10679;left:108"><nobr>paper-summary-level Q&A pairs. Then, Q&A pairs were generated via GPT-4 based on all the collated answers.</nobr></div> <div style="position:absolute;top:10696;left:108"><nobr>By combining paper and SLR-level Q&A pairs, our finetuning dataset not only facilitated equipping LLMs with</nobr></div> <div style="position:absolute;top:10712;left:108"><nobr>in-depth facts from singular studies, but also attempted to make possible LLM prompts that support queries requiring</nobr></div> </span></font> <font size=2 face="Times"><span style="font-size:6px;font-family:Times"> <div style="position:absolute;top:10747;left:122"><nobr>12<font style="font-size:11px">The chunks usually comprised key sections of each paper.</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:10794;left:455"><nobr>9</nobr></div> </span></font> <div style="position:absolute;top:10867;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=10><b>Page 10</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:10922;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:10923;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:10980;left:108"><nobr>a comprehensive synthesis of knowledge and evidence across the entire SLR corpus, achieving both the breadth and</nobr></div> <div style="position:absolute;top:10996;left:108"><nobr>depth needed to support a systematic study of a corpus of literature and to answer specific research questions.</nobr></div> <div style="position:absolute;top:11021;left:108"><nobr>The initial training dataset following the data extraction and Q&A pairs generation comprised 1503 samples and of</nobr></div> <div style="position:absolute;top:11037;left:108"><nobr>these, approximately 28% were paper-summary-level, 49% were paper-level extracted from larger chunks, 31% were</nobr></div> <div style="position:absolute;top:11053;left:108"><nobr>paper-level extracted from paragraphs, and 2% from the SLR-level. To augment the robustness of the finetuned LLMs</nobr></div> <div style="position:absolute;top:11070;left:108"><nobr>during inference, we implemented a novel refinement step: for each original question, we generated ten alternative</nobr></div> <div style="position:absolute;top:11086;left:108"><nobr>formulations. These variants were designed to be semantically diverse yet structurally aligned, ensuring that despite the</nobr></div> <div style="position:absolute;top:11103;left:108"><nobr>variability in phrasing, the essence of the questions remained intact, eliciting the same precise response which resulted in</nobr></div> <div style="position:absolute;top:11119;left:108"><nobr>producing a final training dataset of 16423 samples. This strategic augmentation served a dual purpose. Firstly, it aimed</nobr></div> <div style="position:absolute;top:11135;left:108"><nobr>to significantly enhance the model’s ability to comprehend and respond accurately to a broader spectrum of question</nobr></div> <div style="position:absolute;top:11152;left:108"><nobr>formulations from different users, mirroring the diverse linguistic styles encountered in real-world academic discourse.</nobr></div> <div style="position:absolute;top:11168;left:108"><nobr>Secondly, it aimed at increasing the size of the dataset. For the test dataset, we generated three additional permutations</nobr></div> <div style="position:absolute;top:11184;left:108"><nobr>for each unique question to arrive at a final dataset of 4962 data points, where the distribution of paper-level and</nobr></div> <div style="position:absolute;top:11201;left:108"><nobr>SLR-level questions was the same as that in the training dataset. Table<a href="#10"></a> 1 demonstrates an example of a text summary</nobr></div> <div style="position:absolute;top:11217;left:108"><nobr>(answer), for which 11 different but semantically equivalent questions were generated using GPT-4 for the training</nobr></div> <div style="position:absolute;top:11233;left:108"><nobr>dataset and an additional three for the test dataset.</nobr></div> <div style="position:absolute;top:11269;left:134"><nobr>Answer</nobr></div> <div style="position:absolute;top:11269;left:292"><nobr>Questions</nobr></div> </span></font> <font size=2 face="Times"><span style="font-size:9px;font-family:Times"> <div style="position:absolute;top:11312;left:134"><nobr>The paper’s findings inform</nobr></div> <div style="position:absolute;top:11327;left:134"><nobr>future directions for research</nobr></div> <div style="position:absolute;top:11342;left:134"><nobr>by highlighting the</nobr></div> <div style="position:absolute;top:11357;left:134"><nobr>importance of personalizing</nobr></div> <div style="position:absolute;top:11372;left:134"><nobr>LADs to course requirements</nobr></div> <div style="position:absolute;top:11387;left:134"><nobr>and the potential of</nobr></div> <div style="position:absolute;top:11402;left:134"><nobr>student-centered dashboards</nobr></div> <div style="position:absolute;top:11417;left:134"><nobr>to enhance engagement and</nobr></div> <div style="position:absolute;top:11432;left:134"><nobr>performance. It suggests</nobr></div> <div style="position:absolute;top:11447;left:134"><nobr>further exploration of LADs’</nobr></div> <div style="position:absolute;top:11462;left:134"><nobr>role in supporting</nobr></div> <div style="position:absolute;top:11477;left:134"><nobr>self-regulated learning and</nobr></div> <div style="position:absolute;top:11492;left:134"><nobr>the challenges of</nobr></div> <div style="position:absolute;top:11507;left:134"><nobr>personalization.</nobr></div> <div style="position:absolute;top:11295;left:292"><nobr>Training Questions</nobr></div> <div style="position:absolute;top:11312;left:292"><nobr>1. How do the results of the study shape future research and development paths in learning analytics and</nobr></div> <div style="position:absolute;top:11327;left:292"><nobr>educational technology?</nobr></div> <div style="position:absolute;top:11343;left:292"><nobr>2. What implications do the paper’s outcomes have for the future of research and development in</nobr></div> <div style="position:absolute;top:11358;left:292"><nobr>learning analytics and educational technology?</nobr></div> <div style="position:absolute;top:11374;left:292"><nobr>3. How can the findings from the paper influence the forthcoming trends in research and development</nobr></div> <div style="position:absolute;top:11389;left:292"><nobr>for learning analytics and educational technology?</nobr></div> <div style="position:absolute;top:11406;left:292"><nobr>4. What role do the paper’s findings play in setting the course for future research and development in</nobr></div> <div style="position:absolute;top:11420;left:292"><nobr>learning analytics and educational technology?</nobr></div> <div style="position:absolute;top:11437;left:292"><nobr>5. How will the paper’s discoveries impact the trajectory of research and development in learning</nobr></div> <div style="position:absolute;top:11452;left:292"><nobr>analytics and educational technology moving forward?</nobr></div> <div style="position:absolute;top:11468;left:292"><nobr>6. What effect do the paper’s conclusions have on the future strategies for research and development in</nobr></div> <div style="position:absolute;top:11483;left:292"><nobr>the domains of learning analytics and educational technology?</nobr></div> <div style="position:absolute;top:11499;left:292"><nobr>7. How are the paper’s findings expected to steer future research and development initiatives in learning</nobr></div> <div style="position:absolute;top:11514;left:292"><nobr>analytics and educational technology?</nobr></div> <div style="position:absolute;top:11531;left:292"><nobr>8. In what ways might the paper’s results influence upcoming research and development activities in the</nobr></div> <div style="position:absolute;top:11546;left:292"><nobr>field of learning analytics and educational technology?</nobr></div> <div style="position:absolute;top:11562;left:292"><nobr>9. In what ways do the conclusions of the paper guide the next steps for research and development</nobr></div> <div style="position:absolute;top:11577;left:292"><nobr>within the realms of learning analytics and educational technology?</nobr></div> <div style="position:absolute;top:11593;left:292"><nobr>10. In what manner do the insights from the paper direct future research and development efforts in the</nobr></div> <div style="position:absolute;top:11608;left:292"><nobr>field of learning analytics and educational technology?</nobr></div> <div style="position:absolute;top:11625;left:292"><nobr>11. How do the paper’s findings inform future directions for research and development in the field of</nobr></div> <div style="position:absolute;top:11640;left:292"><nobr>learning analytics and educational technology?</nobr></div> <div style="position:absolute;top:11656;left:292"><nobr>Test Questions</nobr></div> <div style="position:absolute;top:11672;left:292"><nobr>12. In what capacity do the study’s findings shape the forthcoming strategies for research and</nobr></div> <div style="position:absolute;top:11687;left:292"><nobr>development across learning analytics and educational technology?</nobr></div> <div style="position:absolute;top:11704;left:292"><nobr>13. How might the conclusions drawn from the paper steer the future of innovation and inquiry in the</nobr></div> <div style="position:absolute;top:11719;left:292"><nobr>spheres of learning analytics and educational technology?</nobr></div> <div style="position:absolute;top:11735;left:292"><nobr>14. What influence do the outcomes of the paper exert on the trajectory of upcoming investigations and</nobr></div> <div style="position:absolute;top:11750;left:292"><nobr>advancements in learning analytics and educational technology?</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:11768;left:108"><nobr>Table 1: Example of the dataset permutation step for questions relating to an answer from one specific paper in the</nobr></div> <div style="position:absolute;top:11784;left:108"><nobr>target SLR study.</nobr></div> <div style="position:absolute;top:11845;left:108"><nobr>4.2.1 Dataset Format and Characteristics</nobr></div> <div style="position:absolute;top:11873;left:107"><nobr>As part of the dataset processing in Step 2 in the proposed framwork in Figure 2<a href="#7">, </a>a specific format to the structure of</nobr></div> <div style="position:absolute;top:11890;left:108"><nobr>the Q&A pairs was enforced, and unique tokens were inserted into each Q&A data point. For clarity, an illustrative</nobr></div> <div style="position:absolute;top:11906;left:108"><nobr>example of a Q&A data point from the dataset can be seen as follows:</nobr></div> </span></font> <font size=2 face="Times"><span style="font-size:9px;font-family:Times"> <div style="position:absolute;top:11940;left:108"><nobr>{</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:11982;left:452"><nobr>10</nobr></div> </span></font> <div style="position:absolute;top:12055;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=11><b>Page 11</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:12110;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:12111;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=2 face="Times"><span style="font-size:9px;font-family:Times"> <div style="position:absolute;top:12171;left:133"><nobr>"instruction": "According to the 2023SLR dataset, in the aljohani2019integrated paper,</nobr></div> <div style="position:absolute;top:12186;left:235"><nobr>how does the paper conceptualize Learning Analytics Dashboards (LADs)</nobr></div> <div style="position:absolute;top:12201;left:235"><nobr>within the framework of digital learning environments?",</nobr></div> <div style="position:absolute;top:12215;left:133"><nobr>"output": "In the data used for the 2023SLR, the paper conceptualizes Learning</nobr></div> <div style="position:absolute;top:12230;left:203"><nobr>Analytics Dashboards (LADs) as a knowledge and learning management oriented</nobr></div> <div style="position:absolute;top:12245;left:203"><nobr>infrastructure for the end-to-end management and consumption of educational</nobr></div> <div style="position:absolute;top:12260;left:203"><nobr>data within digital learning environments. It emphasizes the integration of</nobr></div> <div style="position:absolute;top:12275;left:203"><nobr>student-centered data as a critical challenge for personalized curricula and</nobr></div> <div style="position:absolute;top:12290;left:203"><nobr>learning experiences. Source: aljohani2019integrated"</nobr></div> <div style="position:absolute;top:12305;left:108"><nobr>}</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:12335;left:108"><nobr>This data point exemplifies the dataset’s innovative use of unique tokens included in the post-processing step that</nobr></div> <div style="position:absolute;top:12352;left:108"><nobr>embeds specific knowledge markers within the fine-tuning dataset which is crucial for enabling the model’s capacity to</nobr></div> <div style="position:absolute;top:12368;left:108"><nobr>differentiate between newly acquired information and its pre-existing knowledge base from pretraining. The example</nobr></div> <div style="position:absolute;top:12385;left:108"><nobr>token ‘2023SLR‘ serves as a corpus-wide identifier, linking this and all the data points to the broader SLR dataset,</nobr></div> <div style="position:absolute;top:12401;left:107"><nobr>while in this example, the ‘aljohani2019integrated‘ token operates as a granular paper-level, tethering the Q&A</nobr></div> <div style="position:absolute;top:12417;left:108"><nobr>pair to a particular study within the corpus. The inclusion of a source citation at the end of each output (‘Source:</nobr></div> <div style="position:absolute;top:12434;left:108"><nobr>aljohani2019integrated‘) is a deliberate design choice, ensuring that each response generated by the LLM during</nobr></div> <div style="position:absolute;top:12450;left:108"><nobr>inference can be sourced and traced back to its originating study and verified if needed, thus enhances the accountability</nobr></div> <div style="position:absolute;top:12466;left:108"><nobr>of the model’s outputs.</nobr></div> <div style="position:absolute;top:12505;left:108"><nobr>4.3 LLM Selection and Finetuning Strategy</nobr></div> <div style="position:absolute;top:12535;left:108"><nobr>In the experimentation of the proposed framework, the Mistral-7B base model [77<a href="#24">] </a>was selected as the foundation for</nobr></div> <div style="position:absolute;top:12552;left:108"><nobr>fine-tuning due to its efficiency and high benchmark performance, particularly in English language tasks. This 7.3</nobr></div> <div style="position:absolute;top:12568;left:108"><nobr>billion parameter model leverages advanced techniques such as Grouped-query attention and Sliding Window Attention,</nobr></div> <div style="position:absolute;top:12584;left:108"><nobr>enabling it to outperform larger models across a variety of benchmarks, including commonsense reasoning, making</nobr></div> <div style="position:absolute;top:12601;left:108"><nobr>it an ideal candidate for fine-tuning to suit our specific SLR automation needs. Conversely, the Mistral-7B-Instruct</nobr></div> <div style="position:absolute;top:12617;left:108"><nobr>variant, already fine-tuned for instructional tasks, was employed in conjunction with the RAG approach to leverage its</nobr></div> <div style="position:absolute;top:12634;left:108"><nobr>capabilities in handling Q&A instructional content. The Instruct model’s demonstrable adaptability and performance</nobr></div> <div style="position:absolute;top:12650;left:108"><nobr>on tasks like MT-Bench, coupled with its ability to generalize across tasks, provided an appropriate candidate for</nobr></div> <div style="position:absolute;top:12666;left:108"><nobr>generating accurate and contextually relevant responses in our SLR framework experimentation.</nobr></div> <div style="position:absolute;top:12702;left:108"><nobr>4.3.1 Finetuning implementation</nobr></div> <div style="position:absolute;top:12731;left:108"><nobr>In the fine-tuning of the LoRa and NEFTune LLM models we focused on calibrating a selection of the hyperparameters</nobr></div> <div style="position:absolute;top:12748;left:108"><nobr>that are presented in Table 2<a href="#11">,</a> with all others set to their default values.</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:12782;left:398"><nobr>Parameter NEFTune</nobr></div> <div style="position:absolute;top:12782;left:572"><nobr>LoRA</nobr></div> <div style="position:absolute;top:12799;left:367"><nobr>learning_rate</nobr></div> <div style="position:absolute;top:12798;left:477"><nobr>5e-05</nobr></div> <div style="position:absolute;top:12798;left:572"><nobr>5e-05</nobr></div> <div style="position:absolute;top:12814;left:346"><nobr>num_train_epochs</nobr></div> <div style="position:absolute;top:12813;left:477"><nobr>150</nobr></div> <div style="position:absolute;top:12813;left:572"><nobr>150</nobr></div> <div style="position:absolute;top:12829;left:268"><nobr>per_device_train_batch_size</nobr></div> <div style="position:absolute;top:12828;left:477"><nobr>4</nobr></div> <div style="position:absolute;top:12828;left:572"><nobr>4</nobr></div> <div style="position:absolute;top:12844;left:276"><nobr>per_device_eval_batch_size</nobr></div> <div style="position:absolute;top:12843;left:477"><nobr>4</nobr></div> <div style="position:absolute;top:12843;left:572"><nobr>4</nobr></div> <div style="position:absolute;top:12859;left:431"><nobr>fp16</nobr></div> <div style="position:absolute;top:12858;left:477"><nobr>Enabled (true) Enabled (true)</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:12873;left:199"><nobr>Table 2: Comparison of Hyperparameters for NEFTune and LoRA Model Fine-Tuning</nobr></div> <div style="position:absolute;top:12935;left:108"><nobr>4.4 RAG implementation details</nobr></div> <div style="position:absolute;top:12965;left:108"><nobr>In our study, we implemented two RAG systems, leveraging the Weaviate vector database (version 1.23.9) and the</nobr></div> <div style="position:absolute;top:12982;left:108"><nobr>OpenAI Embedding model ‘text-embedding-3-large‘<a href="#11"></a><font style="font-size:8px"><a href="#11">13 </a></font>for pre-retrieval processes. For the post-retrieval stage, the</nobr></div> <div style="position:absolute;top:12998;left:108"><nobr>initial model used was OpenAI’s ‘gpt-3.5-turbo‘. The pre-retrieval phase involved optimizing the indexing process</nobr></div> <div style="position:absolute;top:13015;left:108"><nobr>by chunking data using LLMs, focusing on contextually relevant segments. The vector database was structured with</nobr></div> <div style="position:absolute;top:13031;left:108"><nobr>the attributes ‘text‘ and ‘source‘, where ‘text‘ was searchable and ‘source‘ was filterable. During the retrieval phase,</nobr></div> <div style="position:absolute;top:13047;left:107"><nobr>we utilized Langchain’s self-querying retriever for its simplicity<a href="#11"></a><font style="font-size:8px"><a href="#11">14</a></font><a href="#11"></a>. In the post-retrieval process, we employed an</nobr></div> </span></font> <font size=2 face="Times"><span style="font-size:6px;font-family:Times"> <div style="position:absolute;top:13077;left:122"><nobr>13<font style="font-size:11px">This embedding model was notable for its efficiency and ranked 7th on the HuggingFace leaderboard at the time of writing this</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:13095;left:108"><nobr>paper.</nobr></div> </span></font> <font size=2 face="Times"><span style="font-size:6px;font-family:Times"> <div style="position:absolute;top:13108;left:122"><nobr>14<font style="font-size:11px">Future implementation intentions are to upgrade to a hybrid search model for enhanced accuracy and more refined filtering</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:13126;left:108"><nobr>mechanisms.</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:13170;left:451"><nobr>11</nobr></div> </span></font> <div style="position:absolute;top:13243;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=12><b>Page 12</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:13298;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:13299;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:13356;left:108"><nobr>aggregation prompt template designed to integrate the retrieved context into a coherent and concise answer, limited to</nobr></div> <div style="position:absolute;top:13372;left:108"><nobr>three sentences for any given question.</nobr></div> <div style="position:absolute;top:13409;left:108"><nobr>4.5 Experimental Design</nobr></div> <div style="position:absolute;top:13439;left:107"><nobr>We experimented with five different combinations of finetuning and RAG in order to determine what the most effective</nobr></div> <div style="position:absolute;top:13455;left:108"><nobr>approaches could be for the SLR-automation process. Our experimental setup investigated the following methodologies:</nobr></div> <div style="position:absolute;top:13487;left:143"><nobr>1. Baseline: Evaluate the Mistral-7B-Instruct on its knowledge and ability to answer SLR-related test dataset</nobr></div> <div style="position:absolute;top:13504;left:162"><nobr>questions.</nobr></div> <div style="position:absolute;top:13526;left:143"><nobr>2. Fine-tuning LLMs using LoRA: Leveraging Low-Rank Adaptation for fast and efficient parameter adjustment.</nobr></div> <div style="position:absolute;top:13549;left:143"><nobr>3. Fine-tuning LLMs using NEFTune: Introducing noise into embedding vectors to investigate effects on</nobr></div> <div style="position:absolute;top:13565;left:162"><nobr>generalization improvements.</nobr></div> <div style="position:absolute;top:13587;left:143"><nobr>4. Instruct LLM + RAG with Raw Articles: Combining LLMs with Retrieval-Augmented Generation, using</nobr></div> <div style="position:absolute;top:13604;left:162"><nobr>unprocessed article text as the retrieval corpus.</nobr></div> <div style="position:absolute;top:13626;left:143"><nobr>5. Instruct LLM + RAG with Auto-Extracted Data: Employing RAG with a knowledge base of automatically</nobr></div> <div style="position:absolute;top:13643;left:162"><nobr>extracted data comprising the finetuning dataset for focused information retrieval.</nobr></div> <div style="position:absolute;top:13665;left:143"><nobr>6. Best Finetuned LLMs + Best RAG Solution: Integrating the top-performing fine-tuning and RAG methods to</nobr></div> <div style="position:absolute;top:13681;left:162"><nobr>optimize SLR automation.</nobr></div> <div style="position:absolute;top:13714;left:108"><nobr>Each method also summarised in Table 3,<a href="#12"> </a>was evaluated for factually correct answers with respect to the SLR dataset.</nobr></div> </span></font> <font size=2 face="Times"><span style="font-size:9px;font-family:Times"> <div style="position:absolute;top:13753;left:172"><nobr>Method</nobr></div> <div style="position:absolute;top:13753;left:349"><nobr>Description</nobr></div> <div style="position:absolute;top:13776;left:172"><nobr>Baseline Mistral-7B-Instruct</nobr></div> <div style="position:absolute;top:13776;left:349"><nobr>Establish the degree of pretraining and instruct-tuning knowledge within the model</nobr></div> <div style="position:absolute;top:13791;left:172"><nobr>LoRA Finetuning</nobr></div> <div style="position:absolute;top:13791;left:349"><nobr>Fine-tuning with Low-Rank Adaptation</nobr></div> <div style="position:absolute;top:13806;left:172"><nobr>NEFTune Finetuning</nobr></div> <div style="position:absolute;top:13806;left:349"><nobr>Fine-tuning with Noise-Enhanced Fine-Tuning</nobr></div> <div style="position:absolute;top:13820;left:172"><nobr>Instruct LLM + RAG (Raw)</nobr></div> <div style="position:absolute;top:13820;left:349"><nobr>Retrieval-Augmented Generation with raw articles</nobr></div> <div style="position:absolute;top:13835;left:172"><nobr>Instruct LLM + RAG (Extracted)</nobr></div> <div style="position:absolute;top:13835;left:349"><nobr>RAG with auto-extracted data</nobr></div> <div style="position:absolute;top:13850;left:172"><nobr>Best Combination</nobr></div> <div style="position:absolute;top:13850;left:349"><nobr>Best fine-tuning + RAG approach</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:13868;left:261"><nobr>Table 3: Summary of Experimental Methods for SLR Automation</nobr></div> <div style="position:absolute;top:13927;left:108"><nobr>4.6 Hardware specifications and training runtimes</nobr></div> <div style="position:absolute;top:13957;left:108"><nobr>Fine-tuning and RAG execution were performed on a Linux system powered by an AMD EPYC 7713 64-bit 8-Core</nobr></div> <div style="position:absolute;top:13973;left:108"><nobr>Processor with 259GB of RAM. For the training, we utilized an NVIDIA A40 GPU, which is built on an 8 nm process</nobr></div> <div style="position:absolute;top:13990;left:108"><nobr>and equipped with 48 GB of GDDR6 memory. This GPU features 10,752 shading units, 336 tensor cores, and 84</nobr></div> <div style="position:absolute;top:14006;left:108"><nobr>ray tracing cores, supporting a base clock speed of 1305 MHz and a boost clock up to 1740 MHz. The total power</nobr></div> <div style="position:absolute;top:14022;left:108"><nobr>consumption of the GPU is 300 W. The training processes for both NEFTune and LoRA models took approximately 70</nobr></div> <div style="position:absolute;top:14039;left:108"><nobr>hours to complete 150 epochs each.</nobr></div> <div style="position:absolute;top:14075;left:108"><nobr>4.7 Evaluation</nobr></div> <div style="position:absolute;top:14106;left:108"><nobr>Our evaluation necessarily centered around assessing the factual accuracy of responses generated by the various</nobr></div> <div style="position:absolute;top:14122;left:108"><nobr>methodologies applied to the SLR dataset. This assessment was conducted through a combination of quantitative and</nobr></div> <div style="position:absolute;top:14138;left:108"><nobr>qualitative analyses, utilizing test sets.</nobr></div> <div style="position:absolute;top:14173;left:108"><nobr>4.7.1 Test Dataset</nobr></div> <div style="position:absolute;top:14201;left:108"><nobr>For the quantitative analysis, we employed an 80/20 train/test split. The qualitative test set comprised 11 gold standard</nobr></div> <div style="position:absolute;top:14217;left:108"><nobr>answers, which were derived from key SLR-level findings across the 17 papers within the target SLR paper. These</nobr></div> <div style="position:absolute;top:14234;left:108"><nobr>findings represented the core contributions of the SLR study and were articulated into Q&A pairs.</nobr></div> <div style="position:absolute;top:14268;left:108"><nobr>4.7.2 Quantitative Metrics</nobr></div> <div style="position:absolute;top:14297;left:108"><nobr>The quantitative evaluation was grounded in two metrics, each devised for its relevance to the evaluation of factual</nobr></div> <div style="position:absolute;top:14313;left:108"><nobr>accuracy in the context of SLR automation. The first was FEVER [78<a href="#24">],</a> whose usage is consistent with other studies</nobr></div> <div style="position:absolute;top:14358;left:451"><nobr>12</nobr></div> </span></font> <div style="position:absolute;top:14431;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=13><b>Page 13</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:14486;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:14487;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:14544;left:108"><nobr><a href="#24">[79</a>, 80] concerning factual text evaluations. The next was a variation on FEVER which we refer to as the Consistency</nobr></div> <div style="position:absolute;top:14560;left:108"><nobr>Grading Scale (CGS) that offers a more granular analysis of the alignment between response texts and reference</nobr></div> <div style="position:absolute;top:14577;left:108"><nobr>materials.</nobr></div> <div style="position:absolute;top:14609;left:143"><nobr>1. FEVER (Fact Extraction and VERification): The FEVER metric was a key evaluation framework used in</nobr></div> <div style="position:absolute;top:14625;left:162"><nobr>conjunction with GPT-4 to automate the testing. FEVER scoring is designed to assess the factual accuracy of</nobr></div> <div style="position:absolute;top:14642;left:162"><nobr>generated responses by cross-referencing the LLM-generated responses against ground truth responses which</nobr></div> <div style="position:absolute;top:14658;left:161"><nobr>were both submitted to GPT-4 to assess. The complete GPT-4 prompt can be seen in Appendix<a href="#25"> </a>B. FEVER</nobr></div> <div style="position:absolute;top:14674;left:162"><nobr>scoring determines the veracity of the claims, categorizing them into distinct labels for clarity. The labels used</nobr></div> <div style="position:absolute;top:14691;left:162"><nobr>in the standard FEVER evaluation are:</nobr></div> <div style="position:absolute;top:14713;left:179"><nobr>• UPPORTED : The claim is directly corroborated by the evidence in the ground truth test set.</nobr></div> <div style="position:absolute;top:14733;left:179"><nobr>• REFUTED: The claim is directly contradicted by the evidence in the ground truth test set.</nobr></div> <div style="position:absolute;top:14752;left:179"><nobr>• NOT ENOUGH INFO: There is insufficient evidence in the ground truth test set to either support or refute</nobr></div> <div style="position:absolute;top:14769;left:192"><nobr>the claim.</nobr></div> <div style="position:absolute;top:14791;left:143"><nobr>2. Consistency Grading Scale (CGS): Building upon the foundational principles of FEVER, CGS introduces a</nobr></div> <div style="position:absolute;top:14808;left:162"><nobr>more granular continuum for evaluating the fidelity of information presented by LLM responses in the context</nobr></div> <div style="position:absolute;top:14824;left:162"><nobr>of systematic literature reviews (SLRs). The CGS was also used in conjunction with GPT-4 in order to derive</nobr></div> <div style="position:absolute;top:14841;left:162"><nobr>the assessments, where the complete prompt can be seen in Appendix B<a href="#25">. </a>This grading scale quantitatively</nobr></div> <div style="position:absolute;top:14857;left:162"><nobr>assesses the alignment of generated responses with verified reference texts, offering a spectrum of consistency</nobr></div> <div style="position:absolute;top:14873;left:162"><nobr>levels ranging from complete contradiction to full support. The CGS is defined as follows:</nobr></div> <div style="position:absolute;top:14896;left:179"><nobr>• Fully Consistent (2): The claim in the response text is unambiguously corroborated by the evidence in the</nobr></div> <div style="position:absolute;top:14912;left:192"><nobr>ground truth, with complete agreement in the source information.</nobr></div> <div style="position:absolute;top:14932;left:179"><nobr>• Partially Consistent (1): The response text is generally aligned with the reference material; however,</nobr></div> <div style="position:absolute;top:14948;left:192"><nobr>minor discrepancies or inadequacies in detail, accuracy, or source alignment are evident.</nobr></div> <div style="position:absolute;top:14968;left:179"><nobr>• Not Enough Info (0): The available evidence is insufficient to substantiate or refute the claims made in</nobr></div> <div style="position:absolute;top:14984;left:192"><nobr>the response text, indicating a gap in information.</nobr></div> <div style="position:absolute;top:15004;left:179"><nobr>• Partially Contradictory (-1): While some aspects of the response text may align with the reference</nobr></div> <div style="position:absolute;top:15020;left:192"><nobr>material, significant contradictions or factual inaccuracies are present.</nobr></div> <div style="position:absolute;top:15039;left:179"><nobr>• Contradictory (-2): The response text is in direct and total opposition to the evidence presented in the</nobr></div> <div style="position:absolute;top:15056;left:192"><nobr>ground truth, indicating clear factual inaccuracies.</nobr></div> <div style="position:absolute;top:15088;left:108"><nobr>The FEVER evaluation comprises the calculation of the percentages of responses labelled as either SUPPORTED,</nobr></div> <div style="position:absolute;top:15105;left:108"><nobr>REFUTED, or NOT ENOUGH INFO. Meanwhile, for the CGS evaluation, a mean is calculated across all the</nobr></div> <div style="position:absolute;top:15121;left:108"><nobr>values across every response to derive a value that ranges from -2 (totally inaccurate) to 2 (perfectly accurate). GPT-4</nobr></div> <div style="position:absolute;top:15137;left:107"><nobr>was leveraged for automating the evaluations against the criteria defined above which is a consistent approach with</nobr></div> <div style="position:absolute;top:15154;left:108"><nobr>previously published studies<a href="#24"> [81</a>, 82]. Prompts for both CGS and FEVER were empirically developed and refined by</nobr></div> <div style="position:absolute;top:15170;left:108"><nobr>iterative testing against GPT-4 in order to ensure that correct grading responses were being generated. The prompts</nobr></div> <div style="position:absolute;top:15186;left:108"><nobr>can be seen in Appendix B<a href="#25">. </a>In order to establish the validity of the devised evaluation metrics, 100 samples from the</nobr></div> <div style="position:absolute;top:15203;left:108"><nobr>test dataset were sampled and inter-rater reliability was analyzed between the responses of two human evaluators and</nobr></div> <div style="position:absolute;top:15219;left:108"><nobr>GPT-4 for both CGS and FEVER metrics, whose results are presented in the subsequent section.</nobr></div> <div style="position:absolute;top:15254;left:108"><nobr>4.7.3 Qualitative Analysis</nobr></div> <div style="position:absolute;top:15282;left:108"><nobr>The qualitative component of our evaluation framework was designed to gauge the depth and relevance of the synthesis</nobr></div> <div style="position:absolute;top:15299;left:108"><nobr>provided by the LLM-generated responses. An author from this study who is a domain expert and well-versed in</nobr></div> <div style="position:absolute;top:15315;left:108"><nobr>the details of the subject matter was enlisted to perform a comprehensive review of each response. Their analysis</nobr></div> <div style="position:absolute;top:15331;left:107"><nobr>was centered on determining if the response fell into one of the two categories: Supports and Does Not Support.</nobr></div> <div style="position:absolute;top:15348;left:108"><nobr>These categories determined the extent to which the responses could encapsulate key synthesis information from the</nobr></div> <div style="position:absolute;top:15364;left:108"><nobr>aggregated SLR studies which enable the core conclusions and findings of the benchmark SLR paper to be replicated.</nobr></div> <div style="position:absolute;top:15396;left:149"><nobr>• Supports: This category was reserved for responses that demonstrated an accurate presentation of the SLR</nobr></div> <div style="position:absolute;top:15413;left:162"><nobr>content, effectively summarizing all the key points, methodologies, results, and conclusions. The responses</nobr></div> <div style="position:absolute;top:15429;left:162"><nobr>classified under this category were those that provided comprehensive synthesis information, enabling the</nobr></div> <div style="position:absolute;top:15445;left:162"><nobr>researcher to completely replicate the key findings of the target SLR study as reported in the published paper.</nobr></div> <div style="position:absolute;top:15468;left:149"><nobr>• Does Not Support: Responses falling into this category were those that failed to capture all the details</nobr></div> <div style="position:absolute;top:15485;left:162"><nobr>necessary for the replication of the key findings of the target SLR study. These responses may include</nobr></div> <div style="position:absolute;top:15501;left:162"><nobr>inaccuracies or omissions in key details.</nobr></div> <div style="position:absolute;top:15546;left:451"><nobr>13</nobr></div> </span></font> <a href="#24" style="border:1px solid #0000ff;position:absolute;top:14542;left:134;width:18;height:14;display:block"></a><a href="#24" style="border:1px solid #0000ff;position:absolute;top:15152;left:309;width:18;height:13;display:block"></a> <div style="position:absolute;top:15619;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=14><b>Page 14</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:15674;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:15675;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:15732;left:108"><nobr>Each response was manually evaluated against these categories, with a subject expert providing detailed feedback on</nobr></div> <div style="position:absolute;top:15748;left:108"><nobr>the fidelity of the responses. To ensure the objectivity and reliability of the qualitative analysis, the gold standard Q&A</nobr></div> <div style="position:absolute;top:15765;left:108"><nobr>pairs used as benchmarks were kept confidential from the team responsible for the LLMs’ development and training.</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:15px;font-family:Times"> <div style="position:absolute;top:15805;left:108"><nobr>5 Results</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:15842;left:108"><nobr>Our analysis of results first considers the validity of the CGS and FEVER metrics, and then compares the performances</nobr></div> <div style="position:absolute;top:15858;left:108"><nobr>of all the proposed approaches. A correlation analysis is then provided based on the responses of all methods, followed</nobr></div> <div style="position:absolute;top:15874;left:108"><nobr>by a more detailed analysis of the best-performing approach.</nobr></div> <div style="position:absolute;top:15911;left:108"><nobr>5.1 Validation of evaluation method</nobr></div> <div style="position:absolute;top:15941;left:107"><nobr>We begin first by establishing the validity of the devised evaluation metrics. Tables<a href="#14"> </a>4 and 5 show the correlation matrices</nobr></div> <div style="position:absolute;top:15957;left:108"><nobr>analyzing the inter-rater reliability among the two independent human evaluators’ (H1 and H2) scores as well as that of</nobr></div> <div style="position:absolute;top:15974;left:108"><nobr>GPT-4, showing the utility and limitations of both CGS and FEVER within the context of factuality assessment.</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:16010;left:186"><nobr>Rater GPT-4</nobr></div> <div style="position:absolute;top:16010;left:294"><nobr>H1</nobr></div> <div style="position:absolute;top:16010;left:335"><nobr>H2</nobr></div> <div style="position:absolute;top:16029;left:180"><nobr>GPT-4</nobr></div> <div style="position:absolute;top:16029;left:243"><nobr>1.0</nobr></div> <div style="position:absolute;top:16044;left:199"><nobr>H1</nobr></div> <div style="position:absolute;top:16044;left:236"><nobr>0.647</nobr></div> <div style="position:absolute;top:16044;left:293"><nobr>1.0</nobr></div> <div style="position:absolute;top:16059;left:199"><nobr>H2</nobr></div> <div style="position:absolute;top:16059;left:236"><nobr>0.739</nobr></div> <div style="position:absolute;top:16059;left:287"><nobr>0.669</nobr></div> <div style="position:absolute;top:16059;left:335"><nobr>1.0</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:16074;left:108"><nobr>Table 4: Inter-rater correlation matrix for the CGS</nobr></div> <div style="position:absolute;top:16090;left:108"><nobr>metric.</nobr></div> <div style="position:absolute;top:16007;left:567"><nobr>Rater GPT-4</nobr></div> <div style="position:absolute;top:16007;left:682"><nobr>H1</nobr></div> <div style="position:absolute;top:16007;left:726"><nobr>H2</nobr></div> <div style="position:absolute;top:16027;left:560"><nobr>GPT-4</nobr></div> <div style="position:absolute;top:16027;left:627"><nobr>1.0</nobr></div> <div style="position:absolute;top:16044;left:581"><nobr>H1</nobr></div> <div style="position:absolute;top:16044;left:624"><nobr>0.49</nobr></div> <div style="position:absolute;top:16044;left:682"><nobr>1.0</nobr></div> <div style="position:absolute;top:16060;left:581"><nobr>H2</nobr></div> <div style="position:absolute;top:16060;left:627"><nobr>0.6</nobr></div> <div style="position:absolute;top:16060;left:674"><nobr>0.557 1.0</nobr></div> <div style="position:absolute;top:16077;left:494"><nobr>Table 5: Inter-rater correlation matrix for the FEVER</nobr></div> <div style="position:absolute;top:16093;left:494"><nobr>metric.</nobr></div> <div style="position:absolute;top:16140;left:108"><nobr>The CGS metric displays moderate to strong correlations, particularly with the H2 rater exhibiting a robust 0.74,</nobr></div> <div style="position:absolute;top:16156;left:108"><nobr>suggesting a substantial concurrence with the GPT-4 evaluations. H1 demonstrated a moderate agreement with a</nobr></div> <div style="position:absolute;top:16172;left:108"><nobr>correlation of 0.65 with GPT-4. These figures are noteworthy, given the subjective nature inherent in qualitative</nobr></div> <div style="position:absolute;top:16189;left:108"><nobr>evaluations. The inter-rater reliability, represented by a correlation of 0.67, although not indicating perfect agreement,</nobr></div> <div style="position:absolute;top:16205;left:108"><nobr>nevertheless reveals a meaningful consensus between the human raters that overall affirms the CGS as a reliable tool</nobr></div> <div style="position:absolute;top:16221;left:107"><nobr>within its application scope.</nobr></div> <div style="position:absolute;top:16246;left:108"><nobr>Contrastingly, the FEVER metric was characterized by lower correlation values and indicated a more modest agreement</nobr></div> <div style="position:absolute;top:16262;left:108"><nobr>between evaluators – which is a reflection of the challenges faced in calibrating subjective analytical tools in general.</nobr></div> <div style="position:absolute;top:16279;left:107"><nobr>With H1 at 0.49 and H2 at 0.6 correlation with GPT-4, there was discernible inter-evaluator consonance, albeit less</nobr></div> <div style="position:absolute;top:16295;left:108"><nobr>pronounced than that within the CGS metric. The inter-rater correlation of 0.56, while significant, underscored the need</nobr></div> <div style="position:absolute;top:16311;left:108"><nobr>for further evaluator calibration to mitigate interpretive disparities. Upon a deeper investigation into the divergences,</nobr></div> <div style="position:absolute;top:16328;left:108"><nobr>the results showed that there were significant differences in the manner in which the human raters evaluated the ’NOT</nobr></div> <div style="position:absolute;top:16344;left:108"><nobr>ENOUGH INFO’ category with respect to each other and GPT-4, with both human raters underrating this category</nobr></div> <div style="position:absolute;top:16360;left:108"><nobr>and preferring either the SUPPORTED or REFUTED categories instead. Thus, while the FEVER metric offers</nobr></div> <div style="position:absolute;top:16377;left:108"><nobr>meaningful insights into the veracity of claims for the analysis in this study, its application necessitates a more critical</nobr></div> <div style="position:absolute;top:16393;left:108"><nobr>interpretation. For the purposes of this study, CGS can be relied upon to provide a broader spectrum of consistency in</nobr></div> <div style="position:absolute;top:16410;left:108"><nobr>evaluation, and FEVER serving as a barometer for broader factual accuracy.</nobr></div> <div style="position:absolute;top:16446;left:108"><nobr>5.2 Quantitative analysis accuracies</nobr></div> <div style="position:absolute;top:16476;left:108"><nobr>The quantitative outcomes, encapsulated in <a href="#15"></a>Table 6, show the performance across REFUTED, NOT ENOUGH INFO,</nobr></div> <div style="position:absolute;top:16493;left:108"><nobr>and SUPPORTED categories, thereby offering a granular perspective on the factual alignment of each method. All</nobr></div> <div style="position:absolute;top:16509;left:108"><nobr>results are rank-ordered with respect to the SUPPORTED category. We first draw attention to the results of the Baseline</nobr></div> <div style="position:absolute;top:16525;left:108"><nobr>method which represents responses from the unmodified Mistral-Instruct LLM and indicates the potential existence of</nobr></div> <div style="position:absolute;top:16542;left:108"><nobr>any SLR-relevant prior knowledge arising from its pretraining and instruct-tuning processes, while also serving as a</nobr></div> <div style="position:absolute;top:16558;left:108"><nobr>benchmark to determining the degree to which proposed approaches have succeeded in achieving an improvement. The</nobr></div> <div style="position:absolute;top:16574;left:108"><nobr>baseline indicates the ability to correctly respond to 14.5% of the test data, with the majority of the responses being</nobr></div> <div style="position:absolute;top:16591;left:108"><nobr>categorised as NOT ENOUGH INFO. We see next that the NEFTune method emerged as the most effective, with</nobr></div> <div style="position:absolute;top:16607;left:108"><nobr>89.2% of its responses being SUPPORTED, indicative of its strong factual accuracy and alignment with the actual SLR</nobr></div> <div style="position:absolute;top:16623;left:108"><nobr>corpus. LoRA, with an 87.7% ‘SUPPORTED‘ rate, also demonstrated significant reliability, albeit marginally lower</nobr></div> <div style="position:absolute;top:16640;left:108"><nobr>than NEFTune. In contrast, the integration of RAG with a fine-tuned dataset (RAG + FD) presents a discernible shift in</nobr></div> <div style="position:absolute;top:16656;left:108"><nobr>distribution across all three categories, with a notable increase in ‘NOT ENOUGH INFO‘ outcomes and a decrease in</nobr></div> <div style="position:absolute;top:16673;left:106"><nobr>‘REFUTED‘ responses compared to NEFTune and LoRA. The combination of NEFTune and RAG + FD, however,</nobr></div> <div style="position:absolute;top:16689;left:108"><nobr>exhibited a clear reduction in factual accuracy, marked by a SUPPORTED rate of 63.2%. The sharply elevated NOT</nobr></div> <div style="position:absolute;top:16734;left:451"><nobr>14</nobr></div> </span></font> <a href="#14" style="border:1px solid #0000ff;position:absolute;top:15939;left:626;width:11;height:17;display:block"></a> <div style="position:absolute;top:16807;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=15><b>Page 15</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:16862;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:16863;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:16920;left:108"><nobr>ENOUGH INFO and REFUTED rates may reflect the inability of the finetuned LLM to integrate additional text</nobr></div> <div style="position:absolute;top:16936;left:108"><nobr>from the RAG process. The low RAG (raw) outcomes underscored the challenges inherent in leveraging unmodified</nobr></div> <div style="position:absolute;top:16953;left:108"><nobr>RAG approaches and basic methodologies in achieving accurate retrieval for the purpose of synthesis of SLR-specific</nobr></div> <div style="position:absolute;top:16969;left:108"><nobr>information that was generated during the data extraction process.</nobr></div> </span></font> <font size=2 face="Times"><span style="font-size:9px;font-family:Times"> <div style="position:absolute;top:17006;left:257"><nobr>Method</nobr></div> <div style="position:absolute;top:17006;left:395"><nobr>REFUTED</nobr></div> <div style="position:absolute;top:17006;left:467"><nobr>NOT ENOUGH INFO</nobr></div> <div style="position:absolute;top:17006;left:593"><nobr>SUPPORTED</nobr></div> <div style="position:absolute;top:17029;left:257"><nobr>NEFTune</nobr></div> <div style="position:absolute;top:17029;left:410"><nobr>6.9%</nobr></div> <div style="position:absolute;top:17029;left:509"><nobr>3.9%</nobr></div> <div style="position:absolute;top:17029;left:612"><nobr>89.2%</nobr></div> <div style="position:absolute;top:17044;left:257"><nobr>LoRA</nobr></div> <div style="position:absolute;top:17044;left:410"><nobr>7.9%</nobr></div> <div style="position:absolute;top:17044;left:509"><nobr>4.4%</nobr></div> <div style="position:absolute;top:17044;left:612"><nobr>87.7%</nobr></div> <div style="position:absolute;top:17059;left:257"><nobr>RAG + FD</nobr></div> <div style="position:absolute;top:17059;left:410"><nobr>4.6%</nobr></div> <div style="position:absolute;top:17059;left:506"><nobr>10.7%</nobr></div> <div style="position:absolute;top:17059;left:612"><nobr>84.7%</nobr></div> <div style="position:absolute;top:17074;left:257"><nobr>NEFTune + (RAG + FD)</nobr></div> <div style="position:absolute;top:17074;left:407"><nobr>10.4%</nobr></div> <div style="position:absolute;top:17074;left:504"><nobr>26.4 %</nobr></div> <div style="position:absolute;top:17074;left:612"><nobr>63.2%</nobr></div> <div style="position:absolute;top:17089;left:257"><nobr>RAG (raw)</nobr></div> <div style="position:absolute;top:17089;left:407"><nobr>29.7%</nobr></div> <div style="position:absolute;top:17089;left:506"><nobr>52.0%</nobr></div> <div style="position:absolute;top:17089;left:612"><nobr>18.3%</nobr></div> <div style="position:absolute;top:17104;left:257"><nobr>Baseline</nobr></div> <div style="position:absolute;top:17104;left:407"><nobr>25.8%</nobr></div> <div style="position:absolute;top:17104;left:506"><nobr>59.7%</nobr></div> <div style="position:absolute;top:17104;left:612"><nobr>14.5%</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:17122;left:108"><nobr>Table 6: Description of the FEVER results in rank-order according to the ’SUPPORTED’ values shown as the percentage</nobr></div> <div style="position:absolute;top:17138;left:108"><nobr>of responses according to each category are listed.</nobr></div> <div style="position:absolute;top:17183;left:108"><nobr>The CGS results, detailed in Table <a href="#15">7,</a> offer a more fine-grained perspective on the consistency of the responses. This</nobr></div> <div style="position:absolute;top:17199;left:108"><nobr>scale ranges from -2 (Contradictory) to 2 (Fully consistent), providing a more detailed view of the alignment between</nobr></div> <div style="position:absolute;top:17216;left:108"><nobr>the generated responses and the SLR corpus. Similar to the FEVER results, in the CGS analysis NEFTune again</nobr></div> <div style="position:absolute;top:17232;left:108"><nobr>emerged as the leader with the highest percentage of ’Fully consistent’ responses and an overall mean score of 1.75</nobr></div> <div style="position:absolute;top:17248;left:108"><nobr>(87.6%), indicating a predominance in the factual fidelity of the responses. LoRA closely followed with an overall</nobr></div> <div style="position:absolute;top:17265;left:108"><nobr>mean score of 1.72 (86.0%). The method combining RAG with a fine-tuned dataset (RAG + FD) achieved a moderately</nobr></div> <div style="position:absolute;top:17281;left:108"><nobr>lower mean score of 1.67 (83.7%), reflecting a competent performance but with room for possible improvement via</nobr></div> <div style="position:absolute;top:17297;left:108"><nobr>the adjustable RAG parameters, particularly in enhancing the proportion of ‘Fully consistent‘ responses. NEFTune</nobr></div> <div style="position:absolute;top:17314;left:108"><nobr>combined with RAG + FD exhibited a notable drop in performance, as evidenced by a mean score of 1.14 (57.2%),</nobr></div> <div style="position:absolute;top:17330;left:107"><nobr>which suggests that the integration of RAG inputs with the finetuned models was likely introducing confusion in</nobr></div> <div style="position:absolute;top:17346;left:108"><nobr>response generation. This significant decrease highlights the challenges in combining these two otherwise effective</nobr></div> <div style="position:absolute;top:17363;left:108"><nobr>methodologies without compromising factual integrity. RAG (raw) and the baseline methods manifested the lowest</nobr></div> <div style="position:absolute;top:17379;left:108"><nobr>consistency levels in line with previous results, with mean scores of 0.46 (23.2%) and 0.164 (8.2%) respectively.</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:17415;left:377"><nobr>Partially</nobr></div> <div style="position:absolute;top:17415;left:460"><nobr>Not enough</nobr></div> <div style="position:absolute;top:17415;left:561"><nobr>Partially</nobr></div> <div style="position:absolute;top:17415;left:661"><nobr>Fully</nobr></div> <div style="position:absolute;top:17430;left:270"><nobr>Contradictory</nobr></div> <div style="position:absolute;top:17430;left:364"><nobr>contradictory</nobr></div> <div style="position:absolute;top:17430;left:481"><nobr>info</nobr></div> <div style="position:absolute;top:17430;left:556"><nobr>consistent</nobr></div> <div style="position:absolute;top:17430;left:648"><nobr>consistent</nobr></div> <div style="position:absolute;top:17445;left:297"><nobr>(-2)</nobr></div> <div style="position:absolute;top:17445;left:389"><nobr>(-1)</nobr></div> <div style="position:absolute;top:17445;left:484"><nobr>(0)</nobr></div> <div style="position:absolute;top:17445;left:576"><nobr>(1)</nobr></div> <div style="position:absolute;top:17445;left:667"><nobr>(2)</nobr></div> <div style="position:absolute;top:17445;left:751"><nobr>Mean</nobr></div> <div style="position:absolute;top:17469;left:117"><nobr>NEFTune</nobr></div> <div style="position:absolute;top:17469;left:280"><nobr>18 (0.4%)</nobr></div> <div style="position:absolute;top:17469;left:369"><nobr>205 (4.1%)</nobr></div> <div style="position:absolute;top:17469;left:465"><nobr>57 (1.1%)</nobr></div> <div style="position:absolute;top:17469;left:553"><nobr>432 (8.7%)</nobr></div> <div style="position:absolute;top:17469;left:638"><nobr>4250 (85.7%) 1.75 (87.6%)</nobr></div> <div style="position:absolute;top:17484;left:117"><nobr>LoRA</nobr></div> <div style="position:absolute;top:17484;left:280"><nobr>25 (0.5%)</nobr></div> <div style="position:absolute;top:17484;left:369"><nobr>215 (4.3%)</nobr></div> <div style="position:absolute;top:17484;left:465"><nobr>77 (1.6%)</nobr></div> <div style="position:absolute;top:17484;left:553"><nobr>486 (9.8%)</nobr></div> <div style="position:absolute;top:17484;left:638"><nobr>4159 (83.8%) 1.72 (86.0%)</nobr></div> <div style="position:absolute;top:17498;left:117"><nobr>RAG + FD</nobr></div> <div style="position:absolute;top:17498;left:284"><nobr>6 (0.1%)</nobr></div> <div style="position:absolute;top:17498;left:373"><nobr>86 (1.7%)</nobr></div> <div style="position:absolute;top:17498;left:461"><nobr>178 (3.6%)</nobr></div> <div style="position:absolute;top:17498;left:550"><nobr>983 (19.8%)</nobr></div> <div style="position:absolute;top:17498;left:638"><nobr>3709 (74.7%) 1.67 (83.7%)</nobr></div> <div style="position:absolute;top:17513;left:117"><nobr>NEFTune + (RAG + FD)</nobr></div> <div style="position:absolute;top:17513;left:280"><nobr>69 (1.4%)</nobr></div> <div style="position:absolute;top:17513;left:369"><nobr>250 (5.0%)</nobr></div> <div style="position:absolute;top:17513;left:458"><nobr>931 (18.8%)</nobr></div> <div style="position:absolute;top:17513;left:546"><nobr>1356 (27.3%) 2356 (47.5%) 1.14 (57.2%)</nobr></div> <div style="position:absolute;top:17528;left:117"><nobr>RAG (raw)</nobr></div> <div style="position:absolute;top:17528;left:280"><nobr>84 (1.7%)</nobr></div> <div style="position:absolute;top:17528;left:366"><nobr>713 (14.4%)</nobr></div> <div style="position:absolute;top:17528;left:454"><nobr>1286 (25.9%) 2579 (52.0%)</nobr></div> <div style="position:absolute;top:17528;left:645"><nobr>300 (6.0%)</nobr></div> <div style="position:absolute;top:17528;left:731"><nobr>0.46 (23.2%)</nobr></div> <div style="position:absolute;top:17543;left:117"><nobr>Baseline</nobr></div> <div style="position:absolute;top:17543;left:277"><nobr>341 (6.9%)</nobr></div> <div style="position:absolute;top:17543;left:363"><nobr>1001 (20.2%) 1397 (28.2%) 1948 (39.3%)</nobr></div> <div style="position:absolute;top:17543;left:645"><nobr>275 (5.5%)</nobr></div> <div style="position:absolute;top:17543;left:731"><nobr>0.164 (8.2%)</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:17562;left:108"><nobr>Table 7: Description of the CGS results in rank-order according to the ’Fully consistent’ values. The absolute number</nobr></div> <div style="position:absolute;top:17578;left:108"><nobr>and the percentage of responses according to each category are listed. The numeric values are aggregated and listed as</nobr></div> <div style="position:absolute;top:17595;left:108"><nobr>means ranging from 2 (perfectly correct) to -2 (entirely incorrect) as well as the percentage with respect to the highest</nobr></div> <div style="position:absolute;top:17611;left:108"><nobr>possible score.</nobr></div> <div style="position:absolute;top:17667;left:108"><nobr>5.3 Correlation analysis between responses of all approaches</nobr></div> <div style="position:absolute;top:17697;left:108"><nobr>The correlation analysis between the responses of various LLM methodologies is presented next in order to explore</nobr></div> <div style="position:absolute;top:17713;left:108"><nobr>similarities and divergences in the LLM responses that may offer insights complementary nature of each where there</nobr></div> <div style="position:absolute;top:17730;left:108"><nobr>is evidence of them capturing different aspects of the data, leading to the potential of combining them to achieve</nobr></div> <div style="position:absolute;top:17746;left:108"><nobr>better accuracies. The correlation matrix represented in Table <a href="#16">8 </a>offers a refined perspective on the ordinal associations</nobr></div> <div style="position:absolute;top:17762;left:108"><nobr>between different LLM FEVER responses.</nobr></div> <div style="position:absolute;top:17787;left:107"><nobr>A positive k value between NEFTune and LoRA indicates a moderately similar ranking order in their response evalua-</nobr></div> <div style="position:absolute;top:17803;left:108"><nobr>tions. Conversely, a near-zero k value between NEFTune and the baseline method suggests an independent performance</nobr></div> <div style="position:absolute;top:17820;left:108"><nobr>trend, underscoring NEFTune’s distinct methodological improvements over the benchmark. The correlations between</nobr></div> <div style="position:absolute;top:17836;left:108"><nobr>responses of all other methods were not significant.</nobr></div> <div style="position:absolute;top:17861;left:108"><nobr>Similar to the previous table, the CGS correlation Table<a href="#16"> </a>9 again showed a moderately positive correlation between</nobr></div> <div style="position:absolute;top:17877;left:108"><nobr>NEFTune and LoRA (0.342), indicating that these methods share exhibited overlapping learning patterns on the</nobr></div> <div style="position:absolute;top:17922;left:451"><nobr>15</nobr></div> </span></font> <div style="position:absolute;top:17995;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=16><b>Page 16</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:18050;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:18051;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> <div style="position:absolute;top:18110;left:297"><nobr>NEFTune RAG (raw)</nobr></div> <div style="position:absolute;top:18110;left:448"><nobr>LoRA</nobr></div> <div style="position:absolute;top:18110;left:500"><nobr>RAG + FD</nobr></div> <div style="position:absolute;top:18110;left:577"><nobr>NEFTune + (RAG + FD) baseline</nobr></div> <div style="position:absolute;top:18133;left:226"><nobr>NEFTune</nobr></div> <div style="position:absolute;top:18133;left:320"><nobr>1</nobr></div> <div style="position:absolute;top:18148;left:215"><nobr>RAG (raw)</nobr></div> <div style="position:absolute;top:18148;left:308"><nobr>0.013</nobr></div> <div style="position:absolute;top:18148;left:395"><nobr>1</nobr></div> <div style="position:absolute;top:18163;left:245"><nobr>LoRA</nobr></div> <div style="position:absolute;top:18163;left:308"><nobr>0.290</nobr></div> <div style="position:absolute;top:18163;left:384"><nobr>0.014</nobr></div> <div style="position:absolute;top:18163;left:462"><nobr>1</nobr></div> <div style="position:absolute;top:18177;left:219"><nobr>RAG + FD</nobr></div> <div style="position:absolute;top:18177;left:308"><nobr>0.128</nobr></div> <div style="position:absolute;top:18177;left:384"><nobr>0.037</nobr></div> <div style="position:absolute;top:18177;left:450"><nobr>0.116</nobr></div> <div style="position:absolute;top:18177;left:526"><nobr>1</nobr></div> <div style="position:absolute;top:18192;left:144"><nobr>NEFTune + (RAG + FD)</nobr></div> <div style="position:absolute;top:18192;left:308"><nobr>0.100</nobr></div> <div style="position:absolute;top:18192;left:384"><nobr>0.013</nobr></div> <div style="position:absolute;top:18192;left:450"><nobr>0.095</nobr></div> <div style="position:absolute;top:18192;left:515"><nobr>0.206</nobr></div> <div style="position:absolute;top:18192;left:641"><nobr>1</nobr></div> <div style="position:absolute;top:18207;left:235"><nobr>baseline</nobr></div> <div style="position:absolute;top:18207;left:305"><nobr>-0.006</nobr></div> <div style="position:absolute;top:18207;left:384"><nobr>0.107</nobr></div> <div style="position:absolute;top:18207;left:450"><nobr>0.022</nobr></div> <div style="position:absolute;top:18207;left:515"><nobr>0.037</nobr></div> <div style="position:absolute;top:18207;left:627"><nobr>-0.004</nobr></div> <div style="position:absolute;top:18207;left:749"><nobr>1</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:18226;left:234"><nobr>Table 8: Correlation matrix for FEVER responses between all the methods.</nobr></div> <div style="position:absolute;top:18291;left:108"><nobr>underlying finetuning dataset. The near-zero correlation between NEFTune and the baseline (-0.004) reinforces the</nobr></div> <div style="position:absolute;top:18307;left:108"><nobr>distinct advancement NEFTune contributes. This independence from the baseline method and other approaches</nobr></div> <div style="position:absolute;top:18323;left:108"><nobr>highlights NEFTune’s effectiveness in effectively encoding the finetuning dataset SLR.</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:18366;left:298"><nobr>NEFTune RAG (raw) LoRA</nobr></div> <div style="position:absolute;top:18366;left:498"><nobr>RAG + FD</nobr></div> <div style="position:absolute;top:18366;left:576"><nobr>NEFTune + (RAG + FD) baseline</nobr></div> <div style="position:absolute;top:18389;left:228"><nobr>NEFTune</nobr></div> <div style="position:absolute;top:18389;left:321"><nobr>1</nobr></div> <div style="position:absolute;top:18404;left:220"><nobr>RAG (raw)</nobr></div> <div style="position:absolute;top:18404;left:307"><nobr>-0.012</nobr></div> <div style="position:absolute;top:18404;left:395"><nobr>1</nobr></div> <div style="position:absolute;top:18419;left:247"><nobr>LoRA</nobr></div> <div style="position:absolute;top:18419;left:309"><nobr>0.342</nobr></div> <div style="position:absolute;top:18419;left:384"><nobr>0.001</nobr></div> <div style="position:absolute;top:18419;left:460"><nobr>1</nobr></div> <div style="position:absolute;top:18433;left:221"><nobr>RAG + FD</nobr></div> <div style="position:absolute;top:18433;left:309"><nobr>0.129</nobr></div> <div style="position:absolute;top:18433;left:384"><nobr>0.084</nobr></div> <div style="position:absolute;top:18433;left:448"><nobr>0.137</nobr></div> <div style="position:absolute;top:18433;left:525"><nobr>1</nobr></div> <div style="position:absolute;top:18448;left:145"><nobr>NEFTune + (RAG + FD)</nobr></div> <div style="position:absolute;top:18448;left:309"><nobr>0.087</nobr></div> <div style="position:absolute;top:18448;left:384"><nobr>0.035</nobr></div> <div style="position:absolute;top:18448;left:448"><nobr>0.091</nobr></div> <div style="position:absolute;top:18448;left:513"><nobr>0.248</nobr></div> <div style="position:absolute;top:18448;left:640"><nobr>1</nobr></div> <div style="position:absolute;top:18463;left:236"><nobr>baseline</nobr></div> <div style="position:absolute;top:18463;left:307"><nobr>-0.004</nobr></div> <div style="position:absolute;top:18463;left:384"><nobr>0.168</nobr></div> <div style="position:absolute;top:18463;left:448"><nobr>0.025</nobr></div> <div style="position:absolute;top:18463;left:513"><nobr>0.040</nobr></div> <div style="position:absolute;top:18463;left:628"><nobr>0.039</nobr></div> <div style="position:absolute;top:18463;left:747"><nobr>1</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:18482;left:382"><nobr>Table 9: SCG correlation.</nobr></div> <div style="position:absolute;top:18533;left:107"><nobr>We next examine and compare responses of the top three best-performing approaches on two sample test questions</nobr></div> <div style="position:absolute;top:18549;left:108"><nobr>from FEVER, showcasing an example where each method’s response was classified as NOT ENOUGH INFO and</nobr></div> <div style="position:absolute;top:18566;left:108"><nobr>REFUTED. With respect to the NOT ENOUGH INFO responses in Table 10, we<a href="#16"> </a>find variances in detail and emphasis.</nobr></div> <div style="position:absolute;top:18582;left:108"><nobr>In response to the question asking what manner does the dashboard contribute to the advancement of virtual learning,</nobr></div> <div style="position:absolute;top:18598;left:108"><nobr>the reference text highlights the dashboard’s role in managing lab activity schedules and resource usage on a shared</nobr></div> <div style="position:absolute;top:18615;left:107"><nobr>AWS account, focusing on logistical aspects. In contrast, LoRA and NEFTune responses pivot towards the dashboard’s</nobr></div> <div style="position:absolute;top:18631;left:108"><nobr>educational support, emphasizing feedback, self-regulation, and progress monitoring, with NEFTune slightly touching</nobr></div> <div style="position:absolute;top:18647;left:108"><nobr>on resource management but not to the extent mentioned in the reference. RAG + FD, while partially aligning with the</nobr></div> <div style="position:absolute;top:18664;left:108"><nobr>reference by discussing resource usage management, also diverged by emphasizing feedback and activity completion,</nobr></div> <div style="position:absolute;top:18680;left:108"><nobr>omitting the specific detail about scheduling. Thus, none of the responses fully encapsulated the reference’s emphasis</nobr></div> <div style="position:absolute;top:18696;left:108"><nobr>on time slot management and resource usage, illustrating a shift from logistical to educational functionalities in their</nobr></div> <div style="position:absolute;top:18713;left:108"><nobr>interpretations. In these instances the classification of NOT ENOUGH INFO for all responses was correct.</nobr></div> </span></font> <font size=2 face="Times"><span style="font-size:9px;font-family:Times"> <div style="position:absolute;top:18755;left:220"><nobr>Text</nobr></div> <div style="position:absolute;top:18771;left:117"><nobr>Question</nobr></div> <div style="position:absolute;top:18771;left:220"><nobr>According to the 2023SLR dataset, in the naranjo2019visualdashboard paper, in what manner does the dashboard contribute to</nobr></div> <div style="position:absolute;top:18786;left:220"><nobr>the advancement of virtual learning?</nobr></div> <div style="position:absolute;top:18801;left:117"><nobr>Reference text</nobr></div> <div style="position:absolute;top:18801;left:220"><nobr>In the data used for the 2023SLR, for online instruction, the dashboard helps in identifying preferred time slots for students to</nobr></div> <div style="position:absolute;top:18816;left:220"><nobr>engage in lab activities, given the self-paced nature of the instruction. This can aid in managing and preventing soft limits being</nobr></div> <div style="position:absolute;top:18831;left:220"><nobr>reached due to concurrent resource usage in a shared AWS account by multiple students. Source: naranjo2019visualdashboard</nobr></div> <div style="position:absolute;top:18846;left:117"><nobr>NEFTune</nobr></div> <div style="position:absolute;top:18846;left:220"><nobr>In the data used for the 2023SLR, the dashboard contributes to the advancement of online education by providing a visual tool</nobr></div> <div style="position:absolute;top:18861;left:220"><nobr>that helps teachers and students monitor the progress of lab activities and the usage of resources in a hands-on learning</nobr></div> <div style="position:absolute;top:18876;left:220"><nobr>approach. It attends specifically to learning metrics, supporting teacher interventions and fostering self-regulation among</nobr></div> <div style="position:absolute;top:18891;left:220"><nobr>students. The dashboard also aids in scaling education by offering insights into student engagement and progress, potentially</nobr></div> <div style="position:absolute;top:18906;left:220"><nobr>enabling the automation of evaluation processes and enhancing the learning experience. Source: naranjo2019visualdashboard</nobr></div> <div style="position:absolute;top:18921;left:117"><nobr>LoRA</nobr></div> <div style="position:absolute;top:18921;left:220"><nobr>In the data used for the 2023SLR, the dashboard contributes to virtual learning by providing timely and targeted feedback to</nobr></div> <div style="position:absolute;top:18936;left:220"><nobr>students on their work, which supports self-regulation and learning about their own learning process. The dashboard’s visual</nobr></div> <div style="position:absolute;top:18951;left:220"><nobr>representation of learning activities and progress helps to engage students and increase their motivation. Additionally, by</nobr></div> <div style="position:absolute;top:18965;left:220"><nobr>offering teachers a centralized platform for monitoring student progress, the dashboard facilitates timely intervention and</nobr></div> <div style="position:absolute;top:18980;left:220"><nobr>personalized support. Source: naranjo2019visualdashboard</nobr></div> <div style="position:absolute;top:18995;left:117"><nobr>RAG + FD</nobr></div> <div style="position:absolute;top:18995;left:220"><nobr>The dashboard contributes to the advancement of virtual learning by providing timely and targeted feedback on student work,</nobr></div> <div style="position:absolute;top:19010;left:220"><nobr>allowing students to discriminate between completed parts of the hands-on labs and those still pending, fostering</nobr></div> <div style="position:absolute;top:19025;left:220"><nobr>self-regulation and completion of activities. It also helps in managing and preventing soft limits being reached due to</nobr></div> <div style="position:absolute;top:19040;left:220"><nobr>concurrent resource usage in a shared AWS account by multiple students.</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:19067;left:160"><nobr>Table 10: Example of responses ’NOT ENOUGH INFO’ FEVER responses across top three methods.</nobr></div> <div style="position:absolute;top:19110;left:451"><nobr>16</nobr></div> </span></font> <div style="position:absolute;top:19183;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=17><b>Page 17</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:19238;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:19239;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:19296;left:108"><nobr>Meanwhile, with respect to the examples of REFUTED responses for the same question which requested details of</nobr></div> <div style="position:absolute;top:19312;left:108"><nobr>a specific study’s insights on how the learning analytics dashboard (LAD) drove the motivation among students in</nobr></div> <div style="position:absolute;top:19329;left:108"><nobr>Table <a href="#17">11</a>, we see that LoRA and NEFTune produced identical responses and closely match the reference text. However,</nobr></div> <div style="position:absolute;top:19345;left:107"><nobr>while they correctly reported most details, they omitted the detail that intrinsic motivation actually decreased in both</nobr></div> <div style="position:absolute;top:19361;left:108"><nobr>groups, as noted in the reference text. RAG + FD introduced a claim not present in the reference text, suggesting that</nobr></div> <div style="position:absolute;top:19378;left:108"><nobr>LADs lead to a higher academic performance evidenced by higher final grades. Similar to LoRA and NEFTune, the</nobr></div> <div style="position:absolute;top:19394;left:108"><nobr>response mentioned the increased extrinsic motivation in the treatment group and the lack of significant difference in</nobr></div> <div style="position:absolute;top:19410;left:108"><nobr>changes in intrinsic motivation between the groups. However, it also omitted that intrinsic motivation decreased in both</nobr></div> <div style="position:absolute;top:19427;left:108"><nobr>groups over time. Again, in these instances the FEVER classification of REFUTED derived by GPT-4 was correct for</nobr></div> <div style="position:absolute;top:19443;left:108"><nobr>all responses.</nobr></div> </span></font> <font size=2 face="Times"><span style="font-size:9px;font-family:Times"> <div style="position:absolute;top:19478;left:220"><nobr>Text</nobr></div> <div style="position:absolute;top:19494;left:117"><nobr>Question</nobr></div> <div style="position:absolute;top:19494;left:220"><nobr>According to the 2023SLR dataset, in the fleur2020motivation paper, could you detail the study’s insights on how the LAD</nobr></div> <div style="position:absolute;top:19509;left:220"><nobr>drives motivation among students?</nobr></div> <div style="position:absolute;top:19524;left:117"><nobr>Reference text</nobr></div> <div style="position:absolute;top:19524;left:220"><nobr>In the data used for the 2023SLR, the study found a significant interaction effect between group (treatment/control) and time</nobr></div> <div style="position:absolute;top:19539;left:220"><nobr>(begin/end of the course) for extrinsic motivation, with the treatment group showing increased extrinsic motivation at the end</nobr></div> <div style="position:absolute;top:19554;left:220"><nobr>of the course, while it decreased in the control group. However, intrinsic motivation decreased in both groups over time, with</nobr></div> <div style="position:absolute;top:19569;left:220"><nobr>no significant difference between the groups. Source: fleur2020motivation</nobr></div> <div style="position:absolute;top:19583;left:117"><nobr>NEFTune</nobr></div> <div style="position:absolute;top:19583;left:220"><nobr>In the data used for the 2023SLR, the study found a significant interaction effect between group (treatment/control) and time</nobr></div> <div style="position:absolute;top:19598;left:220"><nobr>(beginning/end of the course) for extrinsic motivation, with the treatment group showing increased extrinsic motivation at the</nobr></div> <div style="position:absolute;top:19613;left:220"><nobr>end of the course compared to the control group. However, there was no significant difference between the groups in terms of</nobr></div> <div style="position:absolute;top:19628;left:220"><nobr>changes in intrinsic motivation. Source: fleur2020motivation</nobr></div> <div style="position:absolute;top:19643;left:117"><nobr>LoRA</nobr></div> <div style="position:absolute;top:19643;left:220"><nobr>In the data used for the 2023SLR, the study found a significant interaction effect between group (treatment/control) and time</nobr></div> <div style="position:absolute;top:19658;left:220"><nobr>(beginning/end of the course) for extrinsic motivation, with the treatment group showing increased extrinsic motivation at the</nobr></div> <div style="position:absolute;top:19673;left:220"><nobr>end of the course compared to the control group. However, there was no significant difference between the groups in terms of</nobr></div> <div style="position:absolute;top:19688;left:220"><nobr>changes in intrinsic motivation. Source: fleur2020motivation</nobr></div> <div style="position:absolute;top:19703;left:117"><nobr>RAG + FD</nobr></div> <div style="position:absolute;top:19703;left:220"><nobr>The study by Fleur et al. (2020) found that LADs can significantly increase extrinsic motivation among students, leading to</nobr></div> <div style="position:absolute;top:19718;left:220"><nobr>higher academic performance. The treatment group showed increased extrinsic motivation at the end of the course compared</nobr></div> <div style="position:absolute;top:19733;left:220"><nobr>to the control group, as evidenced by higher final grades. However, there was no significant difference between the groups in</nobr></div> <div style="position:absolute;top:19748;left:220"><nobr>terms of changes in intrinsic motivation.</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:19775;left:196"><nobr>Table 11: Example of responses REFUTED FEVER responses across top three methods.</nobr></div> <div style="position:absolute;top:19820;left:108"><nobr>5.4 Detailed Analysis</nobr></div> <div style="position:absolute;top:19850;left:108"><nobr>For the best-performing model from the above results, Tables<a href="#17"> 12</a> and 13 offer a vantage point to assess NEFTune</nobr></div> <div style="position:absolute;top:19867;left:108"><nobr>model’s proficiency across various levels of analysis. The NEFTune model’s capability is particularly observable at</nobr></div> <div style="position:absolute;top:19883;left:108"><nobr>the paper-level Q&As, with 83.0% of responses achieving ’Fully Consistent’ ratings (CGS=2) as shown in Table <a href="#17">12.</a></nobr></div> <div style="position:absolute;top:19900;left:108"><nobr>Likewise, at the paper-summary-level, NEFTune ability to respond with high fidelity is further confirmed by a ’Fully</nobr></div> <div style="position:absolute;top:19916;left:108"><nobr>Consistent’ score in 95.0% of cases. When combined with ’Partially consistent’ results, NEFTune’s scores rise to 93%</nobr></div> <div style="position:absolute;top:19932;left:108"><nobr>and 98% accuracy for paper-level and paper-summary-level respectively. However, for SLR-level questions, NEFTune’s</nobr></div> <div style="position:absolute;top:19949;left:107"><nobr>’Fully Consistent’ does significantly reduce to a score of 52%, but again, in combination with ’Partially consistent’</nobr></div> <div style="position:absolute;top:19965;left:108"><nobr>results, this rises to 84%, still indicating a strong level of dependability.</nobr></div> </span></font> <font size=2 face="Times"><span style="font-size:9px;font-family:Times"> <div style="position:absolute;top:20005;left:423"><nobr>Partially</nobr></div> <div style="position:absolute;top:20005;left:493"><nobr>Not enough</nobr></div> <div style="position:absolute;top:20005;left:577"><nobr>Partially</nobr></div> <div style="position:absolute;top:20005;left:665"><nobr>Fully</nobr></div> <div style="position:absolute;top:20020;left:327"><nobr>Contradictory</nobr></div> <div style="position:absolute;top:20020;left:412"><nobr>contradictory</nobr></div> <div style="position:absolute;top:20020;left:512"><nobr>info</nobr></div> <div style="position:absolute;top:20020;left:573"><nobr>consistent</nobr></div> <div style="position:absolute;top:20020;left:654"><nobr>consistent</nobr></div> <div style="position:absolute;top:20035;left:228"><nobr>Level of analysis</nobr></div> <div style="position:absolute;top:20035;left:351"><nobr>(-2)</nobr></div> <div style="position:absolute;top:20035;left:434"><nobr>(-1)</nobr></div> <div style="position:absolute;top:20035;left:514"><nobr>(0)</nobr></div> <div style="position:absolute;top:20035;left:590"><nobr>(1)</nobr></div> <div style="position:absolute;top:20035;left:671"><nobr>(2)</nobr></div> <div style="position:absolute;top:20057;left:256"><nobr>Paper-level</nobr></div> <div style="position:absolute;top:20057;left:337"><nobr>11 (0.3%)</nobr></div> <div style="position:absolute;top:20057;left:417"><nobr>171 (4.7%)</nobr></div> <div style="position:absolute;top:20057;left:497"><nobr>55 (1.5%)</nobr></div> <div style="position:absolute;top:20057;left:567"><nobr>378 (10.4%)</nobr></div> <div style="position:absolute;top:20057;left:645"><nobr>3003 (83.0%)</nobr></div> <div style="position:absolute;top:20072;left:207"><nobr>Paper-summary-level</nobr></div> <div style="position:absolute;top:20072;left:340"><nobr>6 (0.5%)</nobr></div> <div style="position:absolute;top:20072;left:420"><nobr>23 (1.8%)</nobr></div> <div style="position:absolute;top:20072;left:500"><nobr>2 (0.2%)</nobr></div> <div style="position:absolute;top:20072;left:573"><nobr>33 (2.6%)</nobr></div> <div style="position:absolute;top:20072;left:645"><nobr>1211 (95.0%)</nobr></div> <div style="position:absolute;top:20087;left:261"><nobr>SLR-level</nobr></div> <div style="position:absolute;top:20087;left:340"><nobr>1 (1.3%)</nobr></div> <div style="position:absolute;top:20087;left:417"><nobr>11 (14.7%)</nobr></div> <div style="position:absolute;top:20087;left:519"><nobr>-</nobr></div> <div style="position:absolute;top:20087;left:570"><nobr>24 (32.0%)</nobr></div> <div style="position:absolute;top:20087;left:651"><nobr>39 (52.0%)</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:20105;left:303"><nobr>Table 12: Summary of Levels by SCG for NEFTune</nobr></div> </span></font> <font size=2 face="Times"><span style="font-size:9px;font-family:Times"> <div style="position:absolute;top:20167;left:287"><nobr>Level of analysis</nobr></div> <div style="position:absolute;top:20167;left:386"><nobr>REFUTED</nobr></div> <div style="position:absolute;top:20167;left:458"><nobr>NOT ENOUGH INFO</nobr></div> <div style="position:absolute;top:20167;left:584"><nobr>SUPPORTED</nobr></div> <div style="position:absolute;top:20190;left:315"><nobr>Paper-level</nobr></div> <div style="position:absolute;top:20190;left:386"><nobr>179 (7.4%)</nobr></div> <div style="position:absolute;top:20190;left:485"><nobr>267 (4.9%)</nobr></div> <div style="position:absolute;top:20190;left:585"><nobr>3180 (87.7%)</nobr></div> <div style="position:absolute;top:20205;left:266"><nobr>Paper-summary-level</nobr></div> <div style="position:absolute;top:20205;left:389"><nobr>44 (3.5%)</nobr></div> <div style="position:absolute;top:20205;left:488"><nobr>14 (1.1%)</nobr></div> <div style="position:absolute;top:20205;left:585"><nobr>1217 (95.5%)</nobr></div> <div style="position:absolute;top:20220;left:319"><nobr>SLR-level</nobr></div> <div style="position:absolute;top:20220;left:386"><nobr>32 (42.7%)</nobr></div> <div style="position:absolute;top:20220;left:491"><nobr>1 (1.3%)</nobr></div> <div style="position:absolute;top:20220;left:591"><nobr>42 (56.0%)</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:20237;left:205"><nobr>Table 13: Description of FEVER results at different levels of questions for NEFTune</nobr></div> <div style="position:absolute;top:20298;left:451"><nobr>17</nobr></div> </span></font> <a href="#17" style="border:1px solid #0000ff;position:absolute;top:19849;left:538;width:18;height:16;display:block"></a> <div style="position:absolute;top:20371;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=18><b>Page 18</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:20426;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:20427;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:20484;left:108"><nobr>In the FEVER metric results in Table <a href="#17">13,</a> the results do harmonize with the CGS findings, demonstrating the model’s</nobr></div> <div style="position:absolute;top:20500;left:108"><nobr>robust accuracy with a majority of ’Supported’ classifications at both the paper level (87.7%) and paper-summary level</nobr></div> <div style="position:absolute;top:20517;left:108"><nobr>(95.5%). Again, a divergent scenario unfolds at the SLR-level, where a considerable 42.7% of responses are categorized</nobr></div> <div style="position:absolute;top:20533;left:108"><nobr>as ’Refuted’.</nobr></div> <div style="position:absolute;top:20557;left:108"><nobr>The analyses of CGS and FEVER metrics, taken conjointly, paint a comprehensive overview of NEFTune’s strengths.</nobr></div> <div style="position:absolute;top:20574;left:107"><nobr>While the model’s performance is robust in factual replication at paper-level Q&As, the performance on SLR-level data</nobr></div> <div style="position:absolute;top:20590;left:108"><nobr>points suggests an area for improvement in either the calibration of the FEVER metric, or in optimizing the learning for</nobr></div> <div style="position:absolute;top:20607;left:108"><nobr>these data points by increasing the size of the finetuning dataset for these samples. However, it should be noted that</nobr></div> <div style="position:absolute;top:20623;left:108"><nobr>despite the lower accuracies and a reduced level of confidence in the model’s outputs on the SLR-level questions, the</nobr></div> <div style="position:absolute;top:20639;left:108"><nobr>SLR-level questions can nonetheless be verified and accurately reconstructed from more accurate paper-summary-level</nobr></div> <div style="position:absolute;top:20656;left:108"><nobr>responses.</nobr></div> <div style="position:absolute;top:20696;left:108"><nobr>5.5 Qualitative analysis accuracies</nobr></div> <div style="position:absolute;top:20728;left:108"><nobr>The qualitative evaluations conducted manually across the eleven NEFTune LLM responses to questions about key</nobr></div> <div style="position:absolute;top:20744;left:108"><nobr>SLR findings in the target paper representing the ground truth, reveal a high level of fidelity and alignment between</nobr></div> <div style="position:absolute;top:20761;left:108"><nobr>the reference and the LLM’s responses. An example of a typical response can be seen in Table <a href="#18">14</a>. The analysis</nobr></div> <div style="position:absolute;top:20777;left:108"><nobr>showed that in eight out of eleven cases, the key findings from the SLR ground truth could be faithfully extracted and</nobr></div> <div style="position:absolute;top:20793;left:108"><nobr>reconstructed from the responses provided by the finetuned LLM’s responses. However, in two instances, a deeper</nobr></div> <div style="position:absolute;top:20810;left:108"><nobr>understanding of certain domain-specific concepts such as prescriptive analytics and model interpretability within the</nobr></div> <div style="position:absolute;top:20826;left:108"><nobr>context of learning analytics was crucial to be able to arrive at ground truth conclusions from the LLM responses. These</nobr></div> <div style="position:absolute;top:20842;left:108"><nobr>instances highlighted a divergence in the conceptualization of certain technical terms between the LLM and the SLR</nobr></div> <div style="position:absolute;top:20859;left:108"><nobr>study, necessitating domain expertise to align the LLM’s broader interpretations with the SLR’s findings. Nonetheless,</nobr></div> <div style="position:absolute;top:20875;left:108"><nobr>it can be concluded that in 90% of the cases, the ground truth findings from the SLR study are supported by the LLM’s</nobr></div> <div style="position:absolute;top:20892;left:108"><nobr>responses. However, there was one notable LLM response that introduced some inaccuracies in its response concerning</nobr></div> <div style="position:absolute;top:20908;left:108"><nobr>the number of participants across all the studies in 17 papers. It was only in his response that a deviation was observed</nobr></div> <div style="position:absolute;top:20924;left:108"><nobr>and in this instance, it could be confirmed that the LLM response did not support the ground truth findings in the SLR</nobr></div> <div style="position:absolute;top:20941;left:108"><nobr>study. Example responses from the qualitative analysis can be seen in Appendix<a href="#25"> A.</a></nobr></div> </span></font> <font size=2 face="Times"><span style="font-size:9px;font-family:Times"> <div style="position:absolute;top:20978;left:117"><nobr>Qualitative</nobr></div> <div style="position:absolute;top:20993;left:117"><nobr>evaluation</nobr></div> <div style="position:absolute;top:20978;left:205"><nobr>Response and analysis</nobr></div> <div style="position:absolute;top:21009;left:117"><nobr>Question</nobr></div> <div style="position:absolute;top:21009;left:205"><nobr>According to the 2023SLR dataset, in the 2023SLR papers, were the reported dashboards (LADs) developed mostly in a pilot</nobr></div> <div style="position:absolute;top:21024;left:205"><nobr>study state, or a prototype for academic purposes, or were they placed into production?</nobr></div> <div style="position:absolute;top:21047;left:117"><nobr>Ground truth</nobr></div> <div style="position:absolute;top:21047;left:205"><nobr>"Many of the published dashboards (LADs) are only in their prototype phases,and only few in the pilot implementation stages."</nobr></div> <div style="position:absolute;top:21069;left:117"><nobr>LLM</nobr></div> <div style="position:absolute;top:21069;left:205"><nobr>"In the data used for the 2023SLR, the majority of papers discuss the development and evaluation of LADs as a research</nobr></div> <div style="position:absolute;top:21084;left:205"><nobr>exercise, either piloting the effectiveness of the dashboards or using them as a case study to investigate their utility. Papers that</nobr></div> <div style="position:absolute;top:21099;left:205"><nobr>do not specifically report on the development of LADs as a research effort include Aljohani et al. (2019), Bodily et al. (2018),</nobr></div> <div style="position:absolute;top:21114;left:205"><nobr>Karaoglanyilmaz et al. (2021), Fleur et al. (2020), Salehiankia et al. (2020), Han et al. (2020), Valle et al. (2021), Chatti et al.</nobr></div> <div style="position:absolute;top:21129;left:205"><nobr>(2020), He et al. (2019), and Naranjo et al. (2019). Other papers, including Ulfa et al. (2019), Liu et al. (2019), Kokoç et al.</nobr></div> <div style="position:absolute;top:21144;left:205"><nobr>(2021), Valle et al. (2021), and Baneres et al. (2019), provide details on the development and implementation of LADs, with</nobr></div> <div style="position:absolute;top:21159;left:205"><nobr>some studying specific dashboards and others discussing the design process. Additionally, Owatari et al. (2020) and Aljohani</nobr></div> <div style="position:absolute;top:21174;left:205"><nobr>et al. (2019) report on the development of LADs as advanced informational systems specifically designed for learning support.</nobr></div> <div style="position:absolute;top:21189;left:205"><nobr>Source: 2023SLR"</nobr></div> <div style="position:absolute;top:21212;left:117"><nobr>Analysis</nobr></div> <div style="position:absolute;top:21212;left:205"><nobr>LLM’s answer can be seen as an expanded version of the ground truth, maintaining the core assertion while adding context</nobr></div> <div style="position:absolute;top:21227;left:205"><nobr>and detail. Therefore, the LLM’s response can be useed to reconstruct the ground truth for the SLR with a high degree of</nobr></div> <div style="position:absolute;top:21242;left:205"><nobr>completeness and accuracy as it includes all the essential information of the ground truth, while extending it with specific</nobr></div> <div style="position:absolute;top:21257;left:205"><nobr>examples and a broader examination of the LADs’ development stages.</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:21274;left:236"><nobr>Table 14: Comparison of Question and Ground Truth with LLM Response</nobr></div> <div style="position:absolute;top:21324;left:108"><nobr>The analyses consistently demonstrate that the finetuned LLM’s responses tend to provide a broader and more detailed</nobr></div> <div style="position:absolute;top:21341;left:108"><nobr>perspective than ground truth which is often more concise and sometimes more definitive in its assertions. Below is a</nobr></div> <div style="position:absolute;top:21357;left:108"><nobr>synthesis of the overarching patterns observed:</nobr></div> <div style="position:absolute;top:21392;left:143"><nobr>1. Expansion and Detailing: The LLM frequently expands upon the information provided in ground truth</nobr></div> <div style="position:absolute;top:21408;left:162"><nobr>introducing additional context, examples, and details that were not explicitly mentioned in the ground truth.</nobr></div> <div style="position:absolute;top:21425;left:161"><nobr>This pattern is evident in analyses where the LLM’s responses offer a deeper dive into specific studies,</nobr></div> <div style="position:absolute;top:21441;left:162"><nobr>technologies, and outcomes associated with LADs, enhancing the understanding of the subject matter beyond</nobr></div> <div style="position:absolute;top:21486;left:451"><nobr>18</nobr></div> </span></font> <div style="position:absolute;top:21559;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=19><b>Page 19</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:21614;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:21615;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:21672;left:162"><nobr>the baseline established by ground truth. However, the additional information and details are correct and do</nobr></div> <div style="position:absolute;top:21688;left:162"><nobr>not indicate hallucination.</nobr></div> <div style="position:absolute;top:21711;left:143"><nobr>2. Alignment with Core Assertions: Despite the additional details, the LLM’s responses generally align with</nobr></div> <div style="position:absolute;top:21727;left:162"><nobr>the core assertions or findings of ground truth. This alignment indicates that the finetuned LLM successfully</nobr></div> <div style="position:absolute;top:21744;left:162"><nobr>captures the essence of the ground truth findings, while adding value through elaboration and exemplification.</nobr></div> <div style="position:absolute;top:21767;left:143"><nobr>3. Discrepancies in Emphasis and Scope: Some discrepancies arise primarily from differences in emphasis and</nobr></div> <div style="position:absolute;top:21783;left:162"><nobr>scope. For instance, while the ground truth response might highlight a lack or absence of certain features or</nobr></div> <div style="position:absolute;top:21799;left:162"><nobr>trends within the LADs research, the LLM occasionally points out exceptions or minor trends that counteract</nobr></div> <div style="position:absolute;top:21816;left:162"><nobr>these broad strokes. These discrepancies do not necessarily contradict ground truth but rather suggest a more</nobr></div> <div style="position:absolute;top:21832;left:161"><nobr>variegated picture, which can give the researcher more scope for expansion in an academic paper.</nobr></div> <div style="position:absolute;top:21855;left:143"><nobr>4. Reconstruction and Completeness: The degree to which the ground truth can be reconstructed from the LLM’s</nobr></div> <div style="position:absolute;top:21871;left:162"><nobr>responses is generally very high, reflecting the LLM’s ability to both mirror and in many responses, extend the</nobr></div> <div style="position:absolute;top:21888;left:162"><nobr>ground truth in the target paper. In cases where the LLM’s detailed accounts align closely with ground truth,</nobr></div> <div style="position:absolute;top:21904;left:162"><nobr>the reconstruction ability is high, indicating a strong corroboration of ground truth by the LLM.</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:15px;font-family:Times"> <div style="position:absolute;top:21945;left:108"><nobr>6 Discussion</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:21982;left:108"><nobr>This study has marked a pivotal shift towards advancing the use of AI and LLMs for automating systematic literature</nobr></div> <div style="position:absolute;top:21999;left:108"><nobr>review methodologies, with a particular focus on the knowledge and evidence synthesis phase. The LLM finetuning</nobr></div> <div style="position:absolute;top:22015;left:108"><nobr>approaches and the novel data extraction processes make a novel contribution to academic research practices, with</nobr></div> <div style="position:absolute;top:22031;left:108"><nobr>even broader applications to other domains that require syntheses across multiple documents with the requirement of</nobr></div> <div style="position:absolute;top:22048;left:108"><nobr>supporting question and answering tasks. This study has demonstrated that indeed modern PEFT finetuning approaches</nobr></div> <div style="position:absolute;top:22064;left:108"><nobr>like NEFTune and LoRA can be used for finetuning effective LLMs (RQ1), and the proposed automated approach for</nobr></div> <div style="position:absolute;top:22081;left:108"><nobr>extracting finetuning datasets from a corpus of selected academic papers can successfully construct finetuning datasets</nobr></div> <div style="position:absolute;top:22097;left:108"><nobr>(RQ2) that support question and answer downstream tasks necessary for executing the final stages of an SLR. The</nobr></div> <div style="position:absolute;top:22113;left:108"><nobr>study also affirms that the LLMs can be effectively finetuned on relatively small datasets (RQ3) that represent relatively</nobr></div> <div style="position:absolute;top:22130;left:108"><nobr>narrow domains associated with the focused objectives of SLR studies. A significant set of challenges that this study</nobr></div> <div style="position:absolute;top:22146;left:108"><nobr>addressed and provided solutions for, concerned the broader problem of LLM hallucination and ensuring that the LLM’s</nobr></div> <div style="position:absolute;top:22162;left:108"><nobr>responses were solely based on the corpus of the selected academic papers, and how response provenance and the</nobr></div> <div style="position:absolute;top:22179;left:108"><nobr>factuality of the responses could be evaluated (RQ4). We devised a novel token-based approach that enabled us to audit</nobr></div> <div style="position:absolute;top:22195;left:108"><nobr>and verify the source information of the LLM responses. Meanwhile, we devised evaluation metrics for factuality and</nobr></div> <div style="position:absolute;top:22211;left:108"><nobr>demonstrated how they can be effectively automated with the assistance of commercial LLMs for efficiency. Finally,</nobr></div> <div style="position:absolute;top:22228;left:108"><nobr>the entire proposed framework for automating the final stages of an SLR was validated for viability by replicating an</nobr></div> <div style="position:absolute;top:22244;left:108"><nobr>actual published SLR study, using it as a benchmark and source of ground truth. Our experiments indeed confirm the</nobr></div> <div style="position:absolute;top:22261;left:108"><nobr>reliability of the proposed finetuning framework for its fidelity in replicating findings from an SLR study (RQ5).</nobr></div> <div style="position:absolute;top:22285;left:108"><nobr>The contributions from this study fill both an existing gap in the literature and a real need in academic research for</nobr></div> <div style="position:absolute;top:22302;left:108"><nobr>automating SLR processes during the knowledge synthesis phase since most efforts have focused on using AI and</nobr></div> <div style="position:absolute;top:22318;left:108"><nobr>machine learning to solve tasks such as paper searchers and filtering. While an increasing number of AI technologies</nobr></div> <div style="position:absolute;top:22334;left:108"><nobr>are emerging to support academic research in the form of features enabling researchers to “talk to” individual papers,</nobr></div> <div style="position:absolute;top:22351;left:108"><nobr>they often have limitations and one of the key limitations is the inability to “talk across” multiple papers with high</nobr></div> <div style="position:absolute;top:22367;left:108"><nobr>reliability and the ability to track the sources of all responses. These limitations and the lack of control over the existing</nobr></div> <div style="position:absolute;top:22383;left:108"><nobr>commercial solutions raise issues of trust as well as replicability, which this study overcomes.</nobr></div> <div style="position:absolute;top:22408;left:108"><nobr>The solutions demonstrated in this study are not without their imperfections, but future developments can overcome</nobr></div> <div style="position:absolute;top:22424;left:108"><nobr>them. The first involves the low inter-rater consonance achieved in this study, particularly for the FEVER metric which</nobr></div> <div style="position:absolute;top:22441;left:108"><nobr>compromised some of its validity. This will be addressed in subsequent studies through a thorough training protocol</nobr></div> <div style="position:absolute;top:22457;left:108"><nobr>that teaches the human evaluators with more examples of how to rate responses. The data extraction process can also</nobr></div> <div style="position:absolute;top:22473;left:108"><nobr>be improved by extracting more than just a single Q&A pair per chunk and section of each paper. This can instead</nobr></div> <div style="position:absolute;top:22490;left:108"><nobr>be expanded to extracting multiple Q&A pairs which will have the effect of encoding more details and information</nobr></div> <div style="position:absolute;top:22506;left:108"><nobr>from each paper, and thus have a greater chance of capturing all the important facts from each study. The quality and</nobr></div> <div style="position:absolute;top:22522;left:108"><nobr>capability of the finetuned models is to a significant degree a function of the data extraction phase. The LLM model</nobr></div> <div style="position:absolute;top:22539;left:108"><nobr>can only know what it has been taught. Deficiencies and omissions in the data extraction step will hinder the depth of</nobr></div> <div style="position:absolute;top:22555;left:108"><nobr>knowledge synthesis and findings that can be supported for an SLR, therefore, most gains in the quality and factuality</nobr></div> <div style="position:absolute;top:22572;left:108"><nobr>of LLM responses stand to be gained through the improvement of this process. Likewise, the lower accuracies exhibited</nobr></div> <div style="position:absolute;top:22588;left:108"><nobr>on the SLR-level responses can also be improved in this manner by increasing the size of this category of Q&A pairs.</nobr></div> <div style="position:absolute;top:22613;left:108"><nobr>Finally, it is inevitable with the current advancements in generative AI, and LLMs specifically, that these technologies</nobr></div> <div style="position:absolute;top:22629;left:108"><nobr>are destined to increase their role in supporting and assisting in future SLR endeavors. To that end, there is a need for</nobr></div> <div style="position:absolute;top:22674;left:451"><nobr>19</nobr></div> </span></font> <div style="position:absolute;top:22747;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=20><b>Page 20</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:22802;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:22803;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:22860;left:108"><nobr>revisiting and updating the PRISMA reporting guidelines. This needs to be undertaken in such a way that researchers in</nobr></div> <div style="position:absolute;top:22876;left:108"><nobr>the future are provided with guidance on what needs to be reported when conducting PRISMA-conforming SLRs using</nobr></div> <div style="position:absolute;top:22893;left:107"><nobr>AI automation tools in order to guarantee transparency and reproducibility. Such efforts have already begun <a href="#21">[15</a>] and</nobr></div> <div style="position:absolute;top:22909;left:108"><nobr>need to be finalised.</nobr></div> <div style="position:absolute;top:22949;left:108"><nobr>Study Limitations and Future Work</nobr></div> <div style="position:absolute;top:22980;left:108"><nobr>This study’s exploration of LLMs in SLRs was constrained by model selection, focusing on Mistral without assessing</nobr></div> <div style="position:absolute;top:22996;left:108"><nobr>larger models over 7 billion parameters that might offer improved outcomes. Future research should broaden its</nobr></div> <div style="position:absolute;top:23013;left:108"><nobr>scope to include a variety of open-source LLMs, evaluating their efficacy in SLR contexts. Dataset access presents a</nobr></div> <div style="position:absolute;top:23029;left:108"><nobr>significant limitation, as copyright restrictions by publishing houses impede the public sharing of even summarised</nobr></div> <div style="position:absolute;top:23045;left:108"><nobr>content, challenging the reproducibility and validation of our findings. Addressing this issue requires navigating</nobr></div> <div style="position:absolute;top:23062;left:108"><nobr>copyright laws to facilitate wider academic verification and exploration. The use of GPT-4 for evaluation reflects the</nobr></div> <div style="position:absolute;top:23078;left:108"><nobr>current state of technology, yet the rapid advancement in LLMs suggests that future iterations will necessitate updated</nobr></div> <div style="position:absolute;top:23094;left:108"><nobr>prompts and methodologies to maintain relevance and accuracy in assessments. A limitation of using proprietary</nobr></div> <div style="position:absolute;top:23111;left:108"><nobr>models like GPT-4 is also cost, which can and should be bypassed in the future using locally running LLMs for data</nobr></div> <div style="position:absolute;top:23127;left:108"><nobr>extraction and evaluation. The study’s focus on singular-topic questions limits the depth of evaluation regarding the</nobr></div> <div style="position:absolute;top:23144;left:108"><nobr>models’ ability to handle complex queries involving multiple concepts. Enhancing the CGS and FEVER metrics</nobr></div> <div style="position:absolute;top:23160;left:108"><nobr>calibration through detailed training could yield more reliable evaluations, a necessity for future studies to consider.</nobr></div> <div style="position:absolute;top:23176;left:108"><nobr>Methodologically, the study leaned towards qualitative meta-synthesis, leaving a gap in quantitative meta-analysis</nobr></div> <div style="position:absolute;top:23193;left:108"><nobr>exploration. Addressing this gap, alongside refining hyperparameters and expanding computational resources to better</nobr></div> <div style="position:absolute;top:23209;left:108"><nobr>support extensive data processing, forms a critical pathway for subsequent research. Performing hyperparameter tuning</nobr></div> <div style="position:absolute;top:23225;left:107"><nobr>was particularly challenging given the long training runtimes for model finetuning, and it is highly likely that suboptimal</nobr></div> <div style="position:absolute;top:23242;left:108"><nobr>hyperparameters were used in this research as a result. Future work should encompass large-scale replication studies</nobr></div> <div style="position:absolute;top:23258;left:108"><nobr>to confirm and extend these initial findings, incorporating both qualitative and quantitative SLR meta-analyses. This</nobr></div> <div style="position:absolute;top:23274;left:108"><nobr>includes investigating LLMs’ multimodal capabilities in extracting information from diverse formats such as tables and</nobr></div> <div style="position:absolute;top:23291;left:108"><nobr>figures.</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:15px;font-family:Times"> <div style="position:absolute;top:23334;left:108"><nobr>7 Conclusion</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:23373;left:108"><nobr>This research introduces an SLR-automation framework leveraging finetuned LLMs, presenting a novel and significant</nobr></div> <div style="position:absolute;top:23389;left:108"><nobr>methodical advancement in employing AI for academic research. Our comprehensive experiments with LLM finetuning</nobr></div> <div style="position:absolute;top:23406;left:108"><nobr>demonstrate that these AI technologies can effectively streamline SLRs, ensuring both efficiency and accuracy in</nobr></div> <div style="position:absolute;top:23422;left:108"><nobr>information retrieval. The framework’s effectiveness was validated by accurately replicating a pre-existing SLR study,</nobr></div> <div style="position:absolute;top:23438;left:108"><nobr>showcasing the practical applicability of our methods.</nobr></div> <div style="position:absolute;top:23463;left:108"><nobr>The study not only underscores the potential of AI to enhance the SLR process through systematic and efficient literature</nobr></div> <div style="position:absolute;top:23479;left:108"><nobr>synthesis but also addresses critical challenges such as LLM hallucination and data provenance. By tackling these</nobr></div> <div style="position:absolute;top:23496;left:108"><nobr>issues, we ensure the reliability and verifiability of the synthesized content.</nobr></div> <div style="position:absolute;top:23520;left:108"><nobr>In proposing this SLR-automation framework, we contribute to the broader discourse on integrating AI-driven method-</nobr></div> <div style="position:absolute;top:23537;left:108"><nobr>ologies in academic research, advocating for the update of PRISMA guidelines to encapsulate these advanced techniques.</nobr></div> <div style="position:absolute;top:23553;left:108"><nobr>This ensures methodological transparency and rigor in future SLRs. Our work lays a foundation for further exploration</nobr></div> <div style="position:absolute;top:23569;left:108"><nobr>in this area, highlighting the necessity for continuous development of AI tools to enrich and facilitate scholarly research.</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:15px;font-family:Times"> <div style="position:absolute;top:23613;left:108"><nobr>References</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:23643;left:115"><nobr>[1] Arsenio Paez. Grey literature: An important resource in systematic reviews. Journal of evidence-based medicine,</nobr></div> <div style="position:absolute;top:23660;left:140"><nobr>2017. doi<a href="https://doi.org/10.1111/jebm.12265">:10.1111/jebm.12265.</a></nobr></div> <div style="position:absolute;top:23685;left:115"><nobr>[2] Yu Xiao and M. Watson. Guidance on conducting a systematic literature review. Journal of Planning Education</nobr></div> <div style="position:absolute;top:23701;left:140"><nobr>and Research, 39:112 – 93, 2017. doi<a href="https://doi.org/10.1177/0739456X17723971">:10.1177/0739456X17723971.</a></nobr></div> <div style="position:absolute;top:23726;left:115"><nobr>[3] Andy P Siddaway, A. Wood, and L. Hedges. How to do a systematic review: A best practice guide for conducting</nobr></div> <div style="position:absolute;top:23743;left:140"><nobr>and reporting narrative reviews, meta-analyses, and meta-syntheses. Annual review of psychology, 70:747–770,</nobr></div> <div style="position:absolute;top:23759;left:140"><nobr>2019. doi<a href="https://doi.org/10.1146/annurev-psych-010418-102803">:10.1146/annurev-psych-010418-102803.</a></nobr></div> <div style="position:absolute;top:23784;left:115"><nobr>[4] Matthew J Page, Joanne E McKenzie, Patrick M Bossuyt, Isabelle Boutron, Tammy C Hoffmann, Cynthia D</nobr></div> <div style="position:absolute;top:23801;left:140"><nobr>Mulrow, Larissa Shamseer, Jennifer M Tetzlaff, Elie A Akl, Sue E Brennan, et al. The prisma 2020 statement: an</nobr></div> <div style="position:absolute;top:23817;left:140"><nobr>updated guideline for reporting systematic reviews. International journal of surgery, 88:105906, 2021.</nobr></div> <div style="position:absolute;top:23862;left:452"><nobr>20</nobr></div> </span></font> <div style="position:absolute;top:23935;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=21><b>Page 21</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:23990;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:23991;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:24048;left:115"><nobr>[5] A. Liberati, D. Altman, J. Tetzlaff, C. Mulrow, P. Gøtzsche, J. Ioannidis, M. Clarke, M. Clarke, P. J. Devereaux,</nobr></div> <div style="position:absolute;top:24064;left:140"><nobr>J. Kleijnen, and D. Moher. The prisma statement for reporting systematic reviews and meta-analyses of studies</nobr></div> <div style="position:absolute;top:24081;left:140"><nobr>that evaluate health care interventions: explanation and elaboration. Journal of clinical epidemiology, 62 10:</nobr></div> <div style="position:absolute;top:24097;left:140"><nobr>e1–34, 2009. doi:<a href="https://doi.org/10.1016/j.jclinepi.2009.06.006">10.1016/j.jclinepi.2009.06.006.</a></nobr></div> <div style="position:absolute;top:24119;left:115"><nobr>[6] M. Rethlefsen, S. Kirtley, S. Waffenschmidt, A. P. Ayala, D. Moher, M. Page, and Jonathan B. Koffel. Prisma-s:</nobr></div> <div style="position:absolute;top:24136;left:140"><nobr>an extension to the prisma statement for reporting literature searches in systematic reviews. Journal of the Medical</nobr></div> <div style="position:absolute;top:24152;left:140"><nobr>Library Association : JMLA, 109:174 – 200, 2021. doi<a href="https://doi.org/10.5195/jmla.2021.962">:10.5195/jmla.2021.962.</a></nobr></div> <div style="position:absolute;top:24174;left:115"><nobr>[7] Ralph I. Williams, L. Clark, W. R. Clark, and Deana M. Raffo. Re-examining systematic literature review</nobr></div> <div style="position:absolute;top:24191;left:140"><nobr>in management research: Additional benefits and execution protocols. European Management Journal, 2020.</nobr></div> <div style="position:absolute;top:24207;left:140"><nobr>doi<a href="https://doi.org/10.1016/J.EMJ.2020.09.007">:10.1016/J.EMJ.2020.09.007.</a></nobr></div> <div style="position:absolute;top:24229;left:115"><nobr>[8] José de la Torre-López, Aurora Ramírez, and José Raúl Romero. Artificial intelligence to automate the systematic</nobr></div> <div style="position:absolute;top:24246;left:140"><nobr>review of scientific literature. Computing, 105(10):2171–2194, 2023.</nobr></div> <div style="position:absolute;top:24268;left:115"><nobr>[9] Chaomei Chen and Min Song. Visualizing a field of research: A methodology of systematic scientometric reviews.</nobr></div> <div style="position:absolute;top:24284;left:140"><nobr>PLoS ONE, 14, 2019. doi:<a href="https://doi.org/10.1371/journal.pone.0223994">10.1371/journal.pone.0223994.</a></nobr></div> <div style="position:absolute;top:24306;left:108"><nobr>[10] R. Ganann, D. Ciliska, and H. Thomas. Expediting systematic reviews: methods and implications of rapid reviews.</nobr></div> <div style="position:absolute;top:24323;left:140"><nobr>Implementation Science : IS, 5:56 – 56, 2010. doi<a href="https://doi.org/10.1186/1748-5908-5-56">:10.1186/1748-5908-5-56.</a></nobr></div> <div style="position:absolute;top:24345;left:108"><nobr>[11] E. McInnes, Lesley Duf, and M. McClarey. Challenges in updating a systematic review. Nursing Times Research,</nobr></div> <div style="position:absolute;top:24361;left:140"><nobr>4:66 – 71, 1999. doi<a href="https://doi.org/10.1177/136140969900400111">:10.1177/136140969900400111.</a></nobr></div> <div style="position:absolute;top:24384;left:108"><nobr>[12] Quenby Mahood, D. Van Eerd, and E. Irvin. Searching for grey literature for systematic reviews: challenges and</nobr></div> <div style="position:absolute;top:24400;left:140"><nobr>benefits. Research Synthesis Methods, 5:221 – 234, 2014. doi<a href="https://doi.org/10.1002/jrsm.1106">:10.1002/jrsm.1106.</a></nobr></div> <div style="position:absolute;top:24422;left:108"><nobr>[13] Mohamed B. Elamin, David N. Flynn, D. Bassler, M. Briel, P. Alonso-Coello, P. Karanicolas, G. Guyatt,</nobr></div> <div style="position:absolute;top:24439;left:140"><nobr>G. Málaga, T. Furukawa, R. Kunz, H. Schuuenemann, M. Murad, C. Barbui, A. Cipriani, and V. Montori. Choice</nobr></div> <div style="position:absolute;top:24455;left:140"><nobr>of data extraction tools for systematic reviews depends on resources and review complexity. Journal of clinical</nobr></div> <div style="position:absolute;top:24471;left:140"><nobr>epidemiology, 62 5:506–10, 2009. doi:<a href="https://doi.org/10.1016/j.jclinepi.2008.10.016">10.1016/j.jclinepi.2008.10.016.</a></nobr></div> <div style="position:absolute;top:24494;left:108"><nobr>[14] Duy Duc An Bui, Guilherme Del Fiol, John F Hurdle, and Siddhartha Jonnalagadda. Extractive text summarization</nobr></div> <div style="position:absolute;top:24510;left:140"><nobr>system to aid data extraction from full text in systematic review development. Journal of biomedical informatics,</nobr></div> <div style="position:absolute;top:24526;left:140"><nobr>64:265–272, 2016.</nobr></div> <div style="position:absolute;top:24549;left:108"><nobr>[15] Teo Susnjak. Prisma-dfllm: An extension of prisma for systematic literature reviews using domain-specific</nobr></div> <div style="position:absolute;top:24565;left:140"><nobr>finetuned large language models. arXiv preprint<a href="http://arxiv.org/abs/2306.14905"> arXiv:2306.14905,</a> 2023.</nobr></div> <div style="position:absolute;top:24587;left:108"><nobr>[16] Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre,</nobr></div> <div style="position:absolute;top:24604;left:140"><nobr>Ilana Heintz, and Dan Roth. Recent advances in natural language processing via large pre-trained language</nobr></div> <div style="position:absolute;top:24620;left:140"><nobr>models: A survey. ACM Comput. Surv., 56(2), sep 2023. ISSN 0360-0300. doi<a href="https://doi.org/10.1145/3605943">:10.1145/3605943</a>. URL</nobr></div> <div style="position:absolute;top:24637;left:140"><nobr><a href="https://doi.org/10.1145/3605943">https://doi.org/10.1145/3605943.</a></nobr></div> <div style="position:absolute;top:24659;left:108"><nobr>[17] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang</nobr></div> <div style="position:absolute;top:24675;left:140"><nobr>Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A survey on</nobr></div> <div style="position:absolute;top:24691;left:140"><nobr>evaluation of large language models. ACM Trans. Intell. Syst. Technol., 15(3), mar 2024. ISSN 2157-6904.</nobr></div> <div style="position:absolute;top:24708;left:140"><nobr>doi<a href="https://doi.org/10.1145/3641289">:10.1145/3641289</a>. URL https://doi.org/10.1145/3641289.</nobr></div> <div style="position:absolute;top:24730;left:108"><nobr>[18] Xinying Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John C. Grundy,</nobr></div> <div style="position:absolute;top:24746;left:140"><nobr>and Haoyu Wang. Large language models for software engineering: A systematic literature review. ArXiv,</nobr></div> <div style="position:absolute;top:24763;left:140"><nobr>abs/2308.10620, 2023. doi<a href="https://doi.org/10.48550/arXiv.2308.10620">:10.48550/arXiv.2308.10620.</a></nobr></div> <div style="position:absolute;top:24785;left:108"><nobr>[19] Adam Tauman Kalai and Santosh S Vempala. Calibrated language models must hallucinate. arXiv preprint</nobr></div> <div style="position:absolute;top:24801;left:140"><nobr><a href="http://arxiv.org/abs/2311.14648">arXiv:2311.14648,</a> 2023.</nobr></div> <div style="position:absolute;top:24824;left:108"><nobr>[20] Timothy R. McIntosh, Tong Liu, Teo Susnjak, Paul Watters, Alex Ng, and Malka N. Halgamuge. A culturally</nobr></div> <div style="position:absolute;top:24840;left:140"><nobr>sensitive test to evaluate nuanced gpt hallucination. IEEE Transactions on Artificial Intelligence, pages 1–13,</nobr></div> <div style="position:absolute;top:24856;left:140"><nobr>2023. doi<a href="https://doi.org/10.1109/TAI.2023.3332837">:10.1109/TAI.2023.3332837.</a></nobr></div> <div style="position:absolute;top:24879;left:108"><nobr>[21] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji rong Wen. Evaluating object hallucination</nobr></div> <div style="position:absolute;top:24895;left:140"><nobr>in large vision-language models. ArXiv, pages 292–305, 2023. doi<a href="https://doi.org/10.48550/arXiv.2305.10355">:10.48550/arXiv.2305.10355.</a></nobr></div> <div style="position:absolute;top:24917;left:108"><nobr>[22] A. Nashwan and Jaber H Jaradat. Streamlining systematic reviews: Harnessing large language models for quality</nobr></div> <div style="position:absolute;top:24934;left:140"><nobr>assessment and risk-of-bias evaluation. Cureus, 15, 2023. doi<a href="https://doi.org/10.7759/cureus.43023">:10.7759/cureus.43023.</a></nobr></div> <div style="position:absolute;top:24956;left:108"><nobr>[23] Qusai Khraisha, Sophie Put, Johanna Kappenberg, Azza Warraitch, and Kristin Hadfield. Can large language</nobr></div> <div style="position:absolute;top:24972;left:140"><nobr>models replace humans in the systematic review process? evaluating gpt-4’s efficacy in screening and ex-</nobr></div> <div style="position:absolute;top:24989;left:140"><nobr>tracting data from peer-reviewed and grey literature in multiple languages. ArXiv, abs/2310.17526, 2023.</nobr></div> <div style="position:absolute;top:25005;left:140"><nobr>doi<a href="https://doi.org/10.48550/arXiv.2310.17526">:10.48550/arXiv.2310.17526.</a></nobr></div> <div style="position:absolute;top:25050;left:452"><nobr>21</nobr></div> </span></font> <a href="https://doi.org/10.1145/3641289" style="border:1px solid #0000ff;position:absolute;top:24706;left:310;width:246;height:17;display:block"></a> <div style="position:absolute;top:25123;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=22><b>Page 22</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:25178;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:25179;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:25236;left:108"><nobr>[24] R. Fernandez, Aaron J. Elmore, M. Franklin, Sanjay Krishnan, and Chenhao Tan. How large language models</nobr></div> <div style="position:absolute;top:25252;left:140"><nobr>will disrupt data management. Proc. VLDB Endow., 16:3302–3309, 2023. doi<a href="https://doi.org/10.14778/3611479.3611527">:10.14778/3611479.3611527.</a></nobr></div> <div style="position:absolute;top:25276;left:108"><nobr>[25] Yifan Peng, Justin F Rousseau, Edward H Shortliffe, and Chunhua Weng. Ai-generated text may have a role in</nobr></div> <div style="position:absolute;top:25292;left:140"><nobr>evidence-based medicine. Nature medicine, 29(7):1593–1594, 2023.</nobr></div> <div style="position:absolute;top:25316;left:108"><nobr>[26] Linda C Smith. Reviews and reviewing: Approaches to research synthesis. an annual review of information</nobr></div> <div style="position:absolute;top:25332;left:140"><nobr>science and technology (arist) paper. Journal of the Association for Information Science and Technology, 75(3):</nobr></div> <div style="position:absolute;top:25349;left:140"><nobr>245–267, 2024.</nobr></div> <div style="position:absolute;top:25373;left:108"><nobr>[27] Riaz Qureshi, Daniel Shaughnessy, Kayden AR Gill, Karen A Robinson, Tianjing Li, and Eitan Agai. Are chatgpt</nobr></div> <div style="position:absolute;top:25389;left:140"><nobr>and large language models “the answer” to bringing us closer to systematic review automation? Systematic</nobr></div> <div style="position:absolute;top:25405;left:140"><nobr>Reviews, 12(1):72, 2023.</nobr></div> <div style="position:absolute;top:25429;left:108"><nobr>[28] Francisco Bolanos, Angelo Salatino, Francesco Osborne, and Enrico Motta. Artificial intelligence for literature</nobr></div> <div style="position:absolute;top:25445;left:140"><nobr>reviews: Opportunities and challenges. arXiv preprint<a href="http://arxiv.org/abs/2402.08565"> arXiv:2402.08565,</a> 2024.</nobr></div> <div style="position:absolute;top:25469;left:108"><nobr>[29] Chitu Okoli. A guide to conducting a standalone systematic literature review. Communications of the Association</nobr></div> <div style="position:absolute;top:25486;left:140"><nobr>for Information Systems, 37, 2015.</nobr></div> <div style="position:absolute;top:25509;left:108"><nobr>[30] Maria J Grant and Andrew Booth. A typology of reviews: an analysis of 14 review types and associated</nobr></div> <div style="position:absolute;top:25526;left:140"><nobr>methodologies. Health information & libraries journal, 26(2):91–108, 2009.</nobr></div> <div style="position:absolute;top:25549;left:108"><nobr>[31] Richard J Torraco. Writing integrative literature reviews: Using the past and present to explore the future. Human</nobr></div> <div style="position:absolute;top:25566;left:140"><nobr>resource development review, 15(4):404–428, 2016.</nobr></div> <div style="position:absolute;top:25589;left:108"><nobr>[32] Richard J Torraco. Writing integrative literature reviews: Guidelines and examples. Human resource development</nobr></div> <div style="position:absolute;top:25606;left:140"><nobr>review, 4(3):356–367, 2005.</nobr></div> <div style="position:absolute;top:25630;left:108"><nobr>[33] Mary Dixon-Woods, Shona Agarwal, David Jones, Bridget Young, and Alex Sutton. Synthesising qualitative and</nobr></div> <div style="position:absolute;top:25646;left:140"><nobr>quantitative evidence: a review of possible methods. Journal of health services research & policy, 10(1):45–53,</nobr></div> <div style="position:absolute;top:25662;left:140"><nobr>2005.</nobr></div> <div style="position:absolute;top:25686;left:108"><nobr>[34] J. Gurevitch, J. Koricheva, Shinichi Nakagawa, and G. Stewart. Meta-analysis and the science of research synthesis.</nobr></div> <div style="position:absolute;top:25703;left:140"><nobr>Nature, 555:175–182, 2018. doi<a href="https://doi.org/10.1038/nature25753">:10.1038/nature25753.</a></nobr></div> <div style="position:absolute;top:25726;left:108"><nobr>[35] Neal R Haddaway and T. Rytwinski. Meta-analysis is not an exact science: Call for guidance on quantitative</nobr></div> <div style="position:absolute;top:25743;left:140"><nobr>synthesis decisions. Environment international, 114:357–359, 2018. doi:<a href="https://doi.org/10.1016/j.envint.2018.02.018">10.1016/j.envint.2018.02.018.</a></nobr></div> <div style="position:absolute;top:25766;left:108"><nobr>[36] S. Metelli and A. Chaimani. Challenges in meta-analyses with observational studies. Evidence-Based Mental</nobr></div> <div style="position:absolute;top:25783;left:140"><nobr>Health, 23:83 – 87, 2020. doi<a href="https://doi.org/10.1136/ebmental-2019-300129">:10.1136/ebmental-2019-300129.</a></nobr></div> <div style="position:absolute;top:25806;left:108"><nobr>[37] Y. Xu. Methodological issues and challenges in data collection and analysis of qualitative meta-synthesis. Asian</nobr></div> <div style="position:absolute;top:25823;left:140"><nobr>nursing research, 2 3:173–83, 2008. doi<a href="https://doi.org/10.1016/S1976-1317(08)60041-9">:10.1016/S1976-1317(08)60041-9.</a></nobr></div> <div style="position:absolute;top:25847;left:108"><nobr>[38] Mohammed A Mohammed, R. Moles, and T. Chen. Meta-synthesis of qualitative research: the challenges and</nobr></div> <div style="position:absolute;top:25863;left:140"><nobr>opportunities. International Journal of Clinical Pharmacy, 38:695–704, 2016. doi<a href="https://doi.org/10.1007/s11096-016-0289-2">:10.1007/s11096-016-0289-2.</a></nobr></div> <div style="position:absolute;top:25887;left:108"><nobr>[39] Lela V. Zimmer. Qualitative meta-synthesis: a question of dialoguing with texts. Journal of advanced nursing, 53</nobr></div> <div style="position:absolute;top:25903;left:140"><nobr>3:311–8, 2006. doi<a href="https://doi.org/10.1111/J.1365-2648.2006.03721.X">:10.1111/J.1365-2648.2006.03721.X.</a></nobr></div> <div style="position:absolute;top:25927;left:108"><nobr>[40] Q. Hong, P. Pluye, Mathieu Bujold, and M. Wassef. Convergent and sequential synthesis designs: implications for</nobr></div> <div style="position:absolute;top:25943;left:140"><nobr>conducting and reporting systematic reviews of qualitative and quantitative evidence. Systematic Reviews, 6, 2017.</nobr></div> <div style="position:absolute;top:25960;left:140"><nobr>doi<a href="https://doi.org/10.1186/s13643-017-0454-2">:10.1186/s13643-017-0454-2.</a></nobr></div> <div style="position:absolute;top:25983;left:108"><nobr>[41] G. Brunton, S. Oliver, and James Thomas. Innovations in framework synthesis as a systematic review method.</nobr></div> <div style="position:absolute;top:26000;left:140"><nobr>Research Synthesis Methods, 11:316 – 330, 2020. doi<a href="https://doi.org/10.1002/jrsm.1399">:10.1002/jrsm.1399.</a></nobr></div> <div style="position:absolute;top:26023;left:108"><nobr>[42] K. Flemming, A. Booth, R. Garside, Ö. Tunçalp, and J. Noyes. Qualitative evidence synthesis for complex</nobr></div> <div style="position:absolute;top:26040;left:140"><nobr>interventions and guideline development: clarification of the purpose, designs and relevant methods. BMJ Global</nobr></div> <div style="position:absolute;top:26056;left:140"><nobr>Health, 4, 2019. doi<a href="https://doi.org/10.1136/bmjgh-2018-000882">:10.1136/bmjgh-2018-000882.</a></nobr></div> <div style="position:absolute;top:26080;left:108"><nobr>[43] J. Higgins, J. López-López, B. Becker, S. Davies, S. Dawson, J. Grimshaw, L. McGuinness, T. Moore, E. Rehfuess,</nobr></div> <div style="position:absolute;top:26096;left:140"><nobr>James Thomas, and D. Caldwell. Synthesising quantitative evidence in systematic reviews of complex health</nobr></div> <div style="position:absolute;top:26113;left:140"><nobr>interventions. BMJ Global Health, 4, 2019. doi<a href="https://doi.org/10.1136/bmjgh-2018-000858">:10.1136/bmjgh-2018-000858.</a></nobr></div> <div style="position:absolute;top:26136;left:108"><nobr>[44] Siddhartha R Jonnalagadda, Pankaj Goyal, and Mark D Huffman. Automating data extraction in systematic</nobr></div> <div style="position:absolute;top:26153;left:140"><nobr>reviews: a systematic review. Systematic Reviews, 4(1), 2015.</nobr></div> <div style="position:absolute;top:26177;left:108"><nobr>[45] AJ Van Altena, R Spijker, and SD Olabarriaga. Usage of automation tools in systematic reviews. Research</nobr></div> <div style="position:absolute;top:26193;left:140"><nobr>synthesis methods, 10(1):72–82, 2019.</nobr></div> <div style="position:absolute;top:26238;left:452"><nobr>22</nobr></div> </span></font> <div style="position:absolute;top:26311;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=23><b>Page 23</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:26366;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:26367;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:26424;left:108"><nobr>[46] Luyi Feng, Yin Kia Chiam, and Sin Kuang Lo. Text-mining techniques and tools for systematic literature reviews:</nobr></div> <div style="position:absolute;top:26440;left:140"><nobr>A systematic literature review. In 2017 24th asia-pacific software engineering conference (apsec), pages 41–50.</nobr></div> <div style="position:absolute;top:26457;left:140"><nobr>IEEE, 2017.</nobr></div> <div style="position:absolute;top:26479;left:108"><nobr>[47] Henry Müller, Simran Pachnanda, Felix Pahl, and Christopher Rosenqvist. The application of artificial intelligence</nobr></div> <div style="position:absolute;top:26496;left:140"><nobr>on different types of literature reviews-a comparative study. In 2022 International Conference on Applied Artificial</nobr></div> <div style="position:absolute;top:26512;left:140"><nobr>Intelligence (ICAPAI), pages 1–7. IEEE, 2022.</nobr></div> <div style="position:absolute;top:26535;left:108"><nobr>[48] Raymon van Dinter, Bedir Tekinerdogan, and Cagatay Catal. Automation of systematic literature reviews: A</nobr></div> <div style="position:absolute;top:26551;left:140"><nobr>systematic literature review. Information and Software Technology, 136:106589, 2021.</nobr></div> <div style="position:absolute;top:26574;left:108"><nobr>[49] Girish Sundaram and Daniel Berleant. Automating systematic literature reviews with natural language processing</nobr></div> <div style="position:absolute;top:26590;left:140"><nobr>and text mining: A systematic literature review. In International Congress on Information and Communication</nobr></div> <div style="position:absolute;top:26607;left:140"><nobr>Technology, pages 73–92. Springer, 2023.</nobr></div> <div style="position:absolute;top:26630;left:108"><nobr>[50] Eugênio Monteiro da Silva Júnior and Moisés Lima Dutra. A roadmap toward the automatic composition of</nobr></div> <div style="position:absolute;top:26646;left:140"><nobr>systematic literature reviews. Iberoamerican Journal of Science Measurement and Communication, 1(2):1–22,</nobr></div> <div style="position:absolute;top:26662;left:140"><nobr>2021.</nobr></div> <div style="position:absolute;top:26685;left:108"><nobr>[51] Gerit Wagner, Roman Lukyanenko, and Guy Paré. Artificial intelligence and the conduct of literature reviews.</nobr></div> <div style="position:absolute;top:26702;left:140"><nobr>Journal of Information Technology, 37(2):209–226, 2022.</nobr></div> <div style="position:absolute;top:26724;left:108"><nobr>[52] Guy Tsafnat, Paul Glasziou, Miew Keen Choong, Adam Dunn, Filippo Galgani, and Enrico Coiera. Systematic</nobr></div> <div style="position:absolute;top:26741;left:140"><nobr>review automation technologies. Systematic reviews, 3:1–15, 2014.</nobr></div> <div style="position:absolute;top:26763;left:108"><nobr>[53] Iain J Marshall and Byron C Wallace. Toward systematic review automation: a practical guide to using machine</nobr></div> <div style="position:absolute;top:26780;left:140"><nobr>learning tools in research synthesis. Systematic reviews, 8:1–10, 2019.</nobr></div> <div style="position:absolute;top:26803;left:108"><nobr>[54] Cameron F Atkinson. Cheap, quick, and rigorous: artificial intelligence and the systematic literature review.</nobr></div> <div style="position:absolute;top:26819;left:140"><nobr>Social Science Computer Review, page 08944393231196281, 2023.</nobr></div> <div style="position:absolute;top:26842;left:108"><nobr>[55] Nouf Ibrahim Altmami and Mohamed El Bachir Menai. Automatic summarization of scientific articles: A survey.</nobr></div> <div style="position:absolute;top:26858;left:140"><nobr>Journal of King Saud University-Computer and Information Sciences, 34(4):1011–1028, 2022.</nobr></div> <div style="position:absolute;top:26881;left:108"><nobr>[56] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and</nobr></div> <div style="position:absolute;top:26897;left:140"><nobr>Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.</nobr></div> <div style="position:absolute;top:26920;left:108"><nobr>[57] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee-</nobr></div> <div style="position:absolute;top:26937;left:140"><nobr>lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv</nobr></div> <div style="position:absolute;top:26953;left:140"><nobr>preprint<a href="http://arxiv.org/abs/2005.14165"> arXiv:2005.14165,</a> 2020.</nobr></div> <div style="position:absolute;top:26976;left:108"><nobr>[58] Rohun Gupta, Isabel Herzog, Joseph Weisberger, John Chao, Kongkrit Chaiyasate, and Edward S Lee. Utilization</nobr></div> <div style="position:absolute;top:26992;left:140"><nobr>of chatgpt for plastic surgery research: friend or foe? Journal of Plastic, Reconstructive & Aesthetic Surgery, 80:</nobr></div> <div style="position:absolute;top:27008;left:139"><nobr>145–147, 2023.</nobr></div> <div style="position:absolute;top:27031;left:108"><nobr>[59] James Edward Hill, Catherine Harris, and Andrew Clegg. Methods for using bing’s ai-powered search engine for</nobr></div> <div style="position:absolute;top:27048;left:140"><nobr>data extraction for a systematic review. Research Synthesis Methods, 2023.</nobr></div> <div style="position:absolute;top:27070;left:108"><nobr>[60] Pablo Castillo-Segura, Carlos Alario-Hoyos, Carlos Delgado Kloos, and Carmen Fernández Panadero. Leveraging</nobr></div> <div style="position:absolute;top:27087;left:140"><nobr>the potential of generative ai to accelerate systematic literature reviews: An example in the area of educational</nobr></div> <div style="position:absolute;top:27103;left:140"><nobr>technology. In 2023 World Engineering Education Forum-Global Engineering Deans Council (WEEF-GEDC),</nobr></div> <div style="position:absolute;top:27120;left:140"><nobr>pages 1–8. IEEE, 2023.</nobr></div> <div style="position:absolute;top:27142;left:108"><nobr>[61] Arun HS Kumar. Analysis of chatgpt tool to assess the potential of its utility for academic writing in biomedical</nobr></div> <div style="position:absolute;top:27159;left:140"><nobr>domain. Biology, Engineering, Medicine and Science Reports, 9(1):24–30, 2023.</nobr></div> <div style="position:absolute;top:27182;left:108"><nobr>[62] Robert Zimmermann, Marina Staab, Mehran Nasseri, and Patrick Brandtner. Leveraging large language models</nobr></div> <div style="position:absolute;top:27198;left:140"><nobr>for literature review tasks-a case study using chatgpt. In International Conference on Advanced Research in</nobr></div> <div style="position:absolute;top:27214;left:140"><nobr>Technologies, Information, Innovation and Sustainability, pages 313–323. Springer, 2024.</nobr></div> <div style="position:absolute;top:27237;left:108"><nobr>[63] Ahmad Alshami, Moustafa Elsayed, Eslam Ali, Abdelrahman EE Eltoukhy, and Tarek Zayed. Harnessing the</nobr></div> <div style="position:absolute;top:27253;left:140"><nobr>power of chatgpt for automating systematic review process: Methodology, case study, limitations, and future</nobr></div> <div style="position:absolute;top:27270;left:140"><nobr>directions. Systems, 11(7):351, 2023.</nobr></div> <div style="position:absolute;top:27293;left:108"><nobr>[64] Daniel Najafali, Justin M Camacho, Erik Reiche, Logan G Galbraith, Shane D Morrison, and Amir H Dorafshar.</nobr></div> <div style="position:absolute;top:27309;left:140"><nobr>Truth or lies? the pitfalls and limitations of chatgpt in systematic review creation. Aesthetic Surgery Journal, 43</nobr></div> <div style="position:absolute;top:27325;left:140"><nobr>(8):NP654–NP655, 2023.</nobr></div> <div style="position:absolute;top:27348;left:108"><nobr>[65] Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In</nobr></div> <div style="position:absolute;top:27365;left:140"><nobr>Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting</nobr></div> <div style="position:absolute;top:27381;left:140"><nobr>of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland,</nobr></div> <div style="position:absolute;top:27426;left:452"><nobr>23</nobr></div> </span></font> <div style="position:absolute;top:27499;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=24><b>Page 24</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:27554;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:27555;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:27612;left:140"><nobr>May 2022. Association for Computational Linguistics. doi<a href="https://doi.org/10.18653/v1/2022.acl-long.229">:10.18653/v1/2022.acl-long.</a>229. URL https://</nobr></div> <div style="position:absolute;top:27629;left:140"><nobr><a href="https://aclanthology.org/2022.acl-long.229">aclanthology.org/2022.acl-long.229.</a></nobr></div> <div style="position:absolute;top:27651;left:108"><nobr>[66] Renat Aksitov, Chung-Ching Chang, David Reitter, Siamak Shakeri, and Yunhsuan Sung. Characterizing</nobr></div> <div style="position:absolute;top:27667;left:140"><nobr>attribution and fluency tradeoffs for retrieval-augmented large language models, 2023.</nobr></div> <div style="position:absolute;top:27689;left:108"><nobr>[67] Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A. Smith. How language model hallucinations can</nobr></div> <div style="position:absolute;top:27706;left:140"><nobr>snowball, 2023.</nobr></div> <div style="position:absolute;top:27728;left:108"><nobr>[68] Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich</nobr></div> <div style="position:absolute;top:27744;left:140"><nobr>Kuttler, M. Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented</nobr></div> <div style="position:absolute;top:27761;left:140"><nobr>generation for knowledge-intensive nlp tasks. ArXiv, abs/2005.11401, 2020.</nobr></div> <div style="position:absolute;top:27783;left:108"><nobr>[69] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve,</nobr></div> <div style="position:absolute;top:27799;left:140"><nobr>generate, and critique through self-reflection. ArXiv, abs/2310.11511, 2023. doi<a href="https://doi.org/10.48550/arXiv.2310.11511">:10.48550/arXiv.2310.11511.</a></nobr></div> <div style="position:absolute;top:27822;left:108"><nobr>[70] Suchin Gururangan, Ana Marasovic, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A</nobr></div> <div style="position:absolute;top:27838;left:140"><nobr>Smith. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint<a href="http://arxiv.org/abs/2004.10964"> arXiv:2004.10964,</a></nobr></div> <div style="position:absolute;top:27854;left:140"><nobr>2020.</nobr></div> <div style="position:absolute;top:27877;left:108"><nobr>[71] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv preprint</nobr></div> <div style="position:absolute;top:27893;left:140"><nobr><a href="http://arxiv.org/abs/1801.06146">arXiv:1801.06146,</a> 2018.</nobr></div> <div style="position:absolute;top:27915;left:108"><nobr>[72] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo,</nobr></div> <div style="position:absolute;top:27932;left:140"><nobr>Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on</nobr></div> <div style="position:absolute;top:27948;left:140"><nobr>Machine Learning, pages 2790–2799. PMLR, 2019.</nobr></div> <div style="position:absolute;top:27971;left:108"><nobr>[73] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized</nobr></div> <div style="position:absolute;top:27987;left:140"><nobr>llms. arXiv preprint<a href="http://arxiv.org/abs/2305.14314"> arXiv:2305.14314,</a> 2023.</nobr></div> <div style="position:absolute;top:28009;left:108"><nobr>[74] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu</nobr></div> <div style="position:absolute;top:28026;left:140"><nobr>Chen. Lora: Low-rank adaptation of large language models, 2021.</nobr></div> <div style="position:absolute;top:28048;left:108"><nobr>[75] Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R</nobr></div> <div style="position:absolute;top:28064;left:140"><nobr>Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, et al. Neftune: Noisy embeddings improve</nobr></div> <div style="position:absolute;top:28081;left:140"><nobr>instruction finetuning. arXiv preprint<a href="http://arxiv.org/abs/2310.05914"> arXiv:2310.05914,</a> 2023.</nobr></div> <div style="position:absolute;top:28103;left:108"><nobr>[76] T. Susnjak, G.S. Ramaswami, and A. Mathrani. Learning analytics dashboard: a tool for providing actionable</nobr></div> <div style="position:absolute;top:28119;left:140"><nobr>insights to learners. International Journal of Educational Technology in Higher Education, 19(12), 2022.</nobr></div> <div style="position:absolute;top:28136;left:140"><nobr>doi<a href="https://doi.org/10.1186/s41239-021-00313-7">:10.1186/s41239-021-00313-7.</a></nobr></div> <div style="position:absolute;top:28158;left:108"><nobr>[77] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las</nobr></div> <div style="position:absolute;top:28174;left:140"><nobr>Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne</nobr></div> <div style="position:absolute;top:28191;left:140"><nobr>Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed.</nobr></div> <div style="position:absolute;top:28207;left:140"><nobr>Mistral 7b, 2023.</nobr></div> <div style="position:absolute;top:28230;left:108"><nobr>[78] James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for</nobr></div> <div style="position:absolute;top:28246;left:140"><nobr>fact extraction and verification. ArXiv, abs/1803.05355, 2018. doi<a href="https://doi.org/10.18653/v1/N18-1074">:10.18653/v1/N18-1074.</a></nobr></div> <div style="position:absolute;top:28268;left:108"><nobr>[79] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, I. Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing</nobr></div> <div style="position:absolute;top:28285;left:140"><nobr>reasoning and acting in language models. ArXiv, abs/2210.03629, 2022.</nobr></div> <div style="position:absolute;top:28307;left:108"><nobr>[80] Giannis Bekoulis, C. Papagiannopoulou, and N. Deligiannis. A review on fact extraction and verification. ACM</nobr></div> <div style="position:absolute;top:28323;left:140"><nobr>Computing Surveys (CSUR), 55:1 – 35, 2020. doi<a href="https://doi.org/10.1145/3485127">:10.1145/3485127.</a></nobr></div> <div style="position:absolute;top:28346;left:108"><nobr>[81] Timothy R McIntosh, Teo Susnjak, Tong Liu, Paul Watters, and Malka N Halgamuge. The inadequacy of</nobr></div> <div style="position:absolute;top:28362;left:140"><nobr>reinforcement learning from human feedback-radicalizing large language models via semantic vulnerabilities.</nobr></div> <div style="position:absolute;top:28378;left:140"><nobr>IEEE Transactions on Cognitive and Developmental Systems, 2024.</nobr></div> <div style="position:absolute;top:28401;left:108"><nobr>[82] Timothy McIntosh, Tong Liu, Teo Susnjak, Hooman Alavizadeh, Alex Ng, Raza Nowrozy, and Paul Watters.</nobr></div> <div style="position:absolute;top:28417;left:140"><nobr>Harnessing gpt-4 for generation of cybersecurity grc policies: A focus on ransomware attack mitigation. Computers</nobr></div> <div style="position:absolute;top:28434;left:140"><nobr>& security, 134:103424, 2023.</nobr></div> <div style="position:absolute;top:28614;left:452"><nobr>24</nobr></div> </span></font> <a href="https://aclanthology.org/2022.acl-long.229" style="border:1px solid #0000ff;position:absolute;top:27610;left:747;width:64;height:17;display:block"></a> <div style="position:absolute;top:28687;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=25><b>Page 25</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:28742;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:28743;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:15px;font-family:Times"> <div style="position:absolute;top:28798;left:108"><nobr>A Qualitative Analysis Reference Q&A Examples</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:29195;left:108"><nobr>Figure 3: Example of a correct response. The reference answer from the SLR study is "Predictive modelling has not</nobr></div> <div style="position:absolute;top:29211;left:108"><nobr>featured in a larger percentage of reviewed LAD studies."</nobr></div> <div style="position:absolute;top:29524;left:108"><nobr>Figure 4: Example of a correct response. The reference answer from the SLR study is "We find that predictive modeling</nobr></div> <div style="position:absolute;top:29540;left:108"><nobr>functionalities are not used in majority of cases within the reviewed LADs, and examples of interpretability of the</nobr></div> <div style="position:absolute;top:29557;left:108"><nobr>models and the ability to explain their predictions to the learners do not yet exist in published studies."</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:15px;font-family:Times"> <div style="position:absolute;top:29608;left:108"><nobr>B LLM Evaluation Prompts</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:29645;left:108"><nobr>This section illustrates the GPT-4 prompts used to evaluate all responses according to the FEVER and CGS metrics.</nobr></div> <div style="position:absolute;top:29802;left:452"><nobr>25</nobr></div> </span></font> <div style="position:absolute;top:29875;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=26><b>Page 26</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:29930;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:29931;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:30561;left:108"><nobr>Figure 5: Example of an incorrect response. The reference answer from the SLR study is "Bodily et al., 2018 (180),</nobr></div> <div style="position:absolute;top:30578;left:108"><nobr>Chen et al., 2019, Aljohani et al., 2019 (86), Ulfa et al., 2019 (67), Majumdar et al., 2019, He et al., 2019 (327), Naranjo</nobr></div> <div style="position:absolute;top:30594;left:108"><nobr>et al., 2019 (64), Baneres et al., 2019 (247), Gras et al., 2020 (127), Karaoglan Yilmaz & Yilmaz, 2020 (81), Fleur et al.,</nobr></div> <div style="position:absolute;top:30610;left:108"><nobr>020 (79), Chatti et al., 2020 (414), Kia et al., 2020 (449), Owatari et al., 2020 (108), Han et al., 2021 (88), Kokoç &</nobr></div> <div style="position:absolute;top:30627;left:107"><nobr>Altun, 2021 (126), Valle et al., 2021 (179)"</nobr></div> <div style="position:absolute;top:30990;left:452"><nobr>26</nobr></div> </span></font> <div style="position:absolute;top:31063;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=27><b>Page 27</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:31118;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:31119;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:31176;left:108"><nobr>FEVER evaluation prompt:</nobr></div> <div style="position:absolute;top:31202;left:124"><nobr>Your task is to rigorously evaluate the factual accuracy, completeness</nobr></div> <div style="position:absolute;top:31219;left:124"><nobr>and the source information of a given response text against a provided</nobr></div> <div style="position:absolute;top:31235;left:124"><nobr>ground truth reference text using the FEVER categories from the Natural</nobr></div> <div style="position:absolute;top:31251;left:124"><nobr>Language Processing field. This evaluation is critical for ensuring the</nobr></div> <div style="position:absolute;top:31268;left:124"><nobr>reliability and quality of information synthesis systems. Please follow</nobr></div> <div style="position:absolute;top:31284;left:124"><nobr>these detailed guidelines precisely:</nobr></div> <div style="position:absolute;top:31317;left:124"><nobr>Carefully read and understand the ground truth reference text in its entirety.</nobr></div> <div style="position:absolute;top:31333;left:124"><nobr>Identify all key facts, details, findings, conclusions, and other important</nobr></div> <div style="position:absolute;top:31350;left:124"><nobr>information presented.</nobr></div> <div style="position:absolute;top:31366;left:124"><nobr>Thoroughly read the response text multiple times if needed. Break it down</nobr></div> <div style="position:absolute;top:31382;left:124"><nobr>into distinct claims, statements, or pieces of information.</nobr></div> <div style="position:absolute;top:31399;left:124"><nobr>For each distinct claim or piece of information in the response text, rigorously</nobr></div> <div style="position:absolute;top:31415;left:124"><nobr>assess its factual status against the ground truth reference:</nobr></div> <div style="position:absolute;top:31448;left:124"><nobr>SUPPORTED: The claim and information in the reponse text are clearly, directly,</nobr></div> <div style="position:absolute;top:31464;left:124"><nobr>completely and comprehensively corroborated and supported by evidence in the</nobr></div> <div style="position:absolute;top:31480;left:124"><nobr>reference text. Also, the ’Source’ information describing where the response</nobr></div> <div style="position:absolute;top:31497;left:124"><nobr>originated from is identical.</nobr></div> <div style="position:absolute;top:31530;left:124"><nobr>NOT ENOUGH INFO: There is no evidence in the reference text that can either</nobr></div> <div style="position:absolute;top:31546;left:124"><nobr>support or contradict the claims in the response text. The response text lacks</nobr></div> <div style="position:absolute;top:31562;left:124"><nobr>enough information to make a judgement about consistency or contradictions with</nobr></div> <div style="position:absolute;top:31579;left:124"><nobr>respect to the reference text.</nobr></div> <div style="position:absolute;top:31611;left:124"><nobr>REFUTED: Either the claims in the response text are completely, unambiguously</nobr></div> <div style="position:absolute;top:31628;left:124"><nobr>and blatantly contradicted or refuted by evidence in the reference text, and</nobr></div> <div style="position:absolute;top:31644;left:124"><nobr>he details in the response text are inaccurate with respect to the reference</nobr></div> <div style="position:absolute;top:31660;left:124"><nobr>text. Or, the claims are supported between the texts, but the Source information</nobr></div> <div style="position:absolute;top:31677;left:124"><nobr>describing where the response originated from is NOT THE SAME between the two texts.</nobr></div> <div style="position:absolute;top:31710;left:124"><nobr>Evaluate the degree of consistency as well as agreement between the following</nobr></div> <div style="position:absolute;top:31726;left:124"><nobr>texts, and respond with a single FEVER categorical label as defined above. Be</nobr></div> <div style="position:absolute;top:31742;left:124"><nobr>utterly objective, impartial and rigorous in your evaluation. Base your</nobr></div> <div style="position:absolute;top:31759;left:124"><nobr>assessments solely on the information given, without making unsubstantiated</nobr></div> <div style="position:absolute;top:31775;left:124"><nobr>assumptions or inferences. Check carefully the consistency between the Source</nobr></div> <div style="position:absolute;top:31791;left:124"><nobr>information of the texts.</nobr></div> <div style="position:absolute;top:31824;left:124"><nobr>Reference Text: "<insert text >"</nobr></div> <div style="position:absolute;top:31857;left:124"><nobr>Response Text: "<insert text >"</nobr></div> <div style="position:absolute;top:31890;left:124"><nobr>Respond below with just a single FEVER categorical label.</nobr></div> <div style="position:absolute;top:31922;left:124"><nobr>Evaluation (SUPPORTED or REFUTED or NOT ENOUGH INFO):</nobr></div> <div style="position:absolute;top:32178;left:452"><nobr>27</nobr></div> </span></font> <div style="position:absolute;top:32251;left:0"><hr><table border=0 width=100%><tr><td bgcolor=eeeeee align=right><font face=arial,sans-serif><a name=28><b>Page 28</b></a></font></td></tr></table></div><font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:32306;left:239"><nobr>Automating Research Synthesis with Domain-Specific LLM Fine-Tuning</nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:11px;font-family:Times"> <div style="position:absolute;top:32307;left:739"><nobr>A P<font style="font-size:8px">REPRINT</font></nobr></div> </span></font> <font size=3 face="Times"><span style="font-size:12px;font-family:Times"> <div style="position:absolute;top:32364;left:108"><nobr>Consistency Grading Scale (CGS) quantitative evaluation prompt:</nobr></div> <div style="position:absolute;top:32398;left:124"><nobr>Your task is to rigorously evaluate the factual accuracy, completeness and the source</nobr></div> <div style="position:absolute;top:32414;left:124"><nobr>information of a given response text against a provided ground truth reference text.</nobr></div> <div style="position:absolute;top:32430;left:124"><nobr>This evaluation is critical for ensuring the reliability and quality of information</nobr></div> <div style="position:absolute;top:32447;left:124"><nobr>synthesis systems. Please follow these detailed guidelines precisely:</nobr></div> <div style="position:absolute;top:32479;left:124"><nobr>Carefully read and understand the ground truth reference text in its entirety. Identify</nobr></div> <div style="position:absolute;top:32496;left:124"><nobr>all key facts, details, findings, conclusions, and other important information presented.</nobr></div> <div style="position:absolute;top:32512;left:124"><nobr>Thoroughly read the response text multiple times if needed. Break it down into distinct</nobr></div> <div style="position:absolute;top:32528;left:124"><nobr>claims, statements, or pieces of information.</nobr></div> <div style="position:absolute;top:32545;left:124"><nobr>For each distinct claim or piece of information in the response text, rigorously assess</nobr></div> <div style="position:absolute;top:32561;left:124"><nobr>its factual status against the ground truth reference:</nobr></div> <div style="position:absolute;top:32594;left:124"><nobr>FULLY CONSISTENT (2): The claim and information in the reponse text are clearly,</nobr></div> <div style="position:absolute;top:32610;left:124"><nobr>directly, completely and comprehensively corroborated by evidence in the reference text.</nobr></div> <div style="position:absolute;top:32627;left:124"><nobr>Also, the ’Source’ information describing where the response originated from is</nobr></div> <div style="position:absolute;top:32643;left:124"><nobr>identical.</nobr></div> <div style="position:absolute;top:32676;left:124"><nobr>PARTIALLY CONSISTENT (1): Parts of the response text are supported by the reference text,</nobr></div> <div style="position:absolute;top:32692;left:124"><nobr>but some details are either somewhat inaccurate, missing, too general or not fully</nobr></div> <div style="position:absolute;top:32708;left:124"><nobr>substantiated from the reference text. Only the Source information describing where the</nobr></div> <div style="position:absolute;top:32725;left:124"><nobr>response originated from is inconsistent.</nobr></div> <div style="position:absolute;top:32758;left:124"><nobr>NOT ENOUGH INFO (0): There is no evidence in the reference that can either support or</nobr></div> <div style="position:absolute;top:32774;left:124"><nobr>contradict the claims in the response text. The response text lacks enough information</nobr></div> <div style="position:absolute;top:32790;left:124"><nobr>to make a judgement about consistency or contradictions with respect to the reference</nobr></div> <div style="position:absolute;top:32807;left:124"><nobr>text.</nobr></div> <div style="position:absolute;top:32839;left:124"><nobr>PARTIALLY CONTRADICTORY (-1): Parts of the response text directly contradict or are</nobr></div> <div style="position:absolute;top:32856;left:124"><nobr>inconsistent with evidence present in the reference text, suggesting factual</nobr></div> <div style="position:absolute;top:32872;left:124"><nobr>inaccuracies.</nobr></div> <div style="position:absolute;top:32905;left:124"><nobr>CONTRADICTORY (-2): The claims in the response text are completely, unambiguously and</nobr></div> <div style="position:absolute;top:32921;left:124"><nobr>blatantly contradicted or refuted by evidence in the reference text. The details in the</nobr></div> <div style="position:absolute;top:32938;left:124"><nobr>response text are inaccurate with respect to the reference text.</nobr></div> <div style="position:absolute;top:32970;left:124"><nobr>Evaluate the degree of consistency between the following texts, including the Source</nobr></div> <div style="position:absolute;top:32987;left:124"><nobr>information of the texts, and respond with a single numerical value ranging from -2 to</nobr></div> <div style="position:absolute;top:33003;left:124"><nobr>2 as defined above.</nobr></div> <div style="position:absolute;top:33019;left:124"><nobr>Be utterly objective, impartial and rigorous in your evaluation. Base your assessments</nobr></div> <div style="position:absolute;top:33036;left:124"><nobr>solely on the information given, without making unsubstantiated assumptions or</nobr></div> <div style="position:absolute;top:33052;left:124"><nobr>inferences.</nobr></div> <div style="position:absolute;top:33085;left:124"><nobr>Reference Text: "<insert text >"</nobr></div> <div style="position:absolute;top:33118;left:124"><nobr>Response Text: "<insert text >"</nobr></div> <div style="position:absolute;top:33150;left:124"><nobr>Respond below with just a single number.</nobr></div> <div style="position:absolute;top:33183;left:124"><nobr>Evaluation (-2,-1,0,1,2):</nobr></div> <div style="position:absolute;top:33366;left:452"><nobr>28</nobr></div> </span></font> <!-- t101380r49a38c30750e30646n55u1l0m0k0 --> </body> </html>