CINXE.COM

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>Do Large Language Models Understand Performance Optimization?</title>  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <meta content="Large languages models, High-performance computing, Performance benchmarking, Code optimization" lang="en" name="keywords"/> <base href="/html/2503.13772v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S1" title="In Do Large Language Models Understand Performance Optimization?">1 Introduction</a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S2" title="In Do Large Language Models Understand Performance Optimization?">2 Related Work</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S2.SS0.SSS0.Px1" title="In 2. Related Work ‣ Do Large Language Models Understand Performance Optimization?">LLMs for Code</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S2.SS0.SSS0.Px2" title="In 2. Related Work ‣ Do Large Language Models Understand Performance Optimization?">LLM Benchmarks</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S2.SS0.SSS0.Px3" title="In 2. Related Work ‣ Do Large Language Models Understand Performance Optimization?">LLM Agents</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S2.SS0.SSS0.Px4" title="In 2. Related Work ‣ Do Large Language Models Understand Performance Optimization?">Traditional Performance Analysis Tools</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S3" title="In Do Large Language Models Understand Performance Optimization?">3 Benchmark Suite</a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4" title="In Do Large Language Models Understand Performance Optimization?">4 Evaluation</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.SS1" title="In 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">4.1 Experiments Design</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.SS2" title="In 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">4.2 Experiment 1: Single Serial Optimization</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.SS3" title="In 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">4.3 Experiment 2: Multiple Serial Optimizations</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.SS4" title="In 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">4.4 Experiment 3: Parallel Optimization</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.SS5" title="In 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">4.5 Experiment 4: Time Spent on Applying Optimizations</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.SS6" title="In 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">4.6 Experiment 5: Correctness</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.SS7" title="In 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">4.7 Experiment 6: HPC Commonsense</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5" title="In Do Large Language Models Understand Performance Optimization?">5 Case Studies</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS1" title="In 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">5.1 Performance Optimization Agent</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS2" title="In 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">5.2 Case 1: NPB_CG</a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS2.SSS0.Px1" title="In 5.2. Case 1: NPB_CG ‣ 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">Version 1</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS2.SSS0.Px2" title="In 5.2. Case 1: NPB_CG ‣ 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">Version 2</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS2.SSS0.Px3" title="In 5.2. Case 1: NPB_CG ‣ 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">Version 3</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS3" title="In 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">5.3 Case 2: XSBench</a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS3.SSS0.Px1" title="In 5.3. Case 2: XSBench ‣ 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">Version 1</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS3.SSS0.Px2" title="In 5.3. Case 2: XSBench ‣ 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">Version 2</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS3.SSS0.Px3" title="In 5.3. Case 2: XSBench ‣ 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">Version 3</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS4" title="In 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">5.4 Case 3: LBM D2Q37</a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS4.SSS1" title="In 5.4. Case 3: LBM D2Q37 ‣ 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">5.4.1 Single Process</a> <ol class="ltx_toclist ltx_toclist_subsubsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS4.SSS1.Px1" title="In 5.4.1. Single Process ‣ 5.4. Case 3: LBM D2Q37 ‣ 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">Version 1</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS4.SSS1.Px2" title="In 5.4.1. Single Process ‣ 5.4. Case 3: LBM D2Q37 ‣ 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">Version 2</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS4.SSS1.Px3" title="In 5.4.1. Single Process ‣ 5.4. Case 3: LBM D2Q37 ‣ 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">Version 3</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS4.SSS2" title="In 5.4. Case 3: LBM D2Q37 ‣ 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">5.4.2 Multiple Processes</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS5" title="In 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">5.5 Case 4: Minisweep</a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS5.SSS0.Px1" title="In 5.5. Case 4: Minisweep ‣ 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">Version 1</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS5.SSS0.Px2" title="In 5.5. Case 4: Minisweep ‣ 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">Version 2</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.SS5.SSS0.Px3" title="In 5.5. Case 4: Minisweep ‣ 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">Version 3</a></li> </ol> </li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S6" title="In Do Large Language Models Understand Performance Optimization?">6 Discussion</a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S7" title="In Do Large Language Models Understand Performance Optimization?">7 Conclusions</a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line ltx_leqno"> <h1 class="ltx_title ltx_title_document">Do Large Language Models Understand Performance Optimization?</h1> <div class="ltx_authors"> Bowen Cui∗ <a href="mailto:bcui2@gmu.edu">bcui2@gmu.edu</a> George Mason UniversityFairfaxVAUSA , Tejas Ramesh∗ <a href="mailto:tramesh2@gmu.edu">tramesh2@gmu.edu</a> George Mason UniversityFairfaxVAUSA , Oscar Hernandez <a href="mailto:oscar@ornl.gov">oscar@ornl.gov</a> Oak Ridge National LaboratoryOak RidgeTNUSA and Keren Zhou <a href="mailto:kzhou6@gmu.edu">kzhou6@gmu.edu</a> George Mason UniversityFairfaxVAUSA </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract.</h6> Large Language Models (LLMs) have emerged as powerful tools for software development tasks such as code completion, translation, and optimization. However, their ability to generate efficient and correct code, particularly in complex High-Performance Computing (HPC) contexts, has remained underexplored. To address this gap, this paper presents a comprehensive benchmark suite encompassing multiple critical HPC computational motifs to evaluate the performance of code optimized by state-of-the-art LLMs, including OpenAI o1, Claude-3.5, and Llama-3.2. In addition to analyzing basic computational kernels, we developed an agent system that integrates LLMs to assess their effectiveness in real HPC applications. Our evaluation focused on key criteria such as execution time, correctness, and understanding of HPC-specific concepts. We also compared the results with those achieved using traditional HPC optimization tools. Based on the findings, we recognized the strengths of LLMs in understanding human instructions and performing automated code transformations. However, we also identified significant limitations, including their tendency to generate incorrect code and their challenges in comprehending complex control and data flows in sophisticated HPC code. </div> <div class="ltx_keywords">Large languages models, High-performance computing, Performance benchmarking, Code optimization </div> <div class="ltx_acknowledgements"> ∗Equal contribution. </div> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> 1. Introduction</h2> <div class="ltx_para" id="S1.p1"> The rise of Large Language Models (LLMs) <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib49" title="">chatgpt-4, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib9" title="">claude, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib14" title="">gpt-3, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib2" title="">Llama3.2, </a>)</cite> has introduced advanced capabilities to a wide range of applications, such as translation tools <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib79" title="">Zhu2023MultilingualMT, </a>)</cite>, search engines <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib61" title="">Spatharioti2023ComparingTA, </a>)</cite>, and recommendation systems <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib4" title="">llm_recommendation, </a>)</cite>. Notably, the coding capabilities of LLMs <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib18" title="">Evauating_LLMs_Trained_on_Code, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib76" title="">codegeex, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib8" title="">codewhisperer, </a>)</cite> have garnered considerable attention from both academia and industry, particularly in tasks such as code completion <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib55" title="">codellama, </a>)</cite>, summarization <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib6" title="">Ahmed2022FewshotTL, </a>)</cite>, and repair <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib69" title="">Xia2022LessTM, </a>)</cite>. LLMs are trained on vast datasets <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib37" title="">stack, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib30" title="">pile, </a>)</cite> that cover a diverse range of programming languages and application domains, enabling them to generate code that effectively aligns with the intention conveyed through natural language inputs. Meanwhile, several studies have proposed benchmarks <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib11" title="">Austin2021ProgramSW, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib33" title="">Hendrycks2021MeasuringCC, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib18" title="">Evauating_LLMs_Trained_on_Code, </a>)</cite> for evaluating the capabilities of LLMs in coding. </div> <div class="ltx_para" id="S1.p2"> As LLMs gain widespread popularity, the High-Performance Computing (HPC) community has been assessing and exploring their potential to address various HPC challenges <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib25" title="">HPC, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib44" title="">Nichols2024CanLL, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib16" title="">Chen2024OMPGPTAG, </a>)</cite>. Existing benchmarks predominantly examine whether LLMs can solve problems under human guidance with example solutions (i.e., multiple shots in-context learning <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib48" title="">Olsson2022IncontextLA, </a>)</cite>) and tend to evaluate general rather than domain-specific questions. There is a dearth of studies <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib47" title="">niu2024evaluatingefficiencysourcecode, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib53" title="">Qiu2024HowEI, </a>)</cite> investigating the efficiency (i.e., performance) of LLM-generated code, particularly in the context of complex real-world problems. Unlike simple benchmarks, HPC code is inherently more complex, requiring not only a deep understanding of hardware, algorithms, and programming languages but also the ability to navigate large code bases and leverage cross-domain knowledge. Optimizations such as vectorization, parallelization, loop transformation, and data prefetching demand intricate program analysis and an understanding of multi-level parallelism models to effectively address performance bottlenecks. Frequently, there exists no straightforward reference solution available for code optimization. </div> <div class="ltx_para" id="S1.p3"> This complexity poses significant challenges for the application of LLMs in HPC tasks. Existing models often lack the specialized knowledge and program analysis capabilities necessary to handle these intricacies, making it difficult to achieve the desired level of optimization for HPC applications. Traditional performance optimization tools <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib5" title="">hpctookit, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib57" title="">Shende2006TheTP, </a>)</cite> have played a pivotal role within the HPC community. Some of these tools <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib26" title="">MAQAO, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib28" title="">KBS-MAQAO, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib20" title="">Codee, </a>)</cite> use static analysis to propose optimizations and occasionally automate code transformations, demonstrating capabilities similar to LLMs. However, traditional tools face multiple constraints as they rely on heuristic methods, are restricted to limited instances for automatic code transformations, and support only some frontend languages. The distinctions between traditional optimization tools and LLM inspire our investigation into their impact on HPC benchmarks, with the goal of understanding the benefits and limitations of these tools and offering insight into advancing performance optimization by integrating LLMs with traditional performance tools. </div> <div class="ltx_para" id="S1.p4"> In this research, we curated a benchmark suite comprising 26 representative HPC codes across 11 distinct domains commonly found in HPC problems. The benchmarks are categorized into three levels based on the complexity and the size of the program. We evaluated and compared the performance of code optimized by Codee <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib20" title="">Codee, </a>)</cite> with that optimized by three leading LLMs, OpenAI o1 <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib1" title="">GPT-4, </a>)</cite>, Llama-3.2 <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib2" title="">Llama3.2, </a>)</cite>, and Claude-3.5 <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib9" title="">claude, </a>)</cite>. We selected Codee as the representative traditional performance tool due to its capability to automate performance analysis and optimization, provision of in-depth insights into HPC code optimization, and wide adoption <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib42" title="">codee_nersc, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib54" title="">richardson2024optimizing, </a>)</cite> by industry and multiple research institutions. Using our benchmark suite, we evaluated the capabilities of LLMs and Codee from the following perspectives: <ul class="ltx_itemize" id="S1.I1"> <li class="ltx_item" id="S1.I1.i1" style="list-style-type:none;"> • <div class="ltx_para" id="S1.I1.i1.p1"> Speedups: Assessing the improvement in execution time achieved through optimization; </div> </li> <li class="ltx_item" id="S1.I1.i2" style="list-style-type:none;"> • <div class="ltx_para" id="S1.I1.i2.p1"> Scalability (parallelism): Measuring the effectiveness of optimizations in scaling across multiple CPU cores and compute nodes; </div> </li> <li class="ltx_item" id="S1.I1.i3" style="list-style-type:none;"> • <div class="ltx_para" id="S1.I1.i3.p1"> Correctness: Checking if optimized code outputs the same results as the original code; </div> </li> <li class="ltx_item" id="S1.I1.i4" style="list-style-type:none;"> • <div class="ltx_para" id="S1.I1.i4.p1"> HPC Commonsense: Evaluating the models’ understanding of domain-specific best practices and principles in HPC; </div> </li> <li class="ltx_item" id="S1.I1.i5" style="list-style-type:none;"> • <div class="ltx_para" id="S1.I1.i5.p1"> Applicability to Applications: Investigating the models’ ability to handle complex HPC applications </div> </li> </ul> In contrast to existing work <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib64" title="">ValeroLara2023ComparingLA, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib44" title="">Nichols2024CanLL, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib45" title="">Nichols2024PerformanceAlignedLF, </a>)</cite>, this study is the first to investigate the differences between LLMs and state-of-the-art traditional performance optimization tools. </div> <div class="ltx_para" id="S1.p5"> Moreover, we propose a performance optimization agent <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib66" title="">wang2024survey, </a>)</cite> for optimizing real HPC applications, integrating insights from both LLMs and traditional performance tools to replicate human optimization of HPC code. This process typically begins with an existing reference implementation and proceeds with the identification of performance hotspots. Subsequently, various optimizations are applied to achieve speedup and scalability. The process is repeated iteratively until the desired speedups are achieved. </div> <div class="ltx_para" id="S1.p6"> The rest of the paper is organized as follows. Section <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S2" title="2. Related Work ‣ Do Large Language Models Understand Performance Optimization?">2</a> presents related topics. Section <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S3" title="3. Benchmark Suite ‣ Do Large Language Models Understand Performance Optimization?">3</a> describes our benchmark suite in detail. Section <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4" title="4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">4</a> introduces the evaluation workflow and discusses the evaluation results. Section <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5" title="5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">5</a> describes our prototype agent environment and results from applying it to four HPC applications. We summarize the key findings and rank capabilities of LLMs and Codee in Section <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S6" title="6. Discussion ‣ Do Large Language Models Understand Performance Optimization?">6</a>. Section <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S7" title="7. Conclusions ‣ Do Large Language Models Understand Performance Optimization?">7</a> concludes and presents future directions. </div> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> 2. Related Work</h2> <div class="ltx_para" id="S2.p1"> This section reviews recent studies in related fields to provide background knowledge and highlight the significance of our research. </div> <section class="ltx_paragraph" id="S2.SS0.SSS0.Px1"> <h5 class="ltx_title ltx_title_paragraph">LLMs for Code</h5> <div class="ltx_para" id="S2.SS0.SSS0.Px1.p1"> LLMs can generate code based on human prompts and code snippets. Typically, code-specific LLMs are trained on general, large-scale datasets <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib35" title="">jiang2024surveylargelanguagemodels, </a>)</cite> and further fine-tuned using smaller, code-related datasets. Those datasets typically consist of pairs of prompts that represent the purposes of the LLMs, example inputs, and outputs in diff-based formats. Reinforcement Learning with Human Feedback (RLHF) <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib50" title="">ouyang2022traininglanguagemodelsfollow, </a>)</cite> enhances LLMs’ capability to produce syntactically and functionally correct code, with models such as CodeRL <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib39" title="">NEURIPS2022_8636419d, </a>)</cite> and PPOCoder <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib60" title="">shojaee2023executionbasedcodegenerationusing, </a>)</cite> demonstrating significant advancements. Furthermore, prompt engineering techniques, including Chain-of-Thought (CoT) <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib68" title="">wei2023chainofthoughtpromptingelicitsreasoning, </a>)</cite> and Self-Debugging <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib19" title="">chen2023teachinglargelanguagemodels, </a>)</cite>, facilitate iterative refinements for in-depth code analysis. Retrieval-augmented generation (RAG) addresses the challenges of repository-level code generation by incorporating cross-file context <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib74" title="">zhang2023repocoderrepositorylevelcodecompletion, </a>)</cite>. Our study evaluates three state-of-the-art LLMs that achieve exceptional results in existing code-related tasks <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib10" title="">lmarena, </a>)</cite>. </div> </section> <section class="ltx_paragraph" id="S2.SS0.SSS0.Px2"> <h5 class="ltx_title ltx_title_paragraph">LLM Benchmarks</h5> <div class="ltx_para" id="S2.SS0.SSS0.Px2.p1"> A range of benchmarks <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib17" title="">chen2021evaluating, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib35" title="">jiang2024surveylargelanguagemodels, </a>)</cite> has been developed to evaluate the code generation capabilities of LLMs. Most studies primarily focus on assessing the correctness of the generated code. Similarity-based metrics such as BLEU <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib51" title="">Papineni2002BleuAM, </a>)</cite>, ROUGE <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib41" title="">lin-2004-rouge, </a>)</cite>, and METEOR <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib12" title="">banerjee-lavie-2005-meteor, </a>)</cite> often fall short in capturing the syntactic and functional correctness of code. Consequently, execution-based metrics, such as pass@k <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib18" title="">Evauating_LLMs_Trained_on_Code, </a>)</cite>, have gained traction for their ability to directly evaluate functional validity by running unit tests on generated code. In addition, there are also studies that capture qualitative aspects like code style and maintainability <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib35" title="">jiang2024surveylargelanguagemodels, </a>)</cite>. Existing benchmarks <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib43" title="">nichols2024can, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib46" title="">nichols2024performance, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib47" title="">niu2024evaluatingefficiencysourcecode, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib29" title="">fang2024towards, </a>)</cite> for evaluating the efficiency of LLMs often focus on narrow aspects, such as their ability to parallelize code, without thoroughly examining their understanding of broader efficiency-related challenges. Furthermore, prior studies mostly analyze small computational kernels, neglecting larger applications. In contrast, our study provides a more comprehensive evaluation by assessing the efficiency, correctness, and HPC-specific expertise of LLMs for both small kernels and large applications. </div> </section> <section class="ltx_paragraph" id="S2.SS0.SSS0.Px3"> <h5 class="ltx_title ltx_title_paragraph">LLM Agents</h5> <div class="ltx_para" id="S2.SS0.SSS0.Px3.p1"> An LLM agent <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib66" title="">wang2024survey, </a>)</cite> is a system designed to utilize LLMs for solving problems by generating a plan, executing the plan through external tools, gathering feedback from the environment, and storing information in memory to refine subsequent steps. It extends the core functionality of LLMs by enabling dynamic interaction with external environments. Several frameworks <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib31" title="">significantgravitas2023autogpt, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib67" title="">wang2024openhands, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib24" title="">langchain, </a>)</cite> have been developed to facilitate the creation of such agents <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib7" title="">cognition2024devin, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib34" title="">cursor, </a>)</cite>. To enhance solution accuracy, these agents <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib75" title="">zhang2024codeagent, </a>)</cite> often employ advanced reasoning strategies, such as Chain-of-Thought <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib68" title="">wei2023chainofthoughtpromptingelicitsreasoning, </a>)</cite>, Tree-of-Thought <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib71" title="">yao2024tree, </a>)</cite>, and ReAct <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib72" title="">yao2022react, </a>)</cite>, to determine the most effective invocation strategies. Additionally, recent research <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib36" title="">kim2023llm, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib59" title="">shinn2024reflexion, </a>)</cite> has explored optimizing the efficiency and accuracy of the agent workflow. Unlike previous code agents focused on software development, we developed a prototype agent system aimed at evaluating the effectiveness of using LLMs for performance optimization of large applications. </div> </section> <section class="ltx_paragraph" id="S2.SS0.SSS0.Px4"> <h5 class="ltx_title ltx_title_paragraph">Traditional Performance Analysis Tools</h5> <div class="ltx_para" id="S2.SS0.SSS0.Px4.p1"> In the field of HPC, performance analysis tools play a critical role in identifying and resolving bottlenecks to optimize application efficiency across single and multiple compute nodes. These tools are generally classified into two main categories: dynamic and static tools. Dynamic tools <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib5" title="">hpctookit, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib58" title="">shende2006tau, </a>)</cite> operate at runtime, collecting performance metrics such as execution time, memory usage, and processor utilization and attributing metrics to source files to identify hotspots. Static tools <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib27" title="">djoudi2005maqao, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib20" title="">Codee, </a>)</cite> analyze source code and compilation databases to identify inefficiencies without executing the program. Recent advancements <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib77" title="">zhou2021automated, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib78" title="">zhou2021gpa, </a>)</cite> have introduced hybrid tools that combine static and dynamic analysis techniques to offer deeper insights. Traditional performance analysis tools typically rely on heuristics and predefined algorithms to detect performance bottlenecks and inefficiencies. While this approach ensures precision—static tools apply rigorous algorithms, and dynamic tools capture empirical data from real execution—such methods may lack the flexibility needed to address the complex and evolving challenges in modern HPC environments. In contrast, LLMs offer enhanced adaptability by modifying code and incorporating additional information through retrieval-augmented generation (RAG). However, it is crucial to carefully validate the outputs of LLMs to ensure correctness and effectiveness in HPC workflows. </div> <figure class="ltx_table" id="S2.T1"> <figcaption class="ltx_caption ltx_centering" style="font-size:70%;">Table 1. Classification of Benchmarks by Computational Motifs</figcaption> <table class="ltx_tabular ltx_centering ltx_align_middle" id="S2.T1.6"> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S2.T1.6.1.1"> <td class="ltx_td ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S2.T1.6.1.1.1">Computational Motif</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.1.1.2">Benchmark</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.1.1.3">Language</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.1.1.4">Level</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.1.1.5">Tokens</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r ltx_border_t" id="S2.T1.6.1.1.6"> Description </td> </tr> <tr class="ltx_tr" id="S2.T1.6.2.2"> <td class="ltx_td ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S2.T1.6.2.2.1" rowspan="5">Dense Linear Algebra</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.2.2.2"> Durbin <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib52" title="">PolyBenchC-4.2.1, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.2.2.3">C</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.2.2.4">1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.2.2.5">538</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r ltx_border_t" id="S2.T1.6.2.2.6"> Toeplitz system solver, used in signal processing and numerical analysis. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.3.3"> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.3.3.1"> Doitgen <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib52" title="">PolyBenchC-4.2.1, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.3.3.2">C</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.3.3.3">1</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.3.3.4">667</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r" id="S2.T1.6.3.3.5"> Multi-resolution analysis kernel for wavelet transforms. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.4.4"> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.4.4.1"> Cholesky <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib52" title="">PolyBenchC-4.2.1, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.4.4.2">C</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.4.4.3">1</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.4.4.4">648</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r" id="S2.T1.6.4.4.5"> Cholesky decomposition, widely used in dense matrix factorizations. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.5.5"> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.5.5.1"> 2mm <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib52" title="">PolyBenchC-4.2.1, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.5.5.2">C</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.5.5.3">1</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.5.5.4">795</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r" id="S2.T1.6.5.5.5"> Double matrix-matrix multiplications. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.6.6"> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.6.6.1"> Correlation <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib52" title="">PolyBenchC-4.2.1, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.6.6.2">C</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.6.6.3">1</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.6.6.4">729</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r" id="S2.T1.6.6.6.5"> Mean and standard deviation computation for matrix columns. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.7.7"> <td class="ltx_td ltx_border_l ltx_border_r" id="S2.T1.6.7.7.1"></td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.7.7.2"> MATMUL <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib22" title="">codee_performance_demos, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.7.7.3">C</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.7.7.4">1</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.7.7.5">764</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r" id="S2.T1.6.7.7.6"> Basic matrix multiplication. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.8.8"> <td class="ltx_td ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S2.T1.6.8.8.1" rowspan="4">Sparse Linear Algebra</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.8.8.2"> Trisolv <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib52" title="">PolyBenchC-4.2.1, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.8.8.3">C</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.8.8.4">1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.8.8.5">623</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r ltx_border_t" id="S2.T1.6.8.8.6"> Triangular solver, solving sparse triangular systems. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.9.9"> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.9.9.1"> Bicg <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib52" title="">PolyBenchC-4.2.1, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.9.9.2">C</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.9.9.3">1</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.9.9.4">649</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r" id="S2.T1.6.9.9.5"> BiCGStab linear solver, an iterative method for sparse systems. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.10.10"> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.10.10.1"> ATMUX <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib22" title="">codee_performance_demos, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.10.10.2">C</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.10.10.3">2</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.10.10.4">2185</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r" id="S2.T1.6.10.10.5"> Sparse matrix-vector multiplication, often used in physics simulations. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.11.11"> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.11.11.1"> NPB_CG <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib22" title="">codee_performance_demos, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.11.11.2">C</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.11.11.3">3</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.11.11.4">20060</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r" id="S2.T1.6.11.11.5"> Conjugate gradient solver, an iterative sparse linear algebra algorithm from NAS benchmarks. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.12.12"> <td class="ltx_td ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S2.T1.6.12.12.1">Spectral Methods</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.12.12.2"> Deriche <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib52" title="">PolyBenchC-4.2.1, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.12.12.3">C</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.12.12.4">1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.12.12.5">864</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r ltx_border_t" id="S2.T1.6.12.12.6"> Edge detection commonly used in image processing. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.13.13"> <td class="ltx_td ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S2.T1.6.13.13.1" rowspan="2">Monte Carlo</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.13.13.2"> PI <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib22" title="">codee_performance_demos, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.13.13.3">C</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.13.13.4">1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.13.13.5">262</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r ltx_border_t" id="S2.T1.6.13.13.6"> Monte Carlo integration, involves random sampling for numerical simulations. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.14.14"> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.14.14.1"> XSBench <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib3" title="">hpc2021, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.14.14.2">C</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.14.14.3">3</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.14.14.4">9094</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r" id="S2.T1.6.14.14.5"> Monte Carlo Macroscopic Cross Section Lookup Benchmark. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.15.15"> <td class="ltx_td ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S2.T1.6.15.15.1">Dynamic Programming</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.15.15.2"> Nussinov <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib52" title="">PolyBenchC-4.2.1, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.15.15.3">C</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.15.15.4">1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.15.15.5">2096</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r ltx_border_t" id="S2.T1.6.15.15.6"> Dynamic programming for RNA secondary structure prediction. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.16.16"> <td class="ltx_td ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S2.T1.6.16.16.1" rowspan="4">Structured Grids</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.16.16.2"> Adi <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib52" title="">PolyBenchC-4.2.1, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.16.16.3">C</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.16.16.4">1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.16.16.5">760</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r ltx_border_t" id="S2.T1.6.16.16.6"> Alternating Direction Implicit method, common in structured grid-based PDEs. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.17.17"> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.17.17.1"> Srad <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib73" title="">srad, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.17.17.2">C</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.17.17.3">1</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.17.17.4">669</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r" id="S2.T1.6.17.17.5"> Speckle reducing anisotropic diffusion for image processing, utilizing structured grids. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.18.18"> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.18.18.1"> Hotspot <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib15" title="">che2009rodinia, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.18.18.2">C</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.18.18.3">2</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.18.18.4">1384</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r" id="S2.T1.6.18.18.5"> 2D structured grid for simulating heat distribution on a chip. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.19.19"> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.19.19.1"> Hotspot3D <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib15" title="">che2009rodinia, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.19.19.2">C</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.19.19.3">2</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.19.19.4">1117</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r" id="S2.T1.6.19.19.5"> 3D heat distribution simulation over a structured grid. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.20.20"> <td class="ltx_td ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S2.T1.6.20.20.1" rowspan="3">N-body Methods</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.20.20.2"> COULOMB <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib22" title="">codee_performance_demos, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.20.20.3">C</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.20.20.4">2</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.20.20.5">1709</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r ltx_border_t" id="S2.T1.6.20.20.6"> Coulomb force simulation, typical of N-body problems in molecular dynamics. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.21.21"> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.21.21.1"> Particlefilter <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib15" title="">che2009rodinia, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.21.21.2">C</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.21.21.3">2</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.21.21.4">2984</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r" id="S2.T1.6.21.21.5"> Particle filter, commonly used in N-body filtering methods in robotics and control. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.22.22"> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.22.22.1"> HACCmk <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib22" title="">codee_performance_demos, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.22.22.2">C++</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.22.22.3">2</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.22.22.4">3266</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r" id="S2.T1.6.22.22.5"> Cosmological simulation using particle-mesh methods for N-body computations. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.23.23"> <td class="ltx_td ltx_border_l ltx_border_r ltx_border_t" id="S2.T1.6.23.23.1"></td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.23.23.2"> Jacobi-1d <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib52" title="">PolyBenchC-4.2.1, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.23.23.3">C</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.23.23.4">1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.6.23.23.5">564</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r ltx_border_t" id="S2.T1.6.23.23.6"> 1-D Jacobi stencil computation. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.24.24"> <td class="ltx_td ltx_align_center ltx_border_l ltx_border_r" id="S2.T1.6.24.24.1">Stencils</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.24.24.2"> LBM D2Q37 <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib3" title="">hpc2021, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.24.24.3">C</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.24.24.4">3</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S2.T1.6.24.24.5">27364</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_r" id="S2.T1.6.24.24.6"> Lattice Boltzmann Method (LBM) for fluid dynamics, uses stencil operations for grid updates. </td> </tr> <tr class="ltx_tr" id="S2.T1.6.25.25"> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_l ltx_border_r ltx_border_t" id="S2.T1.6.25.25.1">Radiation Transport</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S2.T1.6.25.25.2"> Minisweep <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib3" title="">hpc2021, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S2.T1.6.25.25.3">C</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S2.T1.6.25.25.4">3</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S2.T1.6.25.25.5">10478</td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_b ltx_border_r ltx_border_t" id="S2.T1.6.25.25.6"> Solves transport equations for radiation, typically used in nuclear engineering. </td> </tr> </tbody> </table> </figure> </section> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> 3. Benchmark Suite</h2> <div class="ltx_para" id="S3.p1"> In this section, we outline the design rationale behind our benchmark suite, summarized in Table <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S2.T1" title="Table 1 ‣ Traditional Performance Analysis Tools ‣ 2. Related Work ‣ Do Large Language Models Understand Performance Optimization?">1</a>. The suite encompasses a diverse set of motifs, including Dense and Sparse Linear Algebra, Spectral Methods, Monte Carlo, Dynamic Programming, Structured Grids, N-body Methods, Stencils, and Radiation Transport, which represent the core computational patterns frequently encountered in HPC applications. </div> <div class="ltx_para" id="S3.p2"> Our benchmarks range in scale from small, representative compute kernels, such as basic matrix multiplication (MATMUL) and chain matrix multiplication (2mm), to large-scale applications like NPB_CG and LBM D2Q37. These benchmarks are categorized into three levels of increasing code size and complexity: level 1 represents the simplest benchmarks, while level 3 encompasses the most complex. This hierarchical structure enables a comprehensive evaluation of how effectively LLMs and traditional performance tools can tackle foundational computational tasks as well as domain-specific challenges requiring advanced algorithmic and hardware knowledge. For smaller benchmarks, both LLMs and traditional code analysis tools are capable of analyzing complete code files. However, for larger applications, LLMs may face output token length limitations, and traditional tools may generate many false-positive suggestions. To address these challenges, we employ profilers, such as HPCToolkit <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib5" title="">hpctookit, </a>)</cite>, to identify performance-critical hotspots, focusing optimization efforts on the most computationally intensive code sections. The benchmarks span multiple programming languages, primarily C and C++, emphasizing the most common coding environments in the HPC community. To prepare the benchmarks for evaluation, we eliminated any existing OpenMP parallel pragmas. Additionally, we fully expanded C/C++ macros to avoid potential misinterpretations by LLMs and limitations of traditional tools. </div> <div class="ltx_para" id="S3.p3"> In summary, this multi-level, multi-domain design enables systematic evaluation of the capabilities of LLMs and traditional tools in addressing detailed optimization challenges in small benchmarks and usability issues in large-scale applications. </div> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> 4. Evaluation</h2> <div class="ltx_para" id="S4.p1"> We describe our evaluation framework as well as a series of experiments conducted using the framework for level 1 and level 2 benchmarks in Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.F1" title="Figure 1 ‣ 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">1</a>. It begins by instructing four different tools—Codee, OpenAI o1, Llama-3.2, and Claude-3.5—to optimize existing code written in C/C++. Note that all these tools accept static code without executing it before providing performance optimization suggestions. </div> <figure class="ltx_figure" id="S4.F1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="361" id="S4.F1.g1" src="x1.png" width="623"/> <figcaption class="ltx_caption ltx_centering">Figure 1. An Overview of our evaluation framework.</figcaption> </figure> <figure class="ltx_figure" id="S4.F2"> <figure class="ltx_float ltx_lstlisting ltx_align_center" id="S4.F2.tab1"> <div class="ltx_listing ltx_lst_language_Python ltx_lstlisting ltx_framed ltx_framed_rectangle ltx_listing" id="S4.F2.tab1.1" style="background-color:#F2F2EB;"> <div class="ltx_listing_data"><a download="codee_commands.tex" href="data:text/plain;base64,Y29kZWUgY2hlY2tzIC0tdmVyYm9zZSAgLS10YXJnZXQtYXJjaCBjcHUgLS0gZ2NjIC1jIG1haW4uYyAtSSBpbmNsdWRlLyAtTzMKRGF0ZTogMjAyNS0wMS0wOSBDb2RlZSB2ZXJzaW9uOiAyMDI0LjQuMiBMaWNlbnNlIHR5cGU6IEZ1bGwKQ29tcGlsZXIgaW52b2NhdGlvbjogZ2NjIC1jIG1haW4uYyAtSSBpbmNsdWRlLyAtTzMKWzEvMV0gbWFpbi5jIC4uLiBEb25lCkNIRUNLUyBSRVBPUlQKbWFpbi5jOjE2OjkgW1BXUjAzOV0gKGxldmVsOiBMMSk6IENvbnNpZGVyIGxvb3AgaW50ZXJjaGFuZ2UgdG8gaW1wcm92ZSB0aGUgbG9jYWxpdHkgb2YgcmVmZXJlbmNlIGFuZCBlbmFibGUgdmVjdG9yaXphdGlvbgogIExvb3BzIHRvIGludGVyY2hhbmdlOgogICAgMTY6ICAgICAgICAgZm9yIChzaXplX3QgaiA9IDA7IGogPCBuOyBqKyspIHsKICAgIDE3OiAgICAgICAgICAgICBmb3IgKHNpemVfdCBrID0gMDsgayA8IHA7IGsrKykgewogIFN1Z2dlc3Rpb246IEludGVyY2hhbmdlIGlubmVyIGFuZCBvdXRlciBsb29wcyBpbiB0aGUgbG9vcCBuZXN0IHRvIGltcHJvdmUgcGVyZm9ybWFuY2UKICBEb2N1bWVudGF0aW9uOiBodHRwczovL2dpdGh1Yi5jb20vY29kZWUtY29tL29wZW4tY2F0YWxvZy90cmVlL21haW4vQ2hlY2tzL1BXUjAzOQogIEF1dG9GaXg6IGNvZGVlIHJld3JpdGUgLS1tZW1vcnkgbG9vcC1pbnRlcmNoYW5nZSAtLWluLXBsYWNlIG1haW4uYzoxNjo5IC0tIGdjYyAtYyBtYWluLmMgLUkgaW5jbHVkZS8gLU8zCg==">⬇</a></div> <div class="ltx_listingline" id="lstnumberx1"> codee checks --verbose --target-arch cpu -- gcc -c main.c -I include/ -O3 </div> <div class="ltx_listingline" id="lstnumberx2"> Date: 2025-01-09 Codee version: 2024.4.2 License type: Full </div> <div class="ltx_listingline" id="lstnumberx3"> Compiler invocation: gcc -c main.c -I include/ -O3 </div> <div class="ltx_listingline" id="lstnumberx4"> [1/1] main.c ... Done </div> <div class="ltx_listingline" id="lstnumberx5"> CHECKS REPORT </div> <div class="ltx_listingline" id="lstnumberx6"> main.c:16:9 [PWR039] (level: L1): Consider loop interchange to improve the locality of reference and enable vectorization </div> <div class="ltx_listingline" id="lstnumberx7"> Loops to interchange: </div> <div class="ltx_listingline" id="lstnumberx8"> 16: for (size_t j = 0; j < n; j++) { </div> <div class="ltx_listingline" id="lstnumberx9"> 17: for (size_t k = 0; k < p; k++) { </div> <div class="ltx_listingline" id="lstnumberx10"> Suggestion: Interchange inner and outer loops in the loop nest to improve performance </div> <div class="ltx_listingline" id="lstnumberx11"> Documentation: https:github.com/codee-com/open-catalog/tree/main/Checks/PWR039 </div> <div class="ltx_listingline" id="lstnumberx12"> AutoFix: codee rewrite --memory loop-interchange --in-place main.c:16:9 -- gcc -c main.c -I include/ -O3 </div> </div> </figure> <figcaption class="ltx_caption ltx_centering">Figure 2. An example of Codee suggesting optimizations for MATMUL</figcaption> </figure> <div class="ltx_para" id="S4.p2"> Codee incorporates LLVM <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib38" title="">lattner2004llvm, </a>)</cite> to apply compiler analysis and identify potential optimizations by correlating analysis results with its performance problem catalog <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib23" title="">codee_open_catalog, </a>)</cite>. For each problem, Codee explains why these optimizations are necessary and suggests how they might be implemented. Some optimizations may be automatically applied by Codee, while others may require manual adjustments. For example, in Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.F2" title="Figure 2 ‣ 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">2</a>, we show a sample of how Codee suggests an auto-fix enabled loop interchange optimization for MATMUL. </div> <div class="ltx_para" id="S4.p3"> For LLMs, we concatenate initial prompts as shown in Box <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4" title="4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">4</a> with the related code to guide the optimization process. Note that we do not provide LLMs with any existing optimization examples for a fair comparison with Codee. </div> <div class="ltx_para" id="S4.p4"> The optimized code generated by each tool is compiled into executable binaries, which are then tested with representative inputs to assess Code Correctness and Performance Speedup. Moreover, we request each tool to explain the specific optimizations applied, and verify whether the optimization terminology is sensible and consistent with the modifications (i.e., HPC Commonsense). </div> <div class="ltx_para" id="S4.p5"> All experiments were performed on a machine equipped with Rocky Linux 8.5 (Green Obsidian) and a single AMD EPYC 7543 32-Core CPU. The software we used includes GCC/G++ v14.2.0, CLANG/CLANG++ v19.1.5, and Codee v2024.3.5 Table <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.T2" title="Table 2 ‣ 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">2</a> lists the details of the LLMs we used in the experiments. </div> <div class="ltx_para ltx_noindent" id="S4.p6"> <svg class="ltx_picture" height="272.94" id="S4.p6.pic1" overflow="visible" version="1.1" width="600"><g fill="#000000" stroke="#000000" stroke-width="0.4pt" transform="translate(0,272.94) matrix(1 0 0 -1 0 0)"><g fill="#000000" fill-opacity="1.0"><path d="M 0 5.91 L 0 267.03 C 0 270.29 2.64 272.94 5.91 272.94 L 594.09 272.94 C 597.36 272.94 600 270.29 600 267.03 L 600 5.91 C 600 2.64 597.36 0 594.09 0 L 5.91 0 C 2.64 0 0 2.64 0 5.91 Z" style="stroke:none"></path></g><g fill="#F9F9F9" fill-opacity="1.0"><path d="M 1.97 5.91 L 1.97 251.29 L 598.03 251.29 L 598.03 5.91 C 598.03 3.73 596.27 1.97 594.09 1.97 L 5.91 1.97 C 3.73 1.97 1.97 3.73 1.97 5.91 Z" style="stroke:none"></path></g><g fill-opacity="1.0" transform="matrix(1.0 0.0 0.0 1.0 21.65 257.19)"><foreignobject color="#FFFFFF" height="9.84" overflow="visible" transform="matrix(1 0 0 -1 0 16.6)" width="556.69"> Prompt template for EX1 & EX2 & EX3. </foreignobject></g><g fill-opacity="1.0" transform="matrix(1.0 0.0 0.0 1.0 21.65 13.78)"><foreignobject color="#000000" height="225.7" overflow="visible" transform="matrix(1 0 0 -1 0 16.6)" width="556.69"> # System Prompt (Common for all experiments and LLMs) You are a code generation/optimization assistant. Given a prompt your output must only be a compilable source code. The computation environment is a Linux system (Rocky Linux 8.5 Green Obsidian) and a single AMD EPYC 7543 32-Core CPU. The C/C++ language compilers available are: GCC/G++ v14.2.0 and CLANG/CLANG++ v19.1.5 # Prompt for EX1 Provide the C/C++ code with a single serial optimization without removing any of the existing functions or header files and without adding any new functions or print statements. # Prompt for EX2 Propose an additional serial optimization that can be applied without removing any of the existing functions or header files and without adding any new functions or print statements. # Prompt for EX3 Based on the original code, provide optimized parallel C/C++ code without removing any of the existing functions or header files and without adding any new functions or print statements. </foreignobject></g></g></svg> </div> <figure class="ltx_table" id="S4.T2"> <figcaption class="ltx_caption ltx_centering" style="font-size:70%;">Table 2. Comparison of LLMs222“Context Window” here refers to the total input + output token limit of the model.</figcaption> <table class="ltx_tabular ltx_centering ltx_align_middle" id="S4.T2.2"> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T2.2.3.1"> <td class="ltx_td ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S4.T2.2.3.1.1">Model</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.2.3.1.2">Context Window</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.2.3.1.3">Max Output</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.2.3.1.4">#Parameters</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.2.3.1.5">Publish Date</td> </tr> <tr class="ltx_tr" id="S4.T2.1.1"> <td class="ltx_td ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S4.T2.1.1.2"> OpenAI o1 <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib1" title="">GPT-4, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.1.1.3">128k tokens</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.1.1.4">4096 tokens</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.1.1.1"><math alttext="\sim" class="ltx_Math" display="inline" id="S4.T2.1.1.1.m1.1"><semantics id="S4.T2.1.1.1.m1.1a"><mo id="S4.T2.1.1.1.m1.1.1" mathsize="70%" xref="S4.T2.1.1.1.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.T2.1.1.1.m1.1b"><csymbol cd="latexml" id="S4.T2.1.1.1.m1.1.1.cmml" xref="S4.T2.1.1.1.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.1.1.1.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.T2.1.1.1.m1.1d">∼</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.1.1.5">Dec 5, 2024</td> </tr> <tr class="ltx_tr" id="S4.T2.2.4.2"> <td class="ltx_td ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S4.T2.2.4.2.1"> Llama-3.2 <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib2" title="">Llama3.2, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.2.4.2.2">128k tokens</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.2.4.2.3">4096 tokens</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.2.4.2.4">90B</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.2.4.2.5">Sep 25, 2024</td> </tr> <tr class="ltx_tr" id="S4.T2.2.2"> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_l ltx_border_r ltx_border_t" id="S4.T2.2.2.2"> Claude-3.5 <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib9" title="">claude, </a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S4.T2.2.2.3">200k tokens</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S4.T2.2.2.4">8192 tokens</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S4.T2.2.2.1"><math alttext="\sim" class="ltx_Math" display="inline" id="S4.T2.2.2.1.m1.1"><semantics id="S4.T2.2.2.1.m1.1a"><mo id="S4.T2.2.2.1.m1.1.1" mathsize="70%" xref="S4.T2.2.2.1.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.T2.2.2.1.m1.1b"><csymbol cd="latexml" id="S4.T2.2.2.1.m1.1.1.cmml" xref="S4.T2.2.2.1.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.2.2.1.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.T2.2.2.1.m1.1d">∼</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S4.T2.2.2.5">June 20, 2024</td> </tr> </tbody> </table> </figure> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_title_subsection"> 4.1. Experiments Design</h3> <div class="ltx_para" id="S4.SS1.p1"> We conducted six experiments to analyze the performance optimization effects of Codee and LLMs. The details of each experiment are described as follows: </div> <div class="ltx_para" id="S4.SS1.p2"> <ul class="ltx_itemize" id="S4.I1"> <li class="ltx_item" id="S4.I1.i1" style="list-style-type:none;"> • <div class="ltx_para" id="S4.I1.i1.p1"> Experiment 1 (EX1): Single Serial Optimization. We apply a single serial optimization that does not include multi-threaded parallelization in this experiment. Using Codee, we prefer optimizations that can be auto-fixed or require a simple code change (e.g., modifying a single line or adding a few simple annotations). For LLMs, we always instruct them to automatically apply a single optimization. The goal is to assess the speedup each tool can achieve through simple optimizations using different compilers. </div> </li> <li class="ltx_item" id="S4.I1.i2" style="list-style-type:none;"> • <div class="ltx_para" id="S4.I1.i2.p1"> Experiment 2 (EX2): Multiple Serial Optimizations. In this experiment, we select and apply four additional serial optimizations recommended by Codee based on their corresponding importance <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib23" title="">codee_open_catalog, </a>)</cite>. For LLMs, we request them to implement and describe four additional optimizations incrementally. Four versions of the optimized code are yielded, and we choose the version that delivers the highest speedup and correct results to indicate the speedup with different compilers. The goal is to measure the maximum possible speedups each tool can achieve in the single thread setting with extensive updates. </div> </li> <li class="ltx_item" id="S4.I1.i3" style="list-style-type:none;"> • <div class="ltx_para" id="S4.I1.i3.p1"> Experiment 3 (EX3): Parallel Optimizations. We instruct Codee and LLMs to consider only parallel optimizations (typically OpenMP multi-threaded parallelism). The goal is to check each tool’s capability of writing parallel code. Note that we separate parallel and serial optimizations to develop a more thorough understanding of the effects of individual optimizations. Parallel optimizations can overshadow or interact with serial optimizations, making it difficult to isolate and evaluate the effects of serial optimizations independently. </div> </li> <li class="ltx_item" id="S4.I1.i4" style="list-style-type:none;"> • <div class="ltx_para" id="S4.I1.i4.p1"> Experiment 4 (EX4): Time Spent on Applying Optimization. We summarize and compare the human time required to invoke tools, prepare prompts, compile and run code, and validate correctness in EX1, EX2, and EX3 for each tool. </div> </li> <li class="ltx_item" id="S4.I1.i5" style="list-style-type:none;"> • <div class="ltx_para" id="S4.I1.i5.p1"> Experiment 5 (EX5): Correctness. We summarize the correctness of code generated by different tools in EX1, EX2, and EX3. Correctness encompasses the following: checking whether the code is complete, the code compiles without any errors, and the outputs match that of the original code. Note that we adopt the pass@1 metric <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib18" title="">Evauating_LLMs_Trained_on_Code, </a>)</cite> to measure the correctness of the LLMs, only considering the the top-1 result without prompting the LLMs to produce additional answers. </div> </li> <li class="ltx_item" id="S4.I1.i6" style="list-style-type:none;"> • <div class="ltx_para" id="S4.I1.i6.p1"> Experiment 6 (EX6): HPC Commonsense. We define a new term-HPC Commonsense to understand how the suggested optimizations are relevant to each benchmark and the domain they belong to. Codee provides reasons for optimizations during code inspection. If an LLM does not provide explanations directly, we prompt it with “Give me explanations for the optimizations you made.” </div> </li> </ul> </div> <div class="ltx_para" id="S4.SS1.p3"> All experiments were conducted by two graduate students with a limited background in HPC code optimization. The optimized code and benchmark results have been verified by several experienced performance engineers. We compare the execution time for each optimized code generated in EX1, EX2, and EX3 against the original code, with the time averaged over 10 iterations. For each benchmark, we utilized the built-in input by default and updated certain inputs to ensure that the execution time was sufficiently long. Note that NA in our figures indicates that the generated code yields incorrect results, fails to compile, or encounters any runtime errors. For every NA instance, we revert to the original code for measurement of runtime. We use the following abbreviation for LLMs: O1 (OpenAI o1), L3.2 (Llama-3.2), and C3.5 (Claude-3.5) </div> </section> <section class="ltx_subsection" id="S4.SS2"> <h3 class="ltx_title ltx_title_subsection"> 4.2. Experiment 1: Single Serial Optimization</h3> <figure class="ltx_figure" id="S4.F3"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="316" id="S4.F3.g1" src="x2.png" width="830"/> <figcaption class="ltx_caption ltx_centering">Figure 3. Speedups of single and multiple round serial optimizations. The Y axis indicates the speedup and the horizontal lines denote a speedup of 1.0. Benchmarks that achieved speedup ¿=5.9 in multiple round optimization across GCC/G++ and CLANG/CLANG++ compilers have been marked above the respective bars.</figcaption> </figure> <div class="ltx_para" id="S4.SS2.p1"> Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.F3" title="Figure 3 ‣ 4.2. Experiment 1: Single Serial Optimization ‣ 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">3</a> (a) and (b) present the speedups for each benchmark achieved by optimizing the original code using three different LLMs and Codee, compiled with GCC/G++ and CLANG/CLANG++, respectively. The mean speedups achieved by Codee are 1.48<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS2.p1.1.m1.1"><semantics id="S4.SS2.p1.1.m1.1a"><mo id="S4.SS2.p1.1.m1.1.1" xref="S4.SS2.p1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.1.m1.1b"><times id="S4.SS2.p1.1.m1.1.1.cmml" xref="S4.SS2.p1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.1.m1.1d">×</annotation></semantics></math> (GCC/G++) and 1.48<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS2.p1.2.m2.1"><semantics id="S4.SS2.p1.2.m2.1a"><mo id="S4.SS2.p1.2.m2.1.1" xref="S4.SS2.p1.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.2.m2.1b"><times id="S4.SS2.p1.2.m2.1.1.cmml" xref="S4.SS2.p1.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.2.m2.1d">×</annotation></semantics></math> (CLANG/CLANG++). In comparison, OpenAI o1 achieves mean speedups of 1.50<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS2.p1.3.m3.1"><semantics id="S4.SS2.p1.3.m3.1a"><mo id="S4.SS2.p1.3.m3.1.1" xref="S4.SS2.p1.3.m3.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.3.m3.1b"><times id="S4.SS2.p1.3.m3.1.1.cmml" xref="S4.SS2.p1.3.m3.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.3.m3.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.3.m3.1d">×</annotation></semantics></math> and 1.44<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS2.p1.4.m4.1"><semantics id="S4.SS2.p1.4.m4.1a"><mo id="S4.SS2.p1.4.m4.1.1" xref="S4.SS2.p1.4.m4.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.4.m4.1b"><times id="S4.SS2.p1.4.m4.1.1.cmml" xref="S4.SS2.p1.4.m4.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.4.m4.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.4.m4.1d">×</annotation></semantics></math>, Llama-3.2 achieves mean speedups of 1.24<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS2.p1.5.m5.1"><semantics id="S4.SS2.p1.5.m5.1a"><mo id="S4.SS2.p1.5.m5.1.1" xref="S4.SS2.p1.5.m5.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.5.m5.1b"><times id="S4.SS2.p1.5.m5.1.1.cmml" xref="S4.SS2.p1.5.m5.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.5.m5.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.5.m5.1d">×</annotation></semantics></math> and 1.19<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS2.p1.6.m6.1"><semantics id="S4.SS2.p1.6.m6.1a"><mo id="S4.SS2.p1.6.m6.1.1" xref="S4.SS2.p1.6.m6.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.6.m6.1b"><times id="S4.SS2.p1.6.m6.1.1.cmml" xref="S4.SS2.p1.6.m6.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.6.m6.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.6.m6.1d">×</annotation></semantics></math>, and Claude-3.5 achieves mean speedups of 1.02<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS2.p1.7.m7.1"><semantics id="S4.SS2.p1.7.m7.1a"><mo id="S4.SS2.p1.7.m7.1.1" xref="S4.SS2.p1.7.m7.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.7.m7.1b"><times id="S4.SS2.p1.7.m7.1.1.cmml" xref="S4.SS2.p1.7.m7.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.7.m7.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.7.m7.1d">×</annotation></semantics></math> and 1.04<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS2.p1.8.m8.1"><semantics id="S4.SS2.p1.8.m8.1a"><mo id="S4.SS2.p1.8.m8.1.1" xref="S4.SS2.p1.8.m8.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.8.m8.1b"><times id="S4.SS2.p1.8.m8.1.1.cmml" xref="S4.SS2.p1.8.m8.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.8.m8.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.8.m8.1d">×</annotation></semantics></math>, using GCC/G++ and CLANG/CLANG++, respectively. OpenAI o1 typically outperforms Llama-3.2 for speedups and the quantity of correctly produced codes, whereas Claude-3.5 does not reach the same level of efficiency and accuracy. </div> <div class="ltx_para" id="S4.SS2.p2"> Due to heuristics and compiler-based code analysis by Codee, we observed no errors in the optimized code. In contrast, OpenAI o1, Llama-3.2, and Claude-3.5 yield codes that are partially completed or yield incorrect results. Codee matches or outperforms the LLMs in most of the benchmarks. In the case of MATMUL, we noticed 2/3 LLMs (OpenAI o1 and Llama-3.2) apply loop interchange as a serial optimization strategy the same as Codee (yielding ~4.5<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS2.p2.1.m1.1"><semantics id="S4.SS2.p2.1.m1.1a"><mo id="S4.SS2.p2.1.m1.1.1" xref="S4.SS2.p2.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS2.p2.1.m1.1b"><times id="S4.SS2.p2.1.m1.1.1.cmml" xref="S4.SS2.p2.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p2.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p2.1.m1.1d">×</annotation></semantics></math> speedup). We believe the familiarity of LLMs with a widely used algorithm like MATMUL during training results in such a speedup. In another instance, for a more HPC specific algorithm in the HACCmk example, out of the LLMs only OpenAI o1 provides significant speedups comparable to Codee (~5<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS2.p2.2.m2.1"><semantics id="S4.SS2.p2.2.m2.1a"><mo id="S4.SS2.p2.2.m2.1.1" xref="S4.SS2.p2.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS2.p2.2.m2.1b"><times id="S4.SS2.p2.2.m2.1.1.cmml" xref="S4.SS2.p2.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p2.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p2.2.m2.1d">×</annotation></semantics></math> speedup), while other LLMs cannot optimize the code well. Llama-3.2 and Claude-3.5 struggle to achieve significant optimization in level 2 benchmarks, such as Particlefilter. The results with CLANG/CLANG++ are generally consistent with the results obtained with GCC/G++. </div> </section> <section class="ltx_subsection" id="S4.SS3"> <h3 class="ltx_title ltx_title_subsection"> 4.3. Experiment 2: Multiple Serial Optimizations</h3> <div class="ltx_para" id="S4.SS3.p1"> Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.F3" title="Figure 3 ‣ 4.2. Experiment 1: Single Serial Optimization ‣ 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">3</a> (c) and (d) present the optimization results for each benchmark after applying multiple serial optimizations. Codee achieves mean speedups of 1.71<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS3.p1.1.m1.1"><semantics id="S4.SS3.p1.1.m1.1a"><mo id="S4.SS3.p1.1.m1.1.1" xref="S4.SS3.p1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS3.p1.1.m1.1b"><times id="S4.SS3.p1.1.m1.1.1.cmml" xref="S4.SS3.p1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p1.1.m1.1d">×</annotation></semantics></math> (GCC/G++) and 1.78<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS3.p1.2.m2.1"><semantics id="S4.SS3.p1.2.m2.1a"><mo id="S4.SS3.p1.2.m2.1.1" xref="S4.SS3.p1.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS3.p1.2.m2.1b"><times id="S4.SS3.p1.2.m2.1.1.cmml" xref="S4.SS3.p1.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p1.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p1.2.m2.1d">×</annotation></semantics></math> (CLANG/CLANG++). OpenAI o1 achieves mean speedups of 2.07<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS3.p1.3.m3.1"><semantics id="S4.SS3.p1.3.m3.1a"><mo id="S4.SS3.p1.3.m3.1.1" xref="S4.SS3.p1.3.m3.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS3.p1.3.m3.1b"><times id="S4.SS3.p1.3.m3.1.1.cmml" xref="S4.SS3.p1.3.m3.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p1.3.m3.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p1.3.m3.1d">×</annotation></semantics></math> and 2.13<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS3.p1.4.m4.1"><semantics id="S4.SS3.p1.4.m4.1a"><mo id="S4.SS3.p1.4.m4.1.1" xref="S4.SS3.p1.4.m4.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS3.p1.4.m4.1b"><times id="S4.SS3.p1.4.m4.1.1.cmml" xref="S4.SS3.p1.4.m4.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p1.4.m4.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p1.4.m4.1d">×</annotation></semantics></math>, Llama-3.2 achieves mean speedups of 1.30<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS3.p1.5.m5.1"><semantics id="S4.SS3.p1.5.m5.1a"><mo id="S4.SS3.p1.5.m5.1.1" xref="S4.SS3.p1.5.m5.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS3.p1.5.m5.1b"><times id="S4.SS3.p1.5.m5.1.1.cmml" xref="S4.SS3.p1.5.m5.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p1.5.m5.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p1.5.m5.1d">×</annotation></semantics></math> and 1.32<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS3.p1.6.m6.1"><semantics id="S4.SS3.p1.6.m6.1a"><mo id="S4.SS3.p1.6.m6.1.1" xref="S4.SS3.p1.6.m6.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS3.p1.6.m6.1b"><times id="S4.SS3.p1.6.m6.1.1.cmml" xref="S4.SS3.p1.6.m6.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p1.6.m6.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p1.6.m6.1d">×</annotation></semantics></math>, and Claude-3.5 achieves mean speedups of 1.31<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS3.p1.7.m7.1"><semantics id="S4.SS3.p1.7.m7.1a"><mo id="S4.SS3.p1.7.m7.1.1" xref="S4.SS3.p1.7.m7.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS3.p1.7.m7.1b"><times id="S4.SS3.p1.7.m7.1.1.cmml" xref="S4.SS3.p1.7.m7.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p1.7.m7.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p1.7.m7.1d">×</annotation></semantics></math> and 1.31<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS3.p1.8.m8.1"><semantics id="S4.SS3.p1.8.m8.1a"><mo id="S4.SS3.p1.8.m8.1.1" xref="S4.SS3.p1.8.m8.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS3.p1.8.m8.1b"><times id="S4.SS3.p1.8.m8.1.1.cmml" xref="S4.SS3.p1.8.m8.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p1.8.m8.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p1.8.m8.1d">×</annotation></semantics></math>. As EX2 codes are derived from the optimizations implemented in EX1, errors present in EX1 persist and those codes are still identified as NA (e.g., Srad). </div> <div class="ltx_para" id="S4.SS3.p2"> In most cases, we observed that higher speedups are achieved after applying multiple optimizations. For example, in Hotspot3D, Codee shows a significant improvement in multiple-round optimization compared to single-round optimization, with a speedup of 1.80<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS3.p2.1.m1.1"><semantics id="S4.SS3.p2.1.m1.1a"><mo id="S4.SS3.p2.1.m1.1.1" xref="S4.SS3.p2.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.1.m1.1b"><times id="S4.SS3.p2.1.m1.1.1.cmml" xref="S4.SS3.p2.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.1.m1.1d">×</annotation></semantics></math> using CLANG/CLANG++. In an interesting scenario for Particlefilter, apart from suggesting regular canonical optimizations, OpenAI o1 goes a step further in suggesting the implementation of binary search as opposed to the existing linear search approach resulting in a speedup of 9.8<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS3.p2.2.m2.1"><semantics id="S4.SS3.p2.2.m2.1a"><mo id="S4.SS3.p2.2.m2.1.1" xref="S4.SS3.p2.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.2.m2.1b"><times id="S4.SS3.p2.2.m2.1.1.cmml" xref="S4.SS3.p2.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.2.m2.1d">×</annotation></semantics></math> on both the compilers. This finding highlights LLMs’ capability to offer adaptive optimizations from the perspective of algorithmic time complexity. </div> </section> <section class="ltx_subsection" id="S4.SS4"> <h3 class="ltx_title ltx_title_subsection"> 4.4. Experiment 3: Parallel Optimization</h3> <figure class="ltx_figure" id="S4.F4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="923" id="S4.F4.g1" src="x3.png" width="830"/> <figcaption class="ltx_caption ltx_centering">Figure 4. Results from parallel optimizations using GCC/G++ compiler. The Y axis indicates the speedup.</figcaption> </figure> <figure class="ltx_figure" id="S4.F5"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="917" id="S4.F5.g1" src="x4.png" width="830"/> <figcaption class="ltx_caption ltx_centering">Figure 5. Results from parallel optimizations using CLANG/CLANG++ compiler. The Y axis indicates the speedup.</figcaption> </figure> <div class="ltx_para" id="S4.SS4.p1"> Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.F4" title="Figure 4 ‣ 4.4. Experiment 3: Parallel Optimization ‣ 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">4</a> and Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.F5" title="Figure 5 ‣ 4.4. Experiment 3: Parallel Optimization ‣ 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">5</a> demonstrate the speedups achieved by parallel optimizations using each tool with 4 to 32 threads, compiled with GCC/G++ and CLANG/CLANG++ respectively. For GCC/G++, the mean speedups of Codee, OpenAI o1, Llama-3.2, and Claude-3.5 across 4 to 32 threads are 4.75<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS4.p1.1.m1.1"><semantics id="S4.SS4.p1.1.m1.1a"><mo id="S4.SS4.p1.1.m1.1.1" xref="S4.SS4.p1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.p1.1.m1.1b"><times id="S4.SS4.p1.1.m1.1.1.cmml" xref="S4.SS4.p1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p1.1.m1.1d">×</annotation></semantics></math>, 5.01<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS4.p1.2.m2.1"><semantics id="S4.SS4.p1.2.m2.1a"><mo id="S4.SS4.p1.2.m2.1.1" xref="S4.SS4.p1.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.p1.2.m2.1b"><times id="S4.SS4.p1.2.m2.1.1.cmml" xref="S4.SS4.p1.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p1.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p1.2.m2.1d">×</annotation></semantics></math>, 2.89<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS4.p1.3.m3.1"><semantics id="S4.SS4.p1.3.m3.1a"><mo id="S4.SS4.p1.3.m3.1.1" xref="S4.SS4.p1.3.m3.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.p1.3.m3.1b"><times id="S4.SS4.p1.3.m3.1.1.cmml" xref="S4.SS4.p1.3.m3.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p1.3.m3.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p1.3.m3.1d">×</annotation></semantics></math>, and 3.38<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS4.p1.4.m4.1"><semantics id="S4.SS4.p1.4.m4.1a"><mo id="S4.SS4.p1.4.m4.1.1" xref="S4.SS4.p1.4.m4.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.p1.4.m4.1b"><times id="S4.SS4.p1.4.m4.1.1.cmml" xref="S4.SS4.p1.4.m4.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p1.4.m4.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p1.4.m4.1d">×</annotation></semantics></math>, respectively. For CLANG/CLANG++, the mean speedups of Codee, OpenAI o1, Llama-3.2, and Claude-3.5 across 4 to 32 threads are 4.12<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS4.p1.5.m5.1"><semantics id="S4.SS4.p1.5.m5.1a"><mo id="S4.SS4.p1.5.m5.1.1" xref="S4.SS4.p1.5.m5.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.p1.5.m5.1b"><times id="S4.SS4.p1.5.m5.1.1.cmml" xref="S4.SS4.p1.5.m5.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p1.5.m5.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p1.5.m5.1d">×</annotation></semantics></math>, 5.04<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS4.p1.6.m6.1"><semantics id="S4.SS4.p1.6.m6.1a"><mo id="S4.SS4.p1.6.m6.1.1" xref="S4.SS4.p1.6.m6.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.p1.6.m6.1b"><times id="S4.SS4.p1.6.m6.1.1.cmml" xref="S4.SS4.p1.6.m6.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p1.6.m6.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p1.6.m6.1d">×</annotation></semantics></math>, 2.46<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS4.p1.7.m7.1"><semantics id="S4.SS4.p1.7.m7.1a"><mo id="S4.SS4.p1.7.m7.1.1" xref="S4.SS4.p1.7.m7.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.p1.7.m7.1b"><times id="S4.SS4.p1.7.m7.1.1.cmml" xref="S4.SS4.p1.7.m7.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p1.7.m7.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p1.7.m7.1d">×</annotation></semantics></math>, and 3.64<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS4.p1.8.m8.1"><semantics id="S4.SS4.p1.8.m8.1a"><mo id="S4.SS4.p1.8.m8.1.1" xref="S4.SS4.p1.8.m8.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.p1.8.m8.1b"><times id="S4.SS4.p1.8.m8.1.1.cmml" xref="S4.SS4.p1.8.m8.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p1.8.m8.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p1.8.m8.1d">×</annotation></semantics></math>, respectively. OpenAI o1 demonstrates the best performance in recommending strategies for parallel optimization across different compilers. Codee does suggest good parallel optimizations as well, offering auto-fixes for most of the commonly used optimization strategies. Notably, in HACCmk and Hotspot3D, Codee’s performance is particularly impressive, achieving 40.00<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS4.p1.9.m9.1"><semantics id="S4.SS4.p1.9.m9.1a"><mo id="S4.SS4.p1.9.m9.1.1" xref="S4.SS4.p1.9.m9.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.p1.9.m9.1b"><times id="S4.SS4.p1.9.m9.1.1.cmml" xref="S4.SS4.p1.9.m9.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p1.9.m9.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p1.9.m9.1d">×</annotation></semantics></math> and 21.20<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS4.p1.10.m10.1"><semantics id="S4.SS4.p1.10.m10.1a"><mo id="S4.SS4.p1.10.m10.1.1" xref="S4.SS4.p1.10.m10.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.p1.10.m10.1b"><times id="S4.SS4.p1.10.m10.1.1.cmml" xref="S4.SS4.p1.10.m10.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p1.10.m10.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p1.10.m10.1d">×</annotation></semantics></math> at 32 threads respectively. As for the LLMs, it should be noted that Claude-3.5 fails to optimize 8 out of 20 benchmarks. </div> <div class="ltx_para" id="S4.SS4.p2"> While the LLMs are good at yielding efficient optimizations for MATMUL as discussed in previous sections, in the case of parallelization, all of them fail to suggest meaningful optimizations that yield speedups. Interestingly, while the OpenAI o1 yields code with remarkable speedups for MATMUL, it applies serial optimizations instead of using OpenMP pragmas, suggesting that it does not follow users’ instructions and breaks the strictly parallel optimization requirement of our experiment. Llama-3.2 and Claude-3.5 end up not only parallelizing the matrix multiplication function, but also parallelizing the outer loop that benchmarks the MATMUL function multiple times, resulting in complex nested parallelism and excessive memory contention. In the case of Hotspot3D, Codee recommends using #pragma omp for schedule(auto) to achieve a higher speedup than #pragma omp parallel for suggested by the LLMs. A similar scenario also occurs in HACCmk. </div> <div class="ltx_para" id="S4.SS4.p3"> Interestingly with Durbin and Jacobi-1d, the performance declines with an increase in thread counts. This occurs as a result of applying OpenMP pragmas to the inner loop instead of the outer loop of the nested loops present in the program. When parallelizing the inner loop, the workload per thread might be too small relative to the overhead of creating and synchronizing threads. Another interesting scenario occurs with the case of Srad where Llama-3.2 is the only model that failed to scale using the CLANG/CLANG++ compiler, as shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.F5" title="Figure 5 ‣ 4.4. Experiment 3: Parallel Optimization ‣ 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">5</a>. However, its GCC/G++ counterpart does achieve non-trivial speedups. Upon investigation, we found that GCC/G++ automatically fused nested loops for better scheduling. While CLANG/CLANG++, on the other hand, only parallelizes the outmost loop with #pragma omp parallel for. </div> </section> <section class="ltx_subsection" id="S4.SS5"> <h3 class="ltx_title ltx_title_subsection"> 4.5. Experiment 4: Time Spent on Applying Optimizations</h3> <figure class="ltx_table" id="S4.T3"> <figcaption class="ltx_caption ltx_centering" style="font-size:70%;">Table 3. Time taken to apply code optimizations</figcaption> <table class="ltx_tabular ltx_centering ltx_align_middle" id="S4.T3.15"> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T3.15.16.1"> <td class="ltx_td ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S4.T3.15.16.1.1" rowspan="2">Tool</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" colspan="3" id="S4.T3.15.16.1.2">Average Time (Per Benchmark)</td> </tr> <tr class="ltx_tr" id="S4.T3.15.17.2"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T3.15.17.2.1">EX1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T3.15.17.2.2">EX2</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T3.15.17.2.3">EX3</td> </tr> <tr class="ltx_tr" id="S4.T3.3.3"> <td class="ltx_td ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S4.T3.3.3.4">Codee</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T3.1.1.1"> <math alttext="\sim" class="ltx_Math" display="inline" id="S4.T3.1.1.1.m1.1"><semantics id="S4.T3.1.1.1.m1.1a"><mo id="S4.T3.1.1.1.m1.1.1" mathsize="70%" xref="S4.T3.1.1.1.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.T3.1.1.1.m1.1b"><csymbol cd="latexml" id="S4.T3.1.1.1.m1.1.1.cmml" xref="S4.T3.1.1.1.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.1.1.1.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.T3.1.1.1.m1.1d">∼</annotation></semantics></math>3 mins </td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S4.T3.2.2.2"> <math alttext="\sim" class="ltx_Math" display="inline" id="S4.T3.2.2.2.m1.1"><semantics id="S4.T3.2.2.2.m1.1a"><mo id="S4.T3.2.2.2.m1.1.1" mathsize="70%" xref="S4.T3.2.2.2.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.T3.2.2.2.m1.1b"><csymbol cd="latexml" id="S4.T3.2.2.2.m1.1.1.cmml" xref="S4.T3.2.2.2.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.2.2.2.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.T3.2.2.2.m1.1d">∼</annotation></semantics></math>4 mins </td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T3.3.3.3"> <math alttext="\sim" class="ltx_Math" display="inline" id="S4.T3.3.3.3.m1.1"><semantics id="S4.T3.3.3.3.m1.1a"><mo id="S4.T3.3.3.3.m1.1.1" mathsize="70%" xref="S4.T3.3.3.3.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.T3.3.3.3.m1.1b"><csymbol cd="latexml" id="S4.T3.3.3.3.m1.1.1.cmml" xref="S4.T3.3.3.3.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.3.3.3.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.T3.3.3.3.m1.1d">∼</annotation></semantics></math>2 mins </td> </tr> <tr class="ltx_tr" id="S4.T3.7.7"> <td class="ltx_td ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S4.T3.7.7.5">OpenAI o1</td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S4.T3.4.4.1"> <math alttext="\sim" class="ltx_Math" display="inline" id="S4.T3.4.4.1.m1.1"><semantics id="S4.T3.4.4.1.m1.1a"><mo id="S4.T3.4.4.1.m1.1.1" mathsize="70%" xref="S4.T3.4.4.1.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.T3.4.4.1.m1.1b"><csymbol cd="latexml" id="S4.T3.4.4.1.m1.1.1.cmml" xref="S4.T3.4.4.1.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.4.4.1.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.T3.4.4.1.m1.1d">∼</annotation></semantics></math>2 mins </td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S4.T3.6.6.3"> <math alttext="\sim" class="ltx_Math" display="inline" id="S4.T3.5.5.2.m1.1"><semantics id="S4.T3.5.5.2.m1.1a"><mo id="S4.T3.5.5.2.m1.1.1" mathsize="70%" xref="S4.T3.5.5.2.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.T3.5.5.2.m1.1b"><csymbol cd="latexml" id="S4.T3.5.5.2.m1.1.1.cmml" xref="S4.T3.5.5.2.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.5.5.2.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.T3.5.5.2.m1.1d">∼</annotation></semantics></math>6 mins (<math alttext="\sim" class="ltx_Math" display="inline" id="S4.T3.6.6.3.m2.1"><semantics id="S4.T3.6.6.3.m2.1a"><mo id="S4.T3.6.6.3.m2.1.1" mathsize="70%" xref="S4.T3.6.6.3.m2.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.T3.6.6.3.m2.1b"><csymbol cd="latexml" id="S4.T3.6.6.3.m2.1.1.cmml" xref="S4.T3.6.6.3.m2.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.6.6.3.m2.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.T3.6.6.3.m2.1d">∼</annotation></semantics></math>1.5 mins/trial) </td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T3.7.7.4"> <math alttext="\sim" class="ltx_Math" display="inline" id="S4.T3.7.7.4.m1.1"><semantics id="S4.T3.7.7.4.m1.1a"><mo id="S4.T3.7.7.4.m1.1.1" mathsize="70%" xref="S4.T3.7.7.4.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.T3.7.7.4.m1.1b"><csymbol cd="latexml" id="S4.T3.7.7.4.m1.1.1.cmml" xref="S4.T3.7.7.4.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.7.7.4.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.T3.7.7.4.m1.1d">∼</annotation></semantics></math>5 mins </td> </tr> <tr class="ltx_tr" id="S4.T3.11.11"> <td class="ltx_td ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S4.T3.11.11.5">Llama-3.2</td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S4.T3.8.8.1"> <math alttext="\sim" class="ltx_Math" display="inline" id="S4.T3.8.8.1.m1.1"><semantics id="S4.T3.8.8.1.m1.1a"><mo id="S4.T3.8.8.1.m1.1.1" mathsize="70%" xref="S4.T3.8.8.1.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.T3.8.8.1.m1.1b"><csymbol cd="latexml" id="S4.T3.8.8.1.m1.1.1.cmml" xref="S4.T3.8.8.1.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.8.8.1.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.T3.8.8.1.m1.1d">∼</annotation></semantics></math>2 mins </td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S4.T3.10.10.3"> <math alttext="\sim" class="ltx_Math" display="inline" id="S4.T3.9.9.2.m1.1"><semantics id="S4.T3.9.9.2.m1.1a"><mo id="S4.T3.9.9.2.m1.1.1" mathsize="70%" xref="S4.T3.9.9.2.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.T3.9.9.2.m1.1b"><csymbol cd="latexml" id="S4.T3.9.9.2.m1.1.1.cmml" xref="S4.T3.9.9.2.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.9.9.2.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.T3.9.9.2.m1.1d">∼</annotation></semantics></math>9 mins (<math alttext="\sim" class="ltx_Math" display="inline" id="S4.T3.10.10.3.m2.1"><semantics id="S4.T3.10.10.3.m2.1a"><mo id="S4.T3.10.10.3.m2.1.1" mathsize="70%" xref="S4.T3.10.10.3.m2.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.T3.10.10.3.m2.1b"><csymbol cd="latexml" id="S4.T3.10.10.3.m2.1.1.cmml" xref="S4.T3.10.10.3.m2.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.10.10.3.m2.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.T3.10.10.3.m2.1d">∼</annotation></semantics></math>2.25 mins/trial) </td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T3.11.11.4"> <math alttext="\sim" class="ltx_Math" display="inline" id="S4.T3.11.11.4.m1.1"><semantics id="S4.T3.11.11.4.m1.1a"><mo id="S4.T3.11.11.4.m1.1.1" mathsize="70%" xref="S4.T3.11.11.4.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.T3.11.11.4.m1.1b"><csymbol cd="latexml" id="S4.T3.11.11.4.m1.1.1.cmml" xref="S4.T3.11.11.4.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.11.11.4.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.T3.11.11.4.m1.1d">∼</annotation></semantics></math>5 mins </td> </tr> <tr class="ltx_tr" id="S4.T3.15.15"> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_l ltx_border_r ltx_border_t" id="S4.T3.15.15.5">Claude-3.5</td> <td class="ltx_td ltx_align_left ltx_border_b ltx_border_r ltx_border_t" id="S4.T3.12.12.1"> <math alttext="\sim" class="ltx_Math" display="inline" id="S4.T3.12.12.1.m1.1"><semantics id="S4.T3.12.12.1.m1.1a"><mo id="S4.T3.12.12.1.m1.1.1" mathsize="70%" xref="S4.T3.12.12.1.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.T3.12.12.1.m1.1b"><csymbol cd="latexml" id="S4.T3.12.12.1.m1.1.1.cmml" xref="S4.T3.12.12.1.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.12.12.1.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.T3.12.12.1.m1.1d">∼</annotation></semantics></math>2 mins </td> <td class="ltx_td ltx_align_left ltx_border_b ltx_border_r ltx_border_t" id="S4.T3.14.14.3"> <math alttext="\sim" class="ltx_Math" display="inline" id="S4.T3.13.13.2.m1.1"><semantics id="S4.T3.13.13.2.m1.1a"><mo id="S4.T3.13.13.2.m1.1.1" mathsize="70%" xref="S4.T3.13.13.2.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.T3.13.13.2.m1.1b"><csymbol cd="latexml" id="S4.T3.13.13.2.m1.1.1.cmml" xref="S4.T3.13.13.2.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.13.13.2.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.T3.13.13.2.m1.1d">∼</annotation></semantics></math>6 mins (<math alttext="\sim" class="ltx_Math" display="inline" id="S4.T3.14.14.3.m2.1"><semantics id="S4.T3.14.14.3.m2.1a"><mo id="S4.T3.14.14.3.m2.1.1" mathsize="70%" xref="S4.T3.14.14.3.m2.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.T3.14.14.3.m2.1b"><csymbol cd="latexml" id="S4.T3.14.14.3.m2.1.1.cmml" xref="S4.T3.14.14.3.m2.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.14.14.3.m2.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.T3.14.14.3.m2.1d">∼</annotation></semantics></math>1.5 mins/trial) </td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S4.T3.15.15.4"> <math alttext="\sim" class="ltx_Math" display="inline" id="S4.T3.15.15.4.m1.1"><semantics id="S4.T3.15.15.4.m1.1a"><mo id="S4.T3.15.15.4.m1.1.1" mathsize="70%" xref="S4.T3.15.15.4.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.T3.15.15.4.m1.1b"><csymbol cd="latexml" id="S4.T3.15.15.4.m1.1.1.cmml" xref="S4.T3.15.15.4.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.15.15.4.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.T3.15.15.4.m1.1d">∼</annotation></semantics></math>4 mins </td> </tr> </tbody> </table> </figure> <div class="ltx_para" id="S4.SS5.p1"> Table <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.T3" title="Table 3 ‣ 4.5. Experiment 4: Time Spent on Applying Optimizations ‣ 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">3</a> summarizes the time required to apply optimizations using each tool, including activities such as prompting the model, waiting for responses, implementing necessary code changes, and validating the results. Codee includes built-in auto-fixes for specific optimizations, allowing it to automatically transform code regardless of its size. Although many optimizations still require manual adjustments to the code, Codee provides the file, function, and line number where the optimizations should be applied. In comparison, LLMs can often automate code transformations. Nonetheless, Codee is generally faster than LLMs across experiments for the following reasons in EX2 and EX3. First, since LLMs are invoked via remote servers and the models used are large, the process of generating responses can be slow, especially when multiple sequential optimizations are applied, resulting in lengthy outputs. Notably, querying the Llama-3.2 API via Together AI <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib63" title="">together_ai, </a>)</cite> introduces a significant delay compared to other language models. Furthermore, LLMs are typically limited by output token constraints (e.g., 4k tokens for OpenAI o1), which can prevent them from generating complete code outputs, especially in EX2 with multiple serial optimizations. Moreover, some LLM outputs contain extra text or incomplete code, necessitating the manual extraction of code from LLM outputs. It is interesting to observe that OpenAI o1 often produces clean code despite its longest inference time, thereby decreasing the time required for subsequent processing. </div> </section> <section class="ltx_subsection" id="S4.SS6"> <h3 class="ltx_title ltx_title_subsection"> 4.6. Experiment 5: Correctness</h3> <figure class="ltx_table" id="S4.T4"> <figcaption class="ltx_caption ltx_centering" style="font-size:70%;">Table 4. Comparison of correctness of code optimized by Codee and LLMs in EX1, EX2, and EX3.</figcaption> <table class="ltx_tabular ltx_centering ltx_align_middle" id="S4.T4.6"> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T4.6.1.1"> <td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t" id="S4.T4.6.1.1.1" rowspan="2">Outcome</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" colspan="4" id="S4.T4.6.1.1.2">EX1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" colspan="4" id="S4.T4.6.1.1.3">EX2</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" colspan="4" id="S4.T4.6.1.1.4">EX3</td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S4.T4.6.1.1.5" rowspan="2">Total</td> </tr> <tr class="ltx_tr" id="S4.T4.6.2.2"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.2.2.1">Codee</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.2.2.2">O1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.2.2.3">L3.2</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.2.2.4">C3.5</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.2.2.5">Codee</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.2.2.6">O1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.2.2.7">L3.2</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.2.2.8">C3.5</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.2.2.9">Codee</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.2.2.10">O1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.2.2.11">L3.2</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.2.2.12">C3.5</td> </tr> <tr class="ltx_tr" id="S4.T4.6.3.3"> <td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t" id="S4.T4.6.3.3.1">Incorrect</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.3.3.2">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.3.3.3">2</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.3.3.4">6</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.3.3.5">6</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.3.3.6">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.3.3.7">3</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.3.3.8">8</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.3.3.9">7</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.3.3.10">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.3.3.11">1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.3.3.12">7</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.3.3.13">8</td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S4.T4.6.3.3.14">48</td> </tr> <tr class="ltx_tr" id="S4.T4.6.4.4"> <td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t" id="S4.T4.6.4.4.1"> Compilation errors</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.4.4.2">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.4.4.3">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.4.4.4">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.4.4.5">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.4.4.6">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.4.4.7">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.4.4.8">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.4.4.9">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.4.4.10">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.4.4.11">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.4.4.12">1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.4.4.13">2</td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S4.T4.6.4.4.14">3</td> </tr> <tr class="ltx_tr" id="S4.T4.6.5.5"> <td class="ltx_td ltx_align_left ltx_border_l ltx_border_r" id="S4.T4.6.5.5.1"> No LLM generated code</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.5.5.2">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.5.5.3">1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.5.5.4">3</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.5.5.5">2</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.5.5.6">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.5.5.7">1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.5.5.8">6</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.5.5.9">5</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.5.5.10">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.5.5.11">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.5.5.12">1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.5.5.13">0</td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S4.T4.6.5.5.14">19</td> </tr> <tr class="ltx_tr" id="S4.T4.6.6.6"> <td class="ltx_td ltx_align_left ltx_border_l ltx_border_r" id="S4.T4.6.6.6.1"> Incorrect results - Output mismatch</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.6.6.2">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.6.6.3">1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.6.6.4">3</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.6.6.5">3</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.6.6.6">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.6.6.7">2</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.6.6.8">2</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.6.6.9">2</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.6.6.10">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.6.6.11">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.6.6.12">5</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.6.6.13">6</td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S4.T4.6.6.6.14">24</td> </tr> <tr class="ltx_tr" id="S4.T4.6.7.7"> <td class="ltx_td ltx_align_left ltx_border_l ltx_border_r" id="S4.T4.6.7.7.1"> LLM - Failed to follow instructions</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.7.7.2">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.7.7.3">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.7.7.4">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.7.7.5">1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.7.7.6">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.7.7.7">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.7.7.8">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.7.7.9">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.7.7.10">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.7.7.11">1</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.7.7.12">0</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.7.7.13">0</td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S4.T4.6.7.7.14">2</td> </tr> <tr class="ltx_tr" id="S4.T4.6.8.8"> <td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t" id="S4.T4.6.8.8.1">Correct</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.8.8.2">20</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.8.8.3">18</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.8.8.4">14</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.8.8.5">14</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.8.8.6">20</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.8.8.7">17</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.8.8.8">12</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.8.8.9">13</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.8.8.10">20</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.8.8.11">19</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.8.8.12">13</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.6.8.8.13">12</td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S4.T4.6.8.8.14">192</td> </tr> <tr class="ltx_tr" id="S4.T4.6.9.9"> <td class="ltx_td ltx_align_left ltx_border_b ltx_border_l ltx_border_r ltx_border_t" id="S4.T4.6.9.9.1">Total</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S4.T4.6.9.9.2">20</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S4.T4.6.9.9.3">20</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S4.T4.6.9.9.4">20</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S4.T4.6.9.9.5">20</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S4.T4.6.9.9.6">20</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S4.T4.6.9.9.7">20</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S4.T4.6.9.9.8">20</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S4.T4.6.9.9.9">20</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S4.T4.6.9.9.10">20</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S4.T4.6.9.9.11">20</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S4.T4.6.9.9.12">20</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S4.T4.6.9.9.13">20</td> <td class="ltx_td ltx_align_left ltx_border_b ltx_border_r ltx_border_t" id="S4.T4.6.9.9.14">240</td> </tr> </tbody> </table> </figure> <div class="ltx_para" id="S4.SS6.p1"> Table <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.T4" title="Table 4 ‣ 4.6. Experiment 5: Correctness ‣ 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">4</a> shows that Codee consistently demonstrates a perfect correctness rate across all benchmarks, outperforming the LLMs, which do not tend to always provide the desired results. Codee can accurately optimize code across various scenarios using compiler-based analysis. Despite their broader applicability, LLMs often struggle with analyzing data flow, cross-iteration dependencies, and function invocations, particularly in complex benchmarks. </div> <div class="ltx_para" id="S4.SS6.p2"> In EX1, all LLMs fail to yield codes that match expected outputs/results for the Srad benchmark. As shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.F6" title="Figure 6 ‣ 4.6. Experiment 5: Correctness ‣ 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">6</a>, the correct logic is to first compute all derivatives from the old J, then apply them in a separate loop. OpenAI o1 fails because it incorrectly calculates the directional derivatives—dN, dS, dW, and dE using the image J in the same loop that updates J. While traversing J in the row-major order, some positions (those with smaller indices) are already updated while others are not, leading to old and new values being mixed, and this breaks the fundamental requirement of the Srad algorithm that the directional derivatives for each pixel must be computed from the same iteration. Llama-3.2 and Claude-3.5 fail due to similar reasons by collapsing two nested loops into one. </div> <figure class="ltx_figure" id="S4.F6"> <div class="ltx_listing ltx_lst_language_C ltx_lstlisting ltx_align_center ltx_framed ltx_framed_rectangle ltx_listing" id="S4.F6.2" style="background-color:#F2F2EB;"> <div class="ltx_listing_data"><a download="" href="data:text/plain;base64,Zm9yIChpbnQgaSA9IDA7IGkgPCByb3dzOyBpKyspCiAgICBmb3IgKGludCBqID0gMDsgaiA8IGNvbHM7IGorKykgewogICAgICAgIC8vIENhbGN1bGF0ZSBkaXJlY3Rpb25hbCBkZXJpdmF0aXZlcwogICAgICAgIGROW2tdID0gSltpTltpXSAqIGNvbHMgKyBqXSAtIEpjOwogICAgICAgIGRTW2tdID0gSltpU1tpXSAqIGNvbHMgKyBqXSAtIEpjOwogICAgICAgIGRXW2tdID0gSltpICogY29scyArIGpXW2pdXSAtIEpjOwogICAgICAgIGRFW2tdID0gSltpICogY29scyArIGpFW2pdXSAtIEpjOwogICAgfQpmb3IgKGludCBpID0gMDsgaSA8IHJvd3M7IGkrKykKICAgIGZvciAoaW50IGogPSAwOyBqIDwgY29sczsgaisrKSB7CiAgICAgICAgLy8gTG9hZCBkTiwgZFMsIGRXLCBhbmQgZEUKICAgICAgICBEID0gY04gKiBkTiArIGNTICogZFMgKyBjVyAqIGRXICsgY0UgKiBkRTsKICAgICAgICAvLyBVcGRhdGUgSgogICAgICAgIEpba10gPSBKW2tdICsgMC4yNSAqIGxhbWJkYSAqIEQ7CiAgICB9">⬇</a></div> <div class="ltx_listingline" id="lstnumberx13"> for (int i = 0; i < rows; i++) </div> <div class="ltx_listingline" id="lstnumberx14"> for (int j = 0; j < cols; j++) { </div> <div class="ltx_listingline" id="lstnumberx15"> // Calculate directional derivatives </div> <div class="ltx_listingline" id="lstnumberx16"> dN[k] = J[iN[i] * cols + j] - Jc; </div> <div class="ltx_listingline" id="lstnumberx17"> dS[k] = J[iS[i] * cols + j] - Jc; </div> <div class="ltx_listingline" id="lstnumberx18"> dW[k] = J[i * cols + jW[j]] - Jc; </div> <div class="ltx_listingline" id="lstnumberx19"> dE[k] = J[i * cols + jE[j]] - Jc; </div> <div class="ltx_listingline" id="lstnumberx20"> } </div> <div class="ltx_listingline" id="lstnumberx21"> for (int i = 0; i < rows; i++) </div> <div class="ltx_listingline" id="lstnumberx22"> for (int j = 0; j < cols; j++) { </div> <div class="ltx_listingline" id="lstnumberx23"> // Load dN, dS, dW, and dE </div> <div class="ltx_listingline" id="lstnumberx24"> D = cN * dN + cS * dS + cW * dW + cE * dE; </div> <div class="ltx_listingline" id="lstnumberx25"> // Update J </div> <div class="ltx_listingline" id="lstnumberx26"> J[k] = J[k] + 0.25 * lambda * D; </div> <div class="ltx_listingline" id="lstnumberx27"> } </div> </div> <figcaption class="ltx_caption ltx_centering">Figure 6. Original code snippet of Srad</figcaption> </figure> <figure class="ltx_figure" id="S4.F7"> <div class="ltx_listing ltx_lst_language_C ltx_lstlisting ltx_align_center ltx_framed ltx_framed_rectangle ltx_listing" id="S4.F7.2" style="background-color:#F2F2EB;"> <div class="ltx_listing_data"><a download="" href="data:text/plain;base64,Zm9yIChpbnQgaSA9IDA7IGkgPCByb3dzOyBpKyspIHsKICAgIGZvciAoaW50IGogPSAwOyBqIDwgY29sczsgaisrKSB7CiAgICAgICAgLy8gQ2FsY3VsYXRlcyBkaXJlY3Rpb25hbCBkZXJpdmF0aXZlcyBkTiwgZFMsIGRXLCBkRSBiYXNlZCBvbiBKCiAgICAgICAgRCA9IGNOICogZE4gKyBjUyAqIGRTICsgY1cgKiBkVyArIGNFICogZEU7CiAgICAgICAgSltrXSA9IEpba10gKyAwLjI1ICogbGFtYmRhICogRDsKICAgIH0KfQ==">⬇</a></div> <div class="ltx_listingline" id="lstnumberx28"> for (int i = 0; i < rows; i++) { </div> <div class="ltx_listingline" id="lstnumberx29"> for (int j = 0; j < cols; j++) { </div> <div class="ltx_listingline" id="lstnumberx30"> // Calculates directional derivatives dN, dS, dW, dE based on J </div> <div class="ltx_listingline" id="lstnumberx31"> D = cN * dN + cS * dS + cW * dW + cE * dE; </div> <div class="ltx_listingline" id="lstnumberx32"> J[k] = J[k] + 0.25 * lambda * D; </div> <div class="ltx_listingline" id="lstnumberx33"> } </div> <div class="ltx_listingline" id="lstnumberx34"> } </div> </div> <figcaption class="ltx_caption ltx_centering">Figure 7. Optimized code of Srad - OpenAI o1</figcaption> </figure> <div class="ltx_para" id="S4.SS6.p3"> LLMs also tend to output code that is completely irrelevant for the given context despite giving well-structured prompts as inputs. For the COULOMB example in EX2, Llama-3.2 produces some code that is completely unrelated to the prompt which provided a code snippet that needed to be optimized, which is a classic case of Hallucination <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib70" title="">xu2024hallucinationinevitableinnatelimitation, </a>)</cite>. In the case of MATMUL in EX3, when instructed to apply parallel optimizations, OpenAI o1 tends to change the memory access patterns, which isn’t really a parallel optimization strategy, an example of LLMs not following instructions to yield codes that aren’t meaningful. LLMs also lack a precise understanding of compiler features. In EX3, for Cholesky, Claude-3.5 fails to generate code that yields correct results. </div> </section> <section class="ltx_subsection" id="S4.SS7"> <h3 class="ltx_title ltx_title_subsection"> 4.7. Experiment 6: HPC Commonsense</h3> <figure class="ltx_figure" id="S4.SS7.2"> <div class="ltx_block" id="S4.SS7.2.3"> <figure class="ltx_figure ltx_figure_panel ltx_align_center ltx_align_bottom" id="S4.F8" style="width:433.6pt;"><img alt="Refer to caption" class="ltx_graphics ltx_figure_panel ltx_img_landscape" height="406" id="S4.SS7.1.1.g1" src="x5.png" width="830"/> <figcaption class="ltx_caption">Figure 8. Average speedups (Y-axis) achieved by the tools by domain in EX1 and EX2.</figcaption> </figure> <figure class="ltx_figure ltx_figure_panel ltx_align_center ltx_align_bottom" id="S4.F9" style="width:433.6pt;"><img alt="Refer to caption" class="ltx_graphics ltx_figure_panel ltx_img_landscape" height="409" id="S4.SS7.2.2.g1" src="x6.png" width="830"/> <figcaption class="ltx_caption">Figure 9. Average speedups (Y-axis) achieved by the tools by domain in EX3.</figcaption> </figure> </div> </figure> <figure class="ltx_figure" id="S4.F10"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="698" id="S4.F10.g1" src="x7.png" width="664"/> <figcaption class="ltx_caption ltx_centering">Figure 10. Distribution of optimizations applied by each tool</figcaption> </figure> <div class="ltx_para" id="S4.SS7.p1"> We assess HPC commonsense of tools from two perspectives. First, we assess if the speedup achieved by each tool across all computation motifs in Table <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S2.T1" title="Table 1 ‣ Traditional Performance Analysis Tools ‣ 2. Related Work ‣ Do Large Language Models Understand Performance Optimization?">1</a>. Second, we analyze the distribution of optimizations applied by each tool, as well as the rationale behind the optimizations. </div> <div class="ltx_para" id="S4.SS7.p2"> From Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.F8" title="Figure 8 ‣ 4.7. Experiment 6: HPC Commonsense ‣ 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">8</a> and Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.F9" title="Figure 9 ‣ 4.7. Experiment 6: HPC Commonsense ‣ 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">9</a>, we observe that different domains demonstrate varying speedups when employing specific tools. Codee achieves good results in Spectral Methods and Structured Grids using serial optimization strategies, while excelling in Dense Linear Algebra and N-body Methods in a parallel setting. We believe Codee’s familiarity with the nature of computation logic involved in these domains makes it a better code optimizer in such use cases. For instance, from Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.F10" title="Figure 10 ‣ 4.7. Experiment 6: HPC Commonsense ‣ 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">10</a> it is quite evident that a large chunk of the optimizations suggested by Codee revolve around mathematical optimizations including “Fused multiply-add,” “Change of precision”, and “Mathematical simplification.” Codee, on the other hand, fails to yield speedups in the Dynamic Programming domain, especially in the parallel setting. In comparison, OpenAI o1 achieves a speedup of ~7<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS7.p2.1.m1.1"><semantics id="S4.SS7.p2.1.m1.1a"><mo id="S4.SS7.p2.1.m1.1.1" xref="S4.SS7.p2.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS7.p2.1.m1.1b"><times id="S4.SS7.p2.1.m1.1.1.cmml" xref="S4.SS7.p2.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS7.p2.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS7.p2.1.m1.1d">×</annotation></semantics></math> in the Dynamic Programming domain in EX3. For Dynamic Programming problems, the original form of the many state transition equations cannot be parallelized, due to interdependencies between successive iterations. However, we found that o1 can change the access order of the state transition table to be populated, reducing dependencies so that each column in the table can be updated by individual threads <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib65" title="">venkatachalam2014faster, </a>)</cite>. This finding highlights the potential of LLMs in understanding user code to perform adaptive optimizations. </div> <div class="ltx_para" id="S4.SS7.p3"> Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S4.F10" title="Figure 10 ‣ 4.7. Experiment 6: HPC Commonsense ‣ 4. Evaluation ‣ Do Large Language Models Understand Performance Optimization?">10</a> illustrates the distribution of optimizations applied by each tool. Codee’s optimizations are straightforward to summarize, as it categorizes inefficiencies using canonical terminologies such as “fused multiply-add,” “loop fission,” and “loop fusion,” following the structured taxonomy of performance issues outlined in <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib21" title="">codee2024opencatalog, </a>)</cite>. Among these, “fused multiply-add” and “OpenMP scoping” are the most frequently applied optimizations. In contrast, LLMs propose a broader spectrum of optimizations, characterized by more flexible transformations. These are often described using general terms such as “pre-computing constants” and “reducing function overhead,” which, while versatile, may lack the specificity required to precisely identify the transformations applied. A notable trend among LLMs is the frequent emphasis on memory optimizations, which constitute a significant portion of their recommendations, as reflected in the distribution chart. For scenarios involving nested loop structures, OpenAI o1, like Codee, commonly suggests “loop reordering” as an optimization strategy. However, this opportunity is not identified by other models, such as Llama-3.2 and Claude-3.5. When analyzing the code provided in the initial prompt, OpenAI o1 distinguishes itself by offering detailed logical reasoning to justify the chosen optimizations. In contrast, other LLMs tend to lack this level of explanation. Codee also generates a comprehensive report of the entire codebase, including line numbers from all dependent files, performance problems with the code, and potential optimizations. However, it is important to note that Codee’s suggestions are based on predefined heuristics and cannot customize based on application-specific characteristics. </div> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> 5. Case Studies</h2> <div class="ltx_para" id="S5.p1"> We describe the evaluation of level 3 benchmarks in Table <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S2.T1" title="Table 1 ‣ Traditional Performance Analysis Tools ‣ 2. Related Work ‣ Do Large Language Models Understand Performance Optimization?">1</a> in this section. Experiments were performed on up to eight AMD EPYC 7543 processors. We provide first an overview of an agent system we designed to conduct studies. Subsequently, we discuss our findings on each benchmark. </div> <section class="ltx_subsection" id="S5.SS1"> <h3 class="ltx_title ltx_title_subsection"> 5.1. Performance Optimization Agent</h3> <figure class="ltx_figure" id="S5.F11"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="502" id="S5.F11.g1" src="x8.png" width="829"/> <figcaption class="ltx_caption ltx_centering">Figure 11. Overview of the Performance Optimization Agent</figcaption> </figure> <div class="ltx_para" id="S5.SS1.p1"> Developing and improving HPC applications is an intricate and iterative process that targets efficiency and scalability. While traditional tools provide insights through runtime or static inspection, the interpretation and application of these insights rely heavily on human expertise. This process can be time-intensive and error-prone. Advancements in AI-driven code assistants <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib56" title="">rozière2024codellamaopenfoundation, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib32" title="">guo2024deepseekcoderlargelanguagemodel, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib40" title="">li2023starcodersourceyou, </a>)</cite> are bringing a paradigm shift in this domain. On the other hand, while LLMs excel in automatically generating small and general-purpose code, they often fail to analyze large HPC codebases due to the lack of understanding of program structure, invocation context, and performance characteristics. </div> <div class="ltx_para" id="S5.SS1.p2"> To address these challenges, we developed a performance optimization agent leveraging OpenAI o1 as the “brain” of the system, as illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S5.F11" title="Figure 11 ‣ 5.1. Performance Optimization Agent ‣ 5. Case Studies ‣ Do Large Language Models Understand Performance Optimization?">11</a>. This agent facilitates HPC code development by integrating profiling feedback into the optimization process. It serves as a baseline for evaluating the effectiveness of LLMs in enhancing productivity in real HPC application development. Similar to other agent systems, our performance optimization agent comprises core components: Memory, Tools, Planning, and User Requests. The agent’s workflow begins with time-based profiling using HPCToolkit. Profiling data is then converted into JSON structures—more suitable for LLM processing—using Hatchet <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib13" title="">10.1145/3295500.3356219, </a>)</cite>. In the profiling data, we include information on individual source lines, their invocation contexts, and associated metrics such as total CPU execution time and L1 data cache miss rate. In addition to profiling data, user-provided inputs are incorporated to enrich the execution context. These inputs include environment configurations (e.g., hardware and software resources), the number of threads, ranks, and iterations for the application’s runtime. With this comprehensive input, the LLM generates optimized code and recommends additional metrics for further performance analysis. Following this, the identified hotspot function is replaced with its optimized version. The agent then recompiles the application and measures performance improvements. This optimization loop continues iteratively until a predefined threshold (e.g., three iterations) is reached or the LLM suggests no further optimizations. Importantly, the metrics and program context obtained during each iteration are kept in the agent’s memory to allow the LLM to reference prior iterations for more informed optimization suggestions. </div> </section> <section class="ltx_subsection" id="S5.SS2"> <h3 class="ltx_title ltx_title_subsection"> 5.2. Case 1: NPB_CG</h3> <div class="ltx_para" id="S5.SS2.p1"> The Conjugate Gradient (CG) benchmark, part of the NAS parallel Benchmarks (NPB) suite, evaluates the performance of parallel supercomputers by simulating the solution of sparse linear systems using the conjugate gradient method. </div> <div class="ltx_para" id="S5.SS2.p2"> The conj_grad function was identified as the hotspot using HPCToolkit. Codee provided one L2 optimization suggestion by applying multithreading parallelism to the specific forall loop on the hotspot. The original code ran <math alttext="25.00" class="ltx_Math" display="inline" id="S5.SS2.p2.1.m1.1"><semantics id="S5.SS2.p2.1.m1.1a"><mn id="S5.SS2.p2.1.m1.1.1" xref="S5.SS2.p2.1.m1.1.1.cmml">25.00</mn><annotation-xml encoding="MathML-Content" id="S5.SS2.p2.1.m1.1b"><cn id="S5.SS2.p2.1.m1.1.1.cmml" type="float" xref="S5.SS2.p2.1.m1.1.1">25.00</cn></annotation-xml><annotation encoding="application/x-tex" id="S5.SS2.p2.1.m1.1c">25.00</annotation><annotation encoding="application/x-llamapun" id="S5.SS2.p2.1.m1.1d">25.00</annotation></semantics></math> seconds, where the hotspot spent <math alttext="22.00" class="ltx_Math" display="inline" id="S5.SS2.p2.2.m2.1"><semantics id="S5.SS2.p2.2.m2.1a"><mn id="S5.SS2.p2.2.m2.1.1" xref="S5.SS2.p2.2.m2.1.1.cmml">22.00</mn><annotation-xml encoding="MathML-Content" id="S5.SS2.p2.2.m2.1b"><cn id="S5.SS2.p2.2.m2.1.1.cmml" type="float" xref="S5.SS2.p2.2.m2.1.1">22.00</cn></annotation-xml><annotation encoding="application/x-tex" id="S5.SS2.p2.2.m2.1c">22.00</annotation><annotation encoding="application/x-llamapun" id="S5.SS2.p2.2.m2.1d">22.00</annotation></semantics></math> seconds (<math alttext="88.1\%" class="ltx_Math" display="inline" id="S5.SS2.p2.3.m3.1"><semantics id="S5.SS2.p2.3.m3.1a"><mrow id="S5.SS2.p2.3.m3.1.1" xref="S5.SS2.p2.3.m3.1.1.cmml"><mn id="S5.SS2.p2.3.m3.1.1.2" xref="S5.SS2.p2.3.m3.1.1.2.cmml">88.1</mn><mo id="S5.SS2.p2.3.m3.1.1.1" xref="S5.SS2.p2.3.m3.1.1.1.cmml">%</mo></mrow><annotation-xml encoding="MathML-Content" id="S5.SS2.p2.3.m3.1b"><apply id="S5.SS2.p2.3.m3.1.1.cmml" xref="S5.SS2.p2.3.m3.1.1"><csymbol cd="latexml" id="S5.SS2.p2.3.m3.1.1.1.cmml" xref="S5.SS2.p2.3.m3.1.1.1">percent</csymbol><cn id="S5.SS2.p2.3.m3.1.1.2.cmml" type="float" xref="S5.SS2.p2.3.m3.1.1.2">88.1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS2.p2.3.m3.1c">88.1\%</annotation><annotation encoding="application/x-llamapun" id="S5.SS2.p2.3.m3.1d">88.1 %</annotation></semantics></math>). The code optimized by Codee running with 16 threads took <math alttext="4.58" class="ltx_Math" display="inline" id="S5.SS2.p2.4.m4.1"><semantics id="S5.SS2.p2.4.m4.1a"><mn id="S5.SS2.p2.4.m4.1.1" xref="S5.SS2.p2.4.m4.1.1.cmml">4.58</mn><annotation-xml encoding="MathML-Content" id="S5.SS2.p2.4.m4.1b"><cn id="S5.SS2.p2.4.m4.1.1.cmml" type="float" xref="S5.SS2.p2.4.m4.1.1">4.58</cn></annotation-xml><annotation encoding="application/x-tex" id="S5.SS2.p2.4.m4.1c">4.58</annotation><annotation encoding="application/x-llamapun" id="S5.SS2.p2.4.m4.1d">4.58</annotation></semantics></math> seconds (<math alttext="5.46\times" class="ltx_math_unparsed" display="inline" id="S5.SS2.p2.5.m5.1"><semantics id="S5.SS2.p2.5.m5.1a"><mrow id="S5.SS2.p2.5.m5.1b"><mn id="S5.SS2.p2.5.m5.1.1">5.46</mn><mo id="S5.SS2.p2.5.m5.1.2" lspace="0.222em">×</mo></mrow><annotation encoding="application/x-tex" id="S5.SS2.p2.5.m5.1c">5.46\times</annotation><annotation encoding="application/x-llamapun" id="S5.SS2.p2.5.m5.1d">5.46 ×</annotation></semantics></math>). To compare with, applying the performance optimization agent with multiple iterations using 16 threads yielded a speedup of up to <math alttext="8.22\times" class="ltx_math_unparsed" display="inline" id="S5.SS2.p2.6.m6.1"><semantics id="S5.SS2.p2.6.m6.1a"><mrow id="S5.SS2.p2.6.m6.1b"><mn id="S5.SS2.p2.6.m6.1.1">8.22</mn><mo id="S5.SS2.p2.6.m6.1.2" lspace="0.222em">×</mo></mrow><annotation encoding="application/x-tex" id="S5.SS2.p2.6.m6.1c">8.22\times</annotation><annotation encoding="application/x-llamapun" id="S5.SS2.p2.6.m6.1d">8.22 ×</annotation></semantics></math> with no discrepancies in output correctness. </div> <section class="ltx_paragraph" id="S5.SS2.SSS0.Px1"> <h5 class="ltx_title ltx_title_paragraph">Version 1</h5> <div class="ltx_para" id="S5.SS2.SSS0.Px1.p1"> The agent optimized the code using #pragma omp parallel to parallelize loops, #pragma unroll to unroll loops, and #pragma omp simd to enable vectorization. To be specific, in order to ensure the correctness for the CG method, it used #pragma omp parallel for reduction(+: rho) or #pragma omp parallel for reduction(+: sum) for corresponding loops to avoid data races. The optimized code’s runtime decreased to <math alttext="3.04" class="ltx_Math" display="inline" id="S5.SS2.SSS0.Px1.p1.1.m1.1"><semantics id="S5.SS2.SSS0.Px1.p1.1.m1.1a"><mn id="S5.SS2.SSS0.Px1.p1.1.m1.1.1" xref="S5.SS2.SSS0.Px1.p1.1.m1.1.1.cmml">3.04</mn><annotation-xml encoding="MathML-Content" id="S5.SS2.SSS0.Px1.p1.1.m1.1b"><cn id="S5.SS2.SSS0.Px1.p1.1.m1.1.1.cmml" type="float" xref="S5.SS2.SSS0.Px1.p1.1.m1.1.1">3.04</cn></annotation-xml><annotation encoding="application/x-tex" id="S5.SS2.SSS0.Px1.p1.1.m1.1c">3.04</annotation><annotation encoding="application/x-llamapun" id="S5.SS2.SSS0.Px1.p1.1.m1.1d">3.04</annotation></semantics></math> seconds (<math alttext="8.22\times" class="ltx_math_unparsed" display="inline" id="S5.SS2.SSS0.Px1.p1.2.m2.1"><semantics id="S5.SS2.SSS0.Px1.p1.2.m2.1a"><mrow id="S5.SS2.SSS0.Px1.p1.2.m2.1b"><mn id="S5.SS2.SSS0.Px1.p1.2.m2.1.1">8.22</mn><mo id="S5.SS2.SSS0.Px1.p1.2.m2.1.2" lspace="0.222em">×</mo></mrow><annotation encoding="application/x-tex" id="S5.SS2.SSS0.Px1.p1.2.m2.1c">8.22\times</annotation><annotation encoding="application/x-llamapun" id="S5.SS2.SSS0.Px1.p1.2.m2.1d">8.22 ×</annotation></semantics></math>). </div> </section> <section class="ltx_paragraph" id="S5.SS2.SSS0.Px2"> <h5 class="ltx_title ltx_title_paragraph">Version 2</h5> <div class="ltx_para" id="S5.SS2.SSS0.Px2.p1"> In this version, the agent provided two additional performance metrics: L1 cache misses and the number of floating-point instructions, hinting that effectively accessed data and utilized CPU vector units can lead to potential speedups. The agent then applied compiler directives (e.g. __restrict), OpenMP reductions, SIMD parallelism, and alignment hints to the specific loop. Furthermore, it added OpenMP parallelization and prefetch instruction to the conj_grad function, which was not parallelized in the previous version. However, following the optimizations, the total runtime increased to <math alttext="5.69" class="ltx_Math" display="inline" id="S5.SS2.SSS0.Px2.p1.1.m1.1"><semantics id="S5.SS2.SSS0.Px2.p1.1.m1.1a"><mn id="S5.SS2.SSS0.Px2.p1.1.m1.1.1" xref="S5.SS2.SSS0.Px2.p1.1.m1.1.1.cmml">5.69</mn><annotation-xml encoding="MathML-Content" id="S5.SS2.SSS0.Px2.p1.1.m1.1b"><cn id="S5.SS2.SSS0.Px2.p1.1.m1.1.1.cmml" type="float" xref="S5.SS2.SSS0.Px2.p1.1.m1.1.1">5.69</cn></annotation-xml><annotation encoding="application/x-tex" id="S5.SS2.SSS0.Px2.p1.1.m1.1c">5.69</annotation><annotation encoding="application/x-llamapun" id="S5.SS2.SSS0.Px2.p1.1.m1.1d">5.69</annotation></semantics></math> seconds compared with the previous version. </div> </section> <section class="ltx_paragraph" id="S5.SS2.SSS0.Px3"> <h5 class="ltx_title ltx_title_paragraph">Version 3</h5> <div class="ltx_para" id="S5.SS2.SSS0.Px3.p1"> Based on profiling results, L1 cache misses and floating-point instruction counts remained nearly unchanged, and the thread loads across the three indicators are highly imbalanced. To this end, the agent applied advanced optimizations for conj_grad, such as dynamic scheduling (#pragma omp parallel schedule) for threads and aligning the addresses of arrays with alignas(64). The final runtime was decreased to <math alttext="5.52" class="ltx_Math" display="inline" id="S5.SS2.SSS0.Px3.p1.1.m1.1"><semantics id="S5.SS2.SSS0.Px3.p1.1.m1.1a"><mn id="S5.SS2.SSS0.Px3.p1.1.m1.1.1" xref="S5.SS2.SSS0.Px3.p1.1.m1.1.1.cmml">5.52</mn><annotation-xml encoding="MathML-Content" id="S5.SS2.SSS0.Px3.p1.1.m1.1b"><cn id="S5.SS2.SSS0.Px3.p1.1.m1.1.1.cmml" type="float" xref="S5.SS2.SSS0.Px3.p1.1.m1.1.1">5.52</cn></annotation-xml><annotation encoding="application/x-tex" id="S5.SS2.SSS0.Px3.p1.1.m1.1c">5.52</annotation><annotation encoding="application/x-llamapun" id="S5.SS2.SSS0.Px3.p1.1.m1.1d">5.52</annotation></semantics></math> seconds compared with version 2, but L1 cache misses remained the same as version 2. </div> </section> </section> <section class="ltx_subsection" id="S5.SS3"> <h3 class="ltx_title ltx_title_subsection"> 5.3. Case 2: XSBench</h3> <div class="ltx_para" id="S5.SS3.p1"> XSBench is a compact application that simulates a crucial computation kernel of the Monte Carlo neutron transport algorithm. The hotspot is located in the calculate_macrox_xs function identified using HPCToolkit. The original code was executed in 39.5 seconds, and the calculate_macrox_xs function took 28.3 seconds (71.5%). Codee didn’t identify any optimization opportunities for this function. </div> <div class="ltx_para" id="S5.SS3.p2"> Applying the performance optimization agent with three iterations yielded <math alttext="1.33\times" class="ltx_math_unparsed" display="inline" id="S5.SS3.p2.1.m1.1"><semantics id="S5.SS3.p2.1.m1.1a"><mrow id="S5.SS3.p2.1.m1.1b"><mn id="S5.SS3.p2.1.m1.1.1">1.33</mn><mo id="S5.SS3.p2.1.m1.1.2" lspace="0.222em">×</mo></mrow><annotation encoding="application/x-tex" id="S5.SS3.p2.1.m1.1c">1.33\times</annotation><annotation encoding="application/x-llamapun" id="S5.SS3.p2.1.m1.1d">1.33 ×</annotation></semantics></math> speedup. Below we describe metrics profiled and optimizations applied by the agent. </div> <section class="ltx_paragraph" id="S5.SS3.SSS0.Px1"> <h5 class="ltx_title ltx_title_paragraph">Version 1</h5> <div class="ltx_para" id="S5.SS3.SSS0.Px1.p1"> In the first iteration, the agent applied the SIMD intrinsics from <immintrin.h> to optimize the hotspot function, suggesting it helped accelerate larger datasets. Then, it employed compiler hints such as __attribute((hot,flatten,always_inline)) to reduce the function invocation overhead because the hotspot function was called in the innermost level of a nested loop. In addition, the agent also annotated __builtin_expect for certain conditions, aiming to improve branch prediction. Yet, the overall runtime slightly increased to 40.1 seconds. </div> </section> <section class="ltx_paragraph" id="S5.SS3.SSS0.Px2"> <h5 class="ltx_title ltx_title_paragraph">Version 2</h5> <div class="ltx_para" id="S5.SS3.SSS0.Px2.p1"> By investigating L1 data cache load misses as a new profiling metric, the agent pointed out that memory access optimization could lead to potential speedups. The agent reduced the L1 data cache load misses by adopting prefetch instructions (e.g. __bultin_prefetch) for frequently accessed pointers and introducing local variables for frequently accessed positions from high and low arrays. Moreover, the agent duplicated the calculation of interpolation factors to reduce memory accesses. After these optimizations, the overall runtime decreased to <math alttext="34.6" class="ltx_Math" display="inline" id="S5.SS3.SSS0.Px2.p1.1.m1.1"><semantics id="S5.SS3.SSS0.Px2.p1.1.m1.1a"><mn id="S5.SS3.SSS0.Px2.p1.1.m1.1.1" xref="S5.SS3.SSS0.Px2.p1.1.m1.1.1.cmml">34.6</mn><annotation-xml encoding="MathML-Content" id="S5.SS3.SSS0.Px2.p1.1.m1.1b"><cn id="S5.SS3.SSS0.Px2.p1.1.m1.1.1.cmml" type="float" xref="S5.SS3.SSS0.Px2.p1.1.m1.1.1">34.6</cn></annotation-xml><annotation encoding="application/x-tex" id="S5.SS3.SSS0.Px2.p1.1.m1.1c">34.6</annotation><annotation encoding="application/x-llamapun" id="S5.SS3.SSS0.Px2.p1.1.m1.1d">34.6</annotation></semantics></math> seconds, and the hotspot runtime decreased to <math alttext="20.2" class="ltx_Math" display="inline" id="S5.SS3.SSS0.Px2.p1.2.m2.1"><semantics id="S5.SS3.SSS0.Px2.p1.2.m2.1a"><mn id="S5.SS3.SSS0.Px2.p1.2.m2.1.1" xref="S5.SS3.SSS0.Px2.p1.2.m2.1.1.cmml">20.2</mn><annotation-xml encoding="MathML-Content" id="S5.SS3.SSS0.Px2.p1.2.m2.1b"><cn id="S5.SS3.SSS0.Px2.p1.2.m2.1.1.cmml" type="float" xref="S5.SS3.SSS0.Px2.p1.2.m2.1.1">20.2</cn></annotation-xml><annotation encoding="application/x-tex" id="S5.SS3.SSS0.Px2.p1.2.m2.1c">20.2</annotation><annotation encoding="application/x-llamapun" id="S5.SS3.SSS0.Px2.p1.2.m2.1d">20.2</annotation></semantics></math> seconds. </div> </section> <section class="ltx_paragraph" id="S5.SS3.SSS0.Px3"> <h5 class="ltx_title ltx_title_paragraph">Version 3</h5> <div class="ltx_para" id="S5.SS3.SSS0.Px3.p1"> Lastly, the agent focused on leveraging SIMD intrinsics to further reduce L1 data cache load misses and improve performance, which is implemented by adopting _m256d functions for vectorization. It also unrolled the loops around the hotspot function. Applying the above optimizations achieved the best result, with a runtime of 29.8 seconds. </div> </section> </section> <section class="ltx_subsection" id="S5.SS4"> <h3 class="ltx_title ltx_title_subsection"> 5.4. Case 3: LBM D2Q37</h3> <div class="ltx_para" id="S5.SS4.p1"> LBM D2Q37 is a Computational Fluid Dynamics (CFD) code that simulates 2D fluids using the “Lattice Boltzmann Method” (LBM) with 37 velocity components. The computation is performed in double precision, featuring two key kernels: the memory-bound propagate kernel and the compute-bound collide kernel. We applied our performance agent to optimize LBM in single and multi-process environments using the “Test” dataset. </div> <section class="ltx_subsubsection" id="S5.SS4.SSS1"> <h4 class="ltx_title ltx_title_subsubsection"> 5.4.1. Single Process</h4> <div class="ltx_para" id="S5.SS4.SSS1.p1"> The hotspot was identified in the collide kernel based on profiling data obtained using HPCToolkit. Codee only suggested one optimization opportunity for the hotspot, indicating lower chances of achieving an increase in speedup. It suggested applying optimizations like “change of memory access pattern” from column-major order to row-major order and applying either “loop fission”, “loop tiling”, or “loop fusion” to avoid non-consecutive/indirect array access. The original code was executed in 98.4 seconds, while the code with loop tiling ran slightly longer, taking 106 seconds, achieving a speedup ¡ 1. </div> <div class="ltx_para" id="S5.SS4.SSS1.p2"> Applying the performance optimization agent with multiple iterations also yielded no improvements in performance, with speedups ~1 and no discrepancies in output correctness. </div> <section class="ltx_paragraph" id="S5.SS4.SSS1.Px1"> <h5 class="ltx_title ltx_title_paragraph">Version 1</h5> <div class="ltx_para" id="S5.SS4.SSS1.Px1.p1"> The first set of optimizations applied by the agent are using OpenMP #pragma omp simd directives to enable vectorization in the innermost loops over the hotspot. These directives were added for loops that computed values like rho, u, v, Hermite_projections, Collisional_updates, and the final projections written to nxt. The goal was to leverage data parallelism in these loops. However, the total runtime only slightly decreased to 98.3 seconds. </div> </section> <section class="ltx_paragraph" id="S5.SS4.SSS1.Px2"> <h5 class="ltx_title ltx_title_paragraph">Version 2</h5> <div class="ltx_para" id="S5.SS4.SSS1.Px2.p1"> The agent requested new profiling metrics to inspect L1 cache inefficiencies, suggesting memory access optimizations could yield potential speedups. Based on the new metrics, the optimized code refined the placement of #pragma omp simd to specific computational loops. The runtime marginally increased to 98.6 seconds, with profiling metrics showing an increase in L1 data cache misses and stalled cycles, indicating the failure of the previous assumption that the placement of SIMD instructions causes inefficient memory access patterns. </div> </section> <section class="ltx_paragraph" id="S5.SS4.SSS1.Px3"> <h5 class="ltx_title ltx_title_paragraph">Version 3</h5> <div class="ltx_para" id="S5.SS4.SSS1.Px3.p1"> Finally, to address the memory bottlenecks, more aggressive optimizations were applied by the agent, including compiler-specific directives like #pragma GCC ivdep to ignore assumed dependencies and loop unrolling to improve throughput. Additionally, __builtin_prefetch was used to prefetch data into cache, aiming to reduce L1 cache miss latency. While these advanced optimizations were better aligned with profiling insights, they only achieved a runtime of 98.5 seconds, again yielding a speedup close to 1. </div> </section> </section> <section class="ltx_subsubsection" id="S5.SS4.SSS2"> <h4 class="ltx_title ltx_title_subsubsection"> 5.4.2. Multiple Processes</h4> <div class="ltx_para" id="S5.SS4.SSS2.p1"> The benchmark was profiled using eight MPI ranks. Per rank performance metrics were obtained by the agent. By observing that all processes were taking a long time in the same hotspot loop, the agent optimized the code by increasing parallelism by using four threads per rank, instead of two in the original code. However, the total runtime rose from 9.7 seconds to 10.8 seconds. The hotspot loop still consumed the majority of the total compute time, though it averaged at 85.7% of total runtime instead of 91.7%, indicating that some computational overhead outside of the hotspot had grown. The profiling data for each rank showed a spike in front-end stalls—often over 80%—which implied the CPU pipeline was waiting for instruction fetch or decode. This can occur when additional threads compete for the same instruction stream. Another possible factor is the synchronization overhead and increased thread scheduling complexity. </div> </section> </section> <section class="ltx_subsection" id="S5.SS5"> <h3 class="ltx_title ltx_title_subsection"> 5.5. Case 4: Minisweep</h3> <div class="ltx_para" id="S5.SS5.p1"> Minisweep is a radiation transport benchmark that reproduces the computational pattern of the sweep kernel of the Denovo Sn radiation transport code <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#bib.bib62" title="">Evans01082010, </a>)</cite>. The sweep kernel, responsible for most of Denovo’s computational cost, models radiation transport for nuclear reactors. We applied the performance agent to optimize Minisweep in a single-process environment using the “Test” dataset. </div> <div class="ltx_para" id="S5.SS5.p2"> The hotspot was identified in the sweeper_kba_c_kernels.h file based on profiling data obtained using HPCToolkit. Codee failed to suggest meaningful optimizations for the hotspot itself. Applying the performance optimization agent with multiple iterations also yielded no improvements, resulting in a similar runtime to the original code of 3.79 seconds. </div> <section class="ltx_paragraph" id="S5.SS5.SSS0.Px1"> <h5 class="ltx_title ltx_title_paragraph">Version 1</h5> <div class="ltx_para" id="S5.SS5.SSS0.Px1.p1"> The first set of optimizations applied __attribute__((always_inline)) to enable full inlining of the Sweeper_sweep_cell function and inserted OpenMP #pragma omp simd directives to enable vectorization in the innermost loops over the hotspot. Additionally, #pragma unroll and #pragma ivdep were used to improve throughput and allow for more aggressive compiler optimizations. These changes aimed to leverage data parallelism. Despite this, the runtime only slightly decreased to 3.77 seconds. </div> </section> <section class="ltx_paragraph" id="S5.SS5.SSS0.Px2"> <h5 class="ltx_title ltx_title_paragraph">Version 2</h5> <div class="ltx_para" id="S5.SS5.SSS0.Px2.p1"> In the second iteration, new profiling metrics including measurements of frontend and backend stalled cycles were applied. The profiling data showed that the hotspot loop accounted for 55.5% of the computation time, with 43.8% stalled frontend cycles and 39.1% stalled backend cycles. Based on the new metrics, the optimized code refined the placement of #pragma GCC ivdep to ignore assumed dependencies, #pragma unroll to improve throughput, and #pragma omp simd to enhance vectorization. These changes aimed to reduce frontend stalls by improving instruction flow and backend stalls by enhancing parallel execution efficiency. However, the new version resulted in a slightly increased runtime of 3.79 seconds. </div> </section> <section class="ltx_paragraph" id="S5.SS5.SSS0.Px3"> <h5 class="ltx_title ltx_title_paragraph">Version 3</h5> <div class="ltx_para" id="S5.SS5.SSS0.Px3.p1"> In the third iteration, the agent applied the same optimization as the previous version but refined the application of optimization directives. The compiler-specific directives were inserted around the hotspot. These changes led to a slightly increased runtime of 3.88 seconds. Profiling metrics showed a significant decrease in backend stalls at 30.9% indicating that while some improvements were made, the root causes of frontend instruction dependencies persisted. </div> </section> </section> </section> <section class="ltx_section" id="S6"> <h2 class="ltx_title ltx_title_section"> 6. Discussion</h2> <figure class="ltx_table" id="S6.T5"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="S6.T5.2"> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S6.T5.2.1.1"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S6.T5.2.1.1.1" style="padding:0.8pt 1.8pt;">Tool</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.1.1.2" style="padding:0.8pt 1.8pt;">Speedup</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.1.1.3" style="padding:0.8pt 1.8pt;">Correctness</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.1.1.4" style="padding:0.8pt 1.8pt;">Commonsense</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.1.1.5" style="padding:0.8pt 1.8pt;">Latency</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.1.1.6" style="padding:0.8pt 1.8pt;">Applicability</td> </tr> <tr class="ltx_tr" id="S6.T5.2.2.2"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S6.T5.2.2.2.1" style="padding:0.8pt 1.8pt;">Codee</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.2.2.2" style="padding:0.8pt 1.8pt;">★★★</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.2.2.3" style="padding:0.8pt 1.8pt;">★★★★</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.2.2.4" style="padding:0.8pt 1.8pt;">★★★★</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.2.2.5" style="padding:0.8pt 1.8pt;">★★★★</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.2.2.6" style="padding:0.8pt 1.8pt;">★</td> </tr> <tr class="ltx_tr" id="S6.T5.2.3.3"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S6.T5.2.3.3.1" style="padding:0.8pt 1.8pt;">OpenAI o1</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.3.3.2" style="padding:0.8pt 1.8pt;">★★★★</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.3.3.3" style="padding:0.8pt 1.8pt;">★★★</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.3.3.4" style="padding:0.8pt 1.8pt;">★★★</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.3.3.5" style="padding:0.8pt 1.8pt;">★★</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.3.3.6" style="padding:0.8pt 1.8pt;">★★</td> </tr> <tr class="ltx_tr" id="S6.T5.2.4.4"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_l ltx_border_r ltx_border_t" id="S6.T5.2.4.4.1" style="padding:0.8pt 1.8pt;">Claude-3.5</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.4.4.2" style="padding:0.8pt 1.8pt;">★★</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.4.4.3" style="padding:0.8pt 1.8pt;">★★</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.4.4.4" style="padding:0.8pt 1.8pt;">★★★</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.4.4.5" style="padding:0.8pt 1.8pt;">★★★</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S6.T5.2.4.4.6" style="padding:0.8pt 1.8pt;">NA</td> </tr> <tr class="ltx_tr" id="S6.T5.2.5.5"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_b ltx_border_l ltx_border_r ltx_border_t" id="S6.T5.2.5.5.1" style="padding:0.8pt 1.8pt;">Llama-3.2</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S6.T5.2.5.5.2" style="padding:0.8pt 1.8pt;">★</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S6.T5.2.5.5.3" style="padding:0.8pt 1.8pt;">★★</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S6.T5.2.5.5.4" style="padding:0.8pt 1.8pt;">★</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S6.T5.2.5.5.5" style="padding:0.8pt 1.8pt;">★</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="S6.T5.2.5.5.6" style="padding:0.8pt 1.8pt;">NA</td> </tr> </tbody> </table> <figcaption class="ltx_caption ltx_centering" style="font-size:80%;">Table 5. Overall tool performance ranked by the total star count</figcaption> </figure> <div class="ltx_para" id="S6.p1"> We assess the overall performance of the tools employed in the studies by defining and evaluating five key metrics: </div> <div class="ltx_para" id="S6.p2"> <ul class="ltx_itemize" id="S6.I1"> <li class="ltx_item" id="S6.I1.i1" style="list-style-type:none;"> • <div class="ltx_para" id="S6.I1.i1.p1"> Speedup: Measures the average speed improvement achieved by the tools across various experiments. Higher rankings indicate greater speedups. </div> </li> <li class="ltx_item" id="S6.I1.i2" style="list-style-type:none;"> • <div class="ltx_para" id="S6.I1.i2.p1"> Correctness: Evaluates the number of benchmarks for which the tools produced accurate results, with higher rankings reflecting greater reliability. </div> </li> <li class="ltx_item" id="S6.I1.i3" style="list-style-type:none;"> • <div class="ltx_para" id="S6.I1.i3.p1"> Commonsense: Assesses the tools’ effectiveness across multiple domains. Higher ranking indicates more domains a tool leads in terms of the average speedup. </div> </li> <li class="ltx_item" id="S6.I1.i4" style="list-style-type:none;"> • <div class="ltx_para" id="S6.I1.i4.p1"> Latency: Measures the time required by each tool to apply optimizations. Higher rankings indicate a lower latency. </div> </li> <li class="ltx_item" id="S6.I1.i5" style="list-style-type:none;"> • <div class="ltx_para" id="S6.I1.i5.p1"> Applicability: Examines the tools’ ability to optimize larger benchmarks featured in the case studies. </div> </li> </ul> </div> <div class="ltx_para" id="S6.p3"> Each tool is rated on a scale of 1-4 stars (4 being the best and 1 being the worst) for each of these metrics, as shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2503.13772v1#S6.T5" title="Table 5 ‣ 6. Discussion ‣ Do Large Language Models Understand Performance Optimization?">5</a>. From our evaluation, Codee ranks as the best overall in optimizing HPC codes, with OpenAI o1 being a close second. Claude-3.5 ranks third, while Llama-3.2 ranks last. It is noteworthy that both Llama-3.2 and Claude-3.5 generate many incorrect codes across experiments; Claude-3.5 achieves a higher speedup in general. Codee’s strength lies in its capability to invoke reliable and accurate compiler-based analysis, exhibit low response latency, and avoid context length limitations. Nonetheless, its primary limitation lies in the fact that the optimization patterns are predefined and lack the flexibility for customization according to user-specific code, and it is also incapable of offering algorithm-related enhancements. Although LLMs generally fall short in rendering correct code, we observed that they have great potential in suggesting optimizations adaptive to user code. Notably, OpenAI o1, as a reasoning model, performs quite well in optimizing code, especially in suggesting algorithm-related optimizations, yielding speedups even better than Codee. It is also worth noting that neither Codee nor LLMs have proven to be mature tools for automatically optimizing large-scale codebases. </div> </section> <section class="ltx_section" id="S7"> <h2 class="ltx_title ltx_title_section"> 7. Conclusions</h2> <div class="ltx_para" id="S7.p1"> In this study, we designed and conducted a series of experiments using HPC-focused benchmarks to compare the impact of LLMs on performance optimization against state-of-the-art compiler-based tools. Our evaluation reveals that state-of-the-art LLMs lag behind in several critical aspects, especially in terms of correctness. However, we recognize the potential for integrating LLMs with traditional tools, as traditional tools typically offer only a subset of possible transformations. Our performance optimization agent system also reveals a potential to integrate LLMs with profilers to circumvent its limitation on accessing the execution environment. In the future, we plan to evaluate hardware performance not limited to CPUs. Additionally, we aim to enhance our agent systems through the utilization of structured inputs, more concise prompts, and static code analysis. </div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> [1] Gpt-4 documentation. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4" title="">https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4</a>. </li> <li class="ltx_bibitem" id="bib.bib2"> [2] Llama3.2 documentation. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/meta-llama/llama-models?tab=readme-ov-file" title="">https://github.com/meta-llama/llama-models?tab=readme-ov-file</a>. </li> <li class="ltx_bibitem" id="bib.bib3"> [3] Spechpc™ 2021 benchmark suites, 2024. Accessed: 2024-10-25. </li> <li class="ltx_bibitem" id="bib.bib4"> [4] Arkadeep Acharya, Brijraj Singh, and Naoyuki Onoe. Llm based generation of item-description for recommendation system. In Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23, page 1204–1207, New York, NY, USA, 2023. Association for Computing Machinery. </li> <li class="ltx_bibitem" id="bib.bib5"> [5] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. Hpctoolkit: tools for performance analysis of optimized parallel programs http://hpctoolkit.org. Concurr. Comput.: Pract. Exper., 22(6):685–701, apr 2010. </li> <li class="ltx_bibitem" id="bib.bib6"> [6] Toufique Ahmed and Prem Devanbu. Few-shot training llms for project-specific code-summarization. Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022. </li> <li class="ltx_bibitem" id="bib.bib7"> [7] Cognition AI. Devin ai: World’s first ai software engineer, 2024. </li> <li class="ltx_bibitem" id="bib.bib8"> [8] Amazon. What is codewhisperer?, 2022. </li> <li class="ltx_bibitem" id="bib.bib9"> [9] Anthropic. Introducing the next generation of claude, 2024. </li> <li class="ltx_bibitem" id="bib.bib10"> [10] LM Arena. Lm arena: Benchmarking and evaluating language models. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://lmarena.ai/" title="">https://lmarena.ai/</a>, 2025. Accessed: 2025-01-07. </li> <li class="ltx_bibitem" id="bib.bib11"> [11] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language models. ArXiv, abs/2108.07732, 2021. </li> <li class="ltx_bibitem" id="bib.bib12"> [12] Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare Voss, editors, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. </li> <li class="ltx_bibitem" id="bib.bib13"> [13] Abhinav Bhatele, Stephanie Brink, and Todd Gamblin. Hatchet: pruning the overgrowth in parallel profiles. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, New York, NY, USA, 2019. Association for Computing Machinery. </li> <li class="ltx_bibitem" id="bib.bib14"> [14] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. </li> <li class="ltx_bibitem" id="bib.bib15"> [15] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE international symposium on workload characterization (IISWC), pages 44–54. Ieee, 2009. </li> <li class="ltx_bibitem" id="bib.bib16"> [16] Le Chen, Arijit Bhattacharjee, Nesreen K. Ahmed, Niranjan Hasabnis, Gal Oren, Vy A. Vo, and Ali Jannesari. Ompgpt: A generative pre-trained transformer model for openmp. ArXiv, abs/2401.16445, 2024. </li> <li class="ltx_bibitem" id="bib.bib17"> [17] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. </li> <li class="ltx_bibitem" id="bib.bib18"> [18] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, David W. Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William H. Guss, Alex Nichol, Igor Babuschkin, Suchir Balaji, Shantanu Jain, Andrew Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew M. Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. ArXiv, abs/2107.03374, 2021. </li> <li class="ltx_bibitem" id="bib.bib19"> [19] Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug, 2023. </li> <li class="ltx_bibitem" id="bib.bib20"> [20] Codee. Codee: Automated code performance tuning for c/c++, 2024. Accessed: 2024-08-30. </li> <li class="ltx_bibitem" id="bib.bib21"> [21] Codee. Open catalog, 2024. Accessed: 2024-08-27. </li> <li class="ltx_bibitem" id="bib.bib22"> [22] Codee. Performance demos repository, 2024. Accessed: 2024-08-23. </li> <li class="ltx_bibitem" id="bib.bib23"> [23] Codee Community. Open catalog, 2025. Accessed: 2025-01-08. </li> <li class="ltx_bibitem" id="bib.bib24"> [24] LangChain Contributors. Langchain, 2023. </li> <li class="ltx_bibitem" id="bib.bib25"> [25] Xianzhong Ding, Le Chen, Murali Emani, Chunhua Liao, Pei-Hung Lin, Tristan Vanderbruggen, Zhen Xie, Alberto Cerpa, and Wan Du. Hpc-gpt: Integrating large language model for high-performance computing. In Proceedings of the SC ’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W 2023. ACM, November 2023. </li> <li class="ltx_bibitem" id="bib.bib26"> [26] Lamia Djoudi, Denis Barthou, Patrick Carribault, Christophe Lemuet, Jean-Thomas Acquaviva, and William Jalby. Maqao : Modular assembler quality analyzer and optimizer for itanium 2, 2005. </li> <li class="ltx_bibitem" id="bib.bib27"> [27] Lamia Djoudi, Denis Barthou, Patrick Carribault, Christophe Lemuet, Jean-Thomas Acquaviva, William Jalby, et al. Maqao: Modular assembler quality analyzer and optimizer for itanium 2. In The 4th Workshop on EPIC architectures and compiler technology, San Jose, volume 200, 2005. </li> <li class="ltx_bibitem" id="bib.bib28"> [28] Lamia Djoudi, Vasil Khachidze, and William Jalby. Kbs-maqao: A knowledge based system for maqao tool. In 2009 11th IEEE International Conference on High Performance Computing and Communications, pages 571–578, 2009. </li> <li class="ltx_bibitem" id="bib.bib29"> [29] Xiangxin Fang and Lev Mukhanov. Towards llm-based optimization compilers. can llms learn how to apply a single peephole optimization? reasoning is all llms need! arXiv preprint arXiv:2412.12163, 2024. </li> <li class="ltx_bibitem" id="bib.bib30"> [30] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020. </li> <li class="ltx_bibitem" id="bib.bib31"> [31] Significant Gravitas. Autogpt, 2023. </li> <li class="ltx_bibitem" id="bib.bib32"> [32] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024. </li> <li class="ltx_bibitem" id="bib.bib33"> [33] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Xiaodong Song, and Jacob Steinhardt. Measuring coding challenge competence with apps. ArXiv, abs/2105.09938, 2021. </li> <li class="ltx_bibitem" id="bib.bib34"> [34] Anysphere Inc. Cursor: The ai code editor, 2024. </li> <li class="ltx_bibitem" id="bib.bib35"> [35] Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation, 2024. </li> <li class="ltx_bibitem" id="bib.bib36"> [36] Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. An llm compiler for parallel function calling. arXiv preprint arXiv:2312.04511, 2023. </li> <li class="ltx_bibitem" id="bib.bib37"> [37] Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. The stack: 3 tb of permissively licensed source code, 2022. </li> <li class="ltx_bibitem" id="bib.bib38"> [38] Chris Lattner and Vikram Adve. Llvm: A compilation framework for lifelong program analysis & transformation. In International symposium on code generation and optimization, 2004. CGO 2004., pages 75–86. IEEE, 2004. </li> <li class="ltx_bibitem" id="bib.bib39"> [39] Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 21314–21328. Curran Associates, Inc., 2022. </li> <li class="ltx_bibitem" id="bib.bib40"> [40] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder: may the source be with you!, 2023. </li> <li class="ltx_bibitem" id="bib.bib41"> [41] Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. </li> <li class="ltx_bibitem" id="bib.bib42"> [42] NERSC. Codee documentation. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://docs.nersc.gov/tools/performance/codee/" title="">https://docs.nersc.gov/tools/performance/codee/</a>, 2025. Accessed: 2025-01-04. </li> <li class="ltx_bibitem" id="bib.bib43"> [43] Daniel Nichols, Joshua H Davis, Zhaojun Xie, Arjun Rajaram, and Abhinav Bhatele. Can large language models write parallel code? In Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, pages 281–294, 2024. </li> <li class="ltx_bibitem" id="bib.bib44"> [44] Daniel Nichols, Joshua Hoke Davis, Zhaojun Xie, Arjun Rajaram, and Abhinav Bhatele. Can large language models write parallel code? ArXiv, abs/2401.12554, 2024. </li> <li class="ltx_bibitem" id="bib.bib45"> [45] Daniel Nichols, Pranav Polasam, Harshitha Menon, Aniruddha Marathe, Todd Gamblin, and Abhinav Bhatele. Performance-aligned llms for generating fast code. ArXiv, abs/2404.18864, 2024. </li> <li class="ltx_bibitem" id="bib.bib46"> [46] Daniel Nichols, Pranav Polasam, Harshitha Menon, Aniruddha Marathe, Todd Gamblin, and Abhinav Bhatele. Performance-aligned llms for generating fast code. arXiv preprint arXiv:2404.18864, 2024. </li> <li class="ltx_bibitem" id="bib.bib47"> [47] Changan Niu, Ting Zhang, Chuanyi Li, Bin Luo, and Vincent Ng. On evaluating the efficiency of source code generated by llms, 2024. </li> <li class="ltx_bibitem" id="bib.bib48"> [48] Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova Dassarma, Tom Henighan, Benjamin Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, John Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom B. Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Christopher Olah. In-context learning and induction heads. ArXiv, abs/2209.11895, 2022. </li> <li class="ltx_bibitem" id="bib.bib49"> [49] OpenAI. Gpt-4 technical report, 2023. </li> <li class="ltx_bibitem" id="bib.bib50"> [50] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022. </li> <li class="ltx_bibitem" id="bib.bib51"> [51] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Annual Meeting of the Association for Computational Linguistics, 2002. </li> <li class="ltx_bibitem" id="bib.bib52"> [52] Louis-Noel Pouchet and Tomofumi Yuki. Polybenchc-4.2.1, 2024. Accessed: 2024-9-23. </li> <li class="ltx_bibitem" id="bib.bib53"> [53] Ruizhong Qiu, Weiliang Will Zeng, Hanghang Tong, James Ezick, and Christopher Lott. How efficient is llm-generated code? a rigorous & high-standard benchmark. ArXiv, abs/2406.06647, 2024. </li> <li class="ltx_bibitem" id="bib.bib54"> [54] Brad Richardson, Koichi Sakaguchi, Manuel Arenaz, William I Gustafson Jr, Jacob Shpund, Ulises Costi Blanco, Alvaro Goldar Dieste, et al. Optimizing the weather research and forecasting model with openmp offload and codee. arXiv preprint arXiv:2409.07232, 2024. </li> <li class="ltx_bibitem" id="bib.bib55"> [55] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code, 2024. </li> <li class="ltx_bibitem" id="bib.bib56"> [56] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code, 2024. </li> <li class="ltx_bibitem" id="bib.bib57"> [57] Sameer S. Shende and Allen D. Malony. The tau parallel performance system. The International Journal of High Performance Computing Applications, 20:287 – 311, 2006. </li> <li class="ltx_bibitem" id="bib.bib58"> [58] Sameer S Shende and Allen D Malony. The tau parallel performance system. The International Journal of High Performance Computing Applications, 20(2):287–311, 2006. </li> <li class="ltx_bibitem" id="bib.bib59"> [59] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024. </li> <li class="ltx_bibitem" id="bib.bib60"> [60] Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K. Reddy. Execution-based code generation using deep reinforcement learning, 2023. </li> <li class="ltx_bibitem" id="bib.bib61"> [61] Sofia Eleni Spatharioti, David M. Rothschild, Daniel G. Goldstein, and Jake M. Hofman. Comparing traditional and llm-based search for consumer choice: A randomized experiment. ArXiv, abs/2307.03744, 2023. </li> <li class="ltx_bibitem" id="bib.bib62"> [62] Rachel N. Slaybaugh Thomas M. Evans, Alissa S. Stafford and Kevin T. Clarno. Denovo: A new three-dimensional parallel discrete ordinates code in scale. Nuclear Technology, 171(2):171–200, 2010. </li> <li class="ltx_bibitem" id="bib.bib63"> [63] Together.ai. Together.ai - advancing collaborative ai development, 2025. Accessed: 2025-01-08. </li> <li class="ltx_bibitem" id="bib.bib64"> [64] Pedro Valero-Lara, Alexis Huante, Mustafa Al Lail, William F. Godoy, Keita Teranishi, Prasanna Balaprakash, and Jeffrey S. Vetter. Comparing llama-2 and gpt-3 llms for hpc kernels generation. ArXiv, abs/2309.07103, 2023. </li> <li class="ltx_bibitem" id="bib.bib65"> [65] Balaji Venkatachalam, Dan Gusfield, and Yelena Frid. Faster algorithms for rna-folding using the four-russians method. Algorithms for Molecular Biology, 9:1–12, 2014. </li> <li class="ltx_bibitem" id="bib.bib66"> [66] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024. </li> <li class="ltx_bibitem" id="bib.bib67"> [67] Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai software developers as generalist agents, 2024. </li> <li class="ltx_bibitem" id="bib.bib68"> [68] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. </li> <li class="ltx_bibitem" id="bib.bib69"> [69] Chun Xia and Lingming Zhang. Less training, more repairing please: revisiting automated program repair via zero-shot learning. Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022. </li> <li class="ltx_bibitem" id="bib.bib70"> [70] Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is inevitable: An innate limitation of large language models, 2024. </li> <li class="ltx_bibitem" id="bib.bib71"> [71] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024. </li> <li class="ltx_bibitem" id="bib.bib72"> [72] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022. </li> <li class="ltx_bibitem" id="bib.bib73"> [73] Yongjian Yu and S.T. Acton. Speckle reducing anisotropic diffusion. IEEE Transactions on Image Processing, 11(11):1260–1270, 2002. </li> <li class="ltx_bibitem" id="bib.bib74"> [74] Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation, 2023. </li> <li class="ltx_bibitem" id="bib.bib75"> [75] Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. arXiv preprint arXiv:2401.07339, 2024. </li> <li class="ltx_bibitem" id="bib.bib76"> [76] Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x, 2024. </li> <li class="ltx_bibitem" id="bib.bib77"> [77] Keren Zhou, Xiaozhu Meng, Ryuichi Sai, Dejan Grubisic, and John Mellor-Crummey. An automated tool for analysis and tuning of gpu-accelerated code in hpc applications. IEEE Transactions on Parallel and Distributed Systems, 33(4):854–865, 2021. </li> <li class="ltx_bibitem" id="bib.bib78"> [78] Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey. Gpa: A gpu performance advisor based on instruction sampling. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 115–125. IEEE, 2021. </li> <li class="ltx_bibitem" id="bib.bib79"> [79] Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Lingpeng Kong, Jiajun Chen, Lei Li, and Shujian Huang. Multilingual machine translation with large language models: Empirical results and analysis. ArXiv, abs/2304.04675, 2023. </li> </ul> </section> <div class="ltx_pagination ltx_role_newpage"></div> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Mon Mar 17 22:26:10 2025 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/">LaTeXML<img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>