CINXE.COM

<!DOCTYPE html> <html class="writer-html5" lang="en" > <head> <meta charset="utf-8" /><meta name="generator" content="Docutils 0.17.1: http://docutils.sourceforge.net/" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <title>NVIDIA HPC Compilers User's Guide — NVIDIA HPC Compilers User's Guide 25.1 documentation</title> <link rel="stylesheet" href="_static/pygments.css" type="text/css" /> <link rel="stylesheet" href="_static/css/theme.css" type="text/css" /> <link rel="stylesheet" href="_static/copybutton.css" type="text/css" /> <link rel="stylesheet" href="_static/design-style.b7bb847fb20b106c3d81b95245e65545.min.css" type="text/css" /> <link rel="stylesheet" href="_static/omni-style.css" type="text/css" /> <link rel="stylesheet" href="_static/api-styles.css" type="text/css" /> <link rel="shortcut icon" href="_static/favicon.ico"/>  <script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script> <script src="_static/jquery.js"></script> <script src="_static/underscore.js"></script> <script src="_static/doctools.js"></script> <script src="_static/mermaid-init.js"></script> <script src="_static/clipboard.min.js"></script> <script src="_static/copybutton.js"></script> <script src="_static/design-tabs.js"></script> <script src="_static/geoip/geoip.js"></script> <script src="_static/js/theme.js"></script> <link rel="index" title="Index" href="genindex.html" /> <link rel="search" title="Search" href="search.html" /> <link rel="prev" title="Contents" href="contents.html" /> <script src="//assets.adobedtm.com/5d4962a43b79/c1061d2c5e7b/launch-191c2462b890.min.js"></script> </head> <body class="wy-body-for-nav"> <div class="wy-grid-for-nav"> <nav data-toggle="wy-nav-shift" class="wy-nav-side"> <div class="wy-side-scroll"> <div class="wy-side-nav-search" > <a href="contents.html"> <img src="_static/acc-comp-logo.png" class="logo" alt="Logo"/> </a> <div role="search"> <form id="rtd-search-form" class="wy-form" action="search.html" method="get"> <input type="text" name="q" placeholder="Search docs" /> <input type="hidden" name="check_keywords" value="yes" /> <input type="hidden" name="area" value="default" /> </form> </div> </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu"> <ul class="current"> <li class="toctree-l1 current"><a class="current reference internal" href="#">1. Getting Started</a><ul> <li class="toctree-l2"><a class="reference internal" href="#overview">1.1. Overview</a></li> <li class="toctree-l2"><a class="reference internal" href="#creating-an-example">1.2. Creating an Example</a></li> <li class="toctree-l2"><a class="reference internal" href="#invoking-the-command-level-nvidia-hpc-compilers">1.3. Invoking the Command-level NVIDIA HPC Compilers</a><ul> <li class="toctree-l3"><a class="reference internal" href="#command-line-syntax">1.3.1. Command-line Syntax</a></li> <li class="toctree-l3"><a class="reference internal" href="#command-line-options">1.3.2. Command-line Options</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#filename-conventions">1.4. Filename Conventions</a><ul> <li class="toctree-l3"><a class="reference internal" href="#input-files">1.4.1. Input Files</a></li> <li class="toctree-l3"><a class="reference internal" href="#output-files">1.4.2. Output Files</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#fortran-c-and-c-data-types">1.5. Fortran, C++ and C Data Types</a></li> <li class="toctree-l2"><a class="reference internal" href="#platform-specific-considerations">1.6. Platform-specific considerations</a><ul> <li class="toctree-l3"><a class="reference internal" href="#using-the-nvidia-hpc-compilers-on-linux">1.6.1. Using the NVIDIA HPC Compilers on Linux</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#site-specific-customization-of-the-compilers">1.7. Site-Specific Customization of the Compilers</a><ul> <li class="toctree-l3"><a class="reference internal" href="#use-siterc-files">1.7.1. Use siterc Files</a></li> <li class="toctree-l3"><a class="reference internal" href="#using-user-rc-files">1.7.2. Using User rc Files</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#common-development-tasks">1.8. Common Development Tasks</a></li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#use-command-line-options">2. Use Command-line Options</a><ul> <li class="toctree-l2"><a class="reference internal" href="#command-line-option-overview">2.1. Command-line Option Overview</a><ul> <li class="toctree-l3"><a class="reference internal" href="#command-line-options-syntax">2.1.1. Command-line Options Syntax</a></li> <li class="toctree-l3"><a class="reference internal" href="#command-line-suboptions">2.1.2. Command-line Suboptions</a></li> <li class="toctree-l3"><a class="reference internal" href="#command-line-conflicting-options">2.1.3. Command-line Conflicting Options</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#help-with-command-line-options">2.2. Help with Command-line Options</a></li> <li class="toctree-l2"><a class="reference internal" href="#getting-started-with-performance">2.3. Getting Started with Performance</a><ul> <li class="toctree-l3"><a class="reference internal" href="#using-fast">2.3.1. Using -fast</a></li> <li class="toctree-l3"><a class="reference internal" href="#other-performance-related-options">2.3.2. Other Performance-Related Options</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#frequently-used-options">2.4. Frequently-used Options</a></li> <li class="toctree-l2"><a class="reference internal" href="#floating-point-subnormal">2.5. Floating-point Subnormal</a></li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#multicore-cpu-optimization">3. Multicore CPU Optimization</a><ul> <li class="toctree-l2"><a class="reference internal" href="#overview-of-optimization">3.1. Overview of Optimization</a><ul> <li class="toctree-l3"><a class="reference internal" href="#local-optimization">3.1.1. Local Optimization</a></li> <li class="toctree-l3"><a class="reference internal" href="#global-optimization">3.1.2. Global Optimization</a></li> <li class="toctree-l3"><a class="reference internal" href="#loop-optimization-unrolling-vectorization-and-parallelization">3.1.3. Loop Optimization: Unrolling, Vectorization and Parallelization</a></li> <li class="toctree-l3"><a class="reference internal" href="#interprocedural-analysis-ipa-and-optimization">3.1.4. Interprocedural Analysis (IPA) and Optimization</a></li> <li class="toctree-l3"><a class="reference internal" href="#function-inlining">3.1.5. Function Inlining</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#getting-started-with-optimization">3.2. Getting Started with Optimization</a><ul> <li class="toctree-l3"><a class="reference internal" href="#help">3.2.1. -help</a></li> <li class="toctree-l3"><a class="reference internal" href="#minfo">3.2.2. -Minfo</a></li> <li class="toctree-l3"><a class="reference internal" href="#mneginfo">3.2.3. -Mneginfo</a></li> <li class="toctree-l3"><a class="reference internal" href="#dryrun">3.2.4. -dryrun</a></li> <li class="toctree-l3"><a class="reference internal" href="#v">3.2.5. -v</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#local-and-global-optimization">3.3. Local and Global Optimization</a><ul> <li class="toctree-l3"><a class="reference internal" href="#msafeptr">3.3.1. -Msafeptr</a></li> <li class="toctree-l3"><a class="reference internal" href="#o">3.3.2. -O</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#loop-unrolling-using-munroll">3.4. Loop Unrolling using -Munroll</a></li> <li class="toctree-l2"><a class="reference internal" href="#vectorization-using-mvect">3.5. Vectorization using -Mvect</a><ul> <li class="toctree-l3"><a class="reference internal" href="#vectorization-sub-options">3.5.1. Vectorization Sub-options</a></li> <li class="toctree-l3"><a class="reference internal" href="#vectorization-example-using-simd-instructions">3.5.2. Vectorization Example Using SIMD Instructions</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#interprocedural-analysis-and-optimization-using-mipa">3.6. Interprocedural Analysis and Optimization using -Mipa</a><ul> <li class="toctree-l3"><a class="reference internal" href="#building-a-program-without-ipa-single-step">3.6.1. Building a Program Without IPA – Single Step</a></li> <li class="toctree-l3"><a class="reference internal" href="#building-a-program-without-ipa-several-steps">3.6.2. Building a Program Without IPA – Several Steps</a></li> <li class="toctree-l3"><a class="reference internal" href="#building-a-program-without-ipa-using-make">3.6.3. Building a Program Without IPA Using Make</a></li> <li class="toctree-l3"><a class="reference internal" href="#building-a-program-with-ipa">3.6.4. Building a Program with IPA</a></li> <li class="toctree-l3"><a class="reference internal" href="#building-a-program-with-ipa-single-step">3.6.5. Building a Program with IPA – Single Step</a></li> <li class="toctree-l3"><a class="reference internal" href="#building-a-program-with-ipa-several-steps">3.6.6. Building a Program with IPA – Several Steps</a></li> <li class="toctree-l3"><a class="reference internal" href="#building-a-program-with-ipa-using-make">3.6.7. Building a Program with IPA Using Make</a></li> <li class="toctree-l3"><a class="reference internal" href="#questions-about-ipa">3.6.8. Questions about IPA</a></li> </ul> </li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#using-function-inlining">4. Using Function Inlining</a><ul> <li class="toctree-l2"><a class="reference internal" href="#automatic-function-inlining-in-c-and-c">4.1. Automatic function inlining in C++ and C</a></li> <li class="toctree-l2"><a class="reference internal" href="#invoking-procedure-inlining">4.2. Invoking Procedure Inlining</a></li> <li class="toctree-l2"><a class="reference internal" href="#using-an-inline-library">4.3. Using an Inline Library</a></li> <li class="toctree-l2"><a class="reference internal" href="#creating-an-inline-library">4.4. Creating an Inline Library</a><ul> <li class="toctree-l3"><a class="reference internal" href="#working-with-inline-libraries">4.4.1. Working with Inline Libraries</a></li> <li class="toctree-l3"><a class="reference internal" href="#dependencies">4.4.2. Dependencies</a></li> <li class="toctree-l3"><a class="reference internal" href="#updating-inline-libraries-makefiles">4.4.3. Updating Inline Libraries – Makefiles</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#error-detection-during-inlining">4.5. Error Detection during Inlining</a></li> <li class="toctree-l2"><a class="reference internal" href="#examples">4.6. Examples</a></li> <li class="toctree-l2"><a class="reference internal" href="#restrictions-on-inlining">4.7. Restrictions on Inlining</a></li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#using-gpus">5. Using GPUs</a><ul> <li class="toctree-l2"><a class="reference internal" href="#id1">5.1. Overview</a></li> <li class="toctree-l2"><a class="reference internal" href="#terminology">5.2. Terminology</a></li> <li class="toctree-l2"><a class="reference internal" href="#execution-model">5.3. Execution Model</a><ul> <li class="toctree-l3"><a class="reference internal" href="#host-functions">5.3.1. Host Functions</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#memory-model">5.4. Memory Model</a><ul> <li class="toctree-l3"><a class="reference internal" href="#separate-host-and-accelerator-memory-considerations">5.4.1. Separate Host and Accelerator Memory Considerations</a><ul> <li class="toctree-l4"><a class="reference internal" href="#accelerator-memory">5.4.1.1. Accelerator Memory</a></li> <li class="toctree-l4"><a class="reference internal" href="#staging-memory-buffer">5.4.1.2. Staging Memory Buffer</a></li> <li class="toctree-l4"><a class="reference internal" href="#cache-management">5.4.1.3. Cache Management</a></li> <li class="toctree-l4"><a class="reference internal" href="#environment-variables-controlling-device-memory-management">5.4.1.4. Environment Variables Controlling Device Memory Management</a></li> </ul> </li> <li class="toctree-l3"><a class="reference internal" href="#managed-and-unified-memory-modes">5.4.2. Managed and Unified Memory Modes</a><ul> <li class="toctree-l4"><a class="reference internal" href="#managed-memory-mode">5.4.2.1. Managed Memory Mode</a></li> <li class="toctree-l4"><a class="reference internal" href="#unified-memory-mode">5.4.2.2. Unified Memory Mode</a></li> </ul> </li> <li class="toctree-l3"><a class="reference internal" href="#memory-pool-allocator">5.4.3. Memory Pool Allocator</a></li> <li class="toctree-l3"><a class="reference internal" href="#interception-of-deallocations">5.4.4. Interception of Deallocations</a></li> <li class="toctree-l3"><a class="reference internal" href="#command-line-options-selecting-compiler-memory-modes">5.4.5. Command-line Options Selecting Compiler Memory Modes</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#fortran-pointers-in-device-code">5.5. Fortran pointers in device code</a></li> <li class="toctree-l2"><a class="reference internal" href="#calling-routines-in-a-compute-kernel">5.6. Calling routines in a compute kernel</a></li> <li class="toctree-l2"><a class="reference internal" href="#supported-processors-and-gpus">5.7. Supported Processors and GPUs</a></li> <li class="toctree-l2"><a class="reference internal" href="#cuda-versions">5.8. CUDA Versions</a></li> <li class="toctree-l2"><a class="reference internal" href="#compute-capability">5.9. Compute Capability</a></li> <li class="toctree-l2"><a class="reference internal" href="#ptx-jit-compilation">5.10. PTX JIT Compilation</a></li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#using-openacc">6. Using OpenACC</a><ul> <li class="toctree-l2"><a class="reference internal" href="#openacc-programming-model">6.1. OpenACC Programming Model</a><ul> <li class="toctree-l3"><a class="reference internal" href="#levels-of-parallelism">6.1.1. Levels of Parallelism</a></li> <li class="toctree-l3"><a class="reference internal" href="#enable-openacc-directives">6.1.2. Enable OpenACC Directives</a></li> <li class="toctree-l3"><a class="reference internal" href="#openacc-support">6.1.3. OpenACC Support</a></li> <li class="toctree-l3"><a class="reference internal" href="#openacc-extensions">6.1.4. OpenACC Extensions</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#compiling-an-openacc-program">6.2. Compiling an OpenACC Program</a><ul> <li class="toctree-l3"><a class="reference internal" href="#no-acc">6.2.1. -[no]acc</a></li> <li class="toctree-l3"><a class="reference internal" href="#gpu">6.2.2. -gpu</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#openacc-for-multicore-cpus">6.3. OpenACC for Multicore CPUs</a></li> <li class="toctree-l2"><a class="reference internal" href="#openacc-with-cuda-unified-memory">6.4. OpenACC with CUDA Unified Memory</a></li> <li class="toctree-l2"><a class="reference internal" href="#openacc-error-handling">6.5. OpenACC Error Handling</a></li> <li class="toctree-l2"><a class="reference internal" href="#openacc-and-cuda-graphs">6.6. OpenACC and CUDA Graphs</a></li> <li class="toctree-l2"><a class="reference internal" href="#host-and-device-trip-count-options">6.7. Host and Device Trip Count Options</a><ul> <li class="toctree-l3"><a class="reference internal" href="#when-to-use-gpu-tripcount-device-or-gpu-tripcount-host">6.7.1. When to Use <code class="docutils literal notranslate">-gpu=tripcount:device</code> or <code class="docutils literal notranslate">-gpu=tripcount:host</code></a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#environment-variables">6.8. Environment Variables</a></li> <li class="toctree-l2"><a class="reference internal" href="#profiling-accelerator-kernels">6.9. Profiling Accelerator Kernels</a></li> <li class="toctree-l2"><a class="reference internal" href="#openacc-runtime-libraries">6.10. OpenACC Runtime Libraries</a><ul> <li class="toctree-l3"><a class="reference internal" href="#runtime-library-definitions">6.10.1. Runtime Library Definitions</a></li> <li class="toctree-l3"><a class="reference internal" href="#runtime-library-routines">6.10.2. Runtime Library Routines</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#supported-intrinsics">6.11. Supported Intrinsics</a><ul> <li class="toctree-l3"><a class="reference internal" href="#supported-fortran-intrinsics-summary-table">6.11.1. Supported Fortran Intrinsics Summary Table</a></li> <li class="toctree-l3"><a class="reference internal" href="#supported-c-intrinsics-summary-table">6.11.2. Supported C Intrinsics Summary Table</a></li> </ul> </li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#using-openmp">7. Using OpenMP</a><ul> <li class="toctree-l2"><a class="reference internal" href="#id2">7.1. Environment Variables</a></li> <li class="toctree-l2"><a class="reference internal" href="#fallback-mode">7.2. Fallback Mode</a></li> <li class="toctree-l2"><a class="reference internal" href="#loop">7.3. Loop</a></li> <li class="toctree-l2"><a class="reference internal" href="#openmp-subset">7.4. OpenMP Subset</a></li> <li class="toctree-l2"><a class="reference internal" href="#using-metadirective">7.5. Using metadirective</a></li> <li class="toctree-l2"><a class="reference internal" href="#mapping-target-constructs-to-cuda-streams">7.6. Mapping target constructs to CUDA streams</a></li> <li class="toctree-l2"><a class="reference internal" href="#noncontiguous-array-sections">7.7. Noncontiguous Array Sections</a></li> <li class="toctree-l2"><a class="reference internal" href="#openmp-with-cuda-unified-memory">7.8. OpenMP with CUDA Unified Memory</a></li> <li class="toctree-l2"><a class="reference internal" href="#multiple-device-support">7.9. Multiple Device Support</a></li> <li class="toctree-l2"><a class="reference internal" href="#interoperability-with-cuda">7.10. Interoperability with CUDA</a></li> <li class="toctree-l2"><a class="reference internal" href="#interoperability-with-other-openmp-compilers">7.11. Interoperability with Other OpenMP Compilers</a></li> <li class="toctree-l2"><a class="reference internal" href="#gnu-stl">7.12. GNU STL</a></li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#using-stdpar">8. Using Stdpar</a><ul> <li class="toctree-l2"><a class="reference internal" href="#gpu-memory-modes">8.1. GPU Memory Modes</a></li> <li class="toctree-l2"><a class="reference internal" href="#stdpar-c">8.2. Stdpar C++</a><ul> <li class="toctree-l3"><a class="reference internal" href="#introduction-to-stdpar-c">8.2.1. Introduction to Stdpar C++</a></li> <li class="toctree-l3"><a class="reference internal" href="#nvc-compiler-parallel-algorithms-support">8.2.2. NVC++ Compiler Parallel Algorithms Support</a><ul> <li class="toctree-l4"><a class="reference internal" href="#enabling-parallel-algorithms-with-the-stdpar-option">8.2.2.1. Enabling Parallel Algorithms with the -stdpar Option</a></li> </ul> </li> <li class="toctree-l3"><a class="reference internal" href="#stdpar-c-simple-example">8.2.3. Stdpar C++ Simple Example</a></li> <li class="toctree-l3"><a class="reference internal" href="#openacc-implementation-of-parallel-algorithms">8.2.4. OpenACC Implementation of Parallel Algorithms</a></li> <li class="toctree-l3"><a class="reference internal" href="#coding-guidelines-for-gpu-accelerating-parallel-algorithms">8.2.5. Coding Guidelines for GPU-accelerating Parallel Algorithms</a><ul> <li class="toctree-l4"><a class="reference internal" href="#parallel-algorithms-and-device-function-annotations">8.2.5.1. Parallel Algorithms and Device Function Annotations</a></li> <li class="toctree-l4"><a class="reference internal" href="#data-management-in-parallel-algorithms">8.2.5.2. Data Management in Parallel Algorithms</a></li> <li class="toctree-l4"><a class="reference internal" href="#parallel-algorithms-and-function-pointers">8.2.5.3. Parallel Algorithms and Function Pointers</a></li> <li class="toctree-l4"><a class="reference internal" href="#random-access-iterators">8.2.5.4. Random Access Iterators</a></li> <li class="toctree-l4"><a class="reference internal" href="#interoperability-with-the-c-standard-library">8.2.5.5. Interoperability with the C++ Standard Library</a></li> <li class="toctree-l4"><a class="reference internal" href="#no-exceptions-in-gpu-code">8.2.5.6. No Exceptions in GPU Code</a></li> </ul> </li> <li class="toctree-l3"><a class="reference internal" href="#nvc-experimental-features">8.2.6. NVC++ Experimental Features</a><ul> <li class="toctree-l4"><a class="reference internal" href="#multi-dimensional-spans">8.2.6.1. Multi-dimensional Spans</a></li> <li class="toctree-l4"><a class="reference internal" href="#senders-and-receivers">8.2.6.2. Senders and Receivers</a></li> <li class="toctree-l4"><a class="reference internal" href="#linear-algebra">8.2.6.3. Linear Algebra</a></li> </ul> </li> <li class="toctree-l3"><a class="reference internal" href="#stdpar-c-larger-example-lulesh">8.2.7. Stdpar C++ Larger Example: LULESH</a></li> <li class="toctree-l3"><a class="reference internal" href="#interoperability-with-openacc">8.2.8. Interoperability with OpenACC</a><ul> <li class="toctree-l4"><a class="reference internal" href="#data-management-directives">8.2.8.1. Data Management Directives</a></li> <li class="toctree-l4"><a class="reference internal" href="#external-device-function-annotations">8.2.8.2. External Device Function Annotations</a></li> </ul> </li> <li class="toctree-l3"><a class="reference internal" href="#getting-started-with-parallel-algorithms-for-gpus">8.2.9. Getting Started with Parallel Algorithms for GPUs</a><ul> <li class="toctree-l4"><a class="reference internal" href="#supported-nvidia-gpus">8.2.9.1. Supported NVIDIA GPUs</a></li> <li class="toctree-l4"><a class="reference internal" href="#supported-cuda-versions">8.2.9.2. Supported CUDA Versions</a></li> </ul> </li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#stdpar-fortran">8.3. Stdpar Fortran</a><ul> <li class="toctree-l3"><a class="reference internal" href="#calling-routines-in-do-concurrent-on-the-gpu">8.3.1. Calling Routines in DO CONCURRENT on the GPU</a></li> <li class="toctree-l3"><a class="reference internal" href="#gpu-data-management">8.3.2. GPU Data Management</a></li> <li class="toctree-l3"><a class="reference internal" href="#id3">8.3.3. Interoperability with OpenACC</a></li> <li class="toctree-l3"><a class="reference internal" href="#interoperability-with-cuda-fortran">8.3.4. Interoperability with CUDA Fortran</a></li> </ul> </li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#pcast">9. PCAST</a><ul> <li class="toctree-l2"><a class="reference internal" href="#id5">9.1. Overview</a></li> <li class="toctree-l2"><a class="reference internal" href="#pcast-with-a-golden-file">9.2. PCAST with a “Golden” File</a></li> <li class="toctree-l2"><a class="reference internal" href="#pcast-with-openacc">9.3. PCAST with OpenACC</a></li> <li class="toctree-l2"><a class="reference internal" href="#limitations">9.4. Limitations</a></li> <li class="toctree-l2"><a class="reference internal" href="#id6">9.5. Environment Variables</a></li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#using-mpi">10. Using MPI</a><ul> <li class="toctree-l2"><a class="reference internal" href="#using-open-mpi-on-linux">10.1. Using Open MPI on Linux</a></li> <li class="toctree-l2"><a class="reference internal" href="#using-mpi-compiler-wrappers">10.2. Using MPI Compiler Wrappers</a></li> <li class="toctree-l2"><a class="reference internal" href="#testing-and-benchmarking">10.3. Testing and Benchmarking</a></li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#creating-and-using-libraries">11. Creating and Using Libraries</a><ul> <li class="toctree-l2"><a class="reference internal" href="#using-builtin-math-functions-in-c-and-c">11.1. Using builtin Math Functions in C++ and C</a></li> <li class="toctree-l2"><a class="reference internal" href="#using-system-library-routines">11.2. Using System Library Routines</a></li> <li class="toctree-l2"><a class="reference internal" href="#creating-and-using-shared-object-files-on-linux">11.3. Creating and Using Shared Object Files on Linux</a><ul> <li class="toctree-l3"><a class="reference internal" href="#procedure-to-create-a-use-a-shared-object-file">11.3.1. Procedure to create a use a shared object file</a></li> <li class="toctree-l3"><a class="reference internal" href="#ldd-command">11.3.2. ldd Command</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#using-lib3f">11.4. Using LIB3F</a></li> <li class="toctree-l2"><a class="reference internal" href="#lapack-blas-and-ffts">11.5. LAPACK, BLAS and FFTs</a></li> <li class="toctree-l2"><a class="reference internal" href="#linking-with-scalapack">11.6. Linking with ScaLAPACK</a></li> <li class="toctree-l2"><a class="reference internal" href="#the-c-standard-template-library">11.7. The C++ Standard Template Library</a></li> <li class="toctree-l2"><a class="reference internal" href="#nvidia-performance-libraries-nvpl">11.8. NVIDIA Performance Libraries (NVPL)</a></li> <li class="toctree-l2"><a class="reference internal" href="#linking-with-the-nvmalloc-library">11.9. Linking with the nvmalloc Library</a></li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#id7">12. Environment Variables</a><ul> <li class="toctree-l2"><a class="reference internal" href="#setting-environment-variables">12.1. Setting Environment Variables</a><ul> <li class="toctree-l3"><a class="reference internal" href="#setting-environment-variables-on-linux">12.1.1. Setting Environment Variables on Linux</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#hpc-compiler-related-environment-variables">12.2. HPC Compiler Related Environment Variables</a></li> <li class="toctree-l2"><a class="reference internal" href="#hpc-compilers-environment-variables">12.3. HPC Compilers Environment Variables</a><ul> <li class="toctree-l3"><a class="reference internal" href="#fortranopt">12.3.1. FORTRANOPT</a></li> <li class="toctree-l3"><a class="reference internal" href="#fort-fmt-recl">12.3.2. FORT_FMT_RECL</a></li> <li class="toctree-l3"><a class="reference internal" href="#gmon-out-prefix">12.3.3. GMON_OUT_PREFIX</a></li> <li class="toctree-l3"><a class="reference internal" href="#ld-library-path">12.3.4. LD_LIBRARY_PATH</a></li> <li class="toctree-l3"><a class="reference internal" href="#manpath">12.3.5. MANPATH</a></li> <li class="toctree-l3"><a class="reference internal" href="#no-stop-message">12.3.6. NO_STOP_MESSAGE</a></li> <li class="toctree-l3"><a class="reference internal" href="#path">12.3.7. PATH</a></li> <li class="toctree-l3"><a class="reference internal" href="#nvcompiler-fpu-state">12.3.8. NVCOMPILER_FPU_STATE</a></li> <li class="toctree-l3"><a class="reference internal" href="#nvcompiler-term">12.3.9. NVCOMPILER_TERM</a></li> <li class="toctree-l3"><a class="reference internal" href="#nvcompiler-term-debug">12.3.10. NVCOMPILER_TERM_DEBUG</a></li> <li class="toctree-l3"><a class="reference internal" href="#pwd">12.3.11. PWD</a></li> <li class="toctree-l3"><a class="reference internal" href="#static-random-seed">12.3.12. STATIC_RANDOM_SEED</a></li> <li class="toctree-l3"><a class="reference internal" href="#tmp">12.3.13. TMP</a></li> <li class="toctree-l3"><a class="reference internal" href="#tmpdir">12.3.14. TMPDIR</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#using-environment-modules-on-linux">12.4. Using Environment Modules on Linux</a></li> <li class="toctree-l2"><a class="reference internal" href="#stack-traceback-and-jit-debugging">12.5. Stack Traceback and JIT Debugging</a></li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#distributing-files-deployment">13. Distributing Files - Deployment</a><ul> <li class="toctree-l2"><a class="reference internal" href="#deploying-applications-on-linux">13.1. Deploying Applications on Linux</a><ul> <li class="toctree-l3"><a class="reference internal" href="#runtime-library-considerations">13.1.1. Runtime Library Considerations</a></li> <li class="toctree-l3"><a class="reference internal" href="#bit-linux-considerations">13.1.2. 64-bit Linux Considerations</a></li> <li class="toctree-l3"><a class="reference internal" href="#linux-redistributable-files">13.1.3. Linux Redistributable Files</a></li> <li class="toctree-l3"><a class="reference internal" href="#restrictions-on-linux-portability">13.1.4. Restrictions on Linux Portability</a></li> <li class="toctree-l3"><a class="reference internal" href="#licensing-for-redistributable-redist-files">13.1.5. Licensing for Redistributable (REDIST) Files</a></li> </ul> </li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#inter-language-calling">14. Inter-language Calling</a><ul> <li class="toctree-l2"><a class="reference internal" href="#overview-of-calling-conventions">14.1. Overview of Calling Conventions</a></li> <li class="toctree-l2"><a class="reference internal" href="#inter-language-calling-considerations">14.2. Inter-language Calling Considerations</a></li> <li class="toctree-l2"><a class="reference internal" href="#functions-and-subroutines">14.3. Functions and Subroutines</a></li> <li class="toctree-l2"><a class="reference internal" href="#upper-and-lower-case-conventions-underscores">14.4. Upper and Lower Case Conventions, Underscores</a></li> <li class="toctree-l2"><a class="reference internal" href="#compatible-data-types">14.5. Compatible Data Types</a><ul> <li class="toctree-l3"><a class="reference internal" href="#fortran-named-common-blocks">14.5.1. Fortran Named Common Blocks</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#argument-passing-and-return-values">14.6. Argument Passing and Return Values</a><ul> <li class="toctree-l3"><a class="reference internal" href="#passing-by-value-val">14.6.1. Passing by Value (%VAL)</a></li> <li class="toctree-l3"><a class="reference internal" href="#character-return-values">14.6.2. Character Return Values</a></li> <li class="toctree-l3"><a class="reference internal" href="#complex-return-values">14.6.3. Complex Return Values</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#array-indices">14.7. Array Indices</a></li> <li class="toctree-l2"><a class="reference internal" href="#id8">14.8. Examples</a><ul> <li class="toctree-l3"><a class="reference internal" href="#example-fortran-calling-c">14.8.1. Example – Fortran Calling C</a></li> <li class="toctree-l3"><a class="reference internal" href="#example-c-calling-fortran">14.8.2. Example - C Calling Fortran</a></li> <li class="toctree-l3"><a class="reference internal" href="#example-c-calling-c">14.8.3. Example – C++ Calling C</a></li> <li class="toctree-l3"><a class="reference internal" href="#id9">14.8.4. Example – C Calling C ++</a></li> <li class="toctree-l3"><a class="reference internal" href="#id10">14.8.5. Example – Fortran Calling C++</a></li> <li class="toctree-l3"><a class="reference internal" href="#id11">14.8.6. Example – C++ Calling Fortran</a></li> </ul> </li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#programming-considerations-for-64-bit-environments">15. Programming Considerations for 64-Bit Environments</a><ul> <li class="toctree-l2"><a class="reference internal" href="#data-types-in-the-64-bit-environment">15.1. Data Types in the 64-Bit Environment</a><ul> <li class="toctree-l3"><a class="reference internal" href="#c-and-c-data-types">15.1.1. C++ and C Data Types</a></li> <li class="toctree-l3"><a class="reference internal" href="#fortran-data-types">15.1.2. Fortran Data Types</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#large-static-data-in-linux">15.2. Large Static Data in Linux</a></li> <li class="toctree-l2"><a class="reference internal" href="#large-dynamically-allocated-data">15.3. Large Dynamically Allocated Data</a></li> <li class="toctree-l2"><a class="reference internal" href="#bit-array-indexing">15.4. 64-Bit Array Indexing</a></li> <li class="toctree-l2"><a class="reference internal" href="#compiler-options-for-64-bit-programming">15.5. Compiler Options for 64-bit Programming</a></li> <li class="toctree-l2"><a class="reference internal" href="#practical-limitations-of-large-array-programming">15.6. Practical Limitations of Large Array Programming</a></li> <li class="toctree-l2"><a class="reference internal" href="#medium-memory-model-and-large-array-in-c">15.7. Medium Memory Model and Large Array in C</a></li> <li class="toctree-l2"><a class="reference internal" href="#medium-memory-model-and-large-array-in-fortran">15.8. Medium Memory Model and Large Array in Fortran</a></li> <li class="toctree-l2"><a class="reference internal" href="#large-array-and-small-memory-model-in-fortran">15.9. Large Array and Small Memory Model in Fortran</a></li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#c-and-c-inline-assembly-and-intrinsics">16. C++ and C Inline Assembly and Intrinsics</a><ul> <li class="toctree-l2"><a class="reference internal" href="#inline-assembly">16.1. Inline Assembly</a></li> <li class="toctree-l2"><a class="reference internal" href="#extended-inline-assembly">16.2. Extended Inline Assembly</a><ul> <li class="toctree-l3"><a class="reference internal" href="#output-operands">16.2.1. Output Operands</a></li> <li class="toctree-l3"><a class="reference internal" href="#input-operands">16.2.2. Input Operands</a></li> <li class="toctree-l3"><a class="reference internal" href="#clobber-list">16.2.3. Clobber List</a></li> <li class="toctree-l3"><a class="reference internal" href="#additional-constraints">16.2.4. Additional Constraints</a></li> <li class="toctree-l3"><a class="reference internal" href="#simple-constraints">16.2.5. Simple Constraints</a></li> <li class="toctree-l3"><a class="reference internal" href="#machine-constraints">16.2.6. Machine Constraints</a></li> <li class="toctree-l3"><a class="reference internal" href="#multiple-alternative-constraints">16.2.7. Multiple Alternative Constraints</a></li> <li class="toctree-l3"><a class="reference internal" href="#constraint-modifiers">16.2.8. Constraint Modifiers</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#operand-aliases">16.3. Operand Aliases</a></li> <li class="toctree-l2"><a class="reference internal" href="#assembly-string-modifiers">16.4. Assembly String Modifiers</a></li> <li class="toctree-l2"><a class="reference internal" href="#extended-asm-macros">16.5. Extended Asm Macros</a></li> <li class="toctree-l2"><a class="reference internal" href="#intrinsics">16.6. Intrinsics</a></li> </ul> </li> </ul> </div> </div> </nav> <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" > <a href="contents.html">NVIDIA HPC Compilers User's Guide</a> </nav> <div class="wy-nav-content"> <div class="rst-content"> <div role="navigation" aria-label="Page navigation"> <ul class="wy-breadcrumbs"> <li><a href="contents.html" class="icon icon-home"></a> »</li> <li>NVIDIA HPC Compilers User's Guide</li> <li class="wy-breadcrumbs-aside"> </li> <li class="wy-breadcrumbs-aside"> v25.1 | <a href="../../pdf/hpc251ug.pdf" class="reference external">PDF</a> | <a href="https://docs.nvidia.com/hpc-sdk/archive/index.html" class="reference external">Archive</a>   </li> </ul> <hr/> </div> <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article"> <div itemprop="articleBody"> NVIDIA HPC Compilers User’s Guide Preface This guide is part of a set of manuals that describe how to use the NVIDIA HPC Fortran, C++ and C compilers. These compilers include the NVFORTRAN, NVC++ and NVC compilers. They work in conjunction with an assembler, linker, libraries and header files on your target system, and include a CUDA toolchain, libraries and header files for GPU computing. You can use the NVIDIA HPC compilers to develop, optimize and parallelize applications for NVIDIA GPUs and x86-64 and Arm Server multicore CPUs. The NVIDIA HPC Compilers User’s Guide provides operating instructions for the NVIDIA HPC compilers command-level development environment. The NVIDIA HPC Compilers Reference Manual contains details concerning the NVIDIA compilers’ interpretation of the Fortran, C++ and C language standards, implementation of language extensions, and command-level compilation. Users are expected to have previous experience with or knowledge of the Fortran, C++ and C programming languages. These guides do not teach the Fortran, C++ or C programming languages. Audience Description This manual is intended for scientists and engineers using the NVIDIA HPC compilers. To use these compilers, you should be aware of the role of high-level languages, such as Fortran, C++ and C as well as parallel programming models such as CUDA, OpenACC and OpenMP in the software development process, and you should have some level of understanding of programming. The NVIDIA HPC compilers are available on a variety of NVIDIA GPUs and x86-64 and Arm CPU-based platforms and operating systems. You need to be familiar with the basic commands available on your system. Compatibility and Conformance to Standards Your system needs to be running a properly installed and configured version of the NVIDIA HPC compilers. For information on installing NVIDIA HPC compilers, refer to the Release Notes and Installation Guide included with your software. For further information, refer to the following: <ul class="simple"> <li>American National Standard Programming Language FORTRAN, ANSI X3. -1978 (1978).</li> <li>ISO/IEC 1539-1 : 1991, Information technology – Programming Languages – Fortran, Geneva, 1991 (Fortran 90).</li> <li>ISO/IEC 1539-1 : 1997, Information technology – Programming Languages – Fortran, Geneva, 1997 (Fortran 95).</li> <li>ISO/IEC 1539-1 : 2004, Information technology – Programming Languages – Fortran, Geneva, 2004 (Fortran 2003).</li> <li>ISO/IEC 1539-1 : 2010, Information technology – Programming Languages – Fortran, Geneva, 2010 (Fortran 2008).</li> <li>ISO/IEC 1539-1 : 2018, Information technology – Programming Languages – Fortran, Geneva, 2018 (Fortran 2018).</li> <li>Fortran 95 Handbook Complete ISO/ANSI Reference, Adams et al, The MIT Press, Cambridge, Mass, 1997.</li> <li>The Fortran 2003 Handbook, Adams et al, Springer, 2009.</li> <li>OpenACC Application Program Interface, Version 2.7, November 2018, <a class="reference external" href="http://www.openacc.org">http://www.openacc.org</a>.</li> <li>OpenMP Application Program Interface, Version 5.0, November 2018, <a class="reference external" href="http://www.openmp.org">http://www.openmp.org</a>.</li> <li>Programming in VAX Fortran, Version 4.0, Digital Equipment Corporation (September, 1984).</li> <li>IBM VS Fortran, IBM Corporation, Rev. GC26-4119.</li> <li>Military Standard, Fortran, DOD Supplement to American National Standard Programming Language Fortran, ANSI x.3-1978, MIL-STD-1753 (November 9, 1978).</li> <li>American National Standard Programming Language C, ANSI X3.159-1989.</li> <li>ISO/IEC 9899:1990, Information technology – Programming Languages – C, Geneva, 1990 (C90).</li> <li>ISO/IEC 9899:1999, Information technology – Programming Languages – C, Geneva, 1999 (C99).</li> <li>ISO/IEC 9899:2011, Information Technology – Programming Languages – C, Geneva, 2011 (C11).</li> <li>ISO/IEC 14882:2011, Information Technology – Programming Languages – C++, Geneva, 2011 (C++11).</li> <li>ISO/IEC 14882:2014, Information Technology – Programming Languages – C++, Geneva, 2014 (C++14).</li> <li>ISO/IEC 14882:2017, Information Technology – Programming Languages – C++, Geneva, 2017 (C++17).</li> </ul> Organization This guide contains the essential information on how to use the NVIDIA HPC compilers and is divided into these sections: <a class="reference internal" href="#gs-nv">Getting Started</a> provides an introduction to the NVIDIA HPC compilers and describes their use and overall features. <a class="reference internal" href="#cmdln-options-use">Use Command-line Options</a> provides an overview of the command-line options as well as task-related lists of options. <a class="reference internal" href="#opt-parallel">Multicore CPU Optimization</a> describes multicore CPU optimizations and related compiler options. <a class="reference internal" href="#fn-inline-use">Using Function Inlining</a> describes how to use function inlining and shows how to create an inline library. <a class="reference internal" href="#openmp-use">Using OpenMP</a> describes how to use OpenMP for multicore CPU programming. <a class="reference internal" href="#acc-use">Using OpenACC</a> describes how to use an NVIDIA GPU and gives an introduction to using OpenACC. <a class="reference internal" href="#stdpar-use">Using Stdpar</a> describes how to use C++/Fortran Standard Language Parallelism for programming an NVIDIA GPU or multicore CPU. <a class="reference internal" href="#pcast">PCAST</a> describes how to use the Parallel Compiler Assisted Testing features of the HPC Compilers. <a class="reference internal" href="#mpi-use">Using MPI</a> describes how to use MPI with the NVIDIA HPC compilers. <a class="reference internal" href="#lib-create-use">Creating and Using Libraries</a> discusses NVIDIA HPC compiler support libraries, shared object files, and environment variables that affect the behavior of the compilers. <a class="reference internal" href="#env-vars-use">Environment Variables</a> describes the environment variables that affect the behavior of the NVIDIA HPC compilers. <a class="reference internal" href="#deploy-dist-files">Distributing Files – Deployment</a> describes the deployment of your files once you have built, debugged and compiled them successfully. <a class="reference internal" href="#intr-lang-call">Inter-language Calling</a> provides examples showing how to place C language calls in a Fortran program and Fortran language calls in a C program. <a class="reference internal" href="#prog-64bits">Programming Considerations for 64-Bit Environments</a> discusses issues of which programmers should be aware when targeting 64-bit processors. <a class="reference internal" href="#inline-asm-intrin-c-cpp">C++ and C Inline Assembly and Intrinsics</a> describes how to use inline assembly code in C++ and C programs, as well as how to use intrinsic functions that map directly to assembly machine instructions. Hardware and Software Constraints This guide describes versions of the NVIDIA HPC compilers that target NVIDIA GPUs and x86-64 and Arm CPUs. Details concerning environment-specific values and defaults and system-specific features or limitations are presented in the release notes delivered with the NVIDIA HPC compilers. Conventions This guide uses the following conventions: <dl class="simple"> <dt>italic</dt><dd>is used for emphasis. </dd> <dt><code class="docutils literal notranslate">Constant Width</code></dt><dd>is used for filenames, directories, arguments, options, examples, and for language statements in the text, including assembly language statements. </dd> <dt>Bold</dt><dd>is used for commands. </dd> <dt>[ item1 ]</dt><dd>in general, square brackets indicate optional items. In this case item1 is optional. In the context of p/t-sets, square brackets are required to specify a p/t-set. </dd> <dt>{ item2 | item 3 }</dt><dd>braces indicate that a selection is required. In this case, you must select either item2 or item3. </dd> <dt>filename …</dt><dd>ellipsis indicate a repetition. Zero or more of the preceding item may occur. In this example, multiple filenames are allowed. </dd> <dt><code class="docutils literal notranslate">FORTRAN</code></dt><dd>Fortran language statements are shown in the text of this guide using a reduced fixed point size. </dd> <dt><code class="docutils literal notranslate">C++ and C</code></dt><dd>C++ and C language statements are shown in the test of this guide using a reduced fixed point size. </dd> </dl> Terms A number of terms related to systems, processors, compilers and tools are used throughout this guide. For example: <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 24%" /> <col style="width: 30%" /> <col style="width: 24%" /> <col style="width: 22%" /> </colgroup> <tbody> <tr class="row-odd"><td>accelerator</td> <td>FMA</td> <td>-mcmodel=medium</td> <td>shared library</td> </tr> <tr class="row-even"><td>AVX</td> <td>host</td> <td>-mcmodel=small</td> <td>SIMD</td> </tr> <tr class="row-odd"><td>CUDA</td> <td>hyperthreading (HT)</td> <td>MPI</td> <td>SSE</td> </tr> <tr class="row-even"><td>device</td> <td>large arrays</td> <td>MPICH</td> <td>static linking</td> </tr> <tr class="row-odd"><td>driver</td> <td>linux86-64</td> <td>NUMA</td> <td>x86-64</td> </tr> <tr class="row-even"><td>DWARF</td> <td>LLVM</td> <td>OpenPOWER</td> <td>Arm</td> </tr> <tr class="row-odd"><td>dynamic library</td> <td>multicore</td> <td>ppc64le</td> <td>Aarch64</td> </tr> </tbody> </table> The following table lists the NVIDIA HPC compilers and their corresponding commands: <table class="table-no-stripes docutils align-default" id="id12"> <caption>Table 1. NVIDIA HPC Compilers and Commands<a class="headerlink" href="#id12" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 26%" /> <col style="width: 60%" /> <col style="width: 15%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Compiler or Tool</th> <th class="head">Language or Function</th> <th class="head">Command</th> </tr> </thead> <tbody> <tr class="row-even"><td>NVFORTRAN</td> <td>ISO/ANSI Fortran 2003</td> <td>nvfortran</td> </tr> <tr class="row-odd"><td>NVC++</td> <td>ISO/ANSI C++17 with GNU compatibility</td> <td>nvc++</td> </tr> <tr class="row-even"><td>NVC</td> <td>ISO/ANSI C11</td> <td>nvc</td> </tr> </tbody> </table> In general, the designation NVFORTRAN is used to refer to the NVIDIA Fortran compiler, and nvfortran is used to refer to the command that invokes the compiler. A similar convention is used for each of the NVIDIA HPC compilers. For simplicity, examples of command-line invocation of the compilers generally reference the <code class="docutils literal notranslate">nvfortran</code> command, and most source code examples are written in Fortran. Use of NVC++ and NVC is consistent with NVFORTRAN, though there are command-line options and features of these compilers that do not apply to NVFORTRAN, and vice versa. There are a wide variety of x86-64 CPUs in use. Most of these CPUs are forward-compatible, but not backward-compatible, meaning that code compiled to target a given processor will not necessarily execute correctly on a previous-generation processor. A table listing the processor options that NVIDIA HPC compilers support is available in the Release Notes. The table also includes the features utilized by the compilers that distinguish them from a compatibility standpoint. In this manual, the convention is to use “x86-64” to specify the group of CPUs that are x86-compatible, 64-bit enabled, and run a 64-bit operating system. x86-64 processors can differ in terms of their support for various prefetch, SSE and AVX instructions. Where such distinctions are important with respect to a given compiler option or feature, it is explicitly noted in this manual. Related Publications The following documents contain additional information related to the NVIDIA HPC compilers. <ul class="simple"> <li>System V Application Binary Interface Processor Supplement by AT&T UNIX System Laboratories, Inc. (Prentice Hall, Inc.).</li> <li>System V Application Binary Interface X86-64 Architecture Processor Supplement.</li> <li>Fortran 95 Handbook Complete ISO/ANSI Reference, Adams et al, The MIT Press, Cambridge, Mass, 1997.</li> <li>Programming in VAX Fortran, Version 4.0, Digital Equipment Corporation (September, 1984).</li> <li>IBM VS Fortran, IBM Corporation, Rev. GC26-4119.</li> <li>The C Programming Language by Kernighan and Ritchie (Prentice Hall).</li> <li>C: A Reference Manual by Samuel P. Harbison and Guy L. Steele Jr. (Prentice Hall, 1987).</li> <li>The Annotated C++ Reference Manual by Margaret Ellis and Bjarne Stroustrup, AT&T Bell Laboratories, Inc. (Addison-Wesley Publishing Co., 1990).</li> </ul> <section id="getting-started"> <h1>1. Getting Started<a class="headerlink" href="#getting-started" title="Permalink to this headline"></a></h1> This section describes how to use the NVIDIA HPC compilers. <section id="overview"> <h2>1.1. Overview<a class="headerlink" href="#overview" title="Permalink to this headline"></a></h2> The command used to invoke a compiler, such as the nvfortran command, is called a compiler driver. The compiler driver controls the following phases of compilation: preprocessing, compiling, assembling, and linking. Once a file is compiled and an executable file is produced, you can execute, debug, or profile the program on your system. In general, using an NVIDIA HPC compiler involves three steps: <ol class="arabic simple"> <li>Produce program source code in a file containing a .f extension or another appropriate extension, as described in <a class="reference internal" href="#fn-conv-input">Input Files</a>. This program may be one that you have written or one that you are modifying.</li> <li>Compile the program using the appropriate compiler command.</li> <li>Execute, debug, or profile the executable file on your system.</li> </ol> You might also want to deploy your application, though this is not a required step. The NVIDIA HPC compilers allow many variations on these general program development steps. These variations include the following: <ul class="simple"> <li>Stop the compilation after preprocessing, compiling or assembling to save and examine intermediate results.</li> <li>Provide options to the driver that control compiler optimization or that specify various features or limitations.</li> <li>Include as input intermediate files such as preprocessor output, compiler output, or assembler output.</li> </ul> </section> <section id="creating-an-example"> <h2>1.2. Creating an Example<a class="headerlink" href="#creating-an-example" title="Permalink to this headline"></a></h2> Let’s look at a simple example of using the NVIDIA Fortran compiler to create, compile, and execute a program that prints: <div class="highlight-text notranslate"><div class="highlight"><pre>hello </pre></div> </div> <ol class="arabic"> <li>Create your program. For this example, suppose you enter the following simple Fortran program in the file <code class="docutils literal notranslate">hello.f</code>: <div class="highlight-text notranslate"><div class="highlight"><pre>print *, "hello" end </pre></div> </div> </li> <li>Compile the program. When you created your program, you called it <code class="docutils literal notranslate">hello.f</code>. In this example, we compile it from a shell command prompt using the default <code class="docutils literal notranslate">nvfortran</code> driver option. Use the following syntax: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran hello.f </pre></div> </div> By default, the executable output is placed in the file <code class="docutils literal notranslate">a.out</code>. However, you can specify an output file name by using the <code class="docutils literal notranslate">o</code> option. To place the executable output in the file hello, use this command: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -o hello hello.f </pre></div> </div> </li> <li>Execute the program. To execute the resulting hello program, simply type the filename at the command prompt and press the Return or Enter key on your keyboard: <div class="highlight-text notranslate"><div class="highlight"><pre>$ hello </pre></div> </div> Below is the expected output: <div class="highlight-text notranslate"><div class="highlight"><pre>hello </pre></div> </div> </li> </ol> </section> <section id="invoking-the-command-level-nvidia-hpc-compilers"> <h2>1.3. Invoking the Command-level NVIDIA HPC Compilers<a class="headerlink" href="#invoking-the-command-level-nvidia-hpc-compilers" title="Permalink to this headline"></a></h2> To translate and link a Fortran, C, or C++ program, the <code class="docutils literal notranslate">nvfortran</code>, <code class="docutils literal notranslate">nvc</code> and <code class="docutils literal notranslate">nvc++</code> commands do the following: <ol class="arabic simple"> <li>Preprocess the source text file.</li> <li>Check the syntax of the source text.</li> <li>Generate an assembly language file.</li> <li>Pass control to the subsequent assembly and linking steps.</li> </ol> <section id="command-line-syntax"> <h3>1.3.1. Command-line Syntax<a class="headerlink" href="#command-line-syntax" title="Permalink to this headline"></a></h3> The compiler command-line syntax, using nvfortran as an example, is: <div class="highlight-text notranslate"><div class="highlight"><pre>nvfortran [options] [path]filename [...] </pre></div> </div> Where: <dl class="simple"> <dt>options</dt><dd>is one or more command-line options, all of which are described in detail in <a class="reference internal" href="#cmdln-options-use">Use Command-line Options</a>. </dd> <dt>path</dt><dd>is the pathname to the directory containing the file named by filename. If you do not specify the path for a filename, the compiler uses the current directory. You must specify the path separately for each filename not in the current directory. </dd> <dt>filename</dt><dd>is the name of a source file, preprocessed source file, assembly-language file, object file, or library to be processed by the compilation system. You can specify more than one [path]filename. </dd> </dl> </section> <section id="command-line-options"> <h3>1.3.2. Command-line Options<a class="headerlink" href="#command-line-options" title="Permalink to this headline"></a></h3> The command-line options control various aspects of the compilation process. For a complete alphabetical listing and a description of all the command-line options, refer to <a class="reference internal" href="#cmdln-options-use">Use Command-Line Options</a>. The following list provides important information about proper use of command-line options. <ul> <li>Command-line options and their arguments are case sensitive.</li> <li>The compiler drivers recognize characters preceded by a hyphen (-) as command-line options. For example, the <code class="docutils literal notranslate">-Mlist</code> option specifies that the compiler creates a listing file. <div class="admonition note"> Note The convention for the text of this manual is to show command-line options using a dash instead of a hyphen; for example, you see <code class="docutils literal notranslate">-Mlist</code>. </div> </li> <li>The order of options and the filename is flexible. That is, you can place options before and after the filename argument on the command line. However, the placement of some options is significant, such as the -l option, in which the order of the filenames determines the search order. <div class="admonition note"> Note If two or more options contradict each other, the last one in the command line takes precedence. </div> </li> <li>You may write linker options into a text file prefixed with the ‘@’ symbol, e.g. <code class="docutils literal notranslate">@file</code>, and pass that file to the compiler as an option. The contents of <code class="docutils literal notranslate">@file</code> are passed to the linker. <div class="highlight-text notranslate"><div class="highlight"><pre>$ echo "foo.o bar.o" > ./option_file.rsp $ nvc++ @./option_files.rsp </pre></div> </div> The above will pass “foo.o bar.o” to the compiler as linker arguments. </li> </ul> </section> </section> <section id="filename-conventions"> <h2>1.4. Filename Conventions<a class="headerlink" href="#filename-conventions" title="Permalink to this headline"></a></h2> The NVIDIA HPC compilers use the filenames that you specify on the command line to find and to create input and output files. This section describes the input and output filename conventions for the phases of the compilation process. <section id="input-files"> <h3>1.4.1. Input Files<a class="headerlink" href="#input-files" title="Permalink to this headline"></a></h3> You can specify assembly-language files, preprocessed source files, Fortran/C/C++ source files, object files, and libraries as inputs on the command line. The compiler driver determines the type of each input file by examining the filename extensions. The drivers use the following conventions: <dl class="simple"> <dt><code class="docutils literal notranslate">filename.f</code></dt><dd>indicates a Fortran source file. </dd> <dt><code class="docutils literal notranslate">filename.F</code></dt><dd>indicates a Fortran source file that can contain macros and preprocessor directives (to be preprocessed). </dd> <dt><code class="docutils literal notranslate">filename.FOR</code></dt><dd>indicates a Fortran source file that can contain macros and preprocessor directives (to be preprocessed). </dd> <dt><code class="docutils literal notranslate">filename.F90</code></dt><dd>indicates a Fortran 90/95 source file that can contain macros and preprocessor directives (to be preprocessed). </dd> <dt><code class="docutils literal notranslate">filename.F95</code></dt><dd>indicates a Fortran 90/95 source file that can contain macros and preprocessor directives (to be preprocessed). </dd> <dt><code class="docutils literal notranslate">filename.f90</code></dt><dd>indicates a Fortran 90/95 source file that is in freeform format. </dd> <dt><code class="docutils literal notranslate">filename.f95</code></dt><dd>indicates a Fortran 90/95 source file that is in freeform format. </dd> <dt><code class="docutils literal notranslate">filename.cuf</code></dt><dd>indicates a Fortran 90/95 source file in free format with CUDA Fortran extensions. </dd> <dt><code class="docutils literal notranslate">filename.CUF</code></dt><dd>indicates a Fortran 90/95 source file in free format with CUDA Fortran extensions and that can contain macros and preprocessor directives (to be preprocessed). </dd> <dt><code class="docutils literal notranslate">filename.c</code></dt><dd>indicates a C source file that can contain macros and preprocessor directives (to be preprocessed). </dd> <dt><code class="docutils literal notranslate">filename.C</code></dt><dd>indicates a C++ source file that can contain macros and preprocessor directives (to be preprocessed). </dd> <dt><code class="docutils literal notranslate">filename.i</code></dt><dd>indicates a preprocessed C or C++ source file. </dd> <dt><code class="docutils literal notranslate">filename.cc</code></dt><dd>indicates a C++ source file that can contain macros and preprocessor directives (to be preprocessed). </dd> <dt><code class="docutils literal notranslate">filename.cpp</code></dt><dd>indicates a C++ source file that can contain macros and preprocessor directives (to be preprocessed). </dd> <dt><code class="docutils literal notranslate">filename.s</code></dt><dd>indicates an assembly-language file. </dd> <dt><code class="docutils literal notranslate">filename.o</code></dt><dd>(Linux) indicates an object file. </dd> <dt><code class="docutils literal notranslate">filename.a</code></dt><dd>(Linux) indicates a library of object files. </dd> <dt><code class="docutils literal notranslate">filename.so</code></dt><dd>(Linux only) indicates a library of shared object files. </dd> </dl> The driver passes files with <code class="docutils literal notranslate">.s</code> extensions to the assembler and files with <code class="docutils literal notranslate">.o</code>, <code class="docutils literal notranslate">.so</code> and <code class="docutils literal notranslate">.a</code> extensions to the linker. Input files with unrecognized extensions, or no extension, are also passed to the linker. Files with a <code class="docutils literal notranslate">.F</code> (Capital F) or <code class="docutils literal notranslate">.FOR</code> suffix are first preprocessed by the Fortran compilers and the output is passed to the compilation phase. The Fortran preprocessor functions like cpp for C programs, but is built in to the Fortran compilers rather than implemented through an invocation of cpp. This design ensures consistency in the preprocessing step regardless of the type or revision of operating system under which you are compiling. Any input files not needed for a particular phase of processing are not processed. For example, if on the command line you specify an assembly-language file (<code class="docutils literal notranslate">filename.s</code>) and the <code class="docutils literal notranslate">-S</code> option to stop before the assembly phase, the compiler takes no action on the assembly language file. Processing stops after compilation and the assembler does not run. In this scenario, the compilation must have been completed in a previous pass which created the <code class="docutils literal notranslate">.s</code> file. For a complete description of the <code class="docutils literal notranslate">-S</code> option, refer to <a class="reference internal" href="#fn-conv-output">Output Files</a>. In addition to specifying primary input files on the command line, code within other files can be compiled as part of include files using the INCLUDE statement in a Fortran source file or the preprocessor #include directive in Fortran source files that use a <code class="docutils literal notranslate">.F</code> extension or C++ and C source files. When linking a program with a library, the linker extracts only those library components that the program needs. The compiler drivers link in several libraries by default. For more information about libraries, refer to <a class="reference internal" href="#lib-create-use">Create and Use Libraries</a>. </section> <section id="output-files"> <h3>1.4.2. Output Files<a class="headerlink" href="#output-files" title="Permalink to this headline"></a></h3> By default, an executable output file produced by one of the NVIDIA HPC compilers is placed in the file <code class="docutils literal notranslate">a.out</code>. As the <a class="reference internal" href="#example-hello">Hello example</a> shows, you can use the <code class="docutils literal notranslate">-o</code> option to specify the output file name. If you use option <code class="docutils literal notranslate">-F</code> (Fortran only), <code class="docutils literal notranslate">-P</code> (C/C++ only), <code class="docutils literal notranslate">-S</code> or <code class="docutils literal notranslate">-c</code>, the compiler produces a file containing the output of the last completed phase for each input file, as specified by the option supplied. The output file is a preprocessed source file, an assembly-language file, or an unlinked object file respectively. Similarly, the <code class="docutils literal notranslate">-E</code> option does not produce a file, but displays the preprocessed source file on the standard output. Using any of these options, the <code class="docutils literal notranslate">-o</code> option is valid only if you specify a single input file. If no errors occur during processing, you can use the files created by these options as input to a future invocation of any of the NVIDIA compiler drivers. The following table lists the stop-after options and the output files that the compilers create when you use these options. It also indicates the accepted input files. <table class="table-no-stripes docutils align-default" id="id13"> <caption>Table 2. Option Descriptions<a class="headerlink" href="#id13" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 8%" /> <col style="width: 9%" /> <col style="width: 56%" /> <col style="width: 27%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Option</th> <th class="head">Stop After</th> <th class="head">Input</th> <th class="head">Output</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">-E</code></td> <td>preprocessing</td> <td>Source files</td> <td>preprocessed file to standard out</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-F</code></td> <td>preprocessing</td> <td>Source files. This option is not valid for nvc or nvc++.</td> <td>preprocessed file (<code class="docutils literal notranslate">.f</code>)</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-P</code></td> <td>preprocessing</td> <td>Source files. This option is not valid for nvfortran.</td> <td>preprocessed file (<code class="docutils literal notranslate">.i</code>)</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-S</code></td> <td>compilation</td> <td>Source files or preprocessed files</td> <td>assembly-language file (<code class="docutils literal notranslate">.s</code>)</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-c</code></td> <td>assembly</td> <td>Source files, or preprocessed files, or assembly-language files</td> <td>unlinked object file (<code class="docutils literal notranslate">.o</code> or <code class="docutils literal notranslate">.obj</code>)</td> </tr> <tr class="row-odd"><td>none</td> <td>linking</td> <td>Source files, or preprocessed files, assembly-language files, object files, or libraries</td> <td>executable file (<code class="docutils literal notranslate">a.out</code>)</td> </tr> </tbody> </table> If you specify multiple input files or do not specify an object filename, the compiler uses the input filenames to derive corresponding default output filenames of the following form, where filename is the input filename without its extension: <dl class="simple"> <dt><code class="docutils literal notranslate">filename.f</code></dt><dd>indicates a preprocessed file, if you compiled a Fortran file using the <code class="docutils literal notranslate">-⁠F</code> option. </dd> <dt><code class="docutils literal notranslate">filename.i</code></dt><dd>indicates a preprocessed file, if you compiled using the <code class="docutils literal notranslate">-⁠P</code> option. </dd> <dt><code class="docutils literal notranslate">filename.lst</code></dt><dd>indicates a listing file from the <code class="docutils literal notranslate">-⁠Mlist</code> option. </dd> <dt><code class="docutils literal notranslate">filename.o</code> or <code class="docutils literal notranslate">filename.obj</code></dt><dd>indicates a object file from the <code class="docutils literal notranslate">-⁠c</code> option. </dd> <dt><code class="docutils literal notranslate">filename.s</code></dt><dd>indicates an assembly-language file from the <code class="docutils literal notranslate">-⁠S</code> option. </dd> </dl> <div class="admonition note"> Note Unless you specify otherwise, the destination directory for any output file is the current working directory. If the file exists in the destination directory, the compiler overwrites it. </div> The following example demonstrates the use of output filename extensions. <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -c proto.f proto1.F </pre></div> </div> This produces the output files <code class="docutils literal notranslate">proto.o</code> and <code class="docutils literal notranslate">proto1.o</code>, which are binary object files. Prior to compilation, the file <code class="docutils literal notranslate">proto1.F</code> is preprocessed because it has a <code class="docutils literal notranslate">.F</code> filename extension. </section> </section> <section id="fortran-c-and-c-data-types"> <h2>1.5. Fortran, C++ and C Data Types<a class="headerlink" href="#fortran-c-and-c-data-types" title="Permalink to this headline"></a></h2> The NVIDIA Fortran, C++ and C compilers recognize scalar and aggregate data types. A scalar data type holds a single value, such as the integer value 42 or the real value 112.6. An aggregate data type consists of one or more scalar data type objects, such as an array of integer values. </section> <section id="platform-specific-considerations"> <h2>1.6. Platform-specific considerations<a class="headerlink" href="#platform-specific-considerations" title="Permalink to this headline"></a></h2> The NVIDIA HPC Compilers are supported on x86-64 and 64-bit Arm multicore CPUs running Linux. <section id="using-the-nvidia-hpc-compilers-on-linux"> <h3>1.6.1. Using the NVIDIA HPC Compilers on Linux<a class="headerlink" href="#using-the-nvidia-hpc-compilers-on-linux" title="Permalink to this headline"></a></h3> Linux Header Files The Linux system header files contain many GNU gcc extensions. The NVIDIA HPC C++ and C compilers support many of these extensions and can compile most programs that the GNU compilers can compile. A few header files not interoperable with the NVIDIA compilers have been rewritten. If you are using the NVIDIA HPC C++ or C compilers, please make sure that the supplied versions of these include files are found before the system versions. This hierarchy happens by default unless you explicitly add a -I option that references one of the system <code class="docutils literal notranslate">include</code> directories. </section> </section> <section id="site-specific-customization-of-the-compilers"> <h2>1.7. Site-Specific Customization of the Compilers<a class="headerlink" href="#site-specific-customization-of-the-compilers" title="Permalink to this headline"></a></h2> If you are using the NVIDIA HPC Compilers and want all your users to have access to specific libraries or other files, there are special files that allow you to customize the compilers for your site. <section id="use-siterc-files"> <h3>1.7.1. Use siterc Files<a class="headerlink" href="#use-siterc-files" title="Permalink to this headline"></a></h3> The NVIDIA HPC Compiler command-level drivers utilize a file named <code class="docutils literal notranslate">siterc</code> to enable site-specific customization of the behavior of the NVIDIA compilers. The <code class="docutils literal notranslate">siterc</code> file is located in the <code class="docutils literal notranslate">bin</code> subdirectory of the NVIDIA HPC Compilers installation directory. Using <code class="docutils literal notranslate">siterc</code>, you can control how the compiler drivers invoke the various components in the compilation tool chain. </section> <section id="using-user-rc-files"> <h3>1.7.2. Using User rc Files<a class="headerlink" href="#using-user-rc-files" title="Permalink to this headline"></a></h3> In addition to the siterc file, user <code class="docutils literal notranslate">rc</code> files can reside in a given user’s home directory, as specified by the user’s HOME environment variable. You can use these files to control the respective NVIDIA HPC Compilers. All of these files are optional. On Linux, these files are named <code class="docutils literal notranslate">.mynvfortranrc</code>, <code class="docutils literal notranslate">.mynvcrc</code>, and <code class="docutils literal notranslate">.mynvc++rc</code>. The following examples show how you can use these rc files to tailor a given installation for a particular purpose on <code class="docutils literal notranslate">Linux_x86_64</code> targets. The process is similar with obvious substitutions for <code class="docutils literal notranslate">aarch64</code> targets. <table class="table-no-stripes docutils align-default" id="id14"> <caption>Table 3. Examples of Using siterc and User rc Files<a class="headerlink" href="#id14" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 46%" /> <col style="width: 54%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">To do this…</th> <th class="head">Add the line shown to the indicated file(s)</th> </tr> </thead> <tbody> <tr class="row-even"><td>Make available to all linux compilations the libraries found in /opt/newlibs/64</td> <td><code class="docutils literal notranslate">set SITELIB=/opt/newlibs/64;</code> to /opt/nv/Linux_x86_64/25.1/compilers/bin/siterc</td> </tr> <tr class="row-odd"><td>Add to all linux compilations a new library path: /opt/local/fast</td> <td><code class="docutils literal notranslate">append SITELIB=/opt/local/fast;</code> to /opt/nv/Linux_x86_64/25.1/compilers/bin/siterc</td> </tr> <tr class="row-even"><td>With linux compilations, change -Mmpi to link in /opt/mympi/64/libmpix.a</td> <td><code class="docutils literal notranslate">set MPILIBDIR=/opt/mympi/64; set MPILIBNAME=mpix;</code> to /opt/nv/Linux_x86_64/25.1/compilers/bin/siterc</td> </tr> <tr class="row-odd"><td>Build a Fortran executable for linux that resolves shared objects in the relative directory ./REDIST</td> <td><code class="docutils literal notranslate">set RPATH=./REDIST;</code> to ~/.mynvfortranrc</td> </tr> </tbody> </table> </section> </section> <section id="common-development-tasks"> <h2>1.8. Common Development Tasks<a class="headerlink" href="#common-development-tasks" title="Permalink to this headline"></a></h2> Now that you have a brief introduction to the compiler, let’s look at some common development tasks that you might wish to perform. <ul class="simple"> <li>When you compile code you can specify a number of options on the command line that define specific characteristics related to how the program is compiled and linked, typically enhancing or overriding the default behavior of the compiler. For a list of the most common command line options and information on all the command line options, refer to <a class="reference internal" href="#cmdln-options-use">Use Command-line Options</a>.</li> <li>Code optimization for multicore CPUs allows the compiler to organize your code for efficient execution. While possibly increasing compilation time and making the code more difficult to debug, these techniques typically produce code that runs significantly faster than code that does not use them. For more information on optimization refer to <a class="reference internal" href="#opt-parallel">Multicore CPU Optimization</a>.</li> <li>Function inlining, a special type of optimization, replaces a call to a function or a subroutine with the body of the function or subroutine. This process can speed up execution by eliminating parameter passing and the function or subroutine call and return overhead. In addition, function inlining allows the compiler to optimize the function with the rest of the code. However, function inlining may also result in much larger code size with no increase in execution speed. For more information on function inlining, refer to <a class="reference internal" href="#fn-inline-use">Using Function Inlining</a>.</li> <li>A library is a collection of functions or subprograms used to develop software. Libraries contain “helper” code and data, which provide services to independent programs, allowing code and data to be shared and changed in a modular fashion. The functions and programs in a library are grouped for ease of use and linking. When creating your programs, it is often useful to incorporate standard libraries or proprietary ones. For more information on this topic, refer to <a class="reference internal" href="#lib-create-use">Creating and Using Libraries</a>.</li> <li>Environment variables define a set of dynamic values that can affect the way running processes behave on a computer. It is often useful to use these variables to set and pass information that alters the default behavior of the NVIDIA HPC Compilers and the executables which they generate. For more information on these variables, refer to <a class="reference internal" href="#env-vars-use">Environment Variables</a>.</li> <li>Deployment, though possibly an infrequent task, can present some unique issues related to concerns of porting the code to other systems. Deployment, in this context, involves distribution of a specific file or set of files that are already compiled and configured. The distribution must occur in such a way that the application executes accurately on another system which may not be configured exactly the same as the system on which the code was created. For more information on what you might need to know to successfully deploy your code, refer to <a class="reference internal" href="#deploy-dist-files">Distributing Files – Deployment</a>.</li> <li>An intrinsic is a function available in a given language whose implementation is handled specially by the compiler. Intrinsics make using processor-specific enhancements easier because they provide a C++ and C language interface to assembly instructions. In doing so, the compiler manages details that the user would normally have to be concerned with, such as register names, register allocations, and memory locations of data.</li> </ul> </section> </section> <section id="use-command-line-options"> <h1>2. Use Command-line Options<a class="headerlink" href="#use-command-line-options" title="Permalink to this headline"></a></h1> A command line option allows you to control specific behavior when a program is compiled and linked. This section describes the syntax for properly using command-line options and provides a brief overview of a few of the more common options. <section id="command-line-option-overview"> <h2>2.1. Command-line Option Overview<a class="headerlink" href="#command-line-option-overview" title="Permalink to this headline"></a></h2> Before looking at all the command-line options, first become familiar with the syntax for these options. There are a large number of options available to you, yet most users only use a few of them. So, start simple and progress into using the more advanced options. By default, the NVIDIA HPC Compilers generate code that is optimized for the type of processor on which compilation is performed, the compilation host. Before adding options to your command-line, review <a class="reference internal" href="#cmdln-options-help">Help with Command-line Options</a> and <a class="reference internal" href="#freq-used-options">Frequently-used Options</a>. <section id="command-line-options-syntax"> <h3>2.1.1. Command-line Options Syntax<a class="headerlink" href="#command-line-options-syntax" title="Permalink to this headline"></a></h3> On a command-line, options need to be preceded by a hyphen (-). If the compiler does not recognize an option, you get an unknown switch error. The error can be downgraded to a warning by adding the <code class="docutils literal notranslate">-noswitcherror</code> option. This document uses the following notation when describing options: <dl class="simple"> <dt>[item]</dt><dd>Square brackets indicate that the enclosed item is optional. </dd> <dt>{item | item}</dt><dd>Braces indicate that you must select one and only one of the enclosed items. A vertical bar (|) separates the choices. </dd> </dl> <dl class="simple"> <dt>…</dt><dd>Horizontal ellipses indicate that zero or more instances of the preceding item are valid. </dd> </dl> </section> <section id="command-line-suboptions"> <h3>2.1.2. Command-line Suboptions<a class="headerlink" href="#command-line-suboptions" title="Permalink to this headline"></a></h3> Some options accept several suboptions. You can specify these suboptions either by using the full option statement multiple times or by using a comma-separated list for the suboptions. The following two command lines are equivalent: <div class="highlight-text notranslate"><div class="highlight"><pre>nvfortran -Mvect=simd -Mvect=noaltcode </pre></div> </div> <div class="highlight-text notranslate"><div class="highlight"><pre>nvfortran -Mvect=simd,noaltcode </pre></div> </div> </section> <section id="command-line-conflicting-options"> <h3>2.1.3. Command-line Conflicting Options<a class="headerlink" href="#command-line-conflicting-options" title="Permalink to this headline"></a></h3> Some options have an opposite or negated counterpart. For example, both <code class="docutils literal notranslate">-Mvect</code> and <code class="docutils literal notranslate">-Mnovect</code> are available. <code class="docutils literal notranslate">-Mvect</code> enables vectorization and <code class="docutils literal notranslate">-Mnovect</code> disables it. If you used both of these commands on a command line, they would conflict. <div class="admonition note"> Note When you use conflicting options on a command line, the last encountered option takes precedence over any previous one. </div> The conflicting options rule is important for a number of reasons. <ul class="simple"> <li>Some options, such as <code class="docutils literal notranslate">-fast</code>, include other options. Therefore, it is possible for you to be unaware that you have conflicting options.</li> <li>You can use this rule to create makefiles that apply specific flags to a set of files, as shown in the following example.</li> </ul> Example: Makefiles with Options In this makefile fragment, CCFLAGS uses vectorization. CCNOVECTFLAGS uses the flags defined for CCFLAGS but disables vectorization. <div class="highlight-text notranslate"><div class="highlight"><pre>CCFLAGS=c -Mvect=simd CCNOVECTFLAGS=$(CCFLAGS) -Mnovect </pre></div> </div> </section> </section> <section id="help-with-command-line-options"> <h2>2.2. Help with Command-line Options<a class="headerlink" href="#help-with-command-line-options" title="Permalink to this headline"></a></h2> If you are just getting started with the NVIDIA HPC Compilers, it is helpful to know which options are available, when to use them, and which options most users find effective. Using -help The <code class="docutils literal notranslate">-help</code> option is useful because it provides information about all options supported by a given compiler. You can use <code class="docutils literal notranslate">-help</code> in one of three ways: <ul> <li>Use <code class="docutils literal notranslate">-help</code> with no parameters to obtain a list of all the available options with a brief one-line description of each.</li> <li>Add a parameter to <code class="docutils literal notranslate">-help</code> to restrict the output to information about a specific option. The syntax for this usage is: <div class="highlight-text notranslate"><div class="highlight"><pre>-help <command line option> </pre></div> </div> Suppose you use the following command to restrict the output to information about the -fast option: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -help -fast </pre></div> </div> The output you see is similar to: <div class="highlight-text notranslate"><div class="highlight"><pre>-fast Common optimizations; includes -O2 -Munroll=c:1 -Mnoframe -Mlre </pre></div> </div> In the following example, we add the <code class="docutils literal notranslate">-help</code> parameter to restrict the output to information about the help command. The usage information for <code class="docutils literal notranslate">-help</code> shows how groups of options can be listed or examined according to function. <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -help -help -help[=groups|asm|debug|language|linker|opt|other|overall|phase|prepro| suffix|switch|target|variable] </pre></div> </div> </li> <li>Add a parameter to <code class="docutils literal notranslate">-help</code> to restrict the output to a specific set of options or to a building process. The syntax for this usage is this: <div class="highlight-text notranslate"><div class="highlight"><pre>-help=<subgroup> </pre></div> </div> </li> </ul> </section> <section id="getting-started-with-performance"> <h2>2.3. Getting Started with Performance<a class="headerlink" href="#getting-started-with-performance" title="Permalink to this headline"></a></h2> This section provides a quick overview of a few of the command-line options that are useful in improving multicore CPU performance. <section id="using-fast"> <h3>2.3.1. Using -fast<a class="headerlink" href="#using-fast" title="Permalink to this headline"></a></h3> The NVIDIA HPC Compilers implement a wide range of options that allow users a fine degree of control on each optimization phase. When it comes to optimization of code, the quickest way to start is to use the option <code class="docutils literal notranslate">-fast</code>. These options create a generally optimal set of flags. They incorporate optimization options to enable use of vector streaming SIMD instructions for 64-bit targets. They enable vectorization with SIMD instructions, cache alignment, and flush to zero mode. <div class="admonition note"> Note The contents of the <code class="docutils literal notranslate">-fast</code> option are host-dependent. Further, you should use these options on both compile and link command lines. </div> The following table shows the typical <code class="docutils literal notranslate">-fast</code> options. <table class="table-no-stripes docutils align-default" id="id15"> <caption>Table 4. Typical -fast Options<a class="headerlink" href="#id15" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 19%" /> <col style="width: 81%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Use this option…</th> <th class="head">To do this…</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">-O2</code></td> <td>Specifies a code optimization level of 2.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-Munroll=c:1</code></td> <td>Unrolls loops, executing multiple instances of the original loop during each iteration.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-Mnoframe</code></td> <td>Do not generate code to set up a stack frame. Note: With this option, a stack trace does not work.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-Mlre</code></td> <td>Enable loop-carried redundancy elimination.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-Mpre</code></td> <td>Enable partial redundancy elimination</td> </tr> </tbody> </table> On most modern CPUs the <code class="docutils literal notranslate">-fast</code> also includes the options shown in this table: <table class="table-no-stripes docutils align-default" id="id16"> <caption>Table 5. Additional -fast Options<a class="headerlink" href="#id16" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 29%" /> <col style="width: 71%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Use this option…</th> <th class="head">To do this…</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">-Mvect=simd</code></td> <td>Generates packed SIMD instructions.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-Mcache_align</code></td> <td>Aligns long objects on cache-line boundaries.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-Mflushz</code></td> <td>Sets flush-to-zero mode.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-M[no]vect</code></td> <td>Controls automatic vector pipelining.</td> </tr> </tbody> </table> To see the specific behavior of <code class="docutils literal notranslate">-fast</code> for your target, use the following command: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -help -fast </pre></div> </div> </section> <section id="other-performance-related-options"> <h3>2.3.2. Other Performance-Related Options<a class="headerlink" href="#other-performance-related-options" title="Permalink to this headline"></a></h3> While <code class="docutils literal notranslate">-fast</code> is designed to be the quickest route to best performance, it is limited to routine boundaries. Depending on the nature and writing style of the source code, the compiler often can perform further optimization by knowing the global context of usage of a given routine. For instance, determining the possible value range of actual parameters of a routine could enable a loop to be vectorized; similarly, determining static occurrence of calls helps to decide which routine is beneficial to inline. These types of global optimizations are under control of Interprocedural Analysis (IPA) in NVIDIA HPC Compilers. Option <code class="docutils literal notranslate">-Mipa</code> enables Interprocedural Analysis. <code class="docutils literal notranslate">-Mipa=fast</code> is the recommended option to get best performances for global optimization. You can also add the suboption <code class="docutils literal notranslate">inline</code> to enable automatic global inlining across files. You might consider using <code class="docutils literal notranslate">-Mipa=fast,inline</code>. This option for interprocedural analysis and global optimization can improve performance. For more information on optimization, refer to <a class="reference internal" href="#opt-parallel">Multicore CPU Optimization</a>. For specific information about these options, refer to the ‘Optimization Controls’ section of the <a class="reference external" href="../hpc-compilers-ref-guide/index.html">HPC Compilers Reference Guide</a>. </section> </section> <section id="frequently-used-options"> <h2>2.4. Frequently-used Options<a class="headerlink" href="#frequently-used-options" title="Permalink to this headline"></a></h2> In addition to overall performance, there are a number of other options that many users find useful when getting started. The following table provides a brief summary of these options. <table class="table-no-stripes docutils align-default" id="id17"> <caption>Table 6. Commonly Used Command-Line Options<a class="headerlink" href="#id17" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 8%" /> <col style="width: 92%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Use this option…</th> <th class="head">To do this…</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">-acc</code></td> <td>Enable parallelization using OpenACC directives. By default the compilers will parallelize and offload OpenACC regions to an NVIDIA GPU. Use <code class="docutils literal notranslate">-acc=multicore</code> to parallelize OpenACC regions for execution on all the cores of a multicore CPU.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-fast</code></td> <td>This option creates a generally optimal set of flags for targets that support SIMD capability. It incorporates optimization options to enable use of vector streaming SIMD instructions, cache alignment and flushz.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-g</code></td> <td>Instructs the compiler to include symbolic debugging information in the object module; sets the optimization level to zero unless a <code class="docutils literal notranslate">-O</code> option is present on the command line. Conversely, to prevent the generation of DWARF information, use the <code class="docutils literal notranslate">-Mnodwarf</code> option.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-gopt</code></td> <td>Instructs the compiler to include symbolic debugging information in the object file, and to generate optimized code identical to that generated when <code class="docutils literal notranslate">-g</code> is not specified.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-gpu</code></td> <td>Control the type of GPU for which code is generated, the version of CUDA to be targeted, and several other aspects of GPU code generation.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-help</code></td> <td>Provides information about available options.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-mcmodel=medium</code></td> <td>Enables medium=model code generation for 64-bit targets, which is useful when the data space of the program exceeds 4GB.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-mp</code></td> <td>Enable parallelization using OpenMP directives. By default the compilers will parallelize OpenMP regions for execution on all the cores of a multicore CPU. Use <code class="docutils literal notranslate">-mp=gpu</code> to parallelize OpenMP regions for offload to an NVIDIA GPU.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-Mconcur</code></td> <td>Instructs the compiler to enable auto-concurrentization of loops. If specified, the compiler uses multiple CPU cores to execute loops that it determines to be parallelizable; thus, loop iterations are split to execute optimally in a multithreaded execution context.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-Minfo</code></td> <td>Instructs the compiler to produce information on standard error.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-Minline</code></td> <td>Enables function inlining.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-Mipa=fast,inline</code></td> <td>Enables interprocedural analysis and optimization. Also enables automatic procedure inlining.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-Mkeepasm</code></td> <td>Keeps the generated assembly files.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-Munroll</code></td> <td>Invokes the loop unroller to unroll loops, executing multiple instances of the loop during each iteration. This also sets the optimization level to 2 if the level is set to less than 2, or if no -O or -g options are supplied.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-M[no]vect</code></td> <td>Enables [Disables] the code vectorizer.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">--[no_]exceptions</code></td> <td>Removes exception handling from user code. For C++, declares that the functions in this file generate no C++ exceptions, allowing more optimal code generation.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-o</code></td> <td>Names the output file.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-O <level></code></td> <td>Specifies code optimization level where <level> is 0, 1, 2, 3, or 4.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-stdpar</code></td> <td>Enable parallelization and offloading of Standard C++ and Fortran parallel constructs to NVIDIA GPUs; default is -stdpar=gpu.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-tp <target></code></td> <td>Specify a CPU target other than the compilation host CPU.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-Wl, <option></code></td> <td>Compiler driver passes the specified options to the linker.</td> </tr> </tbody> </table> </section> <section id="floating-point-subnormal"> <h2>2.5. Floating-point Subnormal<a class="headerlink" href="#floating-point-subnormal" title="Permalink to this headline"></a></h2> Starting with the 22.7 release of the NVIDIA HPC SDK the default setting of how floating-point denormal (IEEE 754 terminology “subnormal”) values are processed at runtime across both x86_64 and aarch64 processors has been changed to be more consistent. Denormal values can be both operands to, and results of, floating-point operations. The x86_64 ISA differentiate between the two categories, operands and results, and use the terminology “daz” denormals are zeros for operands, and “flushz” flush to zero for results. The Arm V8 ISA as defined can differentiate between the two categories, but currently the processors that NVIDIA HPC SDK support only have a single setting for both operands and results and is defined as “fz” in the floating-point status and control register. The NVIDIA HPC SDK C, C++, and Fortran compilers have command line switches <code class="docutils literal notranslate">-M[no]daz</code> and <code class="docutils literal notranslate">-M[no]flushz</code>, which when specified for the C/C++ main function or the Fortran main program affect how denormals are handled by the processor at runtime. The values of these two command line switches are passed to the runtime library to configure the floating-point status and control register at program startup. NVIDIA HPC SDK supports x86_64 processors from both Intel and AMD, and ArmV8.1 and later processors. The following table summarizes the default settings of the <code class="docutils literal notranslate">-Mdaz</code> and <code class="docutils literal notranslate">-Mflushz</code> command line switches pre and post the 22.7 release. <table class="table-no-stripes docutils align-default" id="id18"> <caption>Table 7. Default settings of <code class="docutils literal notranslate">-Mdaz</code> and <code class="docutils literal notranslate">-Mflushz</code><a class="headerlink" href="#id18" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 33%" /> <col style="width: 33%" /> <col style="width: 33%" /> </colgroup> <thead> <tr class="row-odd"><th class="head"></th> <th class="head">Pre 22.7 defaults</th> <th class="head">22.7 defaults</th> </tr> </thead> <tbody> <tr class="row-even"><td>Intel</td> <td><code class="docutils literal notranslate">-Mdaz</code> <code class="docutils literal notranslate">-Mnoflushz</code> </td> <td><code class="docutils literal notranslate">-Mdaz</code> <code class="docutils literal notranslate">-Mflushz</code> </td> </tr> <tr class="row-odd"><td>AMD</td> <td><code class="docutils literal notranslate">-Mnodaz</code> <code class="docutils literal notranslate">-Mnoflushz</code> </td> <td><code class="docutils literal notranslate">-Mdaz</code> <code class="docutils literal notranslate">-Mflushz</code> </td> </tr> <tr class="row-even"><td>Arm processors</td> <td><code class="docutils literal notranslate">-Mnodaz</code></td> <td><code class="docutils literal notranslate">-Mdaz</code></td> </tr> </tbody> </table> With the NVIDIA HPC SDK 22.7 release, the default handling of denormals operands and results is to treat them as zero, as if the main function/program were compiled with <code class="docutils literal notranslate">-Mdaz-Mflushz</code>. Consequently, these changes can potentially affect applications that are dependent on subnormal values being non-zero. Along with the change to the default treatment of denormal values, users now have the ability to configure the floating-point status and control register through the <code class="docutils literal notranslate">NVCOMPILER_FPU_STATE</code> environment variable — effectively overriding how the program was originally compiled. For further information, see the description of the <a class="reference internal" href="#env-vars-nv-fpu-state">NVCOMPILER_FPU_STATE</a> environment variable. </section> </section> <section id="multicore-cpu-optimization"> <h1>3. Multicore CPU Optimization<a class="headerlink" href="#multicore-cpu-optimization" title="Permalink to this headline"></a></h1> Source code that is readable, maintainable, and produces correct results is not always organized for efficient execution. Normally, the first step in the program development process involves producing code that executes and produces the correct results. This first step usually involves compiling without much worry about optimization. After code is compiled and debugged, code optimization and parallelization become an issue. Invoking one of the NVIDIA HPC Compiler commands with certain options instructs the compiler to generate optimized code. Optimization is not always performed since it increases compilation time and may make debugging difficult. However, optimization produces more efficient code that usually runs significantly faster than code that is not optimized. The compilers optimize code according to the specified optimization level. You can use a number of options to specify the optimization levels, including <code class="docutils literal notranslate">-⁠O</code>, <code class="docutils literal notranslate">-⁠Mvect</code>, <code class="docutils literal notranslate">-⁠Mipa</code> and <code class="docutils literal notranslate">-⁠Mconcur</code>. In addition, you can use several of the <code class="docutils literal notranslate">-⁠M<nvflag></code> switches to control specific types of optimization. This chapter describes the overall effect of the optimization options supported by the NVIDIA HPC Compilers, and basic usage of several options. <section id="overview-of-optimization"> <h2>3.1. Overview of Optimization<a class="headerlink" href="#overview-of-optimization" title="Permalink to this headline"></a></h2> In general, optimization involves using transformations and replacements that generate more efficient code. This is done by the compiler and involves replacements that are independent of the particular target processor’s architecture as well as replacements that take advantage of the architecture, instruction set and registers. For discussion purposes, we categorize optimization: <ul class="simple"> <li><a class="reference internal" href="#opt-local">Local Optimization</a></li> <li><a class="reference internal" href="#opt-global">Global Optimization</a></li> <li><a class="reference internal" href="#opt-loop">Loop Optimization</a></li> <li><a class="reference internal" href="#opt-ipa">Interprocedural Analysis (IPA) and Optimization</a></li> <li><a class="reference internal" href="#opt-func-inline">Optimization Through Function Inlining</a></li> </ul> <section id="local-optimization"> <h3>3.1.1. Local Optimization<a class="headerlink" href="#local-optimization" title="Permalink to this headline"></a></h3> A basic block is a sequence of statements in which the flow of control enters at the beginning and leaves at the end without the possibility of branching, except at the end. Local optimization is performed on a block-by-block basis within a program’s basic blocks. The NVIDIA HPC Compilers perform many types of local optimization including: algebraic identity removal, constant folding, common sub-expression elimination, redundant load and store elimination, scheduling, strength reduction, and peephole optimizations. </section> <section id="global-optimization"> <h3>3.1.2. Global Optimization<a class="headerlink" href="#global-optimization" title="Permalink to this headline"></a></h3> This optimization is performed on a subprogram/function over all its basic blocks. The optimizer performs control-flow and data-flow analysis for an entire program unit. All loops, including those formed by ad hoc branches such as IFs or GOTOs, are detected and optimized. Global optimization includes: constant propagation, copy propagation, dead store elimination, global register allocation, invariant code motion, and induction variable elimination. </section> <section id="loop-optimization-unrolling-vectorization-and-parallelization"> <h3>3.1.3. Loop Optimization: Unrolling, Vectorization and Parallelization<a class="headerlink" href="#loop-optimization-unrolling-vectorization-and-parallelization" title="Permalink to this headline"></a></h3> The performance of certain classes of loops may be improved through vectorization or unrolling options. Vectorization transforms loops to improve memory access performance and make use of packed SSEvector instructions which perform the same operation on multiple data items concurrently. Unrolling replicates the body of loops to reduce loop branching overhead and provide better opportunities for local optimization, vectorization and scheduling of instructions. Performance for loops on systems with multiple processors may also improve using the parallelization features of the NVIDIA HPC Compilers. </section> <section id="interprocedural-analysis-ipa-and-optimization"> <h3>3.1.4. Interprocedural Analysis (IPA) and Optimization<a class="headerlink" href="#interprocedural-analysis-ipa-and-optimization" title="Permalink to this headline"></a></h3> Interprocedural analysis (IPA) allows use of information across function call boundaries to perform optimizations that would otherwise be unavailable. For example, if the actual argument to a function is in fact a constant in the caller, it may be possible to propagate that constant into the callee and perform optimizations that are not valid if the dummy argument is treated as a variable. A wide range of optimizations are enabled or improved by using IPA, including but not limited to data alignment optimizations, argument removal, constant propagation, pointer disambiguation, pure function detection, F90/F95 array shape propagation, data placement, empty function removal, automatic function inlining, inlining of functions from pre-compiled libraries, and interprocedural optimization of functions from pre-compiled libraries. </section> <section id="function-inlining"> <h3>3.1.5. Function Inlining<a class="headerlink" href="#function-inlining" title="Permalink to this headline"></a></h3> This optimization allows a call to a function to be replaced by a copy of the body of that function. This optimization will sometimes speed up execution by eliminating the function call and return overhead. Function inlining may also create opportunities for other types of optimization. Function inlining is not always beneficial. When used improperly it may increase code size and generate less efficient code. </section> </section> <section id="getting-started-with-optimization"> <h2>3.2. Getting Started with Optimization<a class="headerlink" href="#getting-started-with-optimization" title="Permalink to this headline"></a></h2> The first concern should be getting the program to execute and produce correct results. To get the program running, start by compiling and linking without optimization. Add -O0 to the compile line to select no optimization; or add -g to debug the program easily and isolate any coding errors exposed during porting. To get started quickly with optimization, a good set of options to use with any of the NVIDIA HPC compilers is -fast. For example: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -fast -Mipa=fast,inline prog.f </pre></div> </div> For all of the NVIDIA HPC Fortran, C++ and C compilers, the <code class="docutils literal notranslate">-⁠fast -⁠Mipa=fast,inline</code> options generally produce code that is well-optimized without the possibility of significant slowdowns due to pathological cases. <ul class="simple"> <li>The``-⁠fast`` option is an aggregate option that includes a number of individual NVIDIA compiler options; which compiler options are included depends on the target for which compilation is performed.</li> <li>The <code class="docutils literal notranslate">-⁠Mipa=fast,inline</code> option invokes interprocedural analysis (IPA), including several IPA suboptions. The inline suboption enables automatic inlining with IPA. If you do not wish to use automatic inlining, you can compile with <code class="docutils literal notranslate">-⁠Mipa=fast</code> and use several IPA suboptions without inlining.</li> </ul> These aggregate options incorporate a generally optimal set of flags for targets that support SIMD capability, including vectorization with SIMD instructions, cache alignment, and flushz. The following table shows the typical <code class="docutils literal notranslate">-⁠fast</code> options. <table class="table-no-stripes docutils align-default" id="id19"> <caption>Table 8. Typical <code class="docutils literal notranslate">-fast</code> Options<a class="headerlink" href="#id19" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 15%" /> <col style="width: 85%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Use this option…</th> <th class="head">To do this…</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">-O2</code></td> <td>Specifies a code optimization level of 2 and -Mvect=SIMD.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-Munroll=c:1</code></td> <td>Unrolls loops, executing multiple instances of the original loop during each iteration.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-Mnoframe</code></td> <td>Indicates to not generate code to set up a stack frame. Note With this option, a stack trace does not work.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-Mlre</code></td> <td>Indicates loop-carried redundancy elimination.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-Mautoinline</code></td> <td>Enables automatic function inlining in C & C++.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-Mpre</code></td> <td>Indicates partial redundancy elimination</td> </tr> </tbody> </table> On modern multicore CPUs the <code class="docutils literal notranslate">-fast</code> also typically includes the options shown in the following table: <table class="table-no-stripes docutils align-default" id="id20"> <caption>Table 9. Additional <code class="docutils literal notranslate">-fast</code> Options<a class="headerlink" href="#id20" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 29%" /> <col style="width: 71%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Use this option…</th> <th class="head">To do this…</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">-Mvect=simd</code></td> <td>Generates packed SSE and AVX instructions.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-Mcache_align</code></td> <td>Aligns long objects on cache-line boundaries.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-Mflushz</code></td> <td>Sets flush-to-zero mode.</td> </tr> </tbody> </table> By experimenting with individual compiler options on a file-by-file basis, further significant performance gains can sometimes be realized. However, depending on the coding style, individual optimizations can sometimes cause slowdowns, and must be used carefully to ensure performance improvements. There are other useful command line options related to optimization and parallelization, such as <a class="reference internal" href="#opt-gs-help">-help</a>, <a class="reference internal" href="#opt-gs-minfo">-Minfo</a>, <a class="reference internal" href="#opt-gs-mneginfo">-Mneginfo</a>, <a class="reference internal" href="#opt-gs-dryrun">-dryrun</a>, and <a class="reference internal" href="#opt-gs-v">-v</a>. <section id="help"> <h3>3.2.1. -help<a class="headerlink" href="#help" title="Permalink to this headline"></a></h3> As described in <a class="reference internal" href="#cmdln-options-help">Help with Command-Line Options</a>, you can see a specification of any command-line option by invoking any of the NVIDIA HPC Compilers with <code class="docutils literal notranslate">-help</code> in combination with the option in question, without specifying any input files. For example, you might want information on <code class="docutils literal notranslate">-O</code>: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -help -O </pre></div> </div> The resulting output is similar to this: <div class="highlight-text notranslate"><div class="highlight"><pre>-O Set opt level. All -O1 optimizations plus traditional scheduling and global scalar optimizations performed </pre></div> </div> Or you can see the full functionality of <code class="docutils literal notranslate">-help</code> itself, which can return information on either an individual option or groups of options: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -help -help </pre></div> </div> The resulting output is similar to this: <div class="highlight-text notranslate"><div class="highlight"><pre>-help[=groups|asm|debug|language|linker|opt|other|overall| phase|prepro|suffix|switch|target|variable] Show compiler switches </pre></div> </div> </section> <section id="minfo"> <h3>3.2.2. -Minfo<a class="headerlink" href="#minfo" title="Permalink to this headline"></a></h3> You can use the <code class="docutils literal notranslate">-Minfo</code> option to display compile-time optimization listings. When this option is used, the NVIDIA HPC Compilers issue informational messages to standard error (stderr) as compilation proceeds. From these messages, you can determine which loops are optimized using unrolling, SIMD vectorization, parallelization, GPU offloading, interprocedural optimizations and various miscellaneous optimizations. You can also see where and whether functions are inlined. </section> <section id="mneginfo"> <h3>3.2.3. -Mneginfo<a class="headerlink" href="#mneginfo" title="Permalink to this headline"></a></h3> You can use the <code class="docutils literal notranslate">-Mneginfo</code> option to display informational messages to standard error (stderr) that explain why certain optimizations are inhibited. </section> <section id="dryrun"> <h3>3.2.4. -dryrun<a class="headerlink" href="#dryrun" title="Permalink to this headline"></a></h3> The <code class="docutils literal notranslate">-⁠dryrun</code> option can be useful as a diagnostic tool if you need to see the steps used by the compiler driver to preprocess, compile, assemble and link in the presence of a given set of command line inputs. When you specify the <code class="docutils literal notranslate">-⁠dryrun</code> option, these steps are printed to standard error (stderr) but are not actually performed. For example, you can use this option to inspect the default and user-specified libraries that are searched during the link phase, and the order in which they are searched by the linker. </section> <section id="v"> <h3>3.2.5. -v<a class="headerlink" href="#v" title="Permalink to this headline"></a></h3> The <code class="docutils literal notranslate">-v</code> option is similar to <a class="reference internal" href="#opt-gs-dryrun">-dryrun</a>, except each compilation step is performed and not simply printed. </section> </section> <section id="local-and-global-optimization"> <h2>3.3. Local and Global Optimization<a class="headerlink" href="#local-and-global-optimization" title="Permalink to this headline"></a></h2> This section describes local and global optimization. <section id="msafeptr"> <h3>3.3.1. -Msafeptr<a class="headerlink" href="#msafeptr" title="Permalink to this headline"></a></h3> The <code class="docutils literal notranslate">-⁠Msafeptr</code> option can significantly improve performance of C++ and C programs in which there is known to be no pointer aliasing. For obvious reasons, this command-line option must be used carefully. There are a number of suboptions for <code class="docutils literal notranslate">-⁠Msafeptr</code>: <ul class="simple"> <li><code class="docutils literal notranslate">-Msafeptr=all</code> – All pointers are safe. Equivalent to the default setting: <code class="docutils literal notranslate">-⁠Msafeptr</code>.</li> <li><code class="docutils literal notranslate">-Msafeptr=arg</code> – Function formal argument pointers are safe. Equivalent to <code class="docutils literal notranslate">-⁠Msafeptr=dummy</code>.</li> <li><code class="docutils literal notranslate">-Msafeptr=global</code> – Global pointers are safe.</li> <li><code class="docutils literal notranslate">-Msafeptr=local</code> – Local pointers are safe. Equivalent to <code class="docutils literal notranslate">-⁠Msafeptr=auto</code>.</li> <li><code class="docutils literal notranslate">-Msafeptr=static</code> – Static local pointers are safe.</li> </ul> If your C++ or C program has pointer aliasing and you also want automating inlining, then compiling with <code class="docutils literal notranslate">-⁠Mipa=fast</code> or <code class="docutils literal notranslate">-⁠Mipa=fast,inline</code> includes pointer aliasing optimizations. IPA may be able to optimize some of the alias references in your program and leave intact those that cannot be safely optimizied. </section> <section id="o"> <h3>3.3.2. -O<a class="headerlink" href="#o" title="Permalink to this headline"></a></h3> Using the NVIDIA HPC Compiler commands with the <code class="docutils literal notranslate">-O</code><level> option (the capital O is for Optimize), you can specify any integer level from 0 to 4. -O0 Level zero specifies no optimization. A basic block is generated for each language statement. At this level, the compiler generates a basic block for each statement. Performance will almost always be slowest using this optimization level. This level is useful for the initial execution of a program. It is also useful for debugging, since there is a direct correlation between the program text and the code generated. To enable debugging, include <code class="docutils literal notranslate">-g</code> on your compile line. -O1 Level one specifies local optimization. Scheduling of basic blocks is performed. Register allocation is performed. Local optimization is a good choice when the code is very irregular, such as code that contains many short statements containing IF statements and does not contain loops (DO or DO WHILE statements ). Although this case rarely occurs, for certain types of code, this optimization level may perform better than level-two (-O2). -O When no level is specified, level two global optimizations are performed, including traditional scalar optimizations, induction recognition, and loop invariant motion. No SIMD vectorization is enabled. -O2 Level two specifies global optimization. This level performs all level-one local optimization as well as level two global optimization described in <code class="docutils literal notranslate">-O</code>. In addition, more advanced optimizations such as SIMD code generation, cache alignment, and partial redundancy elimination are enabled. -O3 Level three specifies aggressive global optimization. This level performs all level-one and level-two optimizations and enables more aggressive hoisting and scalar replacement optimizations that may or may not be profitable. -O4 Level four performs all level-one, level-two, and level-three optimizations and enables hoisting of guarded invariant floating point expressions. Types of Optimizations The NVIDIA HPC Compilers perform many different types of local optimizations, including but not limited to: <ul class="simple"> <li>Algebraic identity removal</li> <li>Constant folding</li> <li>Common subexpression elimination</li> <li>Local register optimization</li> <li>Peephole optimizations</li> <li>Redundant load and store elimination</li> <li>Strength reductions</li> </ul> Level-two optimization (<code class="docutils literal notranslate">-⁠O2</code> or <code class="docutils literal notranslate">-⁠O</code>) specifies global optimization. The <code class="docutils literal notranslate">-⁠fast</code> option generally specifies global optimization; however, the <code class="docutils literal notranslate">-⁠fast</code> switch varies from release to release, depending on a reasonable selection of switches for any one particular release. The <code class="docutils literal notranslate">-⁠O</code> or <code class="docutils literal notranslate">-⁠O2</code> level performs all level-one local optimizations as well as global optimizations. Control flow analysis is applied and global registers are allocated for all functions and subroutines. Loop regions are given special consideration. This optimization level is a good choice when the program contains loops, the loops are short, and the structure of the code is regular. The NVIDIA HPC Compilers perform many different types of global optimizations, including but not limited to: <ul class="simple"> <li>Branch to branch elimination</li> <li>Constant propagation</li> <li>Copy propagation</li> <li>Dead store elimination</li> <li>Global register allocation</li> <li>Induction variable elimination</li> <li>Invariant code motion</li> </ul> You can explicitly select the optimization level on the command line. For example, the following command line specifies level-two optimization which results in global optimization: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -O2 prog.f </pre></div> </div> The default optimization level changes depending on which options you select on the command line. For example, when you select the <code class="docutils literal notranslate">-⁠g</code> debugging option, the default optimization level is set to level-zero (<code class="docutils literal notranslate">-⁠O0</code>). However, if you need to debug optimized code, you can use the <code class="docutils literal notranslate">-⁠gopt</code> option to generate debug information without perturbing optimization. For a description of the default levels, refer to Default Optimization Levels. The <code class="docutils literal notranslate">-⁠fast</code> option includes <code class="docutils literal notranslate">-⁠O2</code> on all targets. If you want to override the default for <code class="docutils literal notranslate">-⁠fast</code> with <code class="docutils literal notranslate">-⁠O3</code> while maintaining all other elements of <code class="docutils literal notranslate">-⁠fast</code>, simply compile as follows: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -fast -O3 prog.f </pre></div> </div> </section> </section> <section id="loop-unrolling-using-munroll"> <h2>3.4. Loop Unrolling using -Munroll<a class="headerlink" href="#loop-unrolling-using-munroll" title="Permalink to this headline"></a></h2> This optimization unrolls loops, which reduces branch overhead, and can improve execution speed by creating better opportunities for instruction scheduling. A loop with a constant count may be completely unrolled or partially unrolled. A loop with a non-constant count may also be unrolled. A candidate loop must be an innermost loop containing one to four blocks of code. The following example shows the use of the <code class="docutils literal notranslate">-Munroll</code> option: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -Munroll prog.f </pre></div> </div> The <code class="docutils literal notranslate">-Munroll</code> option is included as part of <code class="docutils literal notranslate">-fast</code> on all targets. The loop unroller expands the contents of a loop and reduces the number of times a loop is executed. Branching overhead is reduced when a loop is unrolled two or more times, since each iteration of the unrolled loop corresponds to two or more iterations of the original loop; the number of branch instructions executed is proportionately reduced. When a loop is unrolled completely, the loop’s branch overhead is eliminated altogether. Loop unrolling may be beneficial for the instruction scheduler. When a loop is completely unrolled or unrolled two or more times, opportunities for improved scheduling may be presented. The code generator can take advantage of more possibilities for instruction grouping or filling instruction delays found within the loop. Examples Showing Effect of Unrolling The following side-by-side examples show the effect of code unrolling on a segment that computes a dot product. <div class="admonition note"> Note This example is only meant to represent how the compiler can transform the loop; it is not meant to imply that the programmer needs to manually change code. In fact, manually unrolling your code can sometimes inhibit the compiler’s analysis and optimization. </div> <table class="table-no-stripes docutils align-default" id="id21"> <caption>Table 10. Example of Effect of Code Unrolling<a class="headerlink" href="#id21" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 50%" /> <col style="width: 50%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Dot Product Code</th> <th class="head">Unrolled Dot Product Code</th> </tr> </thead> <tbody> <tr class="row-even"><td><div class="highlight-fortran notranslate"><div class="highlight"><pre>REAL*4 A(100), B(100), Z INTEGER I DO I=1, 100 Z = Z + A(i) * B(i) END DO END </pre></div> </div> </td> <td><div class="highlight-fortran notranslate"><div class="highlight"><pre>REAL*4 A(100), B(100), Z INTEGER I DO I=1, 100, 2 Z = Z + A(i) * B(i) Z = Z + A(i+1) * B(i+1) END DO END </pre></div> </div> </td> </tr> </tbody> </table> Using the -Minfo option, the compiler informs you when a loop is being unrolled. For example, a message similar to the following, indicating the line number, and the number of times the code is unrolled, displays when a loop is unrolled: <div class="highlight-text notranslate"><div class="highlight"><pre>dot: 5, Loop unrolled 5 times </pre></div> </div> Using the c:<m> and n:<m> sub-options to <code class="docutils literal notranslate">-⁠Munroll</code>, or using <code class="docutils literal notranslate">-⁠Mnounroll</code>, you can control whether and how loops are unrolled on a file-by-file basis. For more information on <code class="docutils literal notranslate">-⁠Munroll</code>, refer to <a class="reference internal" href="#cmdln-options-use">Use Command-line Options</a>. </section> <section id="vectorization-using-mvect"> <h2>3.5. Vectorization using -Mvect<a class="headerlink" href="#vectorization-using-mvect" title="Permalink to this headline"></a></h2> The <code class="docutils literal notranslate">-⁠Mvect</code> option is included as part of <code class="docutils literal notranslate">-⁠fast</code> on all multicore CPU targets. If your program contains computationally-intensive loops, the <code class="docutils literal notranslate">-⁠Mvect</code> option may be helpful. If in addition you specify <code class="docutils literal notranslate">-⁠Minfo</code>, and your code contains loops that can be vectorized, the compiler reports relevant information on the optimizations applied. When an NVIDIA HPC Compiler command is invoked with the <code class="docutils literal notranslate">-⁠Mvect</code> option, the vectorizer scans code searching for loops that are candidates for high-⁠level transformations such as loop distribution, loop interchange, cache tiling, and idiom recognition (replacement of a recognizable code sequence, such as a reduction loop, with optimized code sequences or function calls). When the vectorizer finds vectorization opportunities, it internally rearranges or replaces sections of loops (the vectorizer changes the code generated; your source code’s loops are not altered). In addition to performing these loop transformations, the vectorizer produces extensive data dependence information for use by other phases of compilation and detects opportunities to use vector or packed SIMD instructions on processors where these are supported. The <code class="docutils literal notranslate">-⁠Mvect</code> option can speed up code which contains well-behaved countable loops which operate on large floating point arrays in Fortran and their C++ and C counterparts. However, it is possible that some codes will show a decrease in performance when compiled with the <code class="docutils literal notranslate">-⁠Mvect</code> option due to the generation of conditionally executed code segments, inability to determine data alignment, and other code generation factors. For this reason, it is recommended that you check carefully whether particular program units or loops show improved performance when compiled with this option enabled. <section id="vectorization-sub-options"> <h3>3.5.1. Vectorization Sub-options<a class="headerlink" href="#vectorization-sub-options" title="Permalink to this headline"></a></h3> The vectorizer performs high-level loop transformations on countable loops. A loop is countable if the number of iterations is set only before loop execution and cannot be modified during loop execution. Some of the vectorizer transformations can be controlled by arguments to the <code class="docutils literal notranslate">-Mvect</code> command line option. The following sections describe the arguments that affect the operation of the vectorizer. In addition, some of these vectorizer operations can be controlled from within code using directives and pragmas. The vectorizer performs the following operations: <ul class="simple"> <li>Loop interchange</li> <li>Loop splitting</li> <li>Loop fusion</li> <li>Generation of SIMD instructions on CPUs where these are supported</li> <li>Generation of prefetch instructions on processors where these are supported</li> <li>Loop iteration peeling to maximize vector alignment</li> <li>Alternate code generation</li> </ul> The following table lists and briefly describes some of the <code class="docutils literal notranslate">-Mvect</code> suboptions. <table class="table-no-stripes docutils align-default" id="id22"> <caption>Table 11. -Mvect Suboptions<a class="headerlink" href="#id22" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 9%" /> <col style="width: 91%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Use this option …</th> <th class="head">To instruct the vectorizer to do this …</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">-Mvect=altcode</code></td> <td>Generate appropriate code for vectorized loops.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-Mvect=[no]assoc</code></td> <td><dl class="simple"> <dt>Perform[disable] associativity conversions that can change the results of a computation due to a round-off error. For example, a typical optimization is to change one arithmetic operation to another arithmetic operation that is mathematically correct,</dt><dd>but can be computationally different and generate faster code. This option is provided to enable or disable this transformation, since a round-off error for such associativity conversions may produce unacceptable results. </dd> </dl> </td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-Mvect=fuse</code></td> <td>Enable loop fusion.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-Mvect=gather</code></td> <td>Enable vectorization of indirect array references.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-Mvect=idiom</code></td> <td>Enable idiom recognition.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-Mvect=levels:<n></code></td> <td>Set the maximum next level of loops to optimize.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-Mvect=nocond</code></td> <td>Disable vectorization of loops with conditions.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-Mvect=partial</code></td> <td>Enable partial loop vectorization via inner loop distribution.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-Mvect=prefetch</code></td> <td>Automatically generate prefetch instructions when vectorizable loops are encountered, even in cases where SSESIMD instructions are not generated.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-Mvect=short</code></td> <td>Enable short vector operations.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-Mvect=simd</code></td> <td>Automatically generate packed SSE (Streaming SIMD Extensions)SIMD, and prefetch instructions when vectorizable loops are encountered. SIMD instructions, first introduced on Pentium III and AthlonXP processors, operate on single-precision floating-point data.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-Mvect=sizelimit:n</code></td> <td>Limit the size of vectorized loops.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-Mvect=sse</code></td> <td>Equivalent to -Mvect=simd.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-Mvect=uniform</code></td> <td>Perform consistent optimizations in both vectorized and residual loops. Be aware that this may affect the performance of the residual loop.</td> </tr> </tbody> </table> <div class="admonition note"> Note Inserting <code class="docutils literal notranslate">no</code> in front of an option disables the option. For example, to disable the generation of SIMD instructions, compile with -Mvect=nosimd. </div> </section> <section id="vectorization-example-using-simd-instructions"> <h3>3.5.2. Vectorization Example Using SIMD Instructions<a class="headerlink" href="#vectorization-example-using-simd-instructions" title="Permalink to this headline"></a></h3> One of the most important vectorization options is <code class="docutils literal notranslate">-Mvect=simd</code>. When you use this option, the compiler automatically generates SIMD vector instructions, where possible, when targeting processors on which these instructions are supported. This process can improve performance by several factors compared with the equivalent scalar code. All of the NVIDIA HPC Fortran, C++ and C compilers support this capability. In the program in <a class="reference internal" href="#vect-exam-simd-vect-use-simd-exam">Vector operation using SIMD instructions</a>, the vectorizer recognizes the vector operation in subroutine ‘loop’ when either compiler switch <code class="docutils literal notranslate">-Mvect=simd</code> or <code class="docutils literal notranslate">-fast</code> is used. This example shows the compilation, informational messages, and runtime results using SIMD instructions on an Intel Core i7 7800X Skylake system, along with issues that affect SIMD performance. Loops vectorized using SIMD instructions operate much more efficiently when processing vectors that are aligned to a cache-line boundary. You can cause unconstrained data objects of size 16 bytes or greater to be cache-aligned by compiling with the <code class="docutils literal notranslate">-Mcache_align</code> switch. An unconstrained data object is a data object that is not a common block member and not a member of an aggregate data structure. <div class="admonition note"> Note For stack-based local variables to be properly aligned, the main program or function must be compiled with -Mcache_align. </div> The <code class="docutils literal notranslate">-Mcache_align</code> switch has no effect on the alignment of Fortran allocatable or automatic arrays. If you have arrays that are constrained, such as vectors that are members of Fortran common blocks, you must specifically pad your data structures to ensure proper cache alignment. You can use <code class="docutils literal notranslate">-Mcache_align</code> for only the beginning address of each common block to be cache-aligned. The following examples show the results of compiling the sample code in <a class="reference internal" href="#vect-exam-simd-vect-use-simd-exam">Vector operation using SIMD instructions</a> both with and without the option <code class="docutils literal notranslate">-Mvect=simd</code>. Vector operation using SIMD instructions <div class="highlight-text notranslate"><div class="highlight"><pre>program vector_op parameter (N = 9999) real*4 x(N), y(N), z(N), W(N) do i = 1, n y(i) = i z(i) = 2*i w(i) = 4*i enddo do j = 1, 200000 call loop(x,y,z,w,1.0e0,N) enddo print *, x(1),x(771),x(3618),x(6498),x(9999) end </pre></div> </div> <div class="highlight-text notranslate"><div class="highlight"><pre>subroutine loop(a,b,c,d,s,n) integer i, n real*4 a(n), b(n), c(n), d(n),s do i = 1, n a(i) = b(i) + c(i) - s * d(i) enddo end </pre></div> </div> Assume the preceding program is compiled as follows, where <code class="docutils literal notranslate">-Mvect=nosimd</code> disables SIMD vectorization: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -fast -Mvect=nosimd -Minfo vadd.f -Mfree -o vadd vector_op: 4, Loop unrolled 16 times Generated 1 prefetches in scalar loop 9, Loop not vectorized/parallelized: contains call loop: 18, Loop unrolled 8 times FMA (fused multiply-add) instruction(s) generated </pre></div> </div> The following output shows a sample result if the generated executable is run and timed on an Intel Core i7 7800X Skylake system: <div class="highlight-text notranslate"><div class="highlight"><pre>$ /bin/time vadd -1.000000 -771.0000 -3618.000 -6498.000 -9999.000 0.99user 0.01system 0:01.18elapsed 84%CPU (0avgtext+0avgdata 3120maxresident)k 7736inputs+0outputs (4major+834minor)pagefaults 0swaps </pre></div> </div> <div class="highlight-text notranslate"><div class="highlight"><pre>$ /bin/time vadd -1.000000 -771.0000 -3618.000 -6498.000 -9999.000 2.31user 0.00system 0:02.57elapsed 89%CPU (0avgtext+0avgdata 6976maxresident)k 8192inputs+0outputs (4major+149minor)pagefaults 0swaps </pre></div> </div> Now, recompile with vectorization enabled, and you see results similar to these: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -fast -Minfo vadd.f -Mfree -o vadd vector_op: 4, Loop not vectorized: may not be beneficial Unrolled inner loop 8 times Residual loop unrolled 7 times (completely unrolled) Generated 1 prefetches in scalar loop 9, Loop not vectorized/parallelized: contains call loop: 18, Generated 2 alternate versions of the loop Generated vector simd code for the loop Generated 3 prefetch instructions for the loop Generated vector simd code for the loop Generated 3 prefetch instructions for the loop Generated vector simd code for the loop Generated 3 prefetch instructions for the loop FMA (fused multiply-add) instruction(s) generated </pre></div> </div> Notice the informational messages for the loop at line 18. The first line of the message indicates that two alternate versions of the loop were generated. The loop count and alignments of the arrays determine which of these versions is executed. The next several lines indicate the loop was vectorized and that prefetch instructions have been generated for three loads to minimize latency of data transfers from main memory. Executing again, you should see results similar to the following: <div class="highlight-text notranslate"><div class="highlight"><pre>$ /bin/time vadd-simd -1.000000 -771.0000 -3618.000 -6498.000 -9999.000 0.27user 0.00system 0:00.29elapsed 93%CPU (0avgtext+0avgdata 3124maxresident)k 0inputs+0outputs (0major+838minor)pagefaults 0swaps </pre></div> </div> <div class="highlight-text notranslate"><div class="highlight"><pre>$ /bin/time vadd-simd -1.000000 -771.0000 -3618.000 -6498.000 -9999.000 0.62user 0.00system 0:00.65elapsed 95%CPU (0avgtext+0avgdata 6976maxresident)k 0inputs+0outputs (0major+151minor)pagefaults 0swaps </pre></div> </div> The SIMD result is 3.7 times faster than the equivalent non-SIMD version of the program. Speed-up realized by a given loop or program can vary widely based on a number of factors: <ul class="simple"> <li>When the vectors of data are resident in the data cache, performance improvement using SIMD instructions is most effective.</li> <li>If data is aligned properly, performance will be better in general than when using SIMD operations on unaligned data.</li> <li>If the compiler can guarantee that data is aligned properly, even more efficient sequences of SIMD instructions can be generated.</li> <li>The efficiency of loops that operate on single-precision data can be higher. SIMD instructions can operate on four single-precision elements concurrently, but only two double-precision elements.</li> </ul> <div class="admonition note"> Note Compiling with <code class="docutils literal notranslate">-Mvect=simd</code> can result in numerical differences from the executables generated with less optimization. Certain vectorizable operations, for example dot products, are sensitive to order of operations and the associative transformations necessary to enable vectorization (or parallelization). </div> </section> </section> <section id="interprocedural-analysis-and-optimization-using-mipa"> <h2>3.6. Interprocedural Analysis and Optimization using -Mipa<a class="headerlink" href="#interprocedural-analysis-and-optimization-using-mipa" title="Permalink to this headline"></a></h2> The NVIDIA HPC Fortran, C++ and C compilers use interprocedural analysis (IPA) that results in minimal changes to makefiles and the standard edit-build-run application development cycle. Other than adding <code class="docutils literal notranslate">-Mipa</code> to the command line, no other changes are required. For reference and background, the process of building a program without IPA is described later in this section, followed by the minor modifications required to use IPA with the NVIDIA compilers. While the NVC compiler is used here to show how IPA works, similar capabilities apply to each of the NVIDIA HPC Fortran, C++ and C compilers. <section id="building-a-program-without-ipa-single-step"> <h3>3.6.1. Building a Program Without IPA – Single Step<a class="headerlink" href="#building-a-program-without-ipa-single-step" title="Permalink to this headline"></a></h3> Using the nvc command-level compiler driver, multiple source files can be compiled and linked into a single executable with one command. The following example compiles and links three source files: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -o a.out file1.c file2.c file3.c </pre></div> </div> In actuality, the nvc driver executes several steps to produce the assembly code and object files corresponding to each source file, and subsequently to link the object files together into a single executable file. This command is roughly equivalent to the following commands performed individually: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -S -o file1.s file1.c $ as -o file1.o file1.s $ nvc -S -o file2.s file2.c $ as -o file2.o file2.s $ nvc -S -o file3.s file3.c $ as -o file3.o file3.s $ nvc -o a.out file1.o file2.o file3.o </pre></div> </div> If any of the three source files is edited, the executable can be rebuilt with the same command line: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -o a.out file1.c file2.c file3.c </pre></div> </div> <div class="admonition note"> Note This always works as intended, but has the side-effect of recompiling all of the source files, even if only one has changed. For applications with a large number of source files, this can be time-consuming and inefficient. </div> </section> <section id="building-a-program-without-ipa-several-steps"> <h3>3.6.2. Building a Program Without IPA – Several Steps<a class="headerlink" href="#building-a-program-without-ipa-several-steps" title="Permalink to this headline"></a></h3> It is also possible to use individual nvc commands to compile each source file into a corresponding object file, and one to link the resulting object files into an executable: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -c file1.c $ nvc -c file2.c $ nvc -c file3.c $ nvc -o a.out file1.o file2.o file3.o </pre></div> </div> The nvc driver invokes the compiler and assembler as required to process each source file, and invokes the linker for the final link command. If you modify one of the source files, the executable can be rebuilt by compiling just that file and then relinking: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -c file1.c $ nvc -o a.out file1.o file2.o file3.o </pre></div> </div> </section> <section id="building-a-program-without-ipa-using-make"> <h3>3.6.3. Building a Program Without IPA Using Make<a class="headerlink" href="#building-a-program-without-ipa-using-make" title="Permalink to this headline"></a></h3> The program compilation and linking process can be simplified greatly using the <code class="docutils literal notranslate">make</code> utility on systems where it is supported. Suppose you create a <code class="docutils literal notranslate">makefile</code> containing the following lines: <div class="highlight-text notranslate"><div class="highlight"><pre>a.out: file1.o file2.o file3.o nvc $(OPT) -o a.out file1.o file2.o file3.o file1.o: file1.c nvc $(OPT) -c file1.c file2.o: file2.c nvc $(OPT) -c file2.c file3.o: file3.c nvc $(OPT) -c file3.c </pre></div> </div> It is then possible to type a single make command: <div class="highlight-text notranslate"><div class="highlight"><pre>$ make </pre></div> </div> The <code class="docutils literal notranslate">make</code> utility determines which object files are out of date with respect to their corresponding source files, and invokes the compiler to recompile only those source files and to relink the executable. If you subsequently edit one or more source files, the executable can be rebuilt with the minimum number of recompilations using the same single <code class="docutils literal notranslate">make</code> command. </section> <section id="building-a-program-with-ipa"> <h3>3.6.4. Building a Program with IPA<a class="headerlink" href="#building-a-program-with-ipa" title="Permalink to this headline"></a></h3> Interprocedural analysis and optimization (IPA) by the NVIDIA HPC Compilers alters the standard and <code class="docutils literal notranslate">make</code> utility command-level interfaces as little as possible. IPA occurs in three phases: <ul class="simple"> <li>Collection: Create a summary of each function or procedure, collecting the useful information for interprocedural optimizations. This is done during the compile step if the <code class="docutils literal notranslate">-⁠Mipa</code> switch is present on the command line; summary information is collected and stored in the object file.</li> <li>Propagation: Process all the object files to propagate the interprocedural summary information across function and file boundaries. This is done during the link step, when all the object files are combined, if the <code class="docutils literal notranslate">-⁠Mipa</code> switch is present on the link command line.</li> <li>Recompile/Optimization: Recompile each of the object files with the propagated interprocedural information, producing a specialized object file. This process is also performed during the link step when the <code class="docutils literal notranslate">-⁠Mipa</code> switch is present on the link command line.</li> </ul> When linking with <code class="docutils literal notranslate">-⁠Mipa</code>, the NVIDIA HPC Compilers automatically regenerate IPA-optimized versions of each object file, essentially recompiling each file. If there are IPA-optimized objects from a previous build, the compilers will minimize the recompile time by reusing those objects if they are still valid. They will still be valid if the IPA-optimized object is newer than the original object file, and the propagated IPA information for that file has not changed since it was optimized. After each object file has been recompiled, the regular linker is invoked to build the application with the IPA-optimized object files. The IPA-optimized object files are saved in the same directory as the original object files, for use in subsequent program builds. </section> <section id="building-a-program-with-ipa-single-step"> <h3>3.6.5. Building a Program with IPA – Single Step<a class="headerlink" href="#building-a-program-with-ipa-single-step" title="Permalink to this headline"></a></h3> By adding the <code class="docutils literal notranslate">-Mipa</code> command line switch, several source files can be compiled and linked with interprocedural optimizations with one command: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -Mipa=fast -o a.out file1.c file2.c file3.c </pre></div> </div> Just like compiling without <code class="docutils literal notranslate">-Mipa</code>, the driver executes several steps to produce the assembly and object files to create the executable: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -Mipa=fast -S -o file1.s file1.c $ as -o file1.o file1.s $ nvc -Mipa=fast -S -o file2.s file2.c $ as -o file2.o file2.s $ nvc -Mipa=fast -S -o file3.s file3.c $ as -o file3.o file3.s $ nvc -Mipa=fast -o a.out file1.o file2.o file3.o </pre></div> </div> In the last step, an IPA linker is invoked to read all the IPA summary information and perform the interprocedural propagation. The IPA linker reinvokes the compiler on each of the object files to recompile them with interprocedural information. This creates three new objects with mangled names: <div class="highlight-text notranslate"><div class="highlight"><pre>file1_ipa5_a.out.oo.o, file2_ipa5_a.out.oo.o, file3_ipa5_a.out.oo.o </pre></div> </div> The system linker is then invoked to link these IPA-optimized objects into the final executable. Later, if one of the three source files is edited, the executable can be rebuilt with the same command line: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -Mipa=fast -o a.out file1.c file2.c file3.c </pre></div> </div> This works, but again has the side-effect of compiling each source file, and recompiling each object file at link time. </section> <section id="building-a-program-with-ipa-several-steps"> <h3>3.6.6. Building a Program with IPA – Several Steps<a class="headerlink" href="#building-a-program-with-ipa-several-steps" title="Permalink to this headline"></a></h3> Just by adding the <code class="docutils literal notranslate">-Mipa</code> command-line switch, it is possible to use individual nvc commands to compile each source file, followed by a command to link the resulting object files into an executable: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -Mipa=fast -c file1.c $ nvc -Mipa=fast -c file2.c $ nvc -Mipa=fast -c file3.c $ nvc -Mipa=fast -o a.out file1.o file2.o file3.o </pre></div> </div> The nvc driver invokes the compiler and assembler as required to process each source file, and invokes the IPA linker for the final link command. If you modify one of the source files, the executable can be rebuilt by compiling just that file and then relinking: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -Mipa=fast -c file1.c $ nvc -Mipa=fast -o a.out file1.o file2.o file3.o </pre></div> </div> When the IPA linker is invoked, it will determine that the IPA-optimized object for <code class="docutils literal notranslate">file1.o</code> (<code class="docutils literal notranslate">file1_ipa5_a.out.oo.o</code>) is stale, since it is older than the object <code class="docutils literal notranslate">file1.o</code>; and hence it needs to be rebuilt, and reinvokes the compiler to generate it. In addition, depending on the nature of the changes to the source file <code class="docutils literal notranslate">file1.c</code>, the interprocedural optimizations previously performed for <code class="docutils literal notranslate">file2</code> and <code class="docutils literal notranslate">file3</code> may now be inaccurate. For instance, IPA may have propagated a constant argument value in a call from a function in <code class="docutils literal notranslate">file1.c</code> to a function in <code class="docutils literal notranslate">file2.c</code>; if the value of the argument has changed, any optimizations based on that constant value are invalid. The IPA linker determines which, if any, of the previously created IPA-optimized objects need to be regenerated; and, as appropriate, reinvokes the compiler to regenerate them. Only those objects that are stale or which have new or different IPA information are regenerated. This approach saves compile time. </section> <section id="building-a-program-with-ipa-using-make"> <h3>3.6.7. Building a Program with IPA Using Make<a class="headerlink" href="#building-a-program-with-ipa-using-make" title="Permalink to this headline"></a></h3> As shown earlier, programs can be built with IPA using the make utility. Just add the command-line switch -<code class="docutils literal notranslate">Mipa</code>, as shown here: <div class="highlight-text notranslate"><div class="highlight"><pre>OPT=-Mipa=fast a.out: file1.o file2.o file3.o nvc $(OPT) -o a.out file1.o file2.o file3.o file1.o: file1.c nvc $(OPT) -c file1.c file2.o: file2.c nvc $(OPT) -c file2.c file3.o: file3.c nvc $(OPT) -c file3.c </pre></div> </div> Using the single make command invokes the compiler to generate any of the object files that are out-of-date, then invokes nvc to link the objects into the executable. At link time, nvc calls the IPA linker to regenerate any stale or invalid IPA-optimized objects. <div class="highlight-text notranslate"><div class="highlight"><pre>$ make </pre></div> </div> </section> <section id="questions-about-ipa"> <h3>3.6.8. Questions about IPA<a class="headerlink" href="#questions-about-ipa" title="Permalink to this headline"></a></h3> Question: Why is the object file so large? Answer: An object file created with <code class="docutils literal notranslate">-⁠Mipa</code> contains several additional sections. One is the summary information used to drive the interprocedural analysis. In addition, the object file contains the compiler internal representation of the source file, so the file can be recompiled at link time with interprocedural optimizations. There may be additional information when inlining is enabled. The total size of the object file may be 5-10 times its original size. The extra sections are not added to the final executable. Question: What if I compile with <code class="docutils literal notranslate">-⁠Mipa</code> and link without <code class="docutils literal notranslate">-⁠Mipa</code>? Answer: The NVIDIA HPC Compilers generate a legal object file, even when the source file is compiled with <code class="docutils literal notranslate">-⁠Mipa</code>. If you compile with <code class="docutils literal notranslate">-⁠Mipa</code> and link without <code class="docutils literal notranslate">-⁠Mipa</code>, the linker is invoked on the original object files. A legal executable is generated. While this executable does not have the benefit of interprocedural optimizations, any other optimizations do apply. Question: What if I compile without <code class="docutils literal notranslate">-⁠Mipa</code> and link with <code class="docutils literal notranslate">-⁠Mipa</code>? Answer: At link time, the IPA linker must have summary information about all the functions or routines used in the program. This information is created only when a file is compiled with <code class="docutils literal notranslate">-⁠Mipa</code>. If you compile a file without <code class="docutils literal notranslate">-⁠Mipa</code> and then try to get interprocedural optimizations by linking with <code class="docutils literal notranslate">-⁠Mipa</code>, the IPA linker will issue a message that some routines have no IPA summary information, and will proceed to run the system linker using the original object files. If some files were compiled with <code class="docutils literal notranslate">-⁠Mipa</code> and others were not, it will determine the safest approximation of the IPA summary information for those files not compiled with <code class="docutils literal notranslate">-⁠Mipa</code>, and use that to recompile the other files using interprocedural optimizations. Question: Can I build multiple applications in the same directory with <code class="docutils literal notranslate">-⁠Mipa</code>? Answer: Yes. Suppose you have three source files: <code class="docutils literal notranslate">main1.c</code>, <code class="docutils literal notranslate">main2.c</code>, and <code class="docutils literal notranslate">sub.c</code>, where <code class="docutils literal notranslate">sub.c</code> is shared between the two applications. Suppose you build the first application with <code class="docutils literal notranslate">-⁠Mipa</code>, using this command: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -Mipa=fast -o app1 main1.c sub.c </pre></div> </div> The IPA linker creates two IPA-optimized object files and uses them to build the first application. <div class="highlight-text notranslate"><div class="highlight"><pre>main1_ipa4_app1.oo sub_ipa4_app1.oo </pre></div> </div> Now suppose you build the second application using this command: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -Mipa=fast -o app2 main2.c sub.c </pre></div> </div> The IPA linker creates two more IPA-optimized object files: <div class="highlight-text notranslate"><div class="highlight"><pre>main2_ipa4_app2.oo sub_ipa4_app2.oo </pre></div> </div> <div class="admonition note"> Note There are now three object files for sub.c: the original sub.o, and two IPA-optimized objects, one for each application in which it appears. </div> Question: How is the mangled name for the IPA-optimized object files generated? Answer: The mangled name has ‘_ipa’ appended, followed by the decimal number of the length of the executable file name, followed by an underscore and the executable file name itself. The suffix is changed to .oo so that linking *.o does not pull in the IPA-optimized objects. If the IPA linker determines that the file would not benefit from any interprocedural optimizations, it does not have to recompile the file at link time, and uses the original object. Question: Can I use parallel make environments (e.g., pmake) with IPA? Answer: No. IPA is not compatible with parallel make environments. </section> </section> </section> <section id="using-function-inlining"> <h1>4. Using Function Inlining<a class="headerlink" href="#using-function-inlining" title="Permalink to this headline"></a></h1> Function inlining replaces a call to a function or a subroutine with the body of the function or subroutine. This can speed up execution by eliminating parameter passing and function/subroutine call and return overhead. It also allows the compiler to optimize the function with the rest of the code. Note that using function inlining indiscriminately can result in much larger code size and no increase in execution speed. The NVIDIA HPC compilers provide two categories of inlining: <ul class="simple"> <li>Automatic function inlining – In C++ and C, you can inline static functions with the <code class="docutils literal notranslate">inline</code> keyword by using the <code class="docutils literal notranslate">-⁠Mautoinline</code> option, which is included with <code class="docutils literal notranslate">-⁠fast</code>.</li> <li>Function inlining – You can inline functions which were extracted to the inline libraries in Fortran, C++ and C. There are two ways of enabling function inlining: with and without the <code class="docutils literal notranslate">lib</code> suboption. For the latter, you create inline libraries, for example using the <code class="docutils literal notranslate">nvfortran</code> compiler driver and the <code class="docutils literal notranslate">-⁠o</code> and <code class="docutils literal notranslate">-⁠Mextract</code> options.</li> </ul> There are important restrictions on inlining. Inlining only applies to certain types of functions. Refer to <a class="reference internal" href="#fn-inline-restrictions">Restrictions on Inlining</a> for more details on function inlining limitations. This section describes how to use the following options related to function inlining: <div class="line-block"> <div class="line"><code class="docutils literal notranslate">-Mautoinline</code></div> <div class="line"><code class="docutils literal notranslate">-Mextract</code></div> <div class="line"><code class="docutils literal notranslate">-Minline</code></div> <div class="line"><code class="docutils literal notranslate">-Mnoinline</code></div> <div class="line"><code class="docutils literal notranslate">-Mrecursive</code></div> </div> <section id="automatic-function-inlining-in-c-and-c"> <h2>4.1. Automatic function inlining in C++ and C<a class="headerlink" href="#automatic-function-inlining-in-c-and-c" title="Permalink to this headline"></a></h2> To enable automatic function inlining in C++ and C for static functions with the <code class="docutils literal notranslate">inline</code> keyword, use the <code class="docutils literal notranslate">-⁠Mautoinline</code> option (included in <code class="docutils literal notranslate">-⁠fast</code>). Use <code class="docutils literal notranslate">-⁠Mnoautoinline</code> to disable it. These <code class="docutils literal notranslate">-⁠Mautoinline</code> suboptions let you determine the selection criteria, where <code class="docutils literal notranslate">n</code> loosely corresponds to the number of lines in the procedure: <dl class="simple"> <dt>maxsize:<code class="docutils literal notranslate">n</code></dt><dd>Automatically inline functions size <code class="docutils literal notranslate">n</code> and less </dd> <dt>totalsize:<code class="docutils literal notranslate">n</code></dt><dd>Limit automatic inlining to total size of <code class="docutils literal notranslate">n</code> </dd> </dl> </section> <section id="invoking-procedure-inlining"> <h2>4.2. Invoking Procedure Inlining<a class="headerlink" href="#invoking-procedure-inlining" title="Permalink to this headline"></a></h2> To invoke the procedure inliner, use the <code class="docutils literal notranslate">-⁠Minline</code> option. If you do not specify an inline library, the compiler performs a special prepass on all source files named on the compiler command line before it compiles any of them. This pass extracts procedures that meet the requirements for inlining and puts them in a temporary inline library for use by the compilation pass. Several <code class="docutils literal notranslate">-⁠Minline</code> suboptions let you determine the selection criteria for procedures to be inlined. These suboptions include: <dl> <dt>except:<code class="docutils literal notranslate">func</code></dt><dd>Inlines all eligible procedures except <code class="docutils literal notranslate">func</code>, a procedure in the source text. You can use a comma-separated list to specify multiple procedure. </dd> <dt>[name:]``func``</dt><dd>Inlines all procedures in the source text whose name matches <code class="docutils literal notranslate">func</code>. You can use a comma-separated list to specify multiple procedures. </dd> <dt>[maxsize:]``n``</dt><dd>A numeric option is assumed to be a size. Procedures of size <code class="docutils literal notranslate">n</code> or less are inlined, where <code class="docutils literal notranslate">n</code> loosely corresponds to the number of lines in the procedure. If both <code class="docutils literal notranslate">n</code> and <code class="docutils literal notranslate">func</code> are specified, then procedures matching the given name(s) or meeting the size requirements are inlined. </dd> <dt>reshape</dt><dd>Fortran subprograms with array arguments are not inlined by default if the array shape does not match the shape in the caller. Use this option to override the default. </dd> <dt>smallsize:<code class="docutils literal notranslate">n</code></dt><dd>Always inline procedures of size smaller than <code class="docutils literal notranslate">n</code> regardless of other size limits. </dd> <dt>totalsize:<code class="docutils literal notranslate">n</code></dt><dd>Stop inlining in a procedure when the procedure’s total size with inlining reaches the <code class="docutils literal notranslate">n</code> specified. </dd> <dt>[lib:]``file.ext``</dt><dd>Instructs the inliner to inline the procedures within the library file <code class="docutils literal notranslate">file.ext</code>. If no inline library is specified, procedures are extracted from a temporary library created during an extract prepass. <div class="admonition tip"> Tip Create the library file using the <code class="docutils literal notranslate">-Mextract</code> option. </div> </dd> </dl> If you specify both a procedure name and a maxsize n, the compiler inlines procedures that match the procedure name or have n or fewer statements. If a name is used without a keyword, then a name with a period is assumed to be an inline library and a name without a period is assumed to be a procedure name. If a number is used without a keyword, the number is assumed to be a size. Inlining can be disabled with <code class="docutils literal notranslate">-Mnoinline</code>. In the following example, the compiler inlines procedures with fewer than approximately 100 statements in the source file <code class="docutils literal notranslate">myprog.f</code> and writes the executable code in the default output file <code class="docutils literal notranslate">a.out</code>. <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -Minline=maxsize:100 myprog.f </pre></div> </div> </section> <section id="using-an-inline-library"> <h2>4.3. Using an Inline Library<a class="headerlink" href="#using-an-inline-library" title="Permalink to this headline"></a></h2> If you specify one or more inline libraries on the command line with the <code class="docutils literal notranslate">-Minline</code> option, the compiler does not perform an initial extract pass. The compiler selects functions to inline from the specified inline library. If you also specify a size or function name, all functions in the inline library meeting the selection criteria are selected for inline expansion at points in the source text where they are called. If you do not specify a function name or a size limitation for the <code class="docutils literal notranslate">-Minline</code> option, the compiler tries to inline every function in the inline library that matches a function in the source text. In the following example, the compiler inlines the function <code class="docutils literal notranslate">proc</code> from the inline library <code class="docutils literal notranslate">lib.il</code> and writes the executable code in the default output file <code class="docutils literal notranslate">a.out</code>. <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -Minline=name:proc,lib:lib.il myprog.f </pre></div> </div> The following command line is equivalent to the preceding line, with the exception that in the following example does not use the keywords <code class="docutils literal notranslate">name:</code> and <code class="docutils literal notranslate">lib:</code>. You typically use keywords to avoid name conflicts when you use an inline library name that does not contain a period. Otherwise, without the keywords, a period informs the compiler that the file on the command line is an inline library. <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -Minline=proc,lib.il myprog.f </pre></div> </div> </section> <section id="creating-an-inline-library"> <h2>4.4. Creating an Inline Library<a class="headerlink" href="#creating-an-inline-library" title="Permalink to this headline"></a></h2> You can create or update an inline library using the <code class="docutils literal notranslate">-⁠Mextract</code> command-line option. If you do not specify selection criteria with the <code class="docutils literal notranslate">-⁠Mextract</code> option, the compiler attempts to extract all procedures. Several <code class="docutils literal notranslate">-⁠Mextract</code> options let you determine the selection criteria for creating or updating an inline library. These selection criteria include: <dl> <dt><code class="docutils literal notranslate">func</code></dt><dd>Extracts the procedure <code class="docutils literal notranslate">func</code>. you can use a comma-separated list to specify multiple procedures. </dd> <dt>[name:]<code class="docutils literal notranslate">func</code></dt><dd>Extracts the procedure whose name matches <code class="docutils literal notranslate">func</code>, a procedure in the source text. </dd> <dt>[size:]<code class="docutils literal notranslate">n</code></dt><dd>Limits the size of the extracted procedures to those with a statement count less than or equal to <code class="docutils literal notranslate">n</code>, the specified size. <div class="admonition note"> Note The size n may not exactly equal the number of statements in a selected procedure; the size parameter is merely a rough gauge. </div> </dd> <dt>[lib:]<code class="docutils literal notranslate">ext.lib</code></dt><dd>Stores the extracted information in the library directory <code class="docutils literal notranslate">ext.lib</code>. If no inline library is specified, procedures are extracted to a temporary library created during an extract prepass for use during the compilation stage. </dd> </dl> When you use the <code class="docutils literal notranslate">-⁠Mextract</code> option, only the extract phase is performed; the compile and link phases are not performed. The output of an extract pass is a library of procedures available for inlining. This output is placed in the inline library file specified on the command line with the <code class="docutils literal notranslate">-⁠o</code> filename specification. If the library file exists, new information is appended to it. If the file does not exist, it is created. You can use a command similar to the following: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -Mextract=lib:lib.il myfunc.f </pre></div> </div> You can use the <code class="docutils literal notranslate">-⁠Minline</code> option with the <code class="docutils literal notranslate">-⁠Mextract</code> option. In this case, the extracted library of procedures can have other procedures inlined into the library. Using both options enables you to obtain more than one level of inlining. In this situation, if you do not specify a library with the <code class="docutils literal notranslate">-⁠Minline</code> option, the inline process consists of two extract passes. The first pass is a hidden pass implied by the <code class="docutils literal notranslate">-⁠Minline</code> option, during which the compiler extracts procedures and places them into a temporary library. The second pass uses the results of the first pass but puts its results into the library that you specify with the <code class="docutils literal notranslate">-⁠o</code> option. <section id="working-with-inline-libraries"> <h3>4.4.1. Working with Inline Libraries<a class="headerlink" href="#working-with-inline-libraries" title="Permalink to this headline"></a></h3> An inline library is implemented as a directory with each inline function in the library stored as a file using an encoded form of the inlinable function. A special file named <code class="docutils literal notranslate">TOC</code> in the inline library directory serves as a table of contents for the inline library. This is a printable, ASCII file which you can examine to locate information about the library contents, such as names and sizes of functions, the source file from which they were extracted, the version number of the extractor which created the entry, and so on. Libraries and their elements can be manipulated using ordinary system commands. <ul class="simple"> <li>Inline libraries can be copied or renamed.</li> <li>Elements of libraries can be deleted or copied from one library to another.</li> <li>The <cite>ls</cite> or <cite>dir</cite> command can be used to determine the last-change date of a library entry.</li> </ul> </section> <section id="dependencies"> <h3>4.4.2. Dependencies<a class="headerlink" href="#dependencies" title="Permalink to this headline"></a></h3> When a library is created or updated using one of the NVIDIA HPC compilers, the last-change date of the library directory is updated. This allows a library to be listed as a dependence in a makefile and ensures that the necessary compilations are performed when a library is changed. </section> <section id="updating-inline-libraries-makefiles"> <h3>4.4.3. Updating Inline Libraries – Makefiles<a class="headerlink" href="#updating-inline-libraries-makefiles" title="Permalink to this headline"></a></h3> If you use inline libraries you must be certain that they remain up-to-date with the source files into which they are inlined. One way to assure inline libraries are updated is to include them in a makefile. The makefile fragment in the following example assumes the file <code class="docutils literal notranslate">utils.f</code> contains a number of small functions used in the files <code class="docutils literal notranslate">parser.f</code> and <code class="docutils literal notranslate">alloc.f</code>. This portion of the makefile: <ul class="simple"> <li>Maintains the inline library <code class="docutils literal notranslate">utils.il</code>.</li> <li>Updates the library whenever you change <code class="docutils literal notranslate">utils.f</code> or one of the include files it uses.</li> <li>Compiles <code class="docutils literal notranslate">parser.f</code> and <code class="docutils literal notranslate">alloc.f</code> whenever you update the library.</li> </ul> Sample Makefile <div class="highlight-text notranslate"><div class="highlight"><pre>SRC = mydir FC = nvfortran FFLAGS = -O2 main.o: $(SRC)/main.f $(SRC)/global.h $(FC) $(FFLAGS) -c $(SRC)/main.f utils.o: $(SRC)/utils.f $(SRC)/global.h $(SRC)/utils.h $(FC) $(FFLAGS) -c $(SRC)/utils.f utils.il: $(SRC)/utils.f $(SRC)/global.h $(SRC)/utils.h $(FC) $(FFLAGS) -Mextract=15 -o utils.il $(SRC)/utils.f parser.o: $(SRC)/parser.f $(SRC)/global.h utils.il $(FC) $(FFLAGS) -Minline=utils.il -c $(SRC)/parser.f alloc.o: $(SRC)/alloc.f $(SRC)/global.h utils.il $(FC) $(FFLAGS) -Minline=utils.il -c $(SRC)/alloc.f myprog: main.o utils.o parser.o alloc.o $(FC) -o myprog main.o utils.o parser.o alloc.o </pre></div> </div> </section> </section> <section id="error-detection-during-inlining"> <h2>4.5. Error Detection during Inlining<a class="headerlink" href="#error-detection-during-inlining" title="Permalink to this headline"></a></h2> You can specify the <code class="docutils literal notranslate">-Minfo=inline</code> option to request inlining information from the compiler when you invoke the inliner. For example: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -Minline=mylib.il -Minfo=inline myext.f </pre></div> </div> </section> <section id="examples"> <h2>4.6. Examples<a class="headerlink" href="#examples" title="Permalink to this headline"></a></h2> Assume the program <code class="docutils literal notranslate">dhry</code> consists of a single source file <code class="docutils literal notranslate">dhry.f</code>. The following command line builds an executable file for <code class="docutils literal notranslate">dhry</code> in which proc7 is inlined wherever it is called: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran dhry.f -Minline=proc7 </pre></div> </div> The following command lines build an executable file for <code class="docutils literal notranslate">dhry</code> in which proc7 plus any functions of approximately 10 or fewer statements are inlined (one level only). <div class="admonition note"> Note The specified functions are inlined only if they are previously placed in the inline library, <code class="docutils literal notranslate">temp.il</code>, during the extract phase. </div> <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran dhry.f -Mextract=lib:temp.il $ nvfortran dhry.f -Minline=10,proc7,temp.il </pre></div> </div> Using the same source file <code class="docutils literal notranslate">dhry.f</code>, the following example builds an executable for <code class="docutils literal notranslate">dhry</code> in which all functions of roughly ten or fewer statements are inlined. Two levels of inlining are performed. This means that if function A calls function B, and B calls C, and both B and C are inlinable, then the version of B which is inlined into A will have had C inlined into it. <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran dhry.f -Minline=maxsize:10 </pre></div> </div> </section> <section id="restrictions-on-inlining"> <h2>4.7. Restrictions on Inlining<a class="headerlink" href="#restrictions-on-inlining" title="Permalink to this headline"></a></h2> The following Fortran subprograms cannot be extracted: <ul class="simple"> <li>Main or BLOCK DATA programs.</li> <li>Subprograms containing alternate return, assigned GO TO, DATA, SAVE, or EQUIVALENCE statements.</li> <li>Subprograms containing FORMAT statements.</li> <li>Subprograms containing multiple entries.</li> </ul> A Fortran subprogram is not inlined if any of the following applies: <ul class="simple"> <li>It is referenced in a statement function.</li> <li>A common block mismatch exists; in other words, the caller must contain all common blocks specified in the callee, and elements of the common blocks must agree in name, order, and type (except that the caller’s common block can have additional members appended to the end of the common block).</li> <li>An argument mismatch exists; in other words, the number and type (size) of actual and formal parameters must be equal.</li> <li>A name clash exists, such as a call to subroutine <code class="docutils literal notranslate">xyz</code> in the extracted subprogram and a variable named <code class="docutils literal notranslate">xyz</code> in the caller.</li> </ul> The following types of C and C++ functions cannot be inlined: <ul class="simple"> <li>Functions which accept a variable number of arguments</li> </ul> Certain C/C++ functions can only be inlined into the file that contains their definition: <ul class="simple"> <li>Static functions</li> <li>Functions which call a static function</li> <li>Functions which reference a static variable</li> </ul> </section> </section> <section id="using-gpus"> <h1>5. Using GPUs<a class="headerlink" href="#using-gpus" title="Permalink to this headline"></a></h1> An NVIDIA GPU can be used as an accelerator to which a CPU can offload data and executable kernels to perform compute-intensive calculations. This section gives an overview of options for programming NVIDIA GPUs with NVIDIA’s HPC Compilers and covers topics that affect GPU programming when using one or more of the GPU programming models. <section id="id1"> <h2>5.1. Overview<a class="headerlink" href="#id1" title="Permalink to this headline"></a></h2> With the NVIDIA HPC Compilers you can program NVIDIA GPUs using certain standard language constructs, OpenACC directives, OpenMP directives, or CUDA Fortran language extensions. GPU programming with standard language constructs or directives allows you to create high-level GPU-accelerated programs without the need to explicitly initialize the GPU, manage data or program transfers between the host and GPU, or initiate GPU startup and shutdown. Rather, all of these details are implicit in the programming model and are managed by the NVIDIA HPC SDK Fortran, C++ and C compilers. GPU programming with CUDA extensions gives you access to all NVIDIA GPU features and full control over data management and offloading of compute-intensive loops and kernels. The NVC++ compiler supports automatic offload of C++17 Parallel Algorithms invocations to NVIDIA GPUs under control of the <code class="docutils literal notranslate">-stdpar</code> compiler option. See the Blog post Accelerating Standard C++ with GPUs for details on using this feature. The NVFORTRAN compiler supports automatic offload to NVIDIA GPUs of certain Fortran array intrinsics and patterns of array syntax, including use of Volta and Ampere architecture Tensor Cores for appropriate intrinsics. See the Blog post Bringing Tensor Cores to Standard Fortran for details on using this feature. The NVFORTRAN compiler supports CUDA programming in Fortran. See the NVIDIA CUDA Fortran Programming Guide for complete details on how to use CUDA Fortran. The NVCC compiler supports CUDA programming in C and C++ in combination with a host C++ compiler on your system. See the CUDA C++ Programming Guide for an introduction and overview of how to use NVCC and CUDA C++. The NVFORTRAN, NVC++ and NVC compilers all support directive-based programming of NVIDIA GPUs using OpenACC. OpenACC is an accelerator programming model that is portable across operating systems and various host CPUs and types of accelerators, including both NVIDIA GPUs and multicore CPUs. OpenACC directives allow a programmer to migrate applications incrementally to accelerator targets using standards-compliant Fortran, C++ or C that remains completely portable to other compilers and systems. It allows the programmer to augment information available to the compilers, including specification of data local to an accelerator region, guidance on mapping of loops onto an accelerator, and similar performance-related details. The NVFORTRAN, NVC++, and NVC compilers support a subset of the OpenMP Application Program Interface for CPUs and GPUs. OpenMP applications properly structured for GPUs, meaning they expose massive parallelism and have relatively little or no synchronization in GPU-side code segments, should compile and execute with performance on par with or close to equivalent OpenACC. Codes that are not well-structured for GPUs may perform poorly but should execute correctly. In user-directed accelerator programming the user specifies the regions of a host program to be targeted for offloading to an accelerator. The bulk of a user’s program, as well as regions containing constructs that are not supported on the targeted accelerator, are executed on the host. </section> <section id="terminology"> <h2>5.2. Terminology<a class="headerlink" href="#terminology" title="Permalink to this headline"></a></h2> Clear and consistent terminology is important in describing any programming model. This section provides definitions of the terms required for you to effectively use this section and the associated programming model. <dl class="simple"> <dt>Accelerator</dt><dd>a parallel processor, such as a GPU or a CPU running in multicore mode, to which a CPU can offload data and executable kernels to perform compute-intensive calculations. </dd> <dt>Compute intensity</dt><dd>for a given loop, region, or program unit, the ratio of the number of arithmetic operations performed on computed data divided by the number of memory transfers required to move that data between two levels of a memory hierarchy. </dd> <dt>Compute region</dt><dd>a structured block defined by a compute construct. A compute construct is a structured block containing loops which are compiled for the accelerator. A compute region may require device memory to be allocated and data to be copied from host to device upon region entry, and data to be copied from device to host memory and device memory deallocated upon exit. The dynamic range of a compute construct, including any code in procedures called from within the construct, is the compute region. In this release, compute regions may not contain other compute regions or data regions. </dd> <dt>Construct</dt><dd>a structured block identified by the programmer or implicitly defined by the language. Certain actions may occur when program execution reaches the start and end of a construct, such as device memory allocation or data movement between the host and device memory. Loops in a compute construct are targeted for execution on the accelerator. The dynamic range of a construct including any code in procedures called from within the construct, is called a region. </dd> <dt>CUDA</dt><dd>stands for Compute Unified Device Architecture; CUDA C++ and Fortran language extensions and API calls can be used to explicitly control and program an NVIDIA GPU. </dd> <dt>Data region</dt><dd>a region defined by a data construct, or an implicit data region for a function or subroutine containing directives. Data regions typically require device memory to be allocated and data to be copied from host to device memory upon entry, and data to be copied from device to host memory and device memory deallocated upon exit. Data regions may contain other data regions and compute regions. </dd> <dt>Device</dt><dd>a general reference to any type of accelerator. </dd> <dt>Device memory</dt><dd>memory attached to an accelerator which is physically separate from the host memory. </dd> <dt>Directive</dt><dd>in C, a #pragma, or in Fortran, a specially formatted comment statement that is interpreted by a compiler to augment information about or specify the behavior of the program. </dd> <dt>DMA</dt><dd>Direct Memory Access, a method to move data between physically separate memories; this is typically performed by a DMA engine, separate from the host CPU, that can access the host physical memory as well as an IO device or GPU physical memory. </dd> <dt>GPU</dt><dd>a Graphics Processing Unit; one type of accelerator device. </dd> <dt>Host</dt><dd>the main CPU that in this context has an attached accelerator device. The host CPU controls the program regions and data loaded into and executed on the device. </dd> <dt>Loop trip count</dt><dd>the number of times a particular loop executes. </dd> <dt>Private data</dt><dd>with respect to an iterative loop, data which is used only during a particular loop iteration. With respect to a more general region of code, data which is used within the region but is not initialized prior to the region and is re-initialized prior to any use after the region. </dd> <dt>Region</dt><dd>the dynamic range of a construct, including any procedures invoked from within the construct. </dd> <dt>Structured block</dt><dd>in C++ or C, an executable statement, possibly compound, with a single entry at the top and a single exit at the bottom. In Fortran, a block of executable statements with a single entry at the top and a single exit at the bottom. </dd> <dt>Vector operation</dt><dd>a single operation or sequence of operations applied uniformly to each element of an array. </dd> <dt>Visible device copy</dt><dd>a copy of a variable, array, or subarray allocated in device memory, that is visible to the program unit being compiled. </dd> </dl> </section> <section id="execution-model"> <h2>5.3. Execution Model<a class="headerlink" href="#execution-model" title="Permalink to this headline"></a></h2> The execution model targeted by the NVIDIA HPC Compilers is host-directed execution with an attached accelerator device, such as a GPU. The bulk of a user application executes on the host. Compute intensive regions are offloaded to the accelerator device under control of the host. The accelerator device executes kernels, which may be as simple as a tightly-nested loop, or as complex as a subroutine, depending on the accelerator hardware. <section id="host-functions"> <h3>5.3.1. Host Functions<a class="headerlink" href="#host-functions" title="Permalink to this headline"></a></h3> Even in accelerator-targeted regions, the host must orchestrate the execution; it <ul class="simple"> <li>allocates memory on the accelerator device</li> <li>initiates data transfer</li> <li>sends the kernel code to the accelerator</li> <li>passes kernel arguments</li> <li>queues the kernel</li> <li>waits for completion</li> <li>transfers results back to the host</li> <li>deallocates memory</li> </ul> <div class="admonition note"> Note In most cases, the host can queue a sequence of kernels to be executed on the device, one after the other. </div> </section> </section> <section id="memory-model"> <h2>5.4. Memory Model<a class="headerlink" href="#memory-model" title="Permalink to this headline"></a></h2> The most significant difference between a host-only program and a host+accelerator program is that the memory on the accelerator can be completely separate from host memory, which is the case on many GPUs. For example: <ul class="simple"> <li>The host cannot read or write accelerator memory directly because it is not mapped into the virtual memory space of the host.</li> <li>All data movement between host memory and accelerator memory must be performed by the host through runtime library calls that explicitly move data between the separate memories.</li> <li>In general it is not valid for the compiler to assume the accelerator can read or write host memory directly. This is well-defined starting with the OpenACC 2.7 and OpenMP 5.0 specifications.</li> </ul> The systems with the latest GPUs provide a unified single address space between CPU and GPU for some or all memory regions, as detailed in the <a class="reference internal" href="#acc-mem-unified">Managed and Unified Memory Modes</a> subsection below. In these systems data can be accessed from host and accelerator subprograms without the need for explicit data movement. The NVIDIA HPC Compilers support the following system memory modes: <table class="table-no-stripes colwidths-given docutils align-default" id="id23"> <caption>Table 12. GPU Memory Modes<a class="headerlink" href="#id23" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 20%" /> <col style="width: 60%" /> <col style="width: 20%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Memory Mode</th> <th class="head">Description</th> <th class="head">Compiler flags</th> </tr> </thead> <tbody> <tr class="row-even"><td>Separate</td> <td>All data accessed in host and accelerator programs are in separate (CPU and GPU) memories. Data in the application need to be physically moved between CPU and GPU memory either by adding explicit annotations or by relying on a compiler to detect and migrate the data.</td> <td><code class="docutils literal notranslate">-gpu=mem:separate</code></td> </tr> <tr class="row-odd"><td>Managed</td> <td>Dynamically allocated host data are placed in CUDA Managed Memory which is a unified single address space between host and accelerator programs and can therefore be accessed on device without explicit data movement. All other data (host, stack, or global data) remain in separate memory.</td> <td><code class="docutils literal notranslate">-gpu=mem:managed</code></td> </tr> <tr class="row-even"><td>Unified</td> <td>All host data are placed in a unified single address space between the host and accelerator subprograms; no explicit data movements are required. This mode is intended for targets with full CUDA Unified Memory capability and it may utilize CUDA Managed Memory for dynamic allocations.</td> <td><code class="docutils literal notranslate">-gpu=mem:unified</code></td> </tr> </tbody> </table> If the memory mode is not selected explicitly by passing one of the above <code class="docutils literal notranslate">-gpu=mem:*</code> options, the compiler selects a default memory mode. The default memory mode for Stdpar is explained in <a class="reference internal" href="#stdpar-use">Using Stdpar</a>. When Stdpar is not enabled, the default memory mode is Separate Memory. Memory modes may have specific semantics in each programming language and the compilers can sometimes implicitly determine the data movement that’s required. More details can be found in the subsections of each programming model. The following options <code class="docutils literal notranslate">-gpu=[no]managed</code>, <code class="docutils literal notranslate">-gpu=[no]unified</code> and <code class="docutils literal notranslate">-gpu=pinned</code> are deprecated but still accepted. Refer to <a class="reference internal" href="#gpu-mem-flags">Command-line Options Selecting Compiler Memory Modes</a> for compatibility between the current and deprecated memory specific flags. The compiler implicitly defines the following macros corresponding to the memory mode it compiles for: <ul class="simple"> <li>When the code is compiled for Separate Memory Mode, the compiler defines <code class="docutils literal notranslate">__NVCOMPILER_GPU_SEPARATE_MEM</code> macro.</li> <li>When the code is compiled for Managed Memory Mode, the compiler defines <code class="docutils literal notranslate">__NVCOMPILER_GPU_MANAGED_MEM</code> macro.</li> <li>When the code is compiled for Unified Memory Mode, the compiler defines <code class="docutils literal notranslate">__NVCOMPILER_GPU_UNIFIED_MEM</code> macro. If CUDA Managed Memory is utilised, the compiler defines additionally <code class="docutils literal notranslate">__NVCOMPILER_GPU_MANAGED_MEM</code>.</li> </ul> When a binary is compiled for one memory mode it may need to be run on a system with specific memory capabilities as follows: <ul class="simple"> <li>Applications compiled for Separate Memory Mode can run on any CUDA platforms.</li> <li>Applications compiled for Managed Memory Mode must be run on platforms with CUDA Managed Memory or full CUDA Unified Memory capabilities.</li> <li>Applications compiled for Unified Memory Mode must be run on platforms with full CUDA Unified Memory.</li> </ul> <div class="admonition note"> Note Memory allocated in the accelerator subprogram can’t be accessed or deallocated from the host. </div> <section id="separate-host-and-accelerator-memory-considerations"> <h3>5.4.1. Separate Host and Accelerator Memory Considerations<a class="headerlink" href="#separate-host-and-accelerator-memory-considerations" title="Permalink to this headline"></a></h3> The programmer must be aware of the potentially separate memories for many reasons, including but not limited to: <ul class="simple"> <li>Memory bandwidth between host memory and accelerator memory determines the compute intensity required to effectively accelerate a given region of code.</li> <li>Limited size of accelerator memory may prohibit offloading of regions of code that operate on very large amounts of data.</li> </ul> <section id="accelerator-memory"> <h4>5.4.1.1. Accelerator Memory<a class="headerlink" href="#accelerator-memory" title="Permalink to this headline"></a></h4> On the accelerator side, current GPUs implement a weak memory model. In particular, they do not support memory coherence between threads unless those threads are parallel only at the synchronous level and the memory operations are separated by an explicit barrier. Otherwise, if one thread updates a memory location and another reads the same location, or two threads store a value to the same location, the hardware does not guarantee the results. While the results of running such a program might be inconsistent, it is not accurate to say that the results are incorrect. By definition, such programs are defined as being in error. While a compiler can detect some potential errors of this nature, it is nonetheless possible to write an accelerator region that produces inconsistent numerical results. Stack data in accelerator subprograms are allocated per thread. Stack data from one thread are not accessible by the other threads. </section> <section id="staging-memory-buffer"> <h4>5.4.1.2. Staging Memory Buffer<a class="headerlink" href="#staging-memory-buffer" title="Permalink to this headline"></a></h4> Memory transfers between the accelerator and host may not always be asynchronous with respect to the host, even if the chosen programming model (for instance, OpenACC) declares that. This limitation may be due to the specific GPU and host memory architectures. In order to help the host program proceed while a memory transfer to or from the accelerator is underway, the NVIDIA HPC Compilers Runtime maintains a designated staging memory area, also known as a pinned buffer. This memory area is registered with the CUDA API, which makes it suitable for asynchronous memory transfers between the GPU and the host. When an asynchronous memory transfer is started, the data being transferred is staged through the pinned buffer. Multiple asynchronous operations on the same data can be issued - in that case, the runtime system will operate on the data staged in the pinned buffer, not on the original host memory. When the host program issues an explicit or implicit synchronization request, the data is moved from the pinned buffer to its destination transparently to the application. The runtime has the discretion to enable or disable the pinned buffer depending on the host and GPU memory architecture. Also, the size of the pinned buffer is determined by the runtime system as appropriate. The user can control some of these decisions using environment variables at the start of the application. Please refer to <a class="reference internal" href="#env-vars-memory">Environment Variables Controlling Device Memory Management</a> to learn more. </section> <section id="cache-management"> <h4>5.4.1.3. Cache Management<a class="headerlink" href="#cache-management" title="Permalink to this headline"></a></h4> Some current GPUs have a software-managed cache, some have hardware-managed caches, and most have hardware caches that can be used only in certain situations and are limited to read-only data. In low-level programming models such as CUDA, it is up to the programmer to manage these caches. The OpenACC programming model provides directives the programmer can use as hints to the compiler for cache management. </section> <section id="environment-variables-controlling-device-memory-management"> <h4>5.4.1.4. Environment Variables Controlling Device Memory Management<a class="headerlink" href="#environment-variables-controlling-device-memory-management" title="Permalink to this headline"></a></h4> This section summarizes the environment variables that NVIDIA HPC Compilers use to control device memory management. The following table contains the environment variables that are currently supported and provides a brief description of each. <table class="table-no-stripes docutils align-default" id="id24"> <caption>Table 13. Memory Management Environment Variables<a class="headerlink" href="#id24" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 9%" /> <col style="width: 91%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Environment Variable</th> <th class="head">Use</th> </tr> </thead> <tbody> <tr class="row-even"><td>NVCOMPILER_ACC_BUFFERSIZE</td> <td>For NVIDIA CUDA devices, this defines the size of the pinned buffer used to transfer data between host and device.</td> </tr> <tr class="row-odd"><td>NVCOMPILER_ACC_CUDA_CTX_SCHED</td> <td>For NVIDIA CUDA devices, sets flags to be used when creating a new CUDA context. By default, the <code class="docutils literal notranslate">CU_CTX_SCHED_YIELD</code> flag is used. Please refer to the <a class="reference external" href="https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX">CUDA Toolkit Documentation</a> for the detailed description of the <code class="docutils literal notranslate">cuCtxCreate</code> function and the possible flag values.</td> </tr> <tr class="row-even"><td>NVCOMPILER_ACC_CUDA_HEAPSIZE</td> <td>For NVIDIA CUDA devices, sets the heap size limit for <code class="docutils literal notranslate">malloc()</code> when called on device.</td> </tr> <tr class="row-odd"><td>NVCOMPILER_ACC_CUDA_MAX_L2_FETCH_GRANULARITY</td> <td>For NVIDIA CUDA devices, sets the maximum L2 cache fetch granularity size in bytes. A correct value is an integer between 0 and 128.</td> </tr> <tr class="row-even"><td>NVCOMPILER_ACC_CUDA_MEMALLOCASYNC</td> <td>For NVIDIA CUDA devices, when set to a non-zero integer value, enables CUDA asynchronous memory allocations from the default CUDA memory pool as descibed in the <a class="reference external" href="https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MALLOC__ASYNC.html">CUDA Toolkit Documentation</a>. By default, an internal NVIDIA HPC Runtime memory pool is used instead.</td> </tr> <tr class="row-odd"><td>NVCOMPILER_ACC_CUDA_MEMALLOCASYNC_POOLSIZE</td> <td>For NVIDIA CUDA devices, sets the size of the default CUDA memory pool for asynchronous allocations if the <code class="docutils literal notranslate">NVCOMPILER_ACC_CUDA_MEMALLOCASYNC</code> environment variable is also set to a non-zero integer value.</td> </tr> <tr class="row-even"><td>NVCOMPILER_ACC_CUDA_NOCOPY</td> <td>Disables the use of the pinned buffer when transferring user data between host and NVIDIA CUDA devices. When this variable is set to a non-zero integer value, user data will be transferred directly bypassing the pinned buffer. Asynchronous execution of such data transfers can be limited when this setting is in effect.</td> </tr> <tr class="row-odd"><td>NVCOMPILER_ACC_CUDA_PIN</td> <td>For NVIDIA CUDA devices, enables host memory pinning at data directives. When host memory is pinned, data transfers to and from the device can be asynchronous, which can potentially improve program performance. A non-zero integer value enables this mechanism. A value of <code class="docutils literal notranslate">2</code> or greater additionally disallows unpinning the host data after it is pinned. A value of <code class="docutils literal notranslate">3</code> or greater also enables pinning the whole array referenced in a data directive (provided that the size of the array is known), rather than its subarray specified in the data directive. By default, host data referenced at data directives is not pinned unless directed by the compiler at compile-time; refer to <a class="reference internal" href="#gpu-mem-flags">Command-line Options Selecting Compiler Memory Modes</a> for more information about the compile-time memory modes.</td> </tr> <tr class="row-even"><td>NVCOMPILER_ACC_CUDA_PINSIZE</td> <td>For NVIDIA CUDA devices, sets the host memory pinning granularity. If host memory pinning is enabled with the <code class="docutils literal notranslate">NVCOMPILER_ACC_CUDA_PIN</code> environment variable, the runtime will attempt to use this setting to pin larger regions of memory at once, thus potentially lowering the cost of pinning memory when the program needs to pin multiple data regions separately. The maximum allowed value is 1 MB. By default, single byte pinning granularity is used.</td> </tr> <tr class="row-odd"><td>NVCOMPILER_ACC_CUDA_PRINTFIFOSIZE</td> <td>For NVIDIA CUDA devices, sets the buffer size for formatted output calls on device. In particular, it controls the buffer size for the <code class="docutils literal notranslate">printf</code> C function.</td> </tr> <tr class="row-even"><td>NVCOMPILER_ACC_CUDA_STACKSIZE</td> <td>For NVIDIA CUDA devices, sets the stack size limit for device threads.</td> </tr> <tr class="row-odd"><td>NVCOMPILER_ACC_DEV_MEMORY</td> <td>For NVIDIA CUDA devices, when set to a valid non-zero size value, enables the use of a device memory pool and sets its size. By default, the device memory pool is not used.</td> </tr> <tr class="row-even"><td>NVCOMPILER_ACC_MEM_MANAGE</td> <td>For NVIDIA CUDA devices, when set to the integer value 0, disables the use of an internal device memory manager. By default, the device memory manager is enabled. It maintains a list of deallocated chunks of device memory in an attempt to efficiently reuse them for future allocations.</td> </tr> </tbody> </table> </section> </section> <section id="managed-and-unified-memory-modes"> <h3>5.4.2. Managed and Unified Memory Modes<a class="headerlink" href="#managed-and-unified-memory-modes" title="Permalink to this headline"></a></h3> The NVIDIA HPC Compilers support interoperability with <a class="reference external" href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-memory-programming">CUDA Unified Memory</a>. This feature is available with the x86-64 and Arm Server compilers. Unified memory provides a single address space for CPU and GPU; data movement between CPU and GPU memories is implicitly handled by the NVIDIA CUDA driver. Whenever data is accessed on the CPU or the GPU, it could trigger a data transfer if the last time it was accessed was not on the same device. In some cases, page thrashing may occur and impact performance. An introduction to CUDA Unified Memory is available on <a class="reference external" href="https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners">Parallel Forall</a>. <section id="managed-memory-mode"> <h4>5.4.2.1. Managed Memory Mode<a class="headerlink" href="#managed-memory-mode" title="Permalink to this headline"></a></h4> In Managed Memory Mode, all Fortran, C++ and C explicit allocation statements (e.g. <code class="docutils literal notranslate">allocate</code>, <code class="docutils literal notranslate">new</code>, and <code class="docutils literal notranslate">malloc</code>, respectively) in a program unit are replaced by equivalent CUDA managed data allocation calls that place the data in CUDA Managed Memory. The result is that OpenACC and OpenMP data clauses and directives are not needed to manage data movement. They are essentially ignored and can be omitted. For Stdpar this is the minimal required memory mode since there are no specific annotations for data used in the parallel region. To enable Managed Memory Mode, add the option <code class="docutils literal notranslate">-gpu=mem:managed</code> to the compiler and linker command lines. When a program allocates managed memory, it allocates host pinned memory as well as device memory thus making allocate and free operations somewhat more expensive and data transfers somewhat faster. A memory pool allocator is used to mitigate the overhead of the allocate and deallocate operations. More details can be found in <a class="reference internal" href="#gpu-mem-poolallocator">Memory Pool Allocator</a>. Managed Memory Mode has the following limitations: <ul class="simple"> <li>Use of managed memory applies only to dynamically-allocated data.</li> <li>Given an allocatable aggregate with a member that points to local, global, or static data, compiling with <code class="docutils literal notranslate">-gpu=mem:managed</code> and attempting to access memory through that pointer from the compute kernel will cause a failure at runtime.</li> <li>C++ virtual functions are not supported.</li> <li>The <code class="docutils literal notranslate">-gpu=mem:managed</code> compiler option must be used to compile the files in which variables (accessed from GPU) are allocated, even if there is no code to accelerate on the GPU in the source file.</li> <li>When linking multiple translation units, the application must ensure that all data are deallocated using the scheme corresponding to their allocation. For example if the data are allocated in managed memory the deallocation must be performed using CUDA API calls for managed memory. More details and extra compiler support is detailed in <a class="reference internal" href="#gpu-mem-intercept">Interception of Deallocations</a>.</li> </ul> Managed Memory Mode has the following additional limitations when used with NVIDIA Kepler GPUs: <ul class="simple"> <li>Data motion on Kepler GPUs is achieved through fast pinned asynchronous data transfers; from the program’s perspective, however, the transfers are synchronous.</li> <li>The NVIDIA HPC Compiler Runtime enforces synchronous execution of kernels when <code class="docutils literal notranslate">-gpu=mem:managed</code> is used on a system with a Kepler GPU. This situation may result in slower performance because of the extra synchronizations and decreased overlap between CPU and GPU.</li> <li>The total amount of managed memory is limited to the amount of available device memory on Kepler GPUs.</li> </ul> Memory Allocations/Deallocations Automatically Changed to Managed Memory When the compiler utilizes CUDA Managed Memory capability either with <code class="docutils literal notranslate">-gpu=mem:managed</code> or <code class="docutils literal notranslate">-gpu=mem:unified</code>, the following explicit allocations/deallocations are automatically changed into <code class="docutils literal notranslate">cudaMallocManaged</code>/<code class="docutils literal notranslate">cudaFree</code>-type allocations/deallocations: <ul> <li>For C++: <ul> <li>All calls to global <code class="docutils literal notranslate">operator new</code> and <code class="docutils literal notranslate">operator delete</code> that allocate or deallocate memory, such as: <div class="highlight-c++ notranslate"><div class="highlight"><pre>operator new(std::size_t size) operator new(std::size_t size, const std::nothrow_t &nothrow_value) operator new(std::size_t size, std::align_val_t align) operator new(std::size_t size, std::align_val_t align, const std::nothrow_t &nothrow_value) operator delete(void *p) operator delete(void *p, std::size_t size) operator delete(void *p, std::align_val_t align) operator delete(void *p, std::size_t size, std::align_val_t align) operator delete(void *p, const std::nothrow_t &nothrow_value) operator delete(void *p, std::align_val_t align, const std::nothrow_t &nothrow_value) </pre></div> </div> </li> <li>All the array forms of the above overloads.</li> <li>All calls to <code class="docutils literal notranslate">malloc</code>/<code class="docutils literal notranslate">free</code> functions.</li> </ul> </li> <li>For C: all calls to <code class="docutils literal notranslate">malloc</code>/<code class="docutils literal notranslate">free</code> functions.</li> <li>For Fortran: <ul class="simple"> <li>All allocations of automatic arrays.</li> <li>all <code class="docutils literal notranslate">allocate</code>/<code class="docutils literal notranslate">deallocate</code> statements with allocatable arrays or pointer variables.</li> </ul> </li> </ul> </section> <section id="unified-memory-mode"> <h4>5.4.2.2. Unified Memory Mode<a class="headerlink" href="#unified-memory-mode" title="Permalink to this headline"></a></h4> In Unified Memory Mode, the requirements for the program are further relaxed compared to Managed Memory Mode. Specifically, not only is dynamically allocated system memory accessible on the GPU, but global and local memory are also accessible. To enable this feature, add the option <code class="docutils literal notranslate">-gpu=mem:unified</code> to the compiler and linker command lines. Programs compiled with <code class="docutils literal notranslate">-gpu=mem:unified</code> must be run on systems that support full CUDA Unified Memory capability. At this time, full CUDA Unified Memory is supported on NVIDIA Grace Hopper Superchip systems and Linux x86-64 systems running with the Heterogeneous Memory Management (HMM) feature enabled in the Linux kernel. Details about these platforms are available in the following blog posts on the NVIDIA website: <a class="reference external" href="https://developer.nvidia.com/blog/simplifying-gpu-programming-for-hpc-with-the-nvidia-grace-hopper-superchip">Simplifying GPU Programming for HPC with NVIDIA Grace Hopper Superchip</a> and <a class="reference external" href="https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management">Simplifying GPU Application Development with Heterogeneous Memory Management</a>. In Unified Memory Mode, the compiler assumes that any system memory is accessible on the GPU. Even so, the compiler may generate managed memory allocations for explicit data allocations when it considers them beneficial for program performance. If you would like to enforce or prohibit the use of managed memory for dynamic allocations pass <code class="docutils literal notranslate">-gpu=mem:unified:[no]managedalloc</code> to compilation and linking. Unified Memory Mode has the following limitations: <ul class="simple"> <li>Unified memory support for OpenACC, OpenMP and Stdpar Fortran is not mix-and-match; all object files containing OpenACC/OpenMP directives or Fortran <code class="docutils literal notranslate">DO CONCURRENT</code> constructs must be compiled and linked with <code class="docutils literal notranslate">-gpu=mem:unified</code> to ensure correct execution.</li> <li>C++ virtual functions are not supported.</li> </ul> Transitioning to Unified Memory Mode Applications transitioning to architectures that support Unified Memory Mode can be recompiled with <code class="docutils literal notranslate">-gpu=mem:unified</code> without any code modifications. The programmer should be aware that in Unified Memory Mode, the whole program state becomes essentially shared between the CPU and the GPU. By implication, modifications to program variables made on the GPU are visible on the CPU. That is, the GPU does not operate on a copy of the data even if the program contains respective directives, but instead the GPU operates directly on the data in system memory. To understand the importance of this idea, consider the following OpenACC C program: <div class="highlight-c++ notranslate"><div class="highlight"><pre>int x[N]; void foo() { #pragma acc enter data create(x[0:N]) #pragma acc parallel loop for (int i = 0; i < N; i++) { x[i] = i; } } </pre></div> </div> When compiled in Separate Memory Mode, in the <code class="docutils literal notranslate">foo()</code> function a copy of the array <code class="docutils literal notranslate">x</code> is created in GPU memory and initialized as written in the <code class="docutils literal notranslate">loop</code> construct. When <code class="docutils literal notranslate">-gpu=mem:unified</code> is added, however, the compiler ignores the <code class="docutils literal notranslate">acc enter data</code> construct, and the <code class="docutils literal notranslate">loop</code> construct initializes the array <code class="docutils literal notranslate">x</code> in system memory. Another implication of which to be aware, asynchronous code execution on the GPU can introduce race conditions over access to program data. More details about code patterns to avoid when writing application sources for Unified Memory Mode can be found in the sections about specific programming models of this guide e.g. OpenACC, OpenMP, or CUDA Fortran. </section> </section> <section id="memory-pool-allocator"> <h3>5.4.3. Memory Pool Allocator<a class="headerlink" href="#memory-pool-allocator" title="Permalink to this headline"></a></h3> Dynamic memory allocations may be made using <code class="docutils literal notranslate">cudaMallocManaged()</code>, a routine which has higher overhead than allocating non-managed memory using <code class="docutils literal notranslate">cudaMalloc()</code>. The more calls to <code class="docutils literal notranslate">cudaMallocManaged()</code>, the more significant the impact on performance. To mitigate the overhead of <code class="docutils literal notranslate">cudaMallocManaged()</code> or other CUDA allocation API calls, there is a pool allocator enabled by default in the presence of the <code class="docutils literal notranslate">-gpu=mem:managed</code>, <code class="docutils literal notranslate">-gpu=mem:separate:pinnedalloc</code>, or <code class="docutils literal notranslate">-gpu=mem:unified</code> compiler options. It can be disabled, or its behavior modified, using these environment variables: <table class="table-no-stripes docutils align-default" id="id25"> <caption>Table 14. Pool Allocator Environment Variables<a class="headerlink" href="#id25" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 9%" /> <col style="width: 91%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Environment Variable</th> <th class="head">Use</th> </tr> </thead> <tbody> <tr class="row-even"><td>NVCOMPILER_ACC_POOL_ALLOC</td> <td>Disable the pool allocator. The pool allocator is enabled by default; to disable it, set NVCOMPILER_ACC_POOL_ALLOC to 0.</td> </tr> <tr class="row-odd"><td>NVCOMPILER_ACC_POOL_SIZE</td> <td>Set the of the pool. The default size is 1GB but other sizes (i.e., 2GB, 100MB, 500KB, etc.) can be used. The actual pool size is set such that the size is the nearest, smaller number in the Fibonacci series compared to the provided or default size. If necessary, the pool allocator will add more pools but only up to the NVCOMPILER_ACC_POOL_THRESHOLD value.</td> </tr> <tr class="row-even"><td>NVCOMPILER_ACC_POOL_ALLOC_MAXSIZE</td> <td>Set the maximum size for allocations. The default maximum size for allocations is 500MB but another size (i.e., 100KB, 10MB, 250MB, etc.) can be used as long as it is greater than or equal to 16B.</td> </tr> <tr class="row-odd"><td>NVCOMPILER_ACC_POOL_ALLOC_MINSIZE</td> <td>Set the minimum size for allocation blocks. The default size is 128B but other sizes can be used. The size must be greater than or equal to 16B.</td> </tr> <tr class="row-even"><td>NVCOMPILER_ACC_POOL_THRESHOLD</td> <td>Set the percentage of total device memory that the pool allocator can occupy. Values from 0 to 100 are accepted. The default value is 50, corresponding to 50% of device memory.</td> </tr> </tbody> </table> <div class="admonition note"> Note Note that where the size is specified if the unit suffix (B, KB, MB or GB) is ommited, the value is set by default in bytes. </div> </section> <section id="interception-of-deallocations"> <h3>5.4.4. Interception of Deallocations<a class="headerlink" href="#interception-of-deallocations" title="Permalink to this headline"></a></h3> While NVIDIA HPC Compilers facilitate the use of managed or pinned memory automatically, the application must ensure that memory is deallocated using the API which “matches” the API used to allocate said memory. For example, if <code class="docutils literal notranslate">cudaMallocManaged</code> is used to allocate, then <code class="docutils literal notranslate">cudaFree</code> must be used to deallocate; if <code class="docutils literal notranslate">cudaMallocHost</code> is used for allocations, <code class="docutils literal notranslate">cudaFreeHost</code> must be used for deallocations. Understanding this requirement is particularly important when third party or standard libraries are used; these libraries may have been compiled without any memory mode settings which sets up a situation where the deallocation routines in the libraries may not match the allocations made. When data is deallocated with an unmatching API call, the application may exhibit undefined behavior including crashing. To mitigate this issue, the compiler supports an interception mode in which calls to the standard deallocation function (e.g. free in C, delete in C++, or deallocate in Fortran) are inspected by the runtime and, if the memory is not detected as being system-allocated, the runtime replaces the standard deallocation function with the deallocation API corresponding to the allocation scheme in use. To activate this interception mode, use the <code class="docutils literal notranslate">-gpu=interceptdeallocations</code> compiler flag. The interception is enabled by default for Stdpar in the presence of managed memory allocations. To deactivate the interception use the <code class="docutils literal notranslate">-gpu=nointerceptdeallocations</code> compiler switch. This interception can incur extra runtime overhead. </section> <section id="command-line-options-selecting-compiler-memory-modes"> <h3>5.4.5. Command-line Options Selecting Compiler Memory Modes<a class="headerlink" href="#command-line-options-selecting-compiler-memory-modes" title="Permalink to this headline"></a></h3> The following table maps the new memory model flags to their deprecated equivalents. <table class="table-no-stripes colwidths-given docutils align-default" id="id26"> <caption>Table 15. Command-line Options Corresponding to Compiler Memory Modes<a class="headerlink" href="#id26" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 29%" /> <col style="width: 29%" /> <col style="width: 43%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Current Flags</th> <th class="head">Deprecated Flags</th> <th class="head">Brief Description</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">-gpu=mem:managed</code></td> <td><code class="docutils literal notranslate">-gpu=managed</code></td> <td>Managed Memory Mode</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-gpu=mem:managed -stdpar</code></td> <td><code class="docutils literal notranslate">-gpu=nounified -stdpar</code></td> <td>Managed Memory Mode</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-gpu=mem:unified</code></td> <td><code class="docutils literal notranslate">-gpu=unified</code></td> <td>Unified Memory Mode</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-gpu=mem:unified:managedalloc</code></td> <td><code class="docutils literal notranslate">-gpu=unified,managed</code></td> <td>Unified Memory Mode, all dynamically allocated data are implicitly in CUDA Managed Memory.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-gpu=mem:unified:nomanagedalloc</code></td> <td><code class="docutils literal notranslate">-gpu=unified,nomanaged</code></td> <td>Unified Memory Mode, CUDA Managed Memory is not used implicitly.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-gpu=mem:separate</code></td> <td><code class="docutils literal notranslate">-gpu=nomanaged</code></td> <td>Separate Memory Mode</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-gpu=mem:separate</code></td> <td><code class="docutils literal notranslate">-gpu=nounified</code></td> <td>Separate Memory Mode</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-gpu=mem:separate</code></td> <td><code class="docutils literal notranslate">-gpu=nomanaged,nounified</code></td> <td>Separate Memory Mode</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-gpu=mem:separate:pinnedalloc</code></td> <td><code class="docutils literal notranslate">-gpu=pinned</code></td> <td>Separate Memory Mode, dynamically allocated data are in CPU pinned memory implicitly.</td> </tr> </tbody> </table> </section> </section> <section id="fortran-pointers-in-device-code"> <h2>5.5. Fortran pointers in device code<a class="headerlink" href="#fortran-pointers-in-device-code" title="Permalink to this headline"></a></h2> A Fortran pointer variable is implemented with a pointer and a descriptor, where the descriptor (often called a “dope vector”) holds the array bounds and strides for each dimension, among other information, such as the size for each element and whether the pointer is associated. A Fortran scalar pointer has no bounds information, but does have a minimal descriptor. In Fortran, referring to the pointer variable always refers to the pointer target. There is no syntax to explicitly refer to the pointer and descriptor that implement the pointer variable. Fortran allocatable arrays and variables are implemented much the same way as pointer arrays and variables. Much of the discussion below applies both to allocatables and pointers. In OpenACC and OpenMP, when a pointer variable reference appears in a data clause, it’s the pointer target that gets allocated or moved to device memory. The pointer and descriptor are neither allocated nor moved. When a pointer variable is declared in a module declaration section and appears in an <code class="docutils literal notranslate">!$acc declare create()</code> or <code class="docutils literal notranslate">!$omp declare target to()</code> directive, then the pointer and descriptor are statically allocated in device memory. When the pointer variable appears in a data clause, the pointer target is allocated or copied to the device, and the pointer and descriptor are ‘attached’ to the device copy of the data. If the pointer target is already present in device memory, no new memory is allocated or copied, but the pointer and descriptor are still ‘attached’, making the pointer valid in device memory. An important side effect of adding <code class="docutils literal notranslate">declare create</code> in the module declaration section is that when the program executes an ‘allocate’ statement for the pointer (or allocatable), memory is allocated in both CPU and device memory. This means the newly allocated data is already present in device memory. To get values from CPU to device memory or back, you’ll have to use <code class="docutils literal notranslate">update</code> directives. When a pointer variable is used in an OpenACC or OpenMP compute construct, the compiler creates a private copy of the pointer and descriptor for each thread, unless the pointer variable was in a module as described above. The private pointer and descriptor will contain information about the device copy of the pointer target. In the compute construct, the pointer variables may be used pretty much as they can in host code outside a compute construct. However, there are some limitations. The program can do a pointer assignment to the pointer, changing the pointer, but that will only change the private pointer for that thread. The modified pointer in the compute construct will not change the corresponding pointer and descriptor in host memory. </section> <section id="calling-routines-in-a-compute-kernel"> <h2>5.6. Calling routines in a compute kernel<a class="headerlink" href="#calling-routines-in-a-compute-kernel" title="Permalink to this headline"></a></h2> Using explicit interfaces is a common occurrence when writing Fortran applications. Here are some cases where doing so is required for GPU programming. <ul class="simple"> <li>Explicit interfaces are required when using OpenACC <code class="docutils literal notranslate">routine bind</code> or OpenMP <code class="docutils literal notranslate">declare variant</code>.</li> <li>Fortran <code class="docutils literal notranslate">do concurrent</code> requires routines to be <code class="docutils literal notranslate">pure</code> which creates the need for an explicit interface.</li> </ul> </section> <section id="supported-processors-and-gpus"> <h2>5.7. Supported Processors and GPUs<a class="headerlink" href="#supported-processors-and-gpus" title="Permalink to this headline"></a></h2> This NVIDIA HPC Compilers release supports x86-64 and Arm Server CPUs. Cross-compilation across the different families of CPUs is not supported, but you can use the <code class="docutils literal notranslate">-⁠tp=<target></code> flag as documented in the man pages to specify a target processor within a family. To direct the compilers to generate code for NVIDIA GPUs, use the <code class="docutils literal notranslate">-⁠acc</code> flag to enable OpenACC directives, the <code class="docutils literal notranslate">-⁠mp=gpu</code> flag to enable OpenMP directives, the <code class="docutils literal notranslate">-⁠stdpar</code> flag for standard language parallelism, and the <code class="docutils literal notranslate">-cuda</code> flag for CUDA Fortran. Use the <code class="docutils literal notranslate">-⁠gpu</code> flag to select specific options for GPU code generation. You can then use the generated code on any supported system with CUDA installed that has a CUDA-enabled GeForce, Quadro, or Tesla card. For more information on these flags as they relate to accelerator technology, refer to <a class="reference internal" href="#acc-cmdln-opts">Compiling an OpenACC Program</a>. For a complete list of supported CUDA GPUs, refer to the NVIDIA website at: <a class="reference external" href="http://www.nvidia.com/object/cuda_learn_products.html">http://www.nvidia.com/object/cuda_learn_products.html</a> </section> <section id="cuda-versions"> <h2>5.8. CUDA Versions<a class="headerlink" href="#cuda-versions" title="Permalink to this headline"></a></h2> The NVIDIA HPC compilers use components from NVIDIA’s CUDA Toolkit to build programs for execution on an NVIDIA GPU. The NVIDIA HPC SDK puts the CUDA Toolkit components into an HPC SDK installation sub-directory; the HPC SDK currently bundles two versions of recently-released Toolkits. You can compile a program for an NVIDIA GPU on any system supported by the HPC compilers. You will be able to run that program only on a system with an NVIDIA GPU and an installed NVIDIA CUDA driver. NVIDIA HPC SDK products do not contain CUDA device drivers. You must download and install the appropriate <a class="reference external" href="http://www.nvidia.com/cuda">CUDA Driver from NVIDIA</a>. The NVIDIA HPC SDK utility <code class="docutils literal notranslate">nvaccelinfo</code> prints the driver version as its first line of output. You can use it to find out which version of the CUDA Driver is installed on your system. The NVIDIA HPC SDK 25.1 includes components from the following versions of the CUDA Toolkit: <ul class="simple"> <li>CUDA 11.8</li> <li>CUDA 12.4</li> </ul> If you are compiling a program for GPU execution on a system without an installed CUDA driver, the compiler selects the version of the CUDA Toolkit to use based on the value of the <code class="docutils literal notranslate">DEFCUDAVERSION</code> variable contained in a file called <code class="docutils literal notranslate">localrc</code> which is created during installation of the HPC SDK. If you are compiling a program for GPU execution on a system with an installed CUDA driver, the compiler detects the version of the CUDA driver and selects the appropriate CUDA Toolkit version to use from those bundled with the HPC SDK. The compilers look for a CUDA Toolkit version in the /opt/nvidia/hpc_sdk/target/25.1/cuda directory that matches the version of the CUDA Driver installed on the system. If an exact match is not found, the compiler searches for the closest match. For CUDA Driver versions 11.2 through 11.8, the compiler will use the CUDA 11.8 Toolkit. For CUDA Driver versions 12.0 and later, the compiler will use the newest CUDA 12.x Toolkit. You can change the compiler’s default selection of CUDA Toolkit version using a compiler option. Add the <code class="docutils literal notranslate">cudaX.Y</code> sub-option to <code class="docutils literal notranslate">-gpu</code> where <code class="docutils literal notranslate">X.Y</code> denotes the CUDA version. Using a compiler option changes the CUDA Toolkit version for one invocation of the compiler. For example, to compile an OpenACC C file with the CUDA 11.8 Toolkit you would use: <div class="highlight-text notranslate"><div class="highlight"><pre>nvc -acc -gpu=cuda11.8 </pre></div> </div> </section> <section id="compute-capability"> <h2>5.9. Compute Capability<a class="headerlink" href="#compute-capability" title="Permalink to this headline"></a></h2> The compilers can generate code for NVIDIA GPU compute capabilities 3.5 through 8.6. The compilers construct a default list of compute capabilities that matches the compute capabilities supported by the GPUs found on the system used in compilation. If there are no GPUs detected, the compilers generate code for every supported compute capability. You can override the default by specifying one or more compute capabilities using either command-line options or an <code class="docutils literal notranslate">rcfile</code>. To change the default with a command-line option, provide a comma-separated list of compute capabilities to the <code class="docutils literal notranslate">-gpu</code> option. To change the default with an <code class="docutils literal notranslate">rcfile</code>, set the <code class="docutils literal notranslate">DEFCOMPUTECAP</code> value to a blank-separated list of compute capabilities in the siterc file located in your installation’s bin directory: <div class="highlight-text notranslate"><div class="highlight"><pre>set DEFCOMPUTECAP=60 70; </pre></div> </div> Alternatively, if you don’t have permissions to change the <code class="docutils literal notranslate">siterc</code> file, you can add the <code class="docutils literal notranslate">DEFCOMPUTECAP</code> definition to a separate <code class="docutils literal notranslate">.mynvrc</code> file in your home directory. The generation of device code can be time consuming, so you may notice an increase in compile time as the number of compute capabilities increases. </section> <section id="ptx-jit-compilation"> <h2>5.10. PTX JIT Compilation<a class="headerlink" href="#ptx-jit-compilation" title="Permalink to this headline"></a></h2> As of HPC SDK 22.9, support for PTX JIT compilation is enabled in all compilers for relocatable device code mode. This means that applications built with <code class="docutils literal notranslate">-gpu=rdc</code> (that is, with relocatable device code enabled, which is the default mode) are forward-compatible with newer GPUs thanks to the embedded PTX code. The embedded PTX code is dynamically compiled when the application runs on a GPU architecture newer than the architecture specified at compile time. The support for PTX JIT compilation is enabled automatically, which means that you do not need to change the compiler invocation command lines for your existing projects. Use scenarios <ul class="simple"> <li>As an example, you can compile your application targeting the Ampere GPU without having to worry about the Hopper GPU architecture. Once the application runs on a Hopper GPU, it will seamlessly use the embedded PTX code.</li> <li>In CUDA Fortran, or with the CUDA Interoperability mode enabled, you can mix in object files compiled with the CUDA NVCC compiler containing PTX code. This PTX code from NVCC will be handled by the JIT compiler alongside the PTX code contained in object files produced by the HPC SDK compilers. When using the CUDA NVCC compiler, the relocatable device code generation must be enabled explicitly using the NVCC <code class="docutils literal notranslate">--relocatable-device-code</code> true switch, as explained in the <a class="reference external" href="https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#using-separate-compilation-in-cuda">CUDA Compiler Driver guide</a>. More information is available in the <a class="reference internal" href="#openmp-interop-cuda">Interoperability with CUDA</a> section of this guide and in the <a class="reference external" href="../cuda-fortran-prog-guide/index.html">CUDA Fortran Programming Guide</a>.</li> </ul> By default, the compiler will choose the compute capability that matches the GPU on the system where the code is being compiled. For code that is going to run on the system where it is compiled, we recommend letting the compiler set the compute capability. When the default won’t work, we recommend compiling applications for a range of compute capabilities that the application is expected to run against, for example, using the <code class="docutils literal notranslate">-gpu=ccall</code> compiler option. When running the application on a system that supports one of those compute capabilities, the CUDA driver minor version is allowed to be less than the version of the CUDA toolkit used at compile time, as covered in section <a class="reference internal" href="#cuda-toolkit-versions">CUDA Versions</a>. Performance considerations PTX JIT compilation, when it occurs, can have a start-up overhead for the application. The JIT compiler keeps a cached copy of the produced device code, which reduces the overhead on subsequent runs. Please refer to the <a class="reference external" href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#just-in-time-compilation">CUDA Programming Guide</a> for detailed information about how the JIT compiler works. Known limitations In general, in order for PTX JIT compilation to work, the CUDA driver installed on the deployment system must be at least of the version that matches the CUDA toolkit used to compile the application. This requirement is stricter than those explained in section <a class="reference internal" href="#cuda-toolkit-versions">CUDA Versions</a>. For example, as explained in that section, the compilers will use the CUDA 11.8 toolkit that is shipped as part of the HPC SDK toolkit when the CUDA driver installed in the system is at least 11.2. However, while the CUDA 11.2 driver is commonly sufficient to run the application, it will not be able to compile the PTX code produced by the CUDA 11.8 toolkit. This means that any deployment system where the PTX JIT compilation is expected to be used must have at least the CUDA 11.8 driver installed. Please refer to the <a class="reference external" href="https://docs.nvidia.com/deploy/cuda-compatibility/index.html#application-considerations">CUDA Compatibility</a> guide for further information about the CUDA Driver compatibility with CUDA Toolkits. When the application is expected to run on a newer GPU architecture than specified at compile time, we recommend having a CUDA driver installed on the deployment system matching the CUDA toolkit used to build the application. One way to achieve that is to use the <code class="docutils literal notranslate">NVHPC_CUDA_HOME</code> environment variable at compile time to provide a specific CUDA toolkit. Below are a few examples of how the PTX version incompatibility can be diagnosed and fixed. As a general rule, if the CUDA driver is unable to run the application due to incompatible PTX, the application will terminate with an error message indicating the cause. OpenACC and OpenMP applications will in most cases suggest compiler flags to target the current CUDA installation. OpenACC Consider this program that we will compile for Volta GPU and attempt to run on an Ampere GPU, on a system that has CUDA 11.5 installed: <div class="highlight-c++ notranslate"><div class="highlight"><pre>#include <stdio.h> #define N 1000 int array[N]; int main() { #pragma acc parallel loop copy(array[0:N]) for(int i = 0; i < N; i++) { array[i] = 3.0; } printf("Success!\n"); } </pre></div> </div> When we build the program, HPC SDK will choose the CUDA 11.8 toolkit that is included as the default. When we attempt to run it, it fails because code generated with 11.8 does not work with the 11.5 driver: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -acc -gpu=cc70 app.c $ ./a.out Accelerator Fatal Error: This file was compiled: -acc=gpu -gpu=cc70 Rebuild this file with -gpu=cc80 to use NVIDIA Tesla GPU 0 File: /tmp/app.c Function: main:3 Line: 3 </pre></div> </div> From the error message it follows that the system is unable to execute the Volta GPU instructions on the current system. The embedded Volta PTX could not be compiled, which implies a CUDA driver incompatibility. A way to fix this is to use the installed CUDA 11.5 toolkit at compile time: <div class="highlight-text notranslate"><div class="highlight"><pre>$ export NVHPC_CUDA_HOME=/usr/local/cuda-11.5 $ nvc -acc -gpu=cc70 app.c $ ./a.out Success! </pre></div> </div> OpenMP Likewise, an OpenMP program will compile but not run: <div class="highlight-text notranslate"><div class="highlight"><pre>#include <stdio.h> #define N 1000 int array[N]; int main() { #pragma omp target loop for(int i = 0; i < N; i++) { array[i] = 0; } printf("Success!\n"); } </pre></div> </div> <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -mp=gpu -gpu=cc70 app.c $ ./a.out Accelerator Fatal Error: Failed to find device function 'nvkernel_main_F1L3_2'! File was compiled with: -gpu=cc70 Rebuild this file with -gpu=cc80 to use NVIDIA Tesla GPU 0 File: /tmp/app.c Function: main:3 Line: 3 </pre></div> </div> We can also fix it by having <code class="docutils literal notranslate">NVHPC_CUDA_HOME</code> point at the matching CUDA toolkit location: <div class="highlight-text notranslate"><div class="highlight"><pre>$ export NVHPC_CUDA_HOME=/usr/local/cuda-11.5 $ nvc -acc -gpu=cc70 app.c $ ./a.out Success! </pre></div> </div> C++ In contrast to OpenACC and OpenMP applications that simply terminate when PTX JIT encounters an insufficient CUDA driver version, C++ applications throw a system exception when there is a PTX incompatibility: <div class="highlight-c++ notranslate"><div class="highlight"><pre>#include <vector> #include <algorithm> #include <execution> #include <iostream> #include <assert.h> int main() { std::vector<int> x(1000, 0); x[1] = -20; auto result = std::count(std::execution::par, x.begin(), x.end(), -20); assert(result == 1); std::cout << "Success!" << std::endl; } </pre></div> </div> <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc++ -stdpar -gpu=cc70 app.cpp $ ./a.out terminate called after throwing an instance of 'thrust::system::system_error' what(): after reduction step 1: cudaErrorUnsupportedPtxVersion: the provided PTX was compiled with an unsupported toolchain. Aborted (core dumped) </pre></div> </div> The exception message contains a direct reference to an incompatible PTX, which in turn implies an mismatch between the CUDA toolkit and the CUDA driver version. We can fix it similarly by setting <code class="docutils literal notranslate">NVHPC_CUDA_HOME</code>: <div class="highlight-text notranslate"><div class="highlight"><pre>$ export NVHPC_CUDA_HOME=/usr/local/cuda-11.5 $ nvc++ -stdpar -gpu=cc70 app.cpp $ ./a.out Success! </pre></div> </div> </section> </section> <section id="using-openacc"> <h1>6. Using OpenACC<a class="headerlink" href="#using-openacc" title="Permalink to this headline"></a></h1> This chapter gives an overview of directive-based OpenACC programming in which compiler directives are used to specify regions of code in Fortran, C and C++ programs to be offloaded from a host CPU to an NVIDIA GPU. For complete details on using OpenACC with NVIDIA GPUs, see the <a class="reference external" href="../openacc-gs/index.html">OpenACC Getting Started Guide</a>. <section id="openacc-programming-model"> <h2>6.1. OpenACC Programming Model<a class="headerlink" href="#openacc-programming-model" title="Permalink to this headline"></a></h2> With the emergence of GPU architectures in high performance computing, programmers want the ability to program using a familiar, high level programming model that provides both high performance and portability to a wide range of computing architectures. OpenACC emerged in 2011 as a programming model that uses high-level compiler directives to expose parallelism in the code and parallelizing compilers to build the code for a variety of parallel accelerators. This chapter will not attempt to describe OpenACC itself. For that, please refer to the OpenACC specification on the OpenACC <a class="reference external" href="http://www.openacc.org">www.openacc.org</a> website. Here, we will discuss differences between the OpenACC specification and its implementation by the NVIDIA HPC Compilers. Other resources to help you with your parallel programming including video tutorials, course materials, code samples, a best practices guide and more are available on the OpenACC website. <section id="levels-of-parallelism"> <h3>6.1.1. Levels of Parallelism<a class="headerlink" href="#levels-of-parallelism" title="Permalink to this headline"></a></h3> OpenACC supports three levels of parallelism: <ul class="simple"> <li>an outer doall (fully parallel) loop level</li> <li>a workgroup or threadblock (worker parallel) loop level</li> <li>an inner synchronous (SIMD or vector) loop level</li> </ul> Each level can be multidimensional with 2 or 3 dimensions, but the domain must be strictly rectangular. The synchronous level may not be fully implemented with SIMD or vector operations, so explicit synchronization is supported and required across this level. No synchronization is supported between parallel threads across the doall level. The OpenACC execution model on the device side exposes these levels of parallelism and the programmer is required to understand the difference between, for example, a fully parallel loop and a loop that is vectorizable but requires synchronization across iterations. All fully parallel loops can be scheduled for any of doall, workgroup or synchronous parallel execution, but by definition SIMD vector loops that require synchronization can only be scheduled for synchronous parallel execution. </section> <section id="enable-openacc-directives"> <h3>6.1.2. Enable OpenACC Directives<a class="headerlink" href="#enable-openacc-directives" title="Permalink to this headline"></a></h3> NVIDIA HPC compilers enable OpenACC directives with the <code class="docutils literal notranslate">-acc</code> and <code class="docutils literal notranslate">-gpu</code> command line options. For more information on these options refer to <a class="reference internal" href="#acc-cmdln-opts">Compiling an OpenACC Program</a>. _OPENACC macro The <code class="docutils literal notranslate">_OPENACC</code> macro name is defined to have a value <code class="docutils literal notranslate">yyyymm</code> where <cite>yyyy</cite> is the year and <cite>mm</cite> is the month designation of the version of the OpenACC directives supported by the implementation. For example, the version for November, 2017 is 201711. All OpenACC compilers define this macro when OpenACC directives are enabled. </section> <section id="openacc-support"> <h3>6.1.3. OpenACC Support<a class="headerlink" href="#openacc-support" title="Permalink to this headline"></a></h3> The NVIDIA HPC Compilers implement most features of OpenACC 2.7 as defined in The OpenACC Application Programming Interface, Version 2.7, November 2018, <a class="reference external" href="http://www.openacc.org">http://www.openacc.org</a>, with the exception that the following OpenACC 2.7 features are not supported: <ul class="simple"> <li>nested parallelism</li> <li>declare link</li> <li>enforcement of the <code class="docutils literal notranslate">cache</code> clause restriction that all references to listed variables must lie within the region being cached</li> <li>Subarrays and composite variables in <code class="docutils literal notranslate">reduction</code> clauses</li> <li>The <code class="docutils literal notranslate">self</code> clause</li> <li>The <code class="docutils literal notranslate">default</code> clause on data constructs</li> </ul> </section> <section id="openacc-extensions"> <h3>6.1.4. OpenACC Extensions<a class="headerlink" href="#openacc-extensions" title="Permalink to this headline"></a></h3> The NVIDIA Fortran compiler supports an extension to the <code class="docutils literal notranslate">collapse</code> clause on the <code class="docutils literal notranslate">loop</code> construct. The OpenACC specification defines <code class="docutils literal notranslate">collapse</code>: <div class="highlight-text notranslate"><div class="highlight"><pre>collapse(n) </pre></div> </div> NVIDIA Fortran supports the use of the identifier <code class="docutils literal notranslate">force</code> within <code class="docutils literal notranslate">collapse</code>: <div class="highlight-text notranslate"><div class="highlight"><pre>collapse(force:n) </pre></div> </div> Using <code class="docutils literal notranslate">collapse(force:n)</code> instructs the compiler to enforce collapsing parallel loops that are not perfectly nested. </section> </section> <section id="compiling-an-openacc-program"> <h2>6.2. Compiling an OpenACC Program<a class="headerlink" href="#compiling-an-openacc-program" title="Permalink to this headline"></a></h2> Several compiler options are applicable specifically when working with OpenACC. These options include <code class="docutils literal notranslate">-acc</code>, <code class="docutils literal notranslate">-gpu</code>, and <code class="docutils literal notranslate">-Minfo</code>. <section id="no-acc"> <h3>6.2.1. -[no]acc<a class="headerlink" href="#no-acc" title="Permalink to this headline"></a></h3> Enable [disable] OpenACC directives. The following suboptions may be used following an equals sign (“=”), with multiple sub-options separated by commas: <dl class="simple"> <dt>gpu</dt><dd>OpenACC directives are compiled for GPU execution only. </dd> <dt>host</dt><dd>Compile for serial execution on the host CPU. </dd> <dt>multicore</dt><dd>Compile for parallel execution on the host CPU. </dd> <dt>legacy</dt><dd>Suppress warnings about deprecated NVIDIA accelerator directives. </dd> <dt>[no]autopar</dt><dd>Enable [disable] loop autoparallelization within acc parallel. The default is to autoparallelize, that is, to enable loop autoparallelization. </dd> <dt>[no]routineseq</dt><dd>Compile every routine for the devicee. The default behavior is to not treat every routine as a seq directive. </dd> <dt>strict</dt><dd>Instructs the compiler to issue warnings for non-OpenACC accelerator directives. </dd> <dt>sync</dt><dd>Ignore async clauses </dd> <dt>verystrict</dt><dd>Instructs the compiler to fail with an error for any non-OpenACC accelerator directive. </dd> <dt>[no]wait</dt><dd>Wait for each device kernel to finish. Kernel launching is blocked by default unless the async clause is used. </dd> </dl> Default By default OpenACC directives are compiled for GPU and sequential CPU host execution (i.e. equivalent to explicitly setting <code class="docutils literal notranslate">-acc=gpu,host</code>). Usage The following command-line requests that OpenACC directives be enabled and that an error be issued for any non-OpenACC accelerator directive. <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -acc=verystrict prog.f </pre></div> </div> Predefined Macros The following macros corresponding to the target compiled for are added implicitly: <ul class="simple"> <li><code class="docutils literal notranslate">__NVCOMPILER_OPENACC_GPU</code> when the OpenACC directives are compiled for GPU.</li> <li><code class="docutils literal notranslate">__NVCOMPILER_OPENACC_MULTICORE</code> when the OpenACC directives are compiled for multicore CPU.</li> <li><code class="docutils literal notranslate">__NVCOMPILER_OPENACC_HOST</code> when the OpenACC directives are compiled for serial execution on CPU.</li> </ul> </section> <section id="gpu"> <h3>6.2.2. -gpu<a class="headerlink" href="#gpu" title="Permalink to this headline"></a></h3> Used in combination with the -acc, -cuda, -mp, and -stdpar flags to specify options for GPU code generation. The following sub-options may be used following an equals sign (“=”), with multiple sub-options separated by commas: <dl class="simple"> <dt>autocompare</dt><dd>Automatically compare CPU vs GPU results at execution time: implies redundant </dd> <dt>ccXY</dt><dd>Generate code for a device with compute capability X.Y. Multiple compute capabilities can be specified, and one version will be generated for each. By default, the compiler will detect the compute capability for each installed GPU. Use -help -gpu to see the valid compute capabilities for your installation. </dd> <dt>ccall</dt><dd>Generate code for all compute capabilities supported by this platform and by the selected or default CUDA Toolkit. </dd> <dt>ccall-major</dt><dd>Compile for all major supported compute capabilities. </dd> <dt>ccnative</dt><dd>Detects the visible GPUs on the system and generates codes for them. If no device is available, the compute capability matching NVCC’s default will be used. </dd> <dt>cudaX.Y</dt><dd>Use CUDA X.Y Toolkit compatibility, where installed </dd> <dt>[no]debug</dt><dd>Enable [disable] debug information generation in device code </dd> <dt>deepcopy</dt><dd>Enable full deep copy of aggregate data structures in OpenACC; Fortran only </dd> <dt>fastmath</dt><dd>Use routines from the fast math library </dd> <dt>[no]flushz</dt><dd>Enable [disable] flush-to-zero mode for floating point computations on the GPU </dd> <dt>[no]fma</dt><dd>Generate [do not generate] fused multiply-add instructions on the GPU; default at <code class="docutils literal notranslate">-O1</code>. This can be used in conjunction with the global <code class="docutils literal notranslate">-M[no]fma</code> option to explicitly enable/disable FMAs on the CPU or GPU. </dd> <dt>[no]implicitsections</dt><dd>Change [do not change] array element references in a data clause into an array section. In C++, the <code class="docutils literal notranslate">implicitsections</code> option will change <code class="docutils literal notranslate">update device(a[n])</code> to <code class="docutils literal notranslate">update device(a[0:n])</code>. In Fortran, it will change <code class="docutils literal notranslate">enter data copyin(a(n))</code> to <code class="docutils literal notranslate">enter data copyin(a(:n))</code>. The default behavior, <code class="docutils literal notranslate">noimplicitsections</code>, can also be changed using rcfiles; for example, one could add <code class="docutils literal notranslate">set IMPLICITSECTIONS=0;</code> to siterc or another rcfile. </dd> <dt>[no]interceptdeallocations</dt><dd>Intercept [do not intercept] calls to standard library memory deallocations (e.g. <code class="docutils literal notranslate">free</code>) and call the corresponding CUDA memory deallocation version if address is in pinned or managed memory, regular version otherwise. </dd> <dt>keep</dt><dd>Keep the kernel files (.cubin, .ptx, source) </dd> <dt>[no]lineinfo</dt><dd>Enable [disable] GPU line information generation </dd> <dt>loadcache:{L1|L2}</dt><dd>Choose what hardware level cache to use for global memory loads; options include the default, <code class="docutils literal notranslate">L1</code>, or <code class="docutils literal notranslate">L2</code> </dd> <dt>[no]managed</dt><dd>Allocate [do not allocate] any dynamically allocated data in CUDA Managed memory. Use <code class="docutils literal notranslate">-gpu=nomanaged</code> with <code class="docutils literal notranslate">-stdpar</code> to prevent that flag’s implicit use of <code class="docutils literal notranslate">-gpu=managed</code> when CUDA Managed memory capability is detected. This option is deprecated. </dd> <dt>maxregcount:n</dt><dd>Specify the maximum number of registers to use on the GPU; leaving this blank indicates no limit </dd> <dt>mem:{separate|managed|unified}</dt><dd>Select GPU memory mode for the generated binary. This controls CUDA memory capability to be utilised such as separate GPU memory only (<code class="docutils literal notranslate">separate</code>), GPU Managed Memory for the dynamically allocated data (<code class="docutils literal notranslate">managed</code>), or system memory aka full CUDA Unified Memory (<code class="docutils literal notranslate">unified</code>). Use of Managed or Unified Memory facilitates simpler programming by eliminating the need to detect all data to be copied into and outside of the code region executing on the GPU. </dd> <dt>pinned</dt><dd>Use CUDA Pinned Memory. This option is deprecated. </dd> <dt>ptxinfo</dt><dd>Print PTX info </dd> <dt>[no]rdc</dt><dd>Generate [do not generate] relocatable device code. </dd> <dt>redundant</dt><dd>Redundant CPU/GPU execution </dd> <dt>safecache</dt><dd>Allow variable-sized array sections in cache directives; compiler assumes they fit into CUDA shared memory </dd> <dt>sm_XY</dt><dd>Generate code for a device with compute capability X.Y. Multiple compute capabilities can be specified, and one version will be generated for each. By default, the compiler will detect the compute capability for each installed GPU. Use -help -gpu to see the valid compute capabilities for your installation. </dd> <dt>stacklimit:<l>nostacklimit</dt><dd>Sets the limit (l) of stack variables in a procedure or kernel, in KB. This option is deprecated. </dd> <dt>tripcount:{host|device|[no]check|[no]warn}</dt><dd>Determine whether the trip count values for loops in compute constructs are calculated on the host (default) or the device. Also can be used to enable [disable] runtime checks and compile-time warnings related to using host vs. device trip count values. </dd> <dt>[no]unified</dt><dd>Compile [do not compile] for CUDA Unified memory capability, where system memory is accessible from the GPU. This mode utilizes system and managed memory for dynamically allocated data unless explicit behavior is set through <code class="docutils literal notranslate">-gpu=[no]managed</code>. Use <code class="docutils literal notranslate">-gpu=nounified</code> with <code class="docutils literal notranslate">-stdpar</code> to prevent that flag’s implicit use of <code class="docutils literal notranslate">-gpu=unified</code> when CUDA Unified memory capability is detected. This option must appear in both the compile and link lines. This option is deprecated. </dd> <dt>[no]unroll</dt><dd>Enable [disable] automatic inner loop unrolling; default at <code class="docutils literal notranslate">-O3</code> </dd> <dt>zeroinit</dt><dd>Initialize allocated device memory with zero </dd> </dl> Usage In the following example, the compiler generates code for NVIDIA GPUs with compute capabilities 6.0 and 7.0. <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -acc -gpu=cc60,cc70 myprog.f </pre></div> </div> The compiler automatically invokes the necessary software tools to create the kernel code and embeds the kernels in the object file. To link in the appropriate GPU libraries, you must link an OpenACC program with the <code class="docutils literal notranslate">-acc</code> flag, and similarly for -cuda, -mp, or -stdpar. DWARF Debugging Formats Use the <code class="docutils literal notranslate">-g</code> option to enable generation of full DWARF information on both the host and device; in the absence of other optimization flags, <code class="docutils literal notranslate">-g</code> sets the optimization level to zero. If a <code class="docutils literal notranslate">-O</code> option raises the optimization level to one or higher, only GPU line information is generated in device code even when <code class="docutils literal notranslate">-g</code> is specified. To enforce full DWARF generation for device code at optimization levels above zero, use the <code class="docutils literal notranslate">debug</code> sub-option to <code class="docutils literal notranslate">-gpu</code>. Conversely, to prevent the generation of dwarf information for device code, use the <code class="docutils literal notranslate">nodebug</code> sub-option to <code class="docutils literal notranslate">-gpu</code>. Both <code class="docutils literal notranslate">debug</code> and <code class="docutils literal notranslate">nodebug</code> can be used independently of <code class="docutils literal notranslate">-g</code>. </section> </section> <section id="openacc-for-multicore-cpus"> <h2>6.3. OpenACC for Multicore CPUs<a class="headerlink" href="#openacc-for-multicore-cpus" title="Permalink to this headline"></a></h2> The NVIDIA OpenACC compilers support the option <code class="docutils literal notranslate">-acc=multicore</code>, to set the target accelerator for OpenACC programs to the host multicore CPU. This will compile OpenACC compute regions for parallel execution across the cores of the host processor or processors. The host multicore CPU will be treated as a shared-memory accelerator, so the data clauses (<code class="docutils literal notranslate">copy</code>, <code class="docutils literal notranslate">copyin</code>, <code class="docutils literal notranslate">copyout</code>, <code class="docutils literal notranslate">create</code>) will be ignored and no data copies will be executed. By default, <code class="docutils literal notranslate">-acc=multicore</code> will generate code that will use all the available cores of the processor. If the compute region specifies a value in the <code class="docutils literal notranslate">num_gangs</code> clause, the minimum of the <code class="docutils literal notranslate">num_gangs</code> value and the number of available cores will be used. At runtime, the number of cores can be limited by setting the environment variable <code class="docutils literal notranslate">ACC_NUM_CORES</code> to a constant integer value. The number of cores can also be set with the <code class="docutils literal notranslate">void acc_set_num_cores(int numcores)</code> runtime call. If an OpenACC compute construct appears lexically within an OpenMP parallel construct, the OpenACC compute region will generate sequential code. If an OpenACC compute region appears dynamically within an OpenMP region or another OpenACC compute region, the program may generate many more threads than there are cores, and may produce poor performance. The <code class="docutils literal notranslate">-acc=multicore</code> option differs from the <code class="docutils literal notranslate">-acc=host</code> option in that <code class="docutils literal notranslate">-acc=host</code> generates sequential host CPU code for the OpenACC compute regions. </section> <section id="openacc-with-cuda-unified-memory"> <h2>6.4. OpenACC with CUDA Unified Memory<a class="headerlink" href="#openacc-with-cuda-unified-memory" title="Permalink to this headline"></a></h2> When developing OpenACC source for a target supporting CUDA Unified Memory, you can take advantage of a simplified approach to programming because there is no need for data clauses and directives, either in full or in part, depending on the exact memory capability the target supports and the compiler options used. The discussion in this section assumes you have become familiar with the Separate, Managed, and Unified Memory Modes covered in the <a class="reference internal" href="#acc-mem-model">Memory Model</a> and <a class="reference internal" href="#acc-mem-unified">Managed and Unified Memory Modes</a> sections. In Managed Memory Mode, only dynamically-allocated data are implicitly managed by the CUDA runtime; OpenACC data clauses and directives are therefore not needed for movement of this “managed” data. Data clauses and directives are still required to handle static data (C static and extern variables, Fortran module, common block and save variables) and function local data. In Unified Memory Mode, all data is managed by the CUDA runtime. Explicit data clauses and directives are no longer required to indicate which data should reside in GPU memory. All variables are accessible from the OpenACC compute regions executing on the GPU. The NVHPC compiler implementation closely adheres to the shared memory mode detailed in the OpenACC specification, meaning that <code class="docutils literal notranslate">copy</code>, <code class="docutils literal notranslate">copyin</code>, <code class="docutils literal notranslate">copyout</code>, and <code class="docutils literal notranslate">create</code> clauses will not result in any device allocation or data transfer. The <code class="docutils literal notranslate">device_resident</code> clause is still honored as in discrete memory mode and results in an allocation of data only accessible from device code. Device memory can also be allocated or deallocated in OpenACC programs in Unified Memory Mode by using the <code class="docutils literal notranslate">acc_malloc</code> or <code class="docutils literal notranslate">acc_free</code> API calls. Understanding Data Movement In the absence of visible data clauses or directives, when the compiler encounters a compute construct it attempts to determine what data is required for correct execution of the region on the GPU. When the compiler is unable to determine the size and shape of data needing to be accessible on the device, it behaves as follows: <ul class="simple"> <li>In Separate Memory Mode, the compiler emits an error requesting an explicit data clause be added to specify size/shape of the data to be copied.</li> <li>In Managed Memory Mode (<code class="docutils literal notranslate">-gpu=mem:managed</code>), the compiler assumes the data is allocated in managed memory and thus is accessible from the device; if this assumption is wrong, if the data was defined globally or is located on the CPU stack, the program may fail at runtime.</li> <li>In Unified Memory Mode (<code class="docutils literal notranslate">-gpu=mem:unified</code>), all data is accessible from the device making information about size and shape unnecessary.</li> </ul> Take the following example in C: <div class="highlight-c++ notranslate"><div class="highlight"><pre>void set(int* ptr, int i, int j, int dim){ int idx = i * dim + j; return ptr[idx] = someval(i, j); } void fill2d(int* ptr, int dim){ #pragma acc parallel loop for (int i = 0; i < dim; i++) for (int j = 0; j < dim; j++) set(ptr, i, j, dim); } </pre></div> </div> In Separate Memory Mode, the only way to guarantee correctness for this example is to change the line with the <code class="docutils literal notranslate">acc</code> directive as follows: <div class="highlight-text notranslate"><div class="highlight"><pre>#pragma acc parallel loop create(ptr[0:dim*dim]) copyout(ptr[0:dim*dim]) </pre></div> </div> This change explicitly instructs the OpenACC implementation about the precise data segment used within the parallel loop. In Unified Memory Mode, that is, by compiling with <code class="docutils literal notranslate">-acc -gpu=mem:unified</code> and executing on a platform with unified memory capability, the <code class="docutils literal notranslate">create</code> and <code class="docutils literal notranslate">copyout</code> clauses are not required. The next example, in Fortran, illustrates how a global variable can be accessed in an OpenACC routine without requiring any explicit annotation. <div class="highlight-fortran notranslate"><div class="highlight"><pre>module m integer :: globmin = 1234 contains subroutine findmin(a) !$acc routine seq integer, intent(in) :: a(:) integer :: i do i = 1, size(a) if (a(i) .lt. globmin) then globmin = a(i) endif end do end subroutine end module m </pre></div> </div> Compile the example above for Unified Memory Mode: <div class="highlight-text notranslate"><div class="highlight"><pre>nvfortran -acc -gpu=mem:unified example.f90 </pre></div> </div> The source does not need any OpenACC directives to access module variable <code class="docutils literal notranslate">globmin</code>, to either read or update its value, in the routine invoked from CPU and GPU. Moreover, any access to <code class="docutils literal notranslate">globmin</code> will be made to the same exact instance of the variable from CPU and GPU; its value is synchronized automatically. In Separate or Managed Memory Modes, such behavior can only be achieved with a combination of OpenACC <code class="docutils literal notranslate">declare</code> and <code class="docutils literal notranslate">update</code> directives in the source code. In most cases, migrating existing OpenACC applications written for Separate Memory Mode should be a seamless process requiring no source changes. Some data access patterns, however, may lead to different results produced during application execution in Unified Memory Mode. Applications which rely on having separate data copies in GPU memory to conduct temporary computations on the GPU – without maintaining data synchronization with the CPU – pose a challenge for migration to Unified Memory. For the following Fortran example, the value of variable <code class="docutils literal notranslate">c</code> after the last loop will differ depending on whether the example is compiled with or without <code class="docutils literal notranslate">-gpu=mem:unified</code>. <div class="highlight-fortran notranslate"><div class="highlight"><pre>b(:) = ... c = 0 !$acc kernels copyin(b) copyout(a) !$acc loop do i = 1, N b(i) = b(i) * i end do !$acc loop do i = 1, N a(i) = b(i) + i end do !$acc end kernels do i = 1, N c = c + a(i) + b(i) end do </pre></div> </div> Without Unified Memory, array <code class="docutils literal notranslate">b</code> is copied into the GPU memory at the beginning of the OpenACC <code class="docutils literal notranslate">kernels</code> region. It is then updated in the GPU memory and used to compute elements of array <code class="docutils literal notranslate">a</code>. As instructed by the data clause <code class="docutils literal notranslate">copyin(b)</code>, <code class="docutils literal notranslate">b</code> is not copied back to the CPU memory at the end of the <code class="docutils literal notranslate">kernels</code> region and therefore its initial value is used in the computation of <code class="docutils literal notranslate">c</code>. With <code class="docutils literal notranslate">-acc -gpu=mem:unified</code>, the updated value of <code class="docutils literal notranslate">b</code> in the first loop is automatically visible in the last loop leading to a different value of <code class="docutils literal notranslate">c</code> at its end. Implications of Asynchronous Execution Additional complexities can arise when dealing with asynchronous execution, particularly when CPU-GPU shared data is accessed within <code class="docutils literal notranslate">async</code> compute regions instead of using an independent data copy on GPU. The programmer should be especially careful about accessing local variables in asynchronous GPU code. Unless the GPU code execution is explicitly synchronized before the end of the scope in which local variables are defined, the GPU can access stale data thus resulting in undefined behavior. Consider the following OpenACC C example, where a local array is used to hold temporary data on the GPU: <div class="highlight-c++ notranslate"><div class="highlight"><pre>void bar() { int x[N]; #pragma acc enter data create(x[0:N]) async #pragma acc parallel loop async for (int i = 0; i < N; i++) x[i] = i; ... #pragma acc exit data delete(x[0:N]) async } </pre></div> </div> When compiled for Separate Memory Mode, the <code class="docutils literal notranslate">bar()</code> function creates a copy of the array <code class="docutils literal notranslate">x</code> in GPU memory and initializes it as written in the <code class="docutils literal notranslate">loop</code> construct. That copy is eventually deleted. In Unified Memory Mode, however, the compiler ignores the <code class="docutils literal notranslate">acc enter data</code> and <code class="docutils literal notranslate">acc exit data</code> directives, so the <code class="docutils literal notranslate">loop</code> construct executed on the GPU accesses the array <code class="docutils literal notranslate">x</code> in local CPU memory. Moreover, since all constructs in this example are made asynchronous, the access to <code class="docutils literal notranslate">x</code> on the GPU leads to undefined behavior of the program because the variable <code class="docutils literal notranslate">x</code> goes out of scope once the <code class="docutils literal notranslate">bar()</code> function finishes. Performance Considerations In Unified Memory Mode, the OpenACC runtime may leverage data action information such as <code class="docutils literal notranslate">create</code>/<code class="docutils literal notranslate">delete</code> or <code class="docutils literal notranslate">copyin</code>/<code class="docutils literal notranslate">copyout</code> to communicate preferable data placement to the CUDA runtime by means of memory hint APIs as elaborated in the following blog post on the NVIDIA website: <a class="reference external" href="https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management">Simplifying GPU Application Development with Heterogeneous Memory Management</a>. Such actions originate either from explicit data clauses in the source code or via implicit data movement generated by the compiler. This approach can minimize the amount of automatic data migration and may let a developer fine-tune application performance. For the C example above, while adding the data clauses <code class="docutils literal notranslate">create(ptr[0:dim*dim])</code> and <code class="docutils literal notranslate">copyout(ptr[0:dim*dim])</code> becomes optional with <code class="docutils literal notranslate">-gpu=mem:unified</code>, their uses in the OpenACC <code class="docutils literal notranslate">parallel loop</code> directive may improve performance. </section> <section id="openacc-error-handling"> <h2>6.5. OpenACC Error Handling<a class="headerlink" href="#openacc-error-handling" title="Permalink to this headline"></a></h2> The OpenACC specification provides a mechanism to allow you to intercept errors triggered during execution on a GPU and execute a specific routine in response before the program exits. For example, if an MPI process fails while allocating memory on the GPU, the application may want to call <code class="docutils literal notranslate">MPI_Abort</code> to shut down all the other processes before the program exits. This section explains how to take advantage of this feature. To intercept errors the application must give a callback routine to the OpenACC runtime. To provide the callback, the application calls <code class="docutils literal notranslate">acc_set_error_routine</code> with a pointer to the callback routine. The interface is the following, where <code class="docutils literal notranslate">err_msg</code> contains a description of the error: <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef void (*exitroutinetype)(char *err_msg); extern void acc_set_error_routine(exitroutinetype callback_routine); </pre></div> </div> When the OpenACC runtime detects a runtime error, it will invoke the <code class="docutils literal notranslate">callback_routine</code>. <div class="admonition note"> Note This feature is not the same as error recovery. If the callback routine returns to the application, the behavior is decidedly undefined. </div> Let’s look at this feature in more depth using an example. Take the MPI program below and run it with two processes. Process 0 tries to allocate a large array on the GPU, then sends a message to the second process to acknowledge the success of the operation. Process 1 waits for the acknowledgment and terminates upon receiving it. <div class="highlight-c++ notranslate"><div class="highlight"><pre>#include <stdio.h> #include <stdlib.h> #include "mpi.h" #define N 2147483648 int main(int argc, char **argv) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); int ack; if(rank == 0) { float *a = (float*) malloc(sizeof(float) * N); #pragma acc enter data create(a[0:N]) #pragma acc parallel loop independent for(int i = 0; i < N; i++) { a[i] = i *0.5; } #pragma acc exit data copyout(a[0:N]) printf("I am process %d, I have initialized a vector of size %ld bytes on the GPU. Sending acknowledgment to process 1.", rank, N); ack = 1; MPI_Send(&ack, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); } else if(rank == 1) { MPI_Recv(&ack, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("I am process %d, I have received the acknowledgment from process 0 that data in the GPU has been initialized.\n", rank, N); fflush(stdout); } // do some more work MPI_Finalize(); return 0; } </pre></div> </div> We compile the program with: <div class="highlight-text notranslate"><div class="highlight"><pre>$ mpicc -acc -o error_handling_mpi error_handling_mpi.c </pre></div> </div> If we run this program with two MPI processes, the output will look like the following: <div class="highlight-text notranslate"><div class="highlight"><pre>$ mpirun -n 2 ./error_handling_mpi Out of memory allocating -8589934592 bytes of device memory total/free CUDA memory: 11995578368/11919294464 Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 3.7, threadid=1 ...empty... call to cuMemAlloc returned error 2: Out of memory ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. </pre></div> </div> Process 0 failed while allocating memory on the GPU and terminated unexpectedly with an error. In this case <code class="docutils literal notranslate">mpirun</code> was able to identify that one of the processes failed, so it shut down the remaining process and terminated the application. A simple two-process program like this is straightforward to debug. In a real world application though, with hundreds or thousands of processes, having a process exit prematurely may cause the application to hang indefinitely. Therefore it would be ideal to catch the failure of a process, control the termination of the other processes, and provide a useful error message. We can use the OpenACC error handling feature to improve the previous program and correctly terminate the application in case of failure of an MPI process. In the following sample code, we have added an error handling callback routine that will shut down the other processes if a process encounters an error while executing on the GPU. Process 0 tries to allocate a large array into the GPU and, if the operation is successful, process 0 will send an acknowledgment to process 1. Process 0 calls the OpenACC function <code class="docutils literal notranslate">acc_set_error_routine</code> to set the function <code class="docutils literal notranslate">handle_gpu_errors</code> as an error handling callback routine. This routine prints a message and calls <code class="docutils literal notranslate">MPI_Abort</code> to shut down all the MPI processes. If process 0 successfully allocates the array on the GPU, process 1 will receive the acknowledgment. Otherwise, if process 0 fails, it will terminate itself and trigger the call to <code class="docutils literal notranslate">handle_gpu_errors</code>. Process 1 is then terminated by the code executed in the callback routine. <div class="highlight-c++ notranslate"><div class="highlight"><pre>#include <stdio.h> #include <stdlib.h> #include "mpi.h" #define N 2147483648 typedef void (*exitroutinetype)(char *err_msg); extern void acc_set_error_routine(exitroutinetype callback_routine); void handle_gpu_errors(char *err_msg) { printf("GPU Error: %s", err_msg); printf("Exiting...\n\n"); MPI_Abort(MPI_COMM_WORLD, 1); exit(-1); } int main(int argc, char **argv) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); int ack; if(rank == 0) { float *a = (float*) malloc(sizeof(float) * N); acc_set_error_routine(&handle_gpu_errors); #pragma acc enter data create(a[0:N]) #pragma acc parallel loop independent for(int i = 0; i < N; i++) { a[i] = i *0.5; } #pragma acc exit data copyout(a[0:N]) printf("I am process %d, I have initialized a vector of size %ld bytes on the GPU. Sending acknowledgment to process 1.", rank, N); fflush(stdout); ack = 1; MPI_Send(&ack, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); } else if(rank == 1) { MPI_Recv(&ack, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("I am process %d, I have received the acknowledgment from process 0 that data in the GPU has been initialized.\n", rank, N); fflush(stdout); } // more work MPI_Finalize(); return 0; } </pre></div> </div> Again, we compile the program with: <div class="highlight-text notranslate"><div class="highlight"><pre>$ mpicc -acc -o error_handling_mpi error_handling_mpi.c </pre></div> </div> We run the program with two MPI processes and obtain the output below: <div class="highlight-text notranslate"><div class="highlight"><pre>$ mpirun -n 2 ./error_handling_mpi Out of memory allocating -8589934592 bytes of device memory total/free CUDA memory: 11995578368/11919294464 Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 3.7, threadid=1 ...empty... GPU Error: call to cuMemAlloc returned error 2: Out of memory Exiting... -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. </pre></div> </div> This time the error on the GPU was intercepted by the application which managed it with the error handling callback routine. In this case the routine printed some information about the problem and called <code class="docutils literal notranslate">MPI_Abort</code> to terminate the remaining processes and avoid any unexpected behavior from the application. </section> <section id="openacc-and-cuda-graphs"> <h2>6.6. OpenACC and CUDA Graphs<a class="headerlink" href="#openacc-and-cuda-graphs" title="Permalink to this headline"></a></h2> NVIDIA provides an optimized model for work submission onto GPUs called CUDA Graphs. A graph is a series of operations, such as kernel launches and other stream-oriented tasks, connected by their dependencies. A graph can be defined once, “captured”, then launched repeatedly. This has potential benefits in reducing launch latencies and other overheads associated with kernel setup. A complete write-up explaining CUDA Graphs and the CUDA API for graph definition, instantiation, and execution can be found in Chapter 3 of the CUDA C Programming Guide. In OpenACC, we currently expose just the minimal set of operations to allow capture and replay of a graph containing OpenACC compute regions and data directives. The code executed between a “begin capture” call, <code class="docutils literal notranslate">accx_begin_capture_async()</code>, and the “end capture” call, <code class="docutils literal notranslate">accx_end_capture_async()</code>, is called the capture region. The CUDA graph API captures (or records) all the device work between accx_begin_capture_async and accx_end_capture_async. The host code in the capture region will be executed once normally, with the exception that no device work is actually executed on the device. Instead, a graph object is created that can be used to replay the captured work multiple times. <div class="admonition note"> Note Graph capture is similar to a closure concept in many programming languages, like lambda-functions in C++. In lambda-function terms, CUDA graphs capture all the variables by value. That means that all the FIRSTPRIVATE scalars, array shapes, and those derived types, arrays and scalar addresses for data resident on the GPU, are baked into the graph object and cannot be altered. The device data behind the pointers, of course, can be updated by the graph execution normally, and updated by the host between replays. </div> It is important to understand both what can and cannot be captured within a CUDA Graph capture region: <ul> <li>Asynchronous data directives, including <code class="docutils literal notranslate">data create</code>, can be captured. The OpenACC runtime will use the stream-ordered <code class="docutils literal notranslate">cudaMallocAsync()</code> call in the capture region for variables which need allocation in data clauses, an API call allowed in CUDA Graphs.</li> <li>Asynchronous compute regions, preferably ACC <code class="docutils literal notranslate">parallel</code> regions, can be captured. For ACC <code class="docutils literal notranslate">kernels</code> regions, verify that no work is performed on the host. Host compute sections cannot be captured.</li> <li>Asynchronous ACC <code class="docutils literal notranslate">update host (self)</code> and <code class="docutils literal notranslate">update device</code> directives can be captured. The host and device addresses which are captured must be valid during the graph replay/execution.</li> <li>Since only the device work is captured and replayed, any data dependencies between the host and device inside the capture region are erroneous. For example, downloading data from the device, processing it on host and uploading it back to the device within the capture region is invalid.</li> <li>Host code, even host code containing conditionals, can occur within a capture region. Note though that the path taken through the host code will be the path captured by the graph, i.e. the conditionals must likely be consistent during the replay for correct results. Host code which updates host variables, such as <code class="docutils literal notranslate">i=i+1</code> will not be captured in the graph, which might affect proper indexing into device-side arrays or other kernel arguments.</li> <li>Similarly, device work initiated in host code loops can be captured in the CUDA Graph. The graph will not contain a notion of looping, just the sequence of device operations submitted to the device during the loop.</li> <li>Subroutine and function calls within a capture region, which contain further compute regions or other work which runs on the device, are captured. Care must be taken that the device data addresses passed to the kernels are valid throughout graph execution, and don’t come and go based on stack addresses or something similar.</li> <li>Codes which double-buffer, or ping-pong between source and destination arrays that are input on odd iterations and output on even iterations (for example, the snippet shown below), can be accommodated by capturing two graphs: one per even iteration, one per odd iteration. <div class="highlight-c++ notranslate"><div class="highlight"><pre>int* src; int* dest; while (err > tolerance) { for (int i = 0; i < N; i++) { dest[i] = foo(src[i]); } err = bar(dest); int* tmp = dest; dest = src; src = tmp; } </pre></div> </div> </li> <li>Many CUDA library calls, like cublas, etc. can occur in a captured region. Setup for the library calls, such as creating handles, and computing and allocating workspace requirements, should be done before the capture region.</li> <li>Graph capturing is thread-safe with respect to each async queue. Host threads can independently capture graphs using different async queues.</li> <li>When <code class="docutils literal notranslate">-gpu=tripcount:device</code> is used, loop trip counts can vary between runs of the same captured graph, as long as the trip count is updated on the device.</li> </ul> The OpenACC API follows the basic portion of the CUDA Graph API fairly closely. The major difference is OpenACC includes the <code class="docutils literal notranslate">cudaGraphInstantiate()</code> call as part of the end capture function. From Fortran, the graph type is defined in the OpenACC module: <div class="highlight-fortran notranslate"><div class="highlight"><pre>type, bind(c) :: acc_graph_t type(c_ptr) :: graph type(c_ptr) :: graph_exec end type acc_graph_t </pre></div> </div> These subroutines are available in the OpenACC runtime. Here, pGraph is type(acc_graph_t) and async is just the asynchronous queue value: <div class="highlight-fortran notranslate"><div class="highlight"><pre>subroutine accx_async_begin_capture( async ) subroutine accx_async_end_capture( async, pGraph ) subroutine accx_graph_launch( pGraph, async ) subroutine accx_graph_delete( pGraph ) type(c_ptr) function accx_get_graph( pGraph ) type(c_ptr) function accx_get_graph_exec( pGraph ) </pre></div> </div> From C, the graph type is defined in OpenACC.h: <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef struct { void *graph; void *graph_exec; } acc_graph_t; </pre></div> </div> These void functions are available in the OpenACC runtime: <div class="highlight-c++ notranslate"><div class="highlight"><pre>extern void accx_async_begin_capture(long async); extern void accx_async_end_capture(long async, acc_graph_t *pgraph); extern void accx_graph_launch(acc_graph_t *pgraph, long async); extern void accx_graph_delete(acc_graph_t *pgraph); extern void *accx_get_graph(acc_graph_t *pgraph); extern void *accx_get_graph_exec(acc_graph_t *pgraph); </pre></div> </div> We will use a simple Fortran example code which demonstrates some of the modifications needed to use CUDA Graphs from OpenACC. The original serial code for a conjugate gradient iterative solver: <div class="highlight-fortran notranslate"><div class="highlight"><pre>subroutine RunCG(N, A, b, x, tol, max_iter) implicit none integer, intent(in) :: N, max_iter real(WP), intent(in) :: A(N, N), b(N), tol real(WP), intent(inout) :: x(N) real(WP) :: alpha, rr0, rr real(WP), allocatable :: Ax(:), r(:), p(:) integer :: it, i allocate(Ax(N), r(N), p(N)) call symmatvec(N, N, A, x, Ax) do i = 1, N r(i) = b(i) - Ax(i) p(i) = r(i) enddo rr0 = dot(N, r, r) do it = 1, max_iter call symmatvec(N, N, A, p, Ax) alpha = rr0 / dot(N, p, Ax) do i = 1, N x(i) = x(i) + alpha * p(i) r(i) = r(i) - alpha * Ax(i) enddo rr = dot(N, r, r) print*, "Iteration ", it, " residual: ", sqrt(rr) if (sqrt(rr) <= tol) then deallocate(Ax, r, p) return endif do i = 1, N p(i) = r(i) + (rr / rr0) * p(i) enddo rr0 = rr enddo deallocate(Ax, r, p) end subroutine RunCG </pre></div> </div> For this exercise we wish to put the <code class="docutils literal notranslate">do it = 1,max_iter</code> work for each iteration into a CUDA graph. Step one is to port the code to OpenACC, keeping in mind that we want to use asynchronous queues. We annotate the dot function with OpenACC directives like this: <div class="highlight-fortran notranslate"><div class="highlight"><pre>function dot(N, x, y) result(r) integer, intent(in) :: N real(WP), intent(in) :: x(N), y(N) integer :: i real(WP) :: r r = 0.d0 !$acc parallel loop present(x, y) reduction(+:r) async(1) do i = 1, N r = r + x(i) * y(i) enddo !$acc wait(1) end function dot </pre></div> </div> We write the symmetric matrix multiply like this: <div class="highlight-fortran notranslate"><div class="highlight"><pre>subroutine symmatvec(M, N, AT, x, Ax) implicit none integer, intent(in) :: M, N real(WP), intent(in) :: AT(N, M), x(N) real(WP), intent(out) :: Ax(M) integer :: i, j real(WP) :: s ! Note: Since A is symmetric, we can use the "transpose" ! for better memory access here !$acc parallel loop gang present(AT, x, Ax) async(1) do i = 1, M s = 0.d0 !$acc loop vector reduction(+:s) do j = 1, N s = s + AT(j,i) * x(j) end do Ax(i) = s end do end subroutine </pre></div> </div> And now our main loop of the conjugate gradient solver looks like this: <div class="highlight-fortran notranslate"><div class="highlight"><pre>do it = 1, max_iter call symmatvec(N, N, A, p, Ax) alpha = rr0 / dot(N, p, Ax) !$acc parallel loop gang vector async(1) do i = 1, N x(i) = x(i) + alpha * p(i) r(i) = r(i) - alpha * Ax(i) enddo rr = dot(N, r, r) print*, "Iteration ", it, " residual: ", sqrt(rr) if (sqrt(rr) <= tol) exit !$acc parallel loop gang vector async(1) do i = 1, N p(i) = r(i) + (rr / rr0) * p(i) enddo rr0 = rr enddo </pre></div> </div> Step 2 is to prepare the code for running under CUDA Graphs. There is a lot of host code executing in the main loop. While the <code class="docutils literal notranslate">dot()</code> function runs on the GPU, the rest of the statement <code class="docutils literal notranslate">alpha = rr0 / dot(...)</code> runs on the host. Similarly, the 2nd <code class="docutils literal notranslate">dot()</code> call returns its value to the host. The print statement occurs on the host, as does the residual check. Finally, this iteration’s value for rr is moved to rr0 in the last statement of the loop, on the host. The dot product is tricky. We wish to compute the dot product on the GPU, and leave the result on the GPU, so the reduction variable must be present on the GPU. Here, we change the function call to a subroutine, and remove the initialization which is outside of the parallel region: <div class="highlight-fortran notranslate"><div class="highlight"><pre>subroutine dot(N, x, y, r) implicit none integer, intent(in) :: N real(WP), intent(in) :: x(N), y(N) integer :: i real(WP) :: r !$acc parallel loop present(x, y, r) reduction(+:r) async(1) do i = 1, N r = r + x(i) * y(i) enddo end subroutine dot </pre></div> </div> We add one serial kernel to do some of the swapping between rr0 and rr, as well as zeroing out the scalar that will hold the dot product reduction, and move the print and check outside of the GPU capture region, replaced by a update host operation. The finished loop, complete with graph control, looks like this: <div class="highlight-fortran notranslate"><div class="highlight"><pre>do it = 1, max_iter if (it .eq. 1) then ! First time capture call accx_async_begin_capture(1) call symmatvec(N, N, A, p, Ax) call dot(N, p, Ax, rden) !$acc serial async(1) rr0 = rr alpha = rr0 / rden rden = 0.0d0 rr = 0.0d0 !$acc end serial !$acc parallel loop gang vector async(1) do i = 1, N x(i) = x(i) + alpha * p(i) r(i) = r(i) - alpha * Ax(i) enddo call dot(N, r, r, rr) !$acc update host(rr) async(1) !$acc parallel loop gang vector async(1) do i = 1, N p(i) = r(i) + (rr / rr0) * p(i) enddo call accx_async_end_capture(1, graph) endif ! Always launch, then wait call accx_graph_launch(graph, 1) !$acc wait(1) rra(it) = rr if (sqrt(rr) <= tol) exit enddo </pre></div> </div> Step 3 is to compile, run, and profile the result. No special compiler options are needed besides -acc=gpu. When running, you may be advised to set the <code class="docutils literal notranslate">NVCOMPILER_ACC_USE_GRAPH</code> environment variable. This is currently necessary to properly set the OpenACC runtime for graph capture. Failure to abide by the guidelines above may result in wrong answers, which can be hard to debug. See the following sections on how to use environment variables to help. A common issue is that the pointers passed to the device kernels during graph playback will be the same every time. Make sure that is the case between iterations in the code without graph capture. The Nsight Systems tool has very good support for profiling CUDA graphs. The timeline view will provide information on whether you have reduced the launch overhead gaps between the GPU kernels. <a class="reference internal" href="#openacc-cuda-graphs-nsys-cg-report1-timeline-fig">Figure 1</a> shows a timeline of the iterations of the original OpenACC loop: <figure class="align-center" id="openacc-cuda-graphs-nsys-cg-report1-timeline-fig"> <img alt="_images/nsys-cg-report1-timeline.png" src="_images/nsys-cg-report1-timeline.png" /> <figcaption> Figure 1. Nsight Systems Report1 Timeline<a class="headerlink" href="#openacc-cuda-graphs-nsys-cg-report1-timeline-fig" title="Permalink to this image"></a> </figcaption> </figure> <a class="reference internal" href="#openacc-cuda-graphs-nsys-cg-report2-timeline-fig">Figure 2</a> shows a timeline of the iterations when using CUDA Graphs. When the size N is less than a few thousand, launch latency becomes a major contributor to the overall time and here we can see about a 2x speedup: <figure class="align-center" id="openacc-cuda-graphs-nsys-cg-report2-timeline-fig"> <img alt="_images/nsys-cg-report2-timeline.png" src="_images/nsys-cg-report2-timeline.png" /> <figcaption> Figure 2. Nsight Systems Report2 Timeline<a class="headerlink" href="#openacc-cuda-graphs-nsys-cg-report2-timeline-fig" title="Permalink to this image"></a> </figcaption> </figure> You can see a more-detailed trace of the CUDA Graph components by adding the <code class="docutils literal notranslate">--cuda-graph-trace=node</code> option to the nsys profile command. The above loop demonstrates several of the guidelines outlined at the top of this section, namely, capturing compute regions, whether at the top level or in subprogram units, capturing data movement, and restructuring code regions to minimize or eliminate the host code within a capture region. And the minimal API to begin capture, end capture, then launch the captured graph. </section> <section id="host-and-device-trip-count-options"> <h2>6.7. Host and Device Trip Count Options<a class="headerlink" href="#host-and-device-trip-count-options" title="Permalink to this headline"></a></h2> The <code class="docutils literal notranslate">-gpu=tripcount</code> option controls whether the trip counts for loops in a compute construct, such as <code class="docutils literal notranslate">acc parallel loop</code>, are calculated on the host or on the device. The default behavior of the NVHPC compilers is to use the values calculated on the host, though the OpenACC specification states that trip count values should be calculated on the device. We have chosen to maintain, as-is, the default behavior so as not to interfere with existing applications that currently depend on it for correctness. To ensure compliance with the specification, please use the <code class="docutils literal notranslate">-gpu=tripcount:device</code> option. To maintain the default behavior, please use <code class="docutils literal notranslate">-gpu=tripcount:host</code> or do not specify a <code class="docutils literal notranslate">-gpu=tripcount</code> option. To emit a warning at compile-time that an OpenACC program may be using host values for trip counts, use <code class="docutils literal notranslate">-gpu=tripcount:warn</code>, or use <code class="docutils literal notranslate">-gpu=tripcount:nowarn</code> to disable these warnings. To check at runtime whether the host and device values for trip counts are the same, use <code class="docutils literal notranslate">-gpu=tripcount:check</code>. Set the environment variable <code class="docutils literal notranslate">NVCOMPILER_ACC_CHECK_TRIPCOUNT</code> to enable reporting of any differences discovered. To disable these checks, use <code class="docutils literal notranslate">-gpu=tripcount:nocheck</code>. <section id="when-to-use-gpu-tripcount-device-or-gpu-tripcount-host"> <h3>6.7.1. When to Use <code class="docutils literal notranslate">-gpu=tripcount:device</code> or <code class="docutils literal notranslate">-gpu=tripcount:host</code><a class="headerlink" href="#when-to-use-gpu-tripcount-device-or-gpu-tripcount-host" title="Permalink to this headline"></a></h3> Consider the following example code snippet: <div class="highlight-fortran notranslate"><div class="highlight"><pre>real :: array(1000, 10) integer :: i, j, n, m !$acc data create(n, m) copy(array) !$acc kernels n = 1000 m = 10 !$acc end kernels !$acc parallel loop defualt(none) collapse(2) do j=1,m do i=1,n array(i, j) = i+j end do end do </pre></div> </div> The trip count variables <code class="docutils literal notranslate">n</code> and <code class="docutils literal notranslate">m</code> are created on the device, and then their values are set on the device in the <code class="docutils literal notranslate">acc kernels</code> construct. Their values are not set on the host. Therefore, when the parallel loop is run on the device, if the host values for <code class="docutils literal notranslate">n</code> and <code class="docutils literal notranslate">m</code> are used, the loop will not run for the correct number of iterations. In this and similar cases, to ensure the correctness of the program, <code class="docutils literal notranslate">-gpu=tripcount:device</code> should be used. In cases where the values of <code class="docutils literal notranslate">n</code> and <code class="docutils literal notranslate">m</code> are set on the host, it is sufficient to rely on the default behavior or to specify <code class="docutils literal notranslate">-gpu=tripcount:host</code>. There are two ways to verify whether or not the program’s correctness may be affected by the use of <code class="docutils literal notranslate">-gpu=tripcount:device</code> versus <code class="docutils literal notranslate">-gpu=tripcount:host</code>. The <code class="docutils literal notranslate">-gpu=tripcount:check</code> option can be used to detect discrepancies between host and device values for trip counts at runtime, and the <code class="docutils literal notranslate">-gpu=tripcount:warn</code> option can be used to issue compile-time warnings that host values for trip counts may be used. Note: For CUDA Graphs, <code class="docutils literal notranslate">-gpu=tripcount:device</code> allows trip counts to vary between runs for captured graphs on the device, as long as the trip count is updated on the device. This behavior can affect the correctness of CUDA Graphs, and some applications may require this option to use CUDA Graphs correctly. </section> </section> <section id="environment-variables"> <h2>6.8. Environment Variables<a class="headerlink" href="#environment-variables" title="Permalink to this headline"></a></h2> This section summarizes the environment variables that NVIDIA OpenACC supports. These environment variables are user-setable environment variables that control behavior of accelerator-enabled programs at execution. These environment variables must comply with these rules: <ul class="simple"> <li>The names of the environment variables must be upper case.</li> <li>The values of environment variables are case insensitive and may have leading and trailing white space.</li> <li>The behavior is implementation-defined if the values of the environment variables change after the program has started, even if the program itself modifies the values.</li> </ul> The following table contains the environment variables that are currently supported and provides a brief description of each. <table class="table-no-stripes docutils align-default" id="id27"> <caption>Table 16. Supported Environment Variables<a class="headerlink" href="#id27" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 8%" /> <col style="width: 92%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Use this environment variable…</th> <th class="head">To do this…</th> </tr> </thead> <tbody> <tr class="row-even"><td>NVCOMPILER_ACC_CHECK_TRIPCOUNT</td> <td>Enable output for checking differences between host and device trip counts when <code class="docutils literal notranslate">-gpu=tripcount:check</code> is used.</td> </tr> <tr class="row-odd"><td>NVCOMPILER_ACC_CUDA_PROFSTOP</td> <td>Set to 1 (or any positive value) to tell the runtime environment to insert an ‘atexit(cuProfilerStop)’ call upon exit. This behavior may be desired in the case where a profile is incomplete or where a message is issued to call cudaProfilerStop().</td> </tr> <tr class="row-even"><td>NVCOMPILER_ACC_DEVICE_NUM</td> <td>Sets the default device number to use. NVCOMPILER_ACC_DEVICE_NUM. Specifies the default device number to use when executing accelerator regions. The value of this environment variable must be a nonnegative integer between zero and the number of devices attached to the host.</td> </tr> <tr class="row-odd"><td>ACC_DEVICE_NUM</td> <td>Legacy name. Superseded by NVCOMPILER_ACC_DEVICE_NUM.</td> </tr> <tr class="row-even"><td>NVCOMPILER_ACC_DEVICE_TYPE</td> <td>Sets the default device type to use for OpenACC regions. NVCOMPILER_ACC_DEVICE_TYPE. Specifies which accelerator device to use when executing accelerator regions when the program has been compiled to use more than one different type of device. The value of this environment variable is implementation-defined, and in the NVIDIA OpenACC implementation may be the strings NVIDIA, MULTICORE or HOST</td> </tr> <tr class="row-odd"><td>ACC_DEVICE_TYPE</td> <td>Legacy name. Superseded by NVCOMPILER_ACC_DEVICE_TYPE.</td> </tr> <tr class="row-even"><td>NVCOMPILER_ACC_GANGLIMIT</td> <td>For NVIDIA CUDA devices, this defines the maximum number of gangs (CUDA thread blocks) that will be launched by a kernel.</td> </tr> <tr class="row-odd"><td>NVCOMPILER_ACC_NOTIFY</td> <td>With no argument, a debug message will be written to stderr for each kernel launch and/or data transfer. When set to an integer value, the value is used as a bit mask to print information about: 1: kernel launches 2: data transfers 4: region entry/exit 8: wait operations or synchronizations with the device 16: device memory allocates and deallocates </td> </tr> <tr class="row-even"><td>NVCOMPILER_ACC_PROFLIB</td> <td>Enables 3rd party tools interface using the new profiler dynamic library interface.</td> </tr> <tr class="row-odd"><td>NVCOMPILER_ACC_SYNCHRONOUS</td> <td>Disables asynchronous launches and data movement.</td> </tr> <tr class="row-even"><td>NVCOMPILER_ACC_TIME</td> <td>Enables a lightweight profiler to measure data movement and accelerator kernel execution time and print a summary at the end of program execution.</td> </tr> </tbody> </table> </section> <section id="profiling-accelerator-kernels"> <h2>6.9. Profiling Accelerator Kernels<a class="headerlink" href="#profiling-accelerator-kernels" title="Permalink to this headline"></a></h2> Support for Profiler/Trace Tool Interface The NVIDIA HPC Compilers support the OpenACC Profiler/Trace Tools Interface. This is the interface used by the NVIDIA profilers to collect performance measurements of OpenACC programs. Using NVCOMPILER_ACC_TIME Setting the environment variable NVCOMPILER_ACC_TIME to a nonzero value enables collection and printing of simple timing information about the accelerator regions and generated kernels. <div class="admonition note"> Note Turn off all CUDA Profilers (NVIDIA’s Visual Profiler, NVPROF, CUDA_PROFILE, etc) when enabling NVCOMPILER_ACC_TIME, they use the same library to gather performance data and cannot be used concurently. </div> Accelerator Kernel Timing Data <div class="highlight-text notranslate"><div class="highlight"><pre>bb04.f90 s1 15: region entered 1 times time(us): total=1490738 init=1489138 region=1600 kernels=155 data=1445 w/o init: total=1600 max=1600 min=1600 avg=1600 18: kernel launched 1 times time(us): total=155 max=155 min=155 avg=155 </pre></div> </div> In this example, a number of things are occurring: <ul class="simple"> <li>For each accelerator region, the file name bb04.f90 and subroutine or function name s1 is printed, with the line number of the accelerator region, which in the example is 15.</li> <li>The library counts how many times the region is entered (1 in the example) and the microseconds spent in the region (in this example 1490738), which is split into initialization time (in this example 1489138) and execution time (in this example 1600).</li> <li>The execution time is then divided into kernel execution time and data transfer time between the host and GPU.</li> <li>For each kernel, the line number is given, (18 in the example), along with a count of kernel launches, and the total, maximum, minimum, and average time spent in the kernel, all of which are 155 in this example.</li> </ul> </section> <section id="openacc-runtime-libraries"> <h2>6.10. OpenACC Runtime Libraries<a class="headerlink" href="#openacc-runtime-libraries" title="Permalink to this headline"></a></h2> This section provides an overview of the user-callable functions and library routines that are available for use by programmers to query the accelerator features and to control behavior of accelerator-enabled programs at runtime. <div class="admonition note"> Note In Fortran, none of the OpenACC runtime library routines may be called from a PURE or ELEMENTAL procedure. </div> <section id="runtime-library-definitions"> <h3>6.10.1. Runtime Library Definitions<a class="headerlink" href="#runtime-library-definitions" title="Permalink to this headline"></a></h3> There are separate runtime library files for Fortran, and for C++ and C. C++ and C Runtime Library Files In C++ and C, prototypes for the runtime library routines are available in a header file named <code class="docutils literal notranslate">accel.h</code>. All the library routines are <code class="docutils literal notranslate">extern</code> functions with ‘C’ linkage. This file defines: <ul class="simple"> <li>The prototypes of all routines in this section.</li> <li>Any data types used in those prototypes, including an enumeration type to describe types of accelerators.</li> </ul> Fortran Runtime Library Files In Fortran, interface declarations are provided in a Fortran include file named <code class="docutils literal notranslate">accel_lib.h</code> and in a Fortran module named <code class="docutils literal notranslate">accel_lib</code>. These files define: <ul class="simple"> <li>Interfaces for all routines in this section.</li> <li>Integer parameters to define integer kinds for arguments to those routines.</li> <li>Integer parameters to describe types of accelerators.</li> </ul> </section> <section id="runtime-library-routines"> <h3>6.10.2. Runtime Library Routines<a class="headerlink" href="#runtime-library-routines" title="Permalink to this headline"></a></h3> <a class="reference internal" href="#acc-openacc-lib-routines-acc-openacc-lib-routines-tbl">Table 17</a> lists and briefly describes the runtime library routines supported by the NVIDIA HPC Compilers in addition to the standard OpenACC runtine API routines. <table class="table-no-stripes docutils align-default" id="acc-openacc-lib-routines-acc-openacc-lib-routines-tbl"> <caption>Table 17. Accelerator Runtime Library Routines<a class="headerlink" href="#acc-openacc-lib-routines-acc-openacc-lib-routines-tbl" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 21%" /> <col style="width: 79%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">This Runtime Library Routine…</th> <th class="head">Does this…</th> </tr> </thead> <tbody> <tr class="row-even"><td>acc_allocs</td> <td>Returns the number of arrays allocated in data or compute regions.</td> </tr> <tr class="row-odd"><td>acc_bytesalloc</td> <td>Returns the total bytes allocated by data or compute regions.</td> </tr> <tr class="row-even"><td>acc_bytesin</td> <td>Returns the total bytes copied in to the accelerator by data or compute regions.</td> </tr> <tr class="row-odd"><td>acc_bytesout</td> <td>Returns the total bytes copied out from the accelerator by data or compute regions.</td> </tr> <tr class="row-even"><td>acc_copyins</td> <td>Returns the number of arrays copied in to the accelerator by data or compute regions.</td> </tr> <tr class="row-odd"><td>acc_copyouts</td> <td>Returns the number of arrays copied out from the accelerator by data or compute regions.</td> </tr> <tr class="row-even"><td>acc_disable_time</td> <td>Tells the runtime to stop profiling accelerator regions and kernels.</td> </tr> <tr class="row-odd"><td>acc_enable_time</td> <td>Tells the runtime to start profiling accelerator regions and kernels, if it is not already doing so.</td> </tr> <tr class="row-even"><td>acc_exec_time</td> <td>Returns the number of microseconds spent on the accelerator executing kernels.</td> </tr> <tr class="row-odd"><td>acc_frees</td> <td>Returns the number of arrays freed or deallocated in data or compute regions.</td> </tr> <tr class="row-even"><td>acc_get_device</td> <td>Returns the type of accelerator device used to run the next accelerator region, if one is selected.</td> </tr> <tr class="row-odd"><td>acc_get_device_num</td> <td>Returns the number of the device being used to execute an accelerator region.</td> </tr> <tr class="row-even"><td>acc_get_free_memory</td> <td>Returns the total available free memory on the attached accelerator device.</td> </tr> <tr class="row-odd"><td>acc_get_memory</td> <td>Returns the total memory on the attached accelerator device.</td> </tr> <tr class="row-even"><td>acc_get_num_devices</td> <td>Returns the number of accelerator devices of the given type attached to the host.</td> </tr> <tr class="row-odd"><td>acc_kernels</td> <td>Returns the number of accelerator kernels launched since the start of the program.</td> </tr> <tr class="row-even"><td>acc_present_dump</td> <td>Summarizes all data present on the current device.</td> </tr> <tr class="row-odd"><td>acc_present_dump_all</td> <td>Summarizes all data present on all devices.</td> </tr> <tr class="row-even"><td>acc_regions</td> <td>Returns the number of accelerator regions entered since the start of the program.</td> </tr> <tr class="row-odd"><td>acc_total_time</td> <td>Returns the number of microseconds spent in accelerator compute regions and in moving data for accelerator data regions.</td> </tr> </tbody> </table> </section> </section> <section id="supported-intrinsics"> <h2>6.11. Supported Intrinsics<a class="headerlink" href="#supported-intrinsics" title="Permalink to this headline"></a></h2> An intrinsic is a function available in a given language whose implementation is handled specifically by the compiler. Typically, an intrinsic substitutes a sequence of automatically-generated instructions for the original function call. Since the compiler has an intimate knowledge of the intrinsic function, it can better integrate it and optimize it for the situation. Intrinsics make the use of processor-specific enhancements easier because they provide a language interface to assembly instructions. In doing so, the compiler manages things that the user would normally have to be concerned with, such as register names, register allocations, and memory locations of data. This section contains an overview of the Fortran and C intrinsics that the accelerator supports. <section id="supported-fortran-intrinsics-summary-table"> <h3>6.11.1. Supported Fortran Intrinsics Summary Table<a class="headerlink" href="#supported-fortran-intrinsics-summary-table" title="Permalink to this headline"></a></h3> <a class="reference internal" href="#acc-fort-intrin-sum-acc-fort-intrin-sum-tbl">Table 18</a> is an alphabetical summary of the supported Fortran intrinsics that the accelerator supports. These functions are specific to Fortran 90/95 unless otherwise specified. In most cases support is provided for all the data types for which the intrinsic is valid. When support is available for only certain data types, the middle column of the table specifies which ones, using the following codes: <table class="table-no-stripes docutils align-default" id="acc-fort-intrin-sum-acc-fort-intrin-sum-tbl"> <colgroup> <col style="width: 19%" /> <col style="width: 39%" /> <col style="width: 43%" /> </colgroup> <tbody> <tr class="row-odd"><td>I for integer</td> <td>S for single precision real D for double precision real</td> <td>C for single precision complex Z for double precision complex</td> </tr> </tbody> </table> <table class="table-no-stripes docutils align-default" id="id28"> <caption>Table 18. Supported Fortran Intrinsics<a class="headerlink" href="#id28" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 16%" /> <col style="width: 9%" /> <col style="width: 75%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">This intrinsic</th> <th class="head"></th> <th class="head">Return value</th> </tr> </thead> <tbody> <tr class="row-even"><td>ABS</td> <td>I,S,D</td> <td>absolute value of the argument.</td> </tr> <tr class="row-odd"><td>ACOS</td> <td></td> <td>arccosine of the specified argument.</td> </tr> <tr class="row-even"><td>AINT</td> <td></td> <td>truncation of the argument to a whole number.</td> </tr> <tr class="row-odd"><td>ANINT</td> <td></td> <td>nearest whole number of the real argument.</td> </tr> <tr class="row-even"><td>ASIN</td> <td></td> <td>arcsine of the argument.</td> </tr> <tr class="row-odd"><td>ATAN</td> <td></td> <td>arctangent of the argument.</td> </tr> <tr class="row-even"><td>ATAN2</td> <td></td> <td>angle in radians of the complex value first-argument + i*second-argument.</td> </tr> <tr class="row-odd"><td>COS</td> <td>S,D,C,Z</td> <td>cosine of the argument.</td> </tr> <tr class="row-even"><td>COSH</td> <td></td> <td>hyperbolic cosine of the argument.</td> </tr> <tr class="row-odd"><td>DBLE</td> <td>S,D</td> <td>conversion of the argument to double precision real.</td> </tr> <tr class="row-even"><td>DPROD</td> <td></td> <td>double precision product of two single precision arguments.</td> </tr> <tr class="row-odd"><td>EXP</td> <td>S,D,C,Z</td> <td>natural exponential value of the argument.</td> </tr> <tr class="row-even"><td>IAND</td> <td></td> <td>result of logical AND of the two integer arguments.</td> </tr> <tr class="row-odd"><td>IEOR</td> <td></td> <td>result of the boolean exclusive OR of the two integer arguments.</td> </tr> <tr class="row-even"><td>INT</td> <td>I,S,D</td> <td>conversion of the argument to integer type.</td> </tr> <tr class="row-odd"><td>IOR</td> <td></td> <td>result of the boolean inclusive OR of the two integer arguments.</td> </tr> <tr class="row-even"><td>LOG</td> <td>S,D,C,Z</td> <td>base-e (natural logarithm) of the argument.</td> </tr> <tr class="row-odd"><td>LOG10</td> <td></td> <td>base-10 logarithm of the argument.</td> </tr> <tr class="row-even"><td>MAX</td> <td></td> <td>maximum value of the arguments.</td> </tr> <tr class="row-odd"><td>MIN</td> <td></td> <td>minimum value of the arguments.</td> </tr> <tr class="row-even"><td>MOD</td> <td>I</td> <td>remainder of the first argument divided by the second argument.</td> </tr> <tr class="row-odd"><td>NINT</td> <td></td> <td>nearest integer of the real argument.</td> </tr> <tr class="row-even"><td>NOT</td> <td></td> <td>logical complement of the integer argument.</td> </tr> <tr class="row-odd"><td>REAL</td> <td>I,S,D</td> <td>conversion of the argument to real.</td> </tr> <tr class="row-even"><td>SIGN</td> <td></td> <td>absolute value of first argument times the sign of second argument.</td> </tr> <tr class="row-odd"><td>SIN</td> <td>S,D,C,Z</td> <td>sine of the argument.</td> </tr> <tr class="row-even"><td>SINH</td> <td></td> <td>hyperbolic sine of the argument.</td> </tr> <tr class="row-odd"><td>SQRT</td> <td>S,D,C,Z</td> <td>square root of the argument.</td> </tr> <tr class="row-even"><td>TAN</td> <td></td> <td>tangent of the argument.</td> </tr> <tr class="row-odd"><td>TANH</td> <td></td> <td>hyperbolic tangent of the argument.</td> </tr> </tbody> </table> </section> <section id="supported-c-intrinsics-summary-table"> <h3>6.11.2. Supported C Intrinsics Summary Table<a class="headerlink" href="#supported-c-intrinsics-summary-table" title="Permalink to this headline"></a></h3> This section contains two alphabetical summaries – one for double functions and a second for float functions. These lists contain only those C intrinsics that the accelerator supports. <table class="table-no-stripes docutils align-default" id="id29"> <caption>Table 19. Supported C Intrinsic Double Functions<a class="headerlink" href="#id29" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 18%" /> <col style="width: 82%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">This intrinsic</th> <th class="head">Return value</th> </tr> </thead> <tbody> <tr class="row-even"><td>acos</td> <td>arccosine of the argument.</td> </tr> <tr class="row-odd"><td>asin</td> <td>arcsine of the argument.</td> </tr> <tr class="row-even"><td>atan</td> <td>arctangent of the argument.</td> </tr> <tr class="row-odd"><td>atan2</td> <td>arctangent of y/x, where y is the first argument, x the second.</td> </tr> <tr class="row-even"><td>cos</td> <td>cosine of the argument.</td> </tr> <tr class="row-odd"><td>cosh</td> <td>hyperbolic cosine of the argument.</td> </tr> <tr class="row-even"><td>exp</td> <td>exponential value of the argument.</td> </tr> <tr class="row-odd"><td>fabs</td> <td>absolute value of the argument.</td> </tr> <tr class="row-even"><td>fmax</td> <td>maximum value of the two arguments</td> </tr> <tr class="row-odd"><td>fmin</td> <td>minimum value of the two arguments</td> </tr> <tr class="row-even"><td>log</td> <td>natural logarithm of the argument.</td> </tr> <tr class="row-odd"><td>log10</td> <td>base-10 logarithm of the argument.</td> </tr> <tr class="row-even"><td>pow</td> <td>value of the first argument raised to the power of the second argument.</td> </tr> <tr class="row-odd"><td>sin</td> <td>value of the sine of the argument.</td> </tr> <tr class="row-even"><td>sinh</td> <td>hyperbolic sine of the argument.</td> </tr> <tr class="row-odd"><td>sqrt</td> <td>square root of the argument.</td> </tr> <tr class="row-even"><td>tan</td> <td>tangent of the argument.</td> </tr> <tr class="row-odd"><td>tanh</td> <td>hyperbolic tangent of the argument.</td> </tr> </tbody> </table> <table class="table-no-stripes docutils align-default" id="id30"> <caption>Table 20. Supported C Intrinsic Float Functions<a class="headerlink" href="#id30" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 18%" /> <col style="width: 82%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">This intrinsic</th> <th class="head">Return value</th> </tr> </thead> <tbody> <tr class="row-even"><td>acosf</td> <td>arccosine of the argument.</td> </tr> <tr class="row-odd"><td>asinf</td> <td>arcsine of the argument.</td> </tr> <tr class="row-even"><td>atanf</td> <td>arctangent of the argument.</td> </tr> <tr class="row-odd"><td>atan2f</td> <td>arctangent of y/x, where y is the first argument, x the second.</td> </tr> <tr class="row-even"><td>cosf</td> <td>cosine of the argument.</td> </tr> <tr class="row-odd"><td>coshf</td> <td>hyperbolic cosine of the argument.</td> </tr> <tr class="row-even"><td>expf</td> <td>exponential value of the argument.</td> </tr> <tr class="row-odd"><td>fabsf</td> <td>absolute value of the argument.</td> </tr> <tr class="row-even"><td>logf</td> <td>natural logarithm of the argument.</td> </tr> <tr class="row-odd"><td>log10f</td> <td>base-10 logarithm of the argument.</td> </tr> <tr class="row-even"><td>powf</td> <td>value of the first argument raised to the power of the second argument.</td> </tr> <tr class="row-odd"><td>sinf</td> <td>value of the sine of the argument.</td> </tr> <tr class="row-even"><td>sinhf</td> <td>hyperbolic sine of the argument.</td> </tr> <tr class="row-odd"><td>sqrtf</td> <td>square root of the argument.</td> </tr> <tr class="row-even"><td>tanf</td> <td>tangent of the argument.</td> </tr> <tr class="row-odd"><td>tanhf</td> <td>hyperbolic tangent of the argument.</td> </tr> </tbody> </table> </section> </section> </section> <section id="using-openmp"> <h1>7. Using OpenMP<a class="headerlink" href="#using-openmp" title="Permalink to this headline"></a></h1> OpenMP is a specification for a set of compiler directives, an applications programming interface (API), and a set of environment variables that can be used to specify parallel execution in Fortran, C++, and C programs. For general information about using OpenMP and to obtain a copy of the OpenMP specification, refer to the <a class="reference external" href="https://www.openmp.org">OpenMP organization’s website</a>. The NVFORTRAN, NVC++, and NVC compilers support a subset of the OpenMP Application Program Interface for CPUs and GPUs. In defining this subset, we have focused on OpenMP 5.0 features that will enable CPU and GPU targeting for OpenMP applications with a goal of encouraging programming practices that are portable and scalable. For features that are to be avoided, wherever possible, the directives and API calls related to those features are parsed and ignored to maximize portability. Where ignoring such features is not possible, or could result in ambiguous or incorrect execution, the compilers emit appropriate error messages at compile- or run-time. OpenMP applications properly structured for GPUs, meaning they expose massive parallelism and have relatively little or no synchronization in GPU-side code segments, should compile and execute with performance on par with or close to equivalent OpenACC. Codes that are not well-structured for GPUs may perform poorly but should execute correctly. Use the <code class="docutils literal notranslate">-mp</code> compiler switch to enable processing of OpenMP directives and pragmas. The most important sub-options to <code class="docutils literal notranslate">-mp</code> are the following: <ul class="simple"> <li><code class="docutils literal notranslate">gpu</code>: OpenMP directives are compiled for GPU execution plus multicore CPU fallback; this feature is supported on NVIDIA V100 or later GPUs.</li> <li><code class="docutils literal notranslate">multicore</code>: OpenMP directives are compiled for multicore CPU execution only; this sub-option is the default.</li> </ul> Predefined Macros The following macros corresponding to the offload target compiled for are added implicitly: <ul class="simple"> <li><code class="docutils literal notranslate">__NVCOMPILER_OPENMP_GPU</code> when OpenMP target directives are compiled for GPU.</li> <li><code class="docutils literal notranslate">__NVCOMPILER_OPENMP_MULTICORE</code> when OpenMP target directives are compiled for multicore CPU.</li> </ul> <section id="id2"> <h2>7.1. Environment Variables<a class="headerlink" href="#id2" title="Permalink to this headline"></a></h2> The OpenMP specification includes many environment variables related to program execution. Thread affinity One important environment variable is <code class="docutils literal notranslate">OMP_PROC_BIND</code>. It controls the OpenMP CPU thread affinity policy. When thread affinity is disabled, the operating system is free to move threads between the available CPU cores. When thread affinity is enabled, each thread is bound to a subset of the available CPU cores. The environment variable <code class="docutils literal notranslate">OMP_PLACES</code> can be used to specify how a subset of the available CPU cores is determined for each thread. When set to a valid value, this environment variable will enable thread affinity and override the default thread affinity policy. Binding threads to certain CPU cores is often beneficial for application performance, because that can improve the CPU cache hit rate and limit memory transactions between different NUMA nodes. Therefore, it is important to consider enabling thread affinity for your application. The default value of <code class="docutils literal notranslate">OMP_PROC_BIND</code> is <code class="docutils literal notranslate">false</code>. Thus, thread affinity is disabled by default. This is a conservative setting that allows certain classes of applications (such as OpenMP + MPI) to create multiple processes without taking special care of the thread affinity policy to avoid binding threads in different processes to the same CPU cores. The following table explains the simplest possible values of <code class="docutils literal notranslate">OMP_PROC_BIND</code>. For the comprehensive explanation of <code class="docutils literal notranslate">OMP_PROC_BIND</code> and <code class="docutils literal notranslate">OMP_PLACES</code>, please refer to the OpenMP specification. <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 9%" /> <col style="width: 91%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Value</th> <th class="head">Behavior</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">OMP_PROC_BIND=false</code></td> <td>Thread affinity is disabled unless <code class="docutils literal notranslate">OMP_PLACES</code> is set to a valid value. When thread affinity is disabled, the operating system is free to assign threads to any available CPU core at any time of the application execution. This is the default value.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">OMP_PROC_BIND=true</code></td> <td>Thread affinity is enabled. Unless <code class="docutils literal notranslate">OMP_PLACES</code> is set, the implementation attempts to assign threads optimally to CPU cores to maximize the cache hit rate and minimize the number of memory transactions between NUMA nodes.</td> </tr> </tbody> </table> Device offload Another important environment variable to understand is <code class="docutils literal notranslate">OMP_TARGET_OFFLOAD</code>. Use this environment variable to affect the behavior of execution on host and device including host fallback. The following table explains the behavior determined by each of the values to which you can set this environment variable. <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 29%" /> <col style="width: 71%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Value</th> <th class="head">Behavior</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">OMP_TARGET_OFFLOAD=DEFAULT</code></td> <td>Try to execute on a GPU; if a supported GPU is not available, fallback to the host</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">OMP_TARGET_OFFLOAD=DISABLED</code></td> <td>Do not execute on the GPU even if one is available; execute on the host</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">OMP_TARGET_OFFLOAD=MANDATORY</code></td> <td>Execute on a GPU or terminate the program</td> </tr> </tbody> </table> Number of teams on device When an application offloads an <code class="docutils literal notranslate">omp target teams</code> construct to the GPU, the number of teams is calculated automatically unless the construct has a <code class="docutils literal notranslate">num_teams</code> clause. The automatic setting of the number of teams can be limited to a maximum value provided by the <code class="docutils literal notranslate">OMP_NUM_TEAMS</code> environment variable. The same maximum value can also be set by the application at run time with the function <code class="docutils literal notranslate">omp_set_num_teams</code>. <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 44%" /> <col style="width: 56%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Value</th> <th class="head">Behavior</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">OMP_NUM_TEAMS=<positive_integer></code></td> <td>Maximum number of teams on device</td> </tr> </tbody> </table> For the comprehensive explanation of <code class="docutils literal notranslate">OMP_NUM_TEAMS</code>, please refer to the OpenMP specification. Number of threads in teams An <code class="docutils literal notranslate">omp target teams</code> construct offloaded to the GPU creates a league of teams each consisting of a certain number of threads. The number of threads is the same for all teams in the league, and is calculated automatically unless the construct has a <code class="docutils literal notranslate">thread_limit</code> clause. The environment variable <code class="docutils literal notranslate">OMP_TEAMS_THREAD_LIMIT</code> can be used to limit the maximum number of threads in teams. The same maximum value can be set by the application with the runtime function <code class="docutils literal notranslate">omp_set_teams_thread_limit</code>. For NVIDIA GPUs, we recommend using values that are multiples of 32 (which is the size of the GPU thread warp). That equally applies to the <code class="docutils literal notranslate">OMP_TEAMS_THREAD_LIMIT</code> environment variable, the <code class="docutils literal notranslate">omp_set_teams_thread_limit</code> function and the <code class="docutils literal notranslate">thread_limit</code> clause. For any other value, the actual limit on the number of threads per team will likely be rounded down to the nearest multiple of 32. The same guidance applies to the <code class="docutils literal notranslate">num_threads</code> clause as well. <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 44%" /> <col style="width: 56%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Value</th> <th class="head">Behavior</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">OMP_TEAMS_THREAD_LIMIT=<positive_integer></code></td> <td>Maximum number of threads in teams</td> </tr> </tbody> </table> For the comprehensive explanation of <code class="docutils literal notranslate">OMP_TEAMS_THREAD_LIMIT</code>, please refer to the OpenMP specification. Forcing the number of device teams and threads In certain situations, for instance for debugging or performance tuning, it may be desirable to specify an exact number of teams and threads on the GPU. While OpenMP offers a number of convenient ways to control that, e.g. the <code class="docutils literal notranslate">num_teams</code> and <code class="docutils literal notranslate">thread_limit</code> clauses, as well as the environment variables described above, they do not guarantee an exact teams and threads configuration. The NVIDIA HPC OpenMP Runtime supports the <code class="docutils literal notranslate">NVCOMPILER_OMP_CUDA_GRID</code> environment variable. When set, it requests the runtime to use the exact number of teams and threads per team when running OpenMP compute constructs on the GPU. Essentially, its effect is to use a specific CUDA grid configuration for any kernel, bypassing runtime and compiler guidance. <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 24%" /> <col style="width: 76%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Value</th> <th class="head">Behavior</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">NVCOMPILER_OMP_CUDA_GRID=<num_blocks>,<num_threads></code></td> <td>The <code class="docutils literal notranslate"><num_blocks></code> and <code class="docutils literal notranslate"><num_threads></code> must be positive integers. They are used to form a CUDA grid when running GPU kernels associated with <code class="docutils literal notranslate">omp target</code> compute constructs.</td> </tr> </tbody> </table> However, even with an exact CUDA grid specified, the runtime may still use a corrected configuration if that is necessary for a successful kernel launch. Please refer to the <a class="reference external" href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html">CUDA C++ Programming Guide</a> for the detailed explanation of how the CUDA kernel execution configurations work. </section> <section id="fallback-mode"> <h2>7.2. Fallback Mode<a class="headerlink" href="#fallback-mode" title="Permalink to this headline"></a></h2> The HPC compilers support host fallback of OpenMP <code class="docutils literal notranslate">target</code> regions when no GPU is present or <code class="docutils literal notranslate">OMP_TARGET_OFFLOAD</code> is set to <code class="docutils literal notranslate">DISABLED</code>. Execution should always be correct but the performance of the target region may not always be optimal when run on the host. OpenMP target regions prescriptively structured for optimal execution on GPUs may not perform well when run on the dissimilar architecture of the CPU. To provide performance portability between host and device, we recommend use of the <code class="docutils literal notranslate">loop</code> construct. firstprivates with nowait not supported for host execution There is currently a limitation on the use of the <code class="docutils literal notranslate">nowait</code> clause on target regions intended for execution on the host (-mp or -mp=gpu with <code class="docutils literal notranslate">OMP_TARGET_OFFLOAD=DISABLED</code>). If the target region references variables having the <code class="docutils literal notranslate">firstprivate</code> data-sharing attribute, their concurrent updates are not guaranteed to be safe. To work around this limitation, when running on the host, we recommend avoiding the <code class="docutils literal notranslate">nowait</code> clause on such target regions or equivalently using the <code class="docutils literal notranslate">taskwait</code> construct immediately following the region. </section> <section id="loop"> <h2>7.3. Loop<a class="headerlink" href="#loop" title="Permalink to this headline"></a></h2> The HPC compilers support the <code class="docutils literal notranslate">loop</code> construct with an extension to the default binding thread set mechanism specified by OpenMP in order to allow the compilers the freedom to analyze loops and dependencies to generate highly parallel code for CPU and GPU targets. In other words, the compilers map <code class="docutils literal notranslate">loop</code> to either teams or to threads, as the compiler chooses, unless the user explicitly specifies otherwise. The mapping selected is specific to each target architecture even within the same executable (i.e., GPU offload and host fallback) thereby facilitating performance portability. The shape of the parallelism offered by NVIDIA’s GPUs, consisting of thread blocks and three dimensions of threads therein, differs from the multi-threaded vector parallelism of modern CPUs. The following table summarizes the OpenMP mapping to NVIDIA GPUs and multicore CPUs: <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 24%" /> <col style="width: 36%" /> <col style="width: 40%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Construct</th> <th class="head">CPU</th> <th class="head">GPU</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">!$omp target</code></td> <td></td> <td>starts offload</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">!$omp teams</code></td> <td>single team</td> <td>CUDA thread blocks in grid</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">!$omp parallel</code></td> <td>CPU threads</td> <td>CUDA threads within thread block</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">!$omp simd</code></td> <td>hint for vector instructions</td> <td>simdlen(1)</td> </tr> </tbody> </table> HPC programs need to leverage all available parallelism to achieve performance. The programmer can attempt to become an expert in the intricacies of each target architecture and use that knowledge to structure programs accordingly. This prescriptive model can be successful but tends to increase source code complexity and often requires restructuring for each new target architecture. Here’s an example where a programmer explicitly requests the steps the compiler should take to map parallelism to two targets: <div class="highlight-c++ notranslate"><div class="highlight"><pre>#ifdef TARGET_GPU #pragma omp target teams distribute reduction(max:error) #else #pragma omp parallel for reduction(max:error) #endif for( int j = 1; j < n-1; j++) { #ifdef TARGET_GPU #pragma omp parallel for reduction(max:error) #endif for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmaxf( error, fabsf(Anew[j][i]-A[j][i])); } } </pre></div> </div> An alternative is for the programmer to focus on exposing parallelism in a program and allowing a compiler to do the mapping onto the target architectures. The HPC compilers’ implementation of <code class="docutils literal notranslate">loop</code> supports this descriptive model. In this example, the programmer specifies the loop regions to be parallelized by the compiler and the compilers parallelize <code class="docutils literal notranslate">loop</code> across teams and threads: <div class="highlight-c++ notranslate"><div class="highlight"><pre>#pragma omp target teams loop reduction(max:error) for( int j = 1; j < n-1; j++) { #pragma omp loop reduction(max:error) for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmaxf( error, fabsf(Anew[j][i]-A[j][i])); } } </pre></div> </div> The programmer’s tuning tool with <code class="docutils literal notranslate">loop</code> is the <code class="docutils literal notranslate">bind</code> clause. The following table extends the previous mapping example: <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 23%" /> <col style="width: 36%" /> <col style="width: 41%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Construct</th> <th class="head">CPU</th> <th class="head">GPU</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">!$omp loop bind(teams)</code></td> <td>threads</td> <td>CUDA thread blocks and threads</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">!$omp loop bind(parallel)</code></td> <td>threads</td> <td>CUDA threads</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">!$omp loop bind(thread)</code></td> <td>single thread (useful for vector instructions)</td> <td>single thread</td> </tr> </tbody> </table> Orphaned <code class="docutils literal notranslate">loop</code> constructs within a single file are supported; a binding region of either <code class="docutils literal notranslate">parallel</code> or <code class="docutils literal notranslate">thread</code> must be specified with such loops via the <code class="docutils literal notranslate">bind</code> clause. The compilers support <code class="docutils literal notranslate">loop</code> regions containing procedure calls as long as the callee does not contain OpenMP directives. Here are a few additional examples using <code class="docutils literal notranslate">loop</code>. We also show examples of the type of information the compiler would provide when using the <code class="docutils literal notranslate">-Minfo</code> compiler option. Use of <code class="docutils literal notranslate">loop</code> in Fortran: <div class="highlight-text notranslate"><div class="highlight"><pre>!$omp target teams loop do n1loc_blk = 1, n1loc_blksize do igp = 1, ngpown do ig_blk = 1, ig_blksize do ig = ig_blk, ncouls, ig_blksize do n1_loc = n1loc_blk, ntband_dist, n1loc_blksize !expensive computation codes enddo enddo enddo enddo enddo $ nvfortran test.f90 -mp=gpu -Minfo=mp 42, !$omp target teams loop 42, Generating "nvkernel_MAIN__F1L42_1" GPU kernel Generating Tesla code 43, Loop parallelized across teams ! blockidx%x 44, Loop run sequentially 45, Loop run sequentially 46, Loop run sequentially 47, Loop parallelized across threads(128) ! threadidx%x 42, Generating Multicore code 43, Loop parallelized across threads </pre></div> </div> Use of <code class="docutils literal notranslate">loop</code>, <code class="docutils literal notranslate">collapse</code>, and <code class="docutils literal notranslate">bind</code>: <div class="highlight-text notranslate"><div class="highlight"><pre>!$omp target teams loop collapse(3) do n1loc_blk = 1, n1loc_blksize do igp = 1, ngpown do ig_blk = 1, ig_blksize !$omp loop bind(parallel) collapse(2) do ig = ig_blk, ncouls, ig_blksize do n1_loc = n1loc_blk, ntband_dist, n1loc_blksize !expensive computation codes enddo enddo enddo enddo enddo $ nvfortran test.f90 -mp=gpu -Minfo=mp 42, !$omp target teams loop 42, Generating "nvkernel_MAIN__F1L42_1" GPU kernel Generating Tesla code 43, Loop parallelized across teams collapse(3) ! blockidx%x 44, ! blockidx%x collapsed 45, ! blockidx%x collapsed 47, Loop parallelized across threads(128) collapse(2) ! threadidx%x 48, ! threadidx%x collapsed 42, Generating Multicore code 43, Loop parallelized across threads </pre></div> </div> Use of <code class="docutils literal notranslate">loop</code>, <code class="docutils literal notranslate">collapse</code>, and <code class="docutils literal notranslate">bind(thread)</code>: <div class="highlight-text notranslate"><div class="highlight"><pre>!$omp target teams loop collapse(3) do n1loc_blk = 1, n1loc_blksize do igp = 1, ngpown do ig_blk = 1, ig_blksize !$omp loop bind(thread) collapse(2) do ig = ig_blk, ncouls, ig_blksize do n1_loc = n1loc_blk, ntband_dist, n1loc_blksize ! expensive computation codes enddo enddo enddo enddo enddo $ nvfortran test.f90 -mp=gpu -Minfo=mp 42, !$omp target teams loop 42, Generating "nvkernel_MAIN__F1L42_1" GPU kernel Generating Tesla code 43, Loop parallelized across teams, threads(128) collapse(3) ! blockidx%x threadidx%x 44, ! blockidx%x threadidx%x collapsed 45, ! blockidx%x threadidx%x collapsed 47, Loop run sequentially 48, collapsed 42, Generating Multicore code 43, Loop parallelized across threads </pre></div> </div> </section> <section id="openmp-subset"> <h2>7.4. OpenMP Subset<a class="headerlink" href="#openmp-subset" title="Permalink to this headline"></a></h2> This section contains the subset of OpenMP 5.0 features that the HPC compilers support. We have attempted to define this subset of features to be those that enable, where possible, OpenMP-for-GPU application performance that closely mirrors the success NVIDIA has seen with OpenACC. Almost every feature supported on NVIDIA GPUs is also supported on multicore CPUs, although the reverse is not true. Most constructs from OpenMP 3.1 and OpenMP 4.5 that apply to multicore CPUs are supported for CPU targets, and some features from OpenMP 5.0 are supported as well. OpenMP target offload to NVIDIA GPUs is supported on NVIDIA V100 or later GPUs. The section numbers below correspond to the section numbers in the OpenMP Application Programming Interface Version 5.0 November 2018 document. 2. Directives 2.3 Variant Directives 2.3.4 Metadirectives The <code class="docutils literal notranslate">target_device</code>/<code class="docutils literal notranslate">device</code> context selector is supported with the <code class="docutils literal notranslate">kind</code>(<code class="docutils literal notranslate">host</code>|<code class="docutils literal notranslate">nohost</code>|<code class="docutils literal notranslate">cpu</code>|<code class="docutils literal notranslate">gpu</code>) and <code class="docutils literal notranslate">arch</code>(<code class="docutils literal notranslate">nvtpx</code>|<code class="docutils literal notranslate">nvptx64</code>) trait selectors. The <code class="docutils literal notranslate">arch</code> trait property <code class="docutils literal notranslate">nvptx</code> is an alias for <code class="docutils literal notranslate">nvptx64</code>; any other <code class="docutils literal notranslate">arch</code> trait properties are treated as not matching or are ignored. The <code class="docutils literal notranslate">isa</code> selector is treated as not matching or is ignored; no support is provided to select a context based on NVIDIA GPU compute capability. The <code class="docutils literal notranslate">implementation</code> context selector is supported with the <code class="docutils literal notranslate">vendor(nvidia)</code> trait selector. The <code class="docutils literal notranslate">user</code> context selector is supported with the <code class="docutils literal notranslate">condition(expression)</code> trait selector including dynamic <code class="docutils literal notranslate">user</code> traits. The syntax <code class="docutils literal notranslate">begin</code>/<code class="docutils literal notranslate">end metadirective</code> is not supported. 2.3.5 Declare Variant Directive The <code class="docutils literal notranslate">device</code> context selector is supported with the <code class="docutils literal notranslate">kind</code>(<code class="docutils literal notranslate">host</code>|<code class="docutils literal notranslate">nohost</code>|<code class="docutils literal notranslate">cpu</code>|<code class="docutils literal notranslate">gpu</code>) and <code class="docutils literal notranslate">arch</code>(<code class="docutils literal notranslate">nvtpx</code>|<code class="docutils literal notranslate">nvptx64</code>) trait selectors. The <code class="docutils literal notranslate">arch</code> trait property <code class="docutils literal notranslate">nvptx</code> is an alias for <code class="docutils literal notranslate">nvptx64</code>; any other <code class="docutils literal notranslate">arch</code> trait properties are treated as not matching or are ignored. The <code class="docutils literal notranslate">isa</code> selector is also treated as not matching or is ignored; no support is provided to select a context based on NVIDIA GPU compute capability. The <code class="docutils literal notranslate">implementation</code> context selector is supported with the <code class="docutils literal notranslate">vendor(nvidia)</code> trait selector; all other implementation trait selectors are treated as not matching. The syntax <code class="docutils literal notranslate">begin</code>/<code class="docutils literal notranslate">end declare variant</code> is supported for C/C++. 2.4 Requires Directive The <code class="docutils literal notranslate">requires</code> directive has limited support. The requirement clauses <code class="docutils literal notranslate">unified_address</code> and <code class="docutils literal notranslate">unified_shared_memory</code> are accepted but have no effect. To activate OpenMP unified shared memory programming a command-line option needs to be passed in (refer to <a class="reference internal" href="#openmp-unified-mem">OpenMP with CUDA Unified Memory</a> for more details). 2.5 Internal Control Variables ICV support is as follows. <ul class="simple"> <li><code class="docutils literal notranslate">dyn-var</code>, <code class="docutils literal notranslate">nthread-var</code>, <code class="docutils literal notranslate">thread-limit-var</code>, <code class="docutils literal notranslate">max-active-levels-var</code>, <code class="docutils literal notranslate">active-levels-var</code>, <code class="docutils literal notranslate">levels-var</code>, <code class="docutils literal notranslate">run-sched-var</code>, <code class="docutils literal notranslate">dyn-sched-var</code>, and <code class="docutils literal notranslate">stacksize-var</code> are supported</li> <li><code class="docutils literal notranslate">place-partition-var</code>, <code class="docutils literal notranslate">bind-var</code>, <code class="docutils literal notranslate">wait-policy-var</code>, <code class="docutils literal notranslate">display-affinity-var</code>, <code class="docutils literal notranslate">default-device-var</code>, and <code class="docutils literal notranslate">target-offload-var</code> are supported only on the CPU</li> <li><code class="docutils literal notranslate">affinity-format-var</code> is supported only on the CPU; its value is immutable</li> <li><code class="docutils literal notranslate">max-task-priority-var</code>, <code class="docutils literal notranslate">def-allocator-var</code> are not supported</li> <li><code class="docutils literal notranslate">cancel-var</code> is not supported; it always returns false</li> </ul> 2.6 Parallel Construct Support for <code class="docutils literal notranslate">parallel</code> construct clauses is as follows. <ul class="simple"> <li>The <code class="docutils literal notranslate">num_threads</code>, <code class="docutils literal notranslate">default</code>, <code class="docutils literal notranslate">private</code>, <code class="docutils literal notranslate">firstprivate</code>, and <code class="docutils literal notranslate">shared</code> clauses are supported</li> <li>The <code class="docutils literal notranslate">reduction</code> clause is supported as described in 2.19.5</li> <li>The <code class="docutils literal notranslate">if</code> and <code class="docutils literal notranslate">copyin</code> clauses are supported only for CPU targets; the compiler emits an error for GPU targets</li> <li>The <code class="docutils literal notranslate">proc_bind</code> clause is supported only for CPU targets; it is ignored for GPU targets</li> <li>The <code class="docutils literal notranslate">allocate</code> clause is ignored</li> </ul> 2.7 Teams Construct The <code class="docutils literal notranslate">teams</code> construct is supported only when nested within a <code class="docutils literal notranslate">target</code> construct that contains no statements, declarations, or directives outside the <code class="docutils literal notranslate">teams</code> construct, or as a combined <code class="docutils literal notranslate">target</code><code class="docutils literal notranslate">teams</code> construct. The <code class="docutils literal notranslate">teams</code> construct is supported for GPU targets. If the <code class="docutils literal notranslate">target</code> construct falls back to CPU mode, the number of teams is one. Support for <code class="docutils literal notranslate">teams</code> construct clauses is as follows. <ul class="simple"> <li>The <code class="docutils literal notranslate">num_teams</code>, <code class="docutils literal notranslate">thread_limit</code>, <code class="docutils literal notranslate">default</code>, <code class="docutils literal notranslate">private</code>, and <code class="docutils literal notranslate">firstprivate</code> clauses are supported</li> <li>The <code class="docutils literal notranslate">reduction</code> clause is supported as described in 2.19.5</li> <li>The <code class="docutils literal notranslate">shared</code> clause is supported for CPU targets and is supported for GPU targets in unified-memory mode</li> <li>The <code class="docutils literal notranslate">allocate</code> clause is ignored</li> </ul> 2.8 Worksharing Constructs 2.8.1 Sections Construct The <code class="docutils literal notranslate">sections</code> construct is supported only for CPU targets; the compiler emits an error for GPU targets. Support for <code class="docutils literal notranslate">sections</code> construct clauses is as follows. <ul class="simple"> <li>The <code class="docutils literal notranslate">private</code> and <code class="docutils literal notranslate">firstprivate</code> clauses are supported</li> <li>The <code class="docutils literal notranslate">reduction</code> clause is supported as described in 2.19.5</li> <li>The <code class="docutils literal notranslate">lastprivate</code> clause is supported; the optional <code class="docutils literal notranslate">lastprivate</code> modifier is not supported</li> <li>The <code class="docutils literal notranslate">allocate</code> clause is ignored</li> </ul> 2.8.2 Single Construct Support for <code class="docutils literal notranslate">single</code> construct clauses is as follows. <ul class="simple"> <li>The <code class="docutils literal notranslate">private</code>, <code class="docutils literal notranslate">firstprivate</code>, and <code class="docutils literal notranslate">nowait</code> clauses are supported</li> <li>The <code class="docutils literal notranslate">copyprivate</code> clause is supported only for CPU targets; the compiler emits an error for GPU targets</li> <li>The <code class="docutils literal notranslate">allocate</code> clause is ignored</li> </ul> 2.8.3 Workshare Construct The <code class="docutils literal notranslate">workshare</code> construct is supported in Fortran only for CPU targets; the compiler emits an error for GPU targets. 2.9 Loop-Related Constructs 2.9.2 Worksharing-Loop Construct (for/do) Support for worksharing <code class="docutils literal notranslate">for</code> and <code class="docutils literal notranslate">do</code> construct clauses is as follows. <ul class="simple"> <li>The <code class="docutils literal notranslate">private</code>, <code class="docutils literal notranslate">firstprivate</code>, and <code class="docutils literal notranslate">collapse</code> clauses are supported</li> <li>The <code class="docutils literal notranslate">reduction</code> clause is supported as described in 2.19.5</li> <li>The <code class="docutils literal notranslate">schedule</code> clause is supported; the optional modifiers are not supported</li> <li>The <code class="docutils literal notranslate">lastprivate</code> clause is supported; the optional <code class="docutils literal notranslate">lastprivate</code> modifier is not supported</li> <li>The <code class="docutils literal notranslate">ordered</code> clause is supported only for CPU targets; <code class="docutils literal notranslate">ordered(n)</code> clause is not supported</li> <li>The <code class="docutils literal notranslate">linear</code> clause is not supported</li> <li>The <code class="docutils literal notranslate">order(concurrent)</code> clause is ignored</li> <li>The <code class="docutils literal notranslate">allocate</code> clause is ignored</li> </ul> 2.9.3 SIMD Directives The <code class="docutils literal notranslate">simd</code> construct can be used to provide tuning hints for CPU targets; the <code class="docutils literal notranslate">simd</code> construct is ignored for GPU targets. Support for <code class="docutils literal notranslate">simd</code> construct clauses is as follows. <ul class="simple"> <li>The <code class="docutils literal notranslate">reduction</code> clause is supported as described in 2.19.5</li> <li>The <code class="docutils literal notranslate">lastprivate</code> clause is supported; the optional <code class="docutils literal notranslate">lastprivate</code> modifier is not supported</li> <li>The <code class="docutils literal notranslate">if</code>, <code class="docutils literal notranslate">simdlen</code>, and <code class="docutils literal notranslate">linear</code> clauses are not supported</li> <li>The <code class="docutils literal notranslate">safelen</code>, <code class="docutils literal notranslate">aligned</code>, <code class="docutils literal notranslate">nontemporal</code>, and <code class="docutils literal notranslate">order(concurrent)</code> clauses are ignored</li> </ul> The composite <code class="docutils literal notranslate">for</code><code class="docutils literal notranslate">simd</code> and <code class="docutils literal notranslate">do</code><code class="docutils literal notranslate">simd</code> constructs are supported for CPU targets; they are treated as <code class="docutils literal notranslate">for</code> and <code class="docutils literal notranslate">do</code> directives for GPU targets. Supported <code class="docutils literal notranslate">simd</code> clauses are supported on the composite constructs for the CPU. Any <code class="docutils literal notranslate">simd</code> clauses are ignored for GPU targets. The <code class="docutils literal notranslate">declare</code><code class="docutils literal notranslate">simd</code> directive is ignored. 2.9.4 Distribute Directives The <code class="docutils literal notranslate">distribute</code> construct is supported within a <code class="docutils literal notranslate">teams</code> construct. Support for <code class="docutils literal notranslate">distribute</code> construct clauses is as follows: <ul class="simple"> <li>The <code class="docutils literal notranslate">private</code>, <code class="docutils literal notranslate">firstprivate</code>, <code class="docutils literal notranslate">collapse</code>, and <code class="docutils literal notranslate">dist_schedule(static [ ,chunksize])</code> clauses are supported</li> <li>The <code class="docutils literal notranslate">lastprivate</code> clause is not supported</li> <li>The <code class="docutils literal notranslate">allocate</code> clause is ignored</li> </ul> The <code class="docutils literal notranslate">distribute simd</code> construct is treated as a <code class="docutils literal notranslate">distribute</code> construct and is supported for GPU targets; valid supported <code class="docutils literal notranslate">distribute</code> clauses are accepted; <code class="docutils literal notranslate">simd</code> clauses are ignored. The <code class="docutils literal notranslate">distribute</code><code class="docutils literal notranslate">simd</code> construct is not supported for CPU targets. The <code class="docutils literal notranslate">distribute</code><code class="docutils literal notranslate">parallel</code><code class="docutils literal notranslate">for</code> or <code class="docutils literal notranslate">distribute</code><code class="docutils literal notranslate">parallel</code><code class="docutils literal notranslate">do</code> constructs are supported for GPU targets. Valid supported <code class="docutils literal notranslate">distribute</code> and <code class="docutils literal notranslate">parallel</code> and <code class="docutils literal notranslate">for</code> or <code class="docutils literal notranslate">do</code> clauses are accepted. The <code class="docutils literal notranslate">distribute</code><code class="docutils literal notranslate">parallel</code><code class="docutils literal notranslate">for</code> or <code class="docutils literal notranslate">distribute</code><code class="docutils literal notranslate">parallel</code><code class="docutils literal notranslate">do</code> constructs are not supported for CPU targets. The <code class="docutils literal notranslate">distribute parallel `for simd</code> or <code class="docutils literal notranslate">distribute parallel do simd</code> constructs are treated as <code class="docutils literal notranslate">distribute parallel for</code> or <code class="docutils literal notranslate">distribute parallel do</code> constructs and are supported for GPU targets. These are not supported for CPU targets. 2.9.5 Loop Construct Support for <code class="docutils literal notranslate">loop</code> construct clauses is as follows. <ul class="simple"> <li>The <code class="docutils literal notranslate">private</code>, <code class="docutils literal notranslate">bind</code>, and <code class="docutils literal notranslate">collapse</code> clauses are supported</li> <li>The <code class="docutils literal notranslate">reduction</code> clause is supported as described in 2.19.5</li> <li>The <code class="docutils literal notranslate">order(concurrent)</code> clause is assumed</li> <li>The <code class="docutils literal notranslate">lastprivate</code> clause is not supported</li> </ul> 2.10 Tasking Constructs 2.10.1 Task Construct The <code class="docutils literal notranslate">task</code> construct is supported for CPU targets. The compiler emits an error when it encounters <code class="docutils literal notranslate">task</code> within a <code class="docutils literal notranslate">target</code> construct. Support for <code class="docutils literal notranslate">task</code> construct clauses is as follows: <ul class="simple"> <li>The <code class="docutils literal notranslate">if</code>, <code class="docutils literal notranslate">final</code>, <code class="docutils literal notranslate">default</code> , <code class="docutils literal notranslate">private</code>, <code class="docutils literal notranslate">firstprivate</code>, and <code class="docutils literal notranslate">shared</code> clauses are supported</li> <li>The <code class="docutils literal notranslate">depend([dependmodifier,] dependtype : list)</code> clause is supported as described in 2.17.11</li> </ul> 2.10.4 Taskyield Construct The <code class="docutils literal notranslate">taskyield</code> construct is supported for CPU targets; it is ignored for GPU targets. 2.11 Memory Management Directives The memory management allocators, memory management API routines, and memory management directives are not supported 2.12 Device Directives 2.12.1 Device Initialization Depending on how the program is compiled and linked, device initialization may occur at the first <code class="docutils literal notranslate">target</code> construct or API routine call, or may occur implicitly at program startup. 2.12.2 Target Data Construct The <code class="docutils literal notranslate">target data</code> construct is supported for GPU targets. Support for <code class="docutils literal notranslate">target data</code> construct clauses is as follows. <ul class="simple"> <li>The <code class="docutils literal notranslate">if</code>, <code class="docutils literal notranslate">device</code>, <code class="docutils literal notranslate">use_device_ptr</code>, and <code class="docutils literal notranslate">use_device_addr</code> clauses are supported</li> <li>The <code class="docutils literal notranslate">map</code> clause is supported as described in 2.19.7</li> </ul> 2.12.3 Target Enter Data Construct The <code class="docutils literal notranslate">target enter data</code> construct is supported for GPU targets. Support for <code class="docutils literal notranslate">enter data</code> construct clauses is as follows. <ul class="simple"> <li>The <code class="docutils literal notranslate">if</code>, <code class="docutils literal notranslate">device</code>, and <code class="docutils literal notranslate">nowait</code> clauses are supported</li> <li>The <code class="docutils literal notranslate">map</code> clause is supported as described in 2.19.7.</li> <li>The <code class="docutils literal notranslate">depend([dependmodifier,] dependtype : list)</code> clause is supported as described in 2.17.11</li> </ul> 2.12.4 Target Exit Data Construct The <code class="docutils literal notranslate">target exit data</code> construct is supported for GPU targets. Support for <code class="docutils literal notranslate">exit data</code> construct clauses is as follows. <ul class="simple"> <li>The <code class="docutils literal notranslate">if</code>, <code class="docutils literal notranslate">device</code>, and <code class="docutils literal notranslate">nowait</code> clauses are supported</li> <li>The <code class="docutils literal notranslate">map</code> clause is supported as described in 2.19.7.</li> <li>The <code class="docutils literal notranslate">depend([dependmodifier,] dependtype : list)</code> clause is supported as described in 2.17.11</li> </ul> 2.12.5 Target Construct The <code class="docutils literal notranslate">target</code> construct is supported for GPU targets. If there is no GPU or GPU offload is otherwise disabled, execution falls back to CPU mode. Support for <code class="docutils literal notranslate">target</code> construct clauses is as follows: <ul class="simple"> <li>The <code class="docutils literal notranslate">if</code>, <code class="docutils literal notranslate">private</code>, <code class="docutils literal notranslate">firstprivate</code>, <code class="docutils literal notranslate">is_device_ptr</code>, and <code class="docutils literal notranslate">nowait</code> clauses are supported</li> <li>The <code class="docutils literal notranslate">device</code> clause is supported without the device-modifier <code class="docutils literal notranslate">ancestor</code> keyword</li> <li>The <code class="docutils literal notranslate">map</code> clause is supported as described in 2.19.7</li> <li>The <code class="docutils literal notranslate">defaultmap</code> clause is supported using OpenMP 5.0 semantics</li> <li>The <code class="docutils literal notranslate">depend([dependmodifier,] dependtype : list)</code> clause is supported as described in 2.17.11</li> <li>The <code class="docutils literal notranslate">allocate</code> and <code class="docutils literal notranslate">uses_allocate</code> clauses are ignored</li> </ul> 2.12.6 Target Update Construct The <code class="docutils literal notranslate">target update</code> construct is supported for GPU targets. Support for <code class="docutils literal notranslate">target update</code> construct clauses is as follows. <ul class="simple"> <li>The <code class="docutils literal notranslate">if</code>, <code class="docutils literal notranslate">device</code>, and <code class="docutils literal notranslate">nowait</code> clauses are supported.</li> <li>The <code class="docutils literal notranslate">to</code> and <code class="docutils literal notranslate">from</code> clauses are supported without <code class="docutils literal notranslate">mapper</code> or <code class="docutils literal notranslate">mapid</code></li> <li>The <code class="docutils literal notranslate">depend([dependmodifier,] dependtype : list)</code> clause is supported as described in 2.17.11</li> </ul> Array sections are supported in <code class="docutils literal notranslate">to</code> and <code class="docutils literal notranslate">from</code> clauses, including noncontiguous array sections. Array section strides are not supported. If the array section is noncontiguous, the OpenMP runtime may have to use multiple host-to-device or device-to-host data transfer operations, which increases the overhead. If the host data is in host-pinned memory, then <code class="docutils literal notranslate">update</code> data transfers with the <code class="docutils literal notranslate">nowait</code> clause are asynchronous. This means the data transfer for a <code class="docutils literal notranslate">target update to nowait</code> may not occur immediately or synchronously with the program thread, and any changes to the data may affect the transfer, until a synchronizing operation is reached. Similarly, a <code class="docutils literal notranslate">target update from nowait</code> may not occur immediately or synchronously with the program thread, and the downloaded data may not be available until a synchronizing operation is reached. If the host data is not in host-pinned memory, then <code class="docutils literal notranslate">update</code> data transfers with the <code class="docutils literal notranslate">nowait</code> clause may require that the data transfer operation use an intermediate pinned buffer managed by the OpenMP runtime library, and that a memory copy operation on the host between the program memory and the pinned buffer may be needed before starting or before finishing the transfer operation, which affects overhead and performance. To learn more about the pinned buffer, please refer to <cite>Staging Memory Buffer <acc-mem-pinned-buffer></cite>. 2.12.7 Declare Target Construct The <code class="docutils literal notranslate">declare target</code> construct is supported for GPU targets. <ul class="simple"> <li><code class="docutils literal notranslate">declare target ... end declare target</code> is supported</li> <li><code class="docutils literal notranslate">declare target(list)</code> is supported</li> <li>The <code class="docutils literal notranslate">to(list)</code> clause is supported</li> <li>The <code class="docutils literal notranslate">device_type</code> clause is supported for C/C++</li> </ul> A function or procedure that is referenced in a function or procedure that appears in a <code class="docutils literal notranslate">declare target to</code> clause (explicitly or implicitly) is treated as if its name had implicitly appeared in a <code class="docutils literal notranslate">declare target to</code> clause. 2.13 Combined Constructs Combined constructs are supported to the extent that the component constructs are themselves supported. 2.14 Clauses on Combined and Composite Constructs Clauses on combined constructs are supported to the extent that the clauses are supported on the component constructs. 2.16 Master Construct The <code class="docutils literal notranslate">master</code> construct is supported for CPU and GPU targets. 2.17 Synchronization Constructs and Clauses 2.17.1 Critical Construct The <code class="docutils literal notranslate">critical</code> construct is supported only for CPU targets; the compiler emits an error for GPU targets. 2.17.2 Barrier Construct The <code class="docutils literal notranslate">barrier</code> construct is supported. 2.17.3 Implicit Barriers Implicit barriers are implemented. 2.17.4 Implementation-Specific Barriers There may be implementation-specific barriers, and they may be different for CPU targets than for GPU targets. 2.17.5 Taskwait Construct The <code class="docutils literal notranslate">taskwait</code> construct is supported only for CPU targets; it is ignored for GPU targets. <ul class="simple"> <li>The <code class="docutils literal notranslate">depend([dependmodifier,] dependtype : list)</code> clause is supported as described in 2.17.11</li> </ul> 2.17.6 Taskgroup Construct The <code class="docutils literal notranslate">taskgroup</code> construct is supported only for CPU targets. It is ignored for GPU targets. 2.17.7 Atomic Construct Support for <code class="docutils literal notranslate">atomic</code> construct clauses is as follows. <ul class="simple"> <li>The <code class="docutils literal notranslate">read</code>, <code class="docutils literal notranslate">write</code>, <code class="docutils literal notranslate">update</code>, and <code class="docutils literal notranslate">capture</code> clauses are supported.</li> <li>The memory order clauses <code class="docutils literal notranslate">seq_cst</code>, <code class="docutils literal notranslate">acq_rel</code>, <code class="docutils literal notranslate">release</code>, <code class="docutils literal notranslate">acquire</code>, <code class="docutils literal notranslate">relaxed</code> are not supported</li> <li>The <code class="docutils literal notranslate">hint</code> clause is ignored</li> </ul> 2.17.8 Flush Construct The <code class="docutils literal notranslate">flush</code> construct is supported only for CPU targets. 2.17.9 Ordered Construct and Ordered Directive The <code class="docutils literal notranslate">ordered</code> block construct is supported only for CPU targets. 2.17.11 Depend Clause The <code class="docutils literal notranslate">depend</code> clause is supported on CPU targets. It is not supported on GPU targets. The dependence types <code class="docutils literal notranslate">in</code>, <code class="docutils literal notranslate">out</code>, and <code class="docutils literal notranslate">inout</code> are supported. The dependence types <code class="docutils literal notranslate">mutexinoutset</code> and <code class="docutils literal notranslate">depobj</code>, dependence modifier <code class="docutils literal notranslate">iterator(iters)</code>, <code class="docutils literal notranslate">depend(source)</code>, and <code class="docutils literal notranslate">depend(sink:vector)</code> are not supported. 2.19 Data Environment 2.19.2 Threadprivate Directive The <code class="docutils literal notranslate">threadprivate</code> directive is supported only for CPU targets. It is not supported for GPU targets; references to <code class="docutils literal notranslate">threadprivate</code> variables in device code are not supported. 2.19.5 Reduction Clauses and Directives The <code class="docutils literal notranslate">reduction</code> clause is supported. The optional modifier is not supported. 2.19.6 Data Copying Clauses The data copying <code class="docutils literal notranslate">copyin</code> and <code class="docutils literal notranslate">copyprivate</code> clauses are supported only for CPU targets; the compiler emits a compile-time error for GPU targets. 2.19.7 Data Mapping Attribute Rules, Clauses, and Directives <ul class="simple"> <li>The <code class="docutils literal notranslate">map([[mapmod[,]...] maptype:] datalist)</code> clause is supported. Of the map-type-modifiers, <code class="docutils literal notranslate">always</code> is supported, <code class="docutils literal notranslate">close</code> is ignored, and <code class="docutils literal notranslate">mapper(mapid)</code> is not supported.</li> <li>The <code class="docutils literal notranslate">defaultmap</code> clause is supported using OpenMP 5.0 semantics.</li> </ul> 2.20 Nesting of Regions For constructs supported in this subset, restrictions on nesting of regions is observed. Additionally, nested parallel regions on CPU are not supported and nested teams or parallel regions in a target region are not supported. Runtime Library Routines 3.2 Execution Environment Routines The following execution environment runtime API routines are supported. <ul class="simple"> <li><code class="docutils literal notranslate">omp_set_num_threads</code>, <code class="docutils literal notranslate">omp_get_num_threads</code>, <code class="docutils literal notranslate">omp_get_max_threads</code>, <code class="docutils literal notranslate">omp_get_thread_num</code>, <code class="docutils literal notranslate">omp_get_thread_limit</code>, <code class="docutils literal notranslate">omp_get_supported_active_levels</code>, <code class="docutils literal notranslate">omp_set_max_active_levels</code>, <code class="docutils literal notranslate">omp_get_max_active_levels</code>, <code class="docutils literal notranslate">omp_get_level</code>, <code class="docutils literal notranslate">omp_get_ancestor_thread_num</code>, <code class="docutils literal notranslate">omp_get_team_size</code>, <code class="docutils literal notranslate">omp_get_num_teams</code>, <code class="docutils literal notranslate">omp_get_team_num</code>, <code class="docutils literal notranslate">omp_is_initial_device</code></li> </ul> The following execution environment runtime API routines are supported only on the CPU. <ul class="simple"> <li><code class="docutils literal notranslate">omp_get_num_procs</code>, <code class="docutils literal notranslate">omp_set_dynamic</code>, <code class="docutils literal notranslate">omp_get_dynamic</code>, <code class="docutils literal notranslate">omp_set_schedule</code>, <code class="docutils literal notranslate">omp_get_schedule</code>, <code class="docutils literal notranslate">omp_in_final</code>, <code class="docutils literal notranslate">omp_get_proc_bind</code>, <code class="docutils literal notranslate">omp_get_num_places</code>, <code class="docutils literal notranslate">omp_get_affinity_format</code>, <code class="docutils literal notranslate">omp_set_default_device</code>, <code class="docutils literal notranslate">omp_get_default_device</code>, <code class="docutils literal notranslate">omp_get_num_devices</code>, <code class="docutils literal notranslate">omp_get_device_num</code>, <code class="docutils literal notranslate">omp_get_initial_device</code></li> </ul> The following execution environment runtime API routines have limited support. <ul class="simple"> <li><code class="docutils literal notranslate">omp_get_cancellation</code>, <code class="docutils literal notranslate">omp_get_nested</code>; supported only on the CPU; the value returned is always false</li> <li><code class="docutils literal notranslate">omp_display_affinity</code>, <code class="docutils literal notranslate">omp_capture_affinity</code>; supported only on the CPU; the format specifier is ignored</li> <li><code class="docutils literal notranslate">omp_set_nested</code>; supported only on the CPU, the value is ignored</li> </ul> The following execution environment runtime API routines are not supported. <ul class="simple"> <li><code class="docutils literal notranslate">omp_get_place_num_procs</code>, <code class="docutils literal notranslate">omp_get_place_proc_ids</code>, <code class="docutils literal notranslate">omp_get_place_num</code>, <code class="docutils literal notranslate">omp_get_partition_num_places</code>, <code class="docutils literal notranslate">omp_get_partition_place_nums</code>, <code class="docutils literal notranslate">omp_set_affinity_format</code>, <code class="docutils literal notranslate">omp_get_max_task_priority</code>, <code class="docutils literal notranslate">omp_pause_resource</code>, <code class="docutils literal notranslate">omp_pause_resource_all</code></li> </ul> 3.3 Lock Routines Lock runtime API routines are not supported on the GPU. The following lock runtime API routines are supported on the CPU. <ul class="simple"> <li><code class="docutils literal notranslate">omp_init_lock</code>, <code class="docutils literal notranslate">omp_init_nest_lock</code>, <code class="docutils literal notranslate">omp_destroy_lock</code>, <code class="docutils literal notranslate">omp_destroy_nest_lock</code>, <code class="docutils literal notranslate">omp_set_lock</code>, <code class="docutils literal notranslate">omp_set_nest_lock</code>, <code class="docutils literal notranslate">omp_unset_lock</code>, <code class="docutils literal notranslate">omp_unset_nest_lock</code>, <code class="docutils literal notranslate">omp_test_lock</code>, <code class="docutils literal notranslate">omp_test_nest_lock</code></li> </ul> The following lock runtime API routines are not supported. <ul class="simple"> <li><code class="docutils literal notranslate">omp_init_lock_with_hint</code>, <code class="docutils literal notranslate">omp_init_nest_lock_with_hint</code></li> </ul> 3.4 Timing Routines The following timing runtime API routines are supported. <ul class="simple"> <li><code class="docutils literal notranslate">omp_get_wtime</code>, <code class="docutils literal notranslate">omp_get_wtick</code></li> </ul> 3.6 Device Memory Routines The following device memory routines are supported only on the CPU. <ul class="simple"> <li><code class="docutils literal notranslate">omp_target_is_present</code>, <code class="docutils literal notranslate">omp_target_associate_ptr</code>, <code class="docutils literal notranslate">omp_target_disassociate_ptr</code></li> <li><code class="docutils literal notranslate">omp_target_memcpy</code> and <code class="docutils literal notranslate">omp_target_memcpy_rect</code> are only supported when copying to and from the same device.</li> </ul> The following device memory routines are supported on the CPU; we extend OpenMP to support these in target regions on a GPU, but only allocation and deallocation on the same device is supported. <ul class="simple"> <li><code class="docutils literal notranslate">omp_target_alloc</code>, <code class="docutils literal notranslate">omp_target_free</code></li> </ul> 3.7 Memory Management Routines The following memory management routines are supported. <ul class="simple"> <li><code class="docutils literal notranslate">omp_alloc</code>, <code class="docutils literal notranslate">omp_free</code></li> </ul> The following memory management routines are not supported. <ul class="simple"> <li><code class="docutils literal notranslate">omp_init_allocator</code>, <code class="docutils literal notranslate">omp_destroy_allocator</code>, <code class="docutils literal notranslate">omp_set_default_allocator</code>, <code class="docutils literal notranslate">omp_get_default_allocator</code></li> </ul> 6 Environment Variables The following environment variables have limited support. <ul class="simple"> <li><code class="docutils literal notranslate">OMP_SCHEDULE</code>, <code class="docutils literal notranslate">OMP_NUM_THREADS</code>, <code class="docutils literal notranslate">OMP_NUM_TEAMS</code>, <code class="docutils literal notranslate">OMP_DYNAMIC</code>, <code class="docutils literal notranslate">OMP_PROC_BIND</code>, <code class="docutils literal notranslate">OMP_PLACES</code>, <code class="docutils literal notranslate">OMP_STACKSIZE</code>, <code class="docutils literal notranslate">OMP_WAIT_POLICY</code>, <code class="docutils literal notranslate">OMP_MAX_ACTIVE_LEVELS</code>, <code class="docutils literal notranslate">OMP_NESTED</code>, <code class="docutils literal notranslate">OMP_THREAD_LIMIT</code>, <code class="docutils literal notranslate">OMP_TEAMS_THREAD_LIMIT</code>, <code class="docutils literal notranslate">OMP_DISPLAY_ENV</code>, <code class="docutils literal notranslate">OMP_DISPLAY_AFFINITY</code>, <code class="docutils literal notranslate">OMP_DEFAULT_DEVICE</code>, and <code class="docutils literal notranslate">OMP_TARGET_OFFLOAD</code> are supported on CPU.</li> <li><code class="docutils literal notranslate">OMP_CANCELLATION</code> and <code class="docutils literal notranslate">OMP_MAX_TASK_PRIORITY</code> are ignored.</li> <li><code class="docutils literal notranslate">OMP_AFFINITY_FORMAT</code>, <code class="docutils literal notranslate">OMP_TOOL</code>, <code class="docutils literal notranslate">OMP_TOOL_LIBRARIES</code>, <code class="docutils literal notranslate">OMP_DEBUG</code>, and <code class="docutils literal notranslate">OMP_ALLOCATOR</code> are not supported</li> </ul> </section> <section id="using-metadirective"> <h2>7.5. Using metadirective<a class="headerlink" href="#using-metadirective" title="Permalink to this headline"></a></h2> This section contains limitations affecting <code class="docutils literal notranslate">metadirective</code> along with a few guidelines for its use. The Fortran compiler does not support variants leading to an OpenMP directive for which a corresponding <code class="docutils literal notranslate">end</code> directive is required. Nesting <code class="docutils literal notranslate">user</code> conditions, while legal, may create situations that the HPC Compilers do not handle gracefully. To avoid potential problems, use <code class="docutils literal notranslate">device</code> traits inside <code class="docutils literal notranslate">user</code> conditions instead. The following example illustrates this best practice. Avoid nesting dynamic <code class="docutils literal notranslate">user</code> conditions like this: <div class="highlight-c++ notranslate"><div class="highlight"><pre>#pragma omp metadirective \ when( user={condition(use_offload)} : target teams distribute) \ default( parallel for schedule(static) ) for (i = 0; i < N; i++) { ... #pragma omp metadirective \ when( user={condition(use_offload)} : parallel for) for (j = 0; j < N; j++) { ... } ... } </pre></div> </div> Instead, use <code class="docutils literal notranslate">target_device</code> and <code class="docutils literal notranslate">device</code> traits within dynamic <code class="docutils literal notranslate">user</code> conditions like this: <div class="highlight-c++ notranslate"><div class="highlight"><pre>#pragma omp metadirective \ when( target_device={kind(gpu)}, user={condition(use_offload)} : target teams distribute) \ default( parallel for schedule(static) ) for (i = 0; i < N; i++) { ... #pragma omp metadirective \ when( device={kind(gpu)} : parallel for) for (j = 0; j < N; j++) { ... } ... } </pre></div> </div> The HPC compilers do not support nesting <code class="docutils literal notranslate">metadirective</code> inside a <code class="docutils literal notranslate">target</code> construct applying to a syntactic block leading to a <code class="docutils literal notranslate">teams</code> variant. Some examples: The compilers will emit an error given the following code: <div class="highlight-c++ notranslate"><div class="highlight"><pre>#pragma omp target map(to:v1,v2) map(from:v3) { #pragma omp metadirective \ when( device={arch("nvptx")} : teams distribute parallel for) \ default( parallel for) for (int i = 0; i < N; i++) { v3[i] = v1[i] * v2[i]; } } </pre></div> </div> The compilers will always match <code class="docutils literal notranslate">device={arch("nvptx")}</code> given the following code: <div class="highlight-c++ notranslate"><div class="highlight"><pre>#pragma omp target map(to:v1,v2) map(from:v3) #pragma omp metadirective \ when( device={arch("nvptx")} : teams distribute parallel for) \ default( parallel for) for (int i = 0; i < N; i++) { v3[i] = v1[i] * v2[i]; } </pre></div> </div> The compilers match <code class="docutils literal notranslate">device={"arch")</code> for GPU code, and <code class="docutils literal notranslate">default</code> for host fallback, given the following code: <div class="highlight-c++ notranslate"><div class="highlight"><pre>#pragma omp target teams distribute map(to:v1,v2) map(from:v3) for (...) { #pragma omp metadirective \ when( device={arch("nvptx")} : parallel for) \ default( simd ) for (int i = 0; i < N; i++) { v3[i] = v1[i] * v2[i]; } } </pre></div> </div> </section> <section id="mapping-target-constructs-to-cuda-streams"> <h2>7.6. Mapping target constructs to CUDA streams<a class="headerlink" href="#mapping-target-constructs-to-cuda-streams" title="Permalink to this headline"></a></h2> An OpenMP target task generating construct is executed on the GPU in a CUDA stream. The following are target task generating constructs: <ul class="simple"> <li><code class="docutils literal notranslate">target enter data</code></li> <li><code class="docutils literal notranslate">target exit data</code></li> <li><code class="docutils literal notranslate">target update</code></li> <li><code class="docutils literal notranslate">target</code></li> </ul> This section explains how these target constructs are mapped to CUDA streams. The relationship with the OpenACC queues is also explained below. Keep in mind that the <code class="docutils literal notranslate">target data</code> construct does not generate a task and is not necessarily executed in a CUDA stream. It also cannot have the <code class="docutils literal notranslate">depend</code> and <code class="docutils literal notranslate">nowait</code> clauses, thus its behavior cannot be directly controlled by the user application. The rest of this section does not cover the behavior of the <code class="docutils literal notranslate">target data</code> construct. Any task-generating target construct can have <code class="docutils literal notranslate">depend</code> and <code class="docutils literal notranslate">nowait</code> clauses. The NVIDIA OpenMP Runtime takes these clauses as a guidance for how to map the construct to a specific CUDA stream. Below is a breakdown of how the clauses affect the mapping decisions. ‘target’ without ‘depend’, without ‘nowait’ For these constructs, the per-thread default CUDA stream is normally used. The stream is unique for each host thread, so target regions created by different host threads will execute independently in different streams according to the CUDA rules described in <a class="reference external" href="https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html">CUDA Runtime API</a>; see the rules in the “Per-thread default stream” section. The OpenACC queue <code class="docutils literal notranslate">acc_async_sync</code> is initially associated with the same per-thread default CUDA stream. The user is allowed to change the association by calling <code class="docutils literal notranslate">acc_set_cuda_stream(acc_async_sync, stream)</code>. This will change accordingly the stream used for <code class="docutils literal notranslate">target</code> without <code class="docutils literal notranslate">nowait</code>. The CUDA stream handle can be directly obtained via the <code class="docutils literal notranslate">ompx_get_cuda_stream(int device, int nowait)</code> function, with the <code class="docutils literal notranslate">nowait</code> parameter set to 0. The per-thread default stream can be obtained with the CUDA handle <code class="docutils literal notranslate">CU_STREAM_PER_THREAD</code> or <code class="docutils literal notranslate">cudaStreamPerThread</code>. Here is an example of how a custom CUDA stream can be used to substitute the default stream: <div class="highlight-c++ notranslate"><div class="highlight"><pre>extern __global__ void kernel(int *data); CUstream stream; cuStreamCreate(&stream, CU_STREAM_DEFAULT); acc_set_cuda_stream(acc_async_sync, stream); #pragma omp target enter data map(to:data[:N]) #pragma omp target data use_device_ptr(data) kernel<<<N/32, 32, 0, stream>>>(data); #pragma omp target teams distribute parallel for for (int i = 0; i < N; i++) { data[i]++; } #pragma omp target exit data map(from:data[:N]) </pre></div> </div> Note there is no explicit stream synchronization after the CUDA <code class="docutils literal notranslate">kernel</code> is launched. The stream is synchronized automatically at the <code class="docutils literal notranslate">target</code> constructs that follow. ‘target’ with ‘depend’, without ‘nowait’ For this construct, the runtime will block the current thread until all dependencies listed in the <code class="docutils literal notranslate">depend</code> clause are resolved. Then, the <code class="docutils literal notranslate">target</code> construct will be executed in the default per-thread CUDA stream as described in the previous section (that is, as if there is no <code class="docutils literal notranslate">depend</code> clause). ‘target’ with ‘nowait’, without ‘depend’ By default, the runtime will select a CUDA stream for each new <code class="docutils literal notranslate">target nowait</code> construct. The selected stream may be the same that was used for a prior <code class="docutils literal notranslate">target nowait</code> construct. That is, there is no guarantee of uniqueness of the selected stream. This is different from the OpenACC model that uses the same CUDA stream associated with the <code class="docutils literal notranslate">acc_async_noval</code> queue for any asynchronous construct with the <code class="docutils literal notranslate">async</code> clause without an argument. To change this behavior, the user can call the <code class="docutils literal notranslate">ompx_set_cuda_stream_auto(int enable)</code> function with the <code class="docutils literal notranslate">enable</code> parameter set to 0. In this case, the CUDA stream associated with the <code class="docutils literal notranslate">acc_async_noval</code> OpenACC queue will be used for all OpenMP <code class="docutils literal notranslate">target nowait</code> constructs. Another way to enable this behavior is to set the environment variable <code class="docutils literal notranslate">NVCOMPILER_OMP_AUTO_STREAMS</code> to <code class="docutils literal notranslate">FALSE</code>. To access the stream used for the next <code class="docutils literal notranslate">target nowait</code> construct, the user can call the <code class="docutils literal notranslate">ompx_get_cuda_stream(int device, int nowait)</code> function, with the <code class="docutils literal notranslate">nowait</code> parameter set to 1. ‘target’ with both ‘depend’ and ‘nowait’ The decision on which CUDA stream to use in this case relies on previously scheduled target and host tasks sharing a subset of the dependencies listed in the <code class="docutils literal notranslate">depend</code> clause: <ul class="simple"> <li>If the target construct has only one dependency, which is of the type <code class="docutils literal notranslate">inout</code> or <code class="docutils literal notranslate">out</code>, and that dependency maps to a previously scheduled <code class="docutils literal notranslate">target depend(...) nowait</code> construct, and the same device is used for both target constructs, then the CUDA stream which the previous target task was scheduled to will be used.</li> <li>Otherwise, a CUDA stream will be selected for this target construct according to the stream selection policy.</li> </ul> Note that target constructs with a single <code class="docutils literal notranslate">in</code> dependency can be scheduled on a newly selected CUDA stream. This is to allow parallel execution of multiple <code class="docutils literal notranslate">target nowait</code> constructs that depend on data produced by another previously scheduled <code class="docutils literal notranslate">target nowait</code> construct. Here is a simplified example of how a <code class="docutils literal notranslate">target</code> construct, a CUDA library function and a CUDA kernel can be executed on the GPU in the same stream asynchronously with respect to the host thread: <div class="highlight-c++ notranslate"><div class="highlight"><pre>extern __global__ void kernel(int *data); cudaStream_t stream = (cudaStream_t)ompx_get_cuda_stream(omp_get_default_device(), 1); cufftSetStream(cufft_plan, stream); #pragma omp target enter data map(to:data[:N]) depend(inout:stream) nowait #pragma omp target data use_device_ptr(data) { kernel<<<N/32, 32, 0, stream>>>(data); cufftExecC2C(cufft_plan, data, data, CUFFT_FORWARD); } #pragma omp target teams distribute parallel for depend(inout:stream) nowait for (int i = 0; i < N; i++) { data[i]++; } #pragma omp target exit data map(from:data[:N]) depend(inout:stream) nowait </pre></div> </div> Note that the <code class="docutils literal notranslate">stream</code> variable holds the CUDA stream handle and also serves as the dependency for the <code class="docutils literal notranslate">target</code> constructs. This dependency enforces the order of execution and also guarantees the target constructs are on the same stream that was returned from the <code class="docutils literal notranslate">ompx_get_cuda_stream</code> function call. NVIDIA OpenMP API to access and control CUDA streams NVIDIA OpenMP Runtime provides the following API to access CUDA streams and to control their use. <div class="highlight-c++ notranslate"><div class="highlight"><pre>void *ompx_get_cuda_stream(int device, int nowait); </pre></div> </div> This function returns the handle of the CUDA stream that will be used for the next <code class="docutils literal notranslate">target</code> construct: <ul class="simple"> <li>If the <code class="docutils literal notranslate">nowait</code> parameter is set to 0, it returns the CUDA stream associated with the OpenACC queue <code class="docutils literal notranslate">acc_async_sync</code>, which is initially mapped to the default per-thread CUDA stream;</li> <li>Otherwise, it returns a CUDA stream which will be used for the next <code class="docutils literal notranslate">target nowait</code> construct that cannot be mapped to an existing stream according to the rules for the <code class="docutils literal notranslate">depend</code> clause.</li> </ul> <div class="highlight-c++ notranslate"><div class="highlight"><pre>void ompx_set_cuda_stream_auto(int enable); </pre></div> </div> This function sets the policy for how CUDA streams are selected for <code class="docutils literal notranslate">target nowait</code> constructs: <ul class="simple"> <li>If the <code class="docutils literal notranslate">enable</code> parameter is set to a non-zero value, an internally selected CUDA stream will be used for each <code class="docutils literal notranslate">target nowait</code> construct that follows. This is the default behavior;</li> <li>Otherwise, the CUDA stream associated with the OpenACC queue <code class="docutils literal notranslate">acc_async_noval</code> will be used for all <code class="docutils literal notranslate">target nowait</code> constructs that follow. This becomes the default behavior if the environment variable <code class="docutils literal notranslate">NVCOMPILER_OMP_AUTO_STREAMS</code> is set to <code class="docutils literal notranslate">FALSE</code>.</li> </ul> The setting is done only for the host thread which calls this function. </section> <section id="noncontiguous-array-sections"> <h2>7.7. Noncontiguous Array Sections<a class="headerlink" href="#noncontiguous-array-sections" title="Permalink to this headline"></a></h2> Array sections can be used in <code class="docutils literal notranslate">to</code> and <code class="docutils literal notranslate">from</code> clauses, including noncontiguous array sections. The noncontiguous array section must be specified in a single <code class="docutils literal notranslate">map</code> clause; it cannot be split between multiple directives. Although this feature may become a part of a future OpenMP specification, at this time it is an NVIDIA HPC compilers extension. </section> <section id="openmp-with-cuda-unified-memory"> <h2>7.8. OpenMP with CUDA Unified Memory<a class="headerlink" href="#openmp-with-cuda-unified-memory" title="Permalink to this headline"></a></h2> This section will focus on OpenMP unified shared memory programming, and assume users are familiar with Separate, Managed, and Unified Memory Modes explained in the <a class="reference internal" href="#acc-mem-model">Memory Model</a> and <a class="reference internal" href="#acc-mem-unified">Managed and Unified Memory Modes</a> sections. OpenMP unified shared memory corresponds to Unified Memory Mode in NVHPC Compilers and it can be enabled with <code class="docutils literal notranslate">-gpu=mem:unified</code> flag. Source code with <code class="docutils literal notranslate">requires unified_shared_memory</code> directive is accepted but requires <code class="docutils literal notranslate">-gpu=mem:unified</code> flag to activate Unified Memory Mode. In Unified Memory Mode, <code class="docutils literal notranslate">map</code> clauses on <code class="docutils literal notranslate">target</code> constructs are optional. Additionally, <code class="docutils literal notranslate">declare target</code> directives are optional for variables with static storage duration accessed inside functions to which such directive is applied. The OpenMP unified shared memory eases accelerator programming on the GPUs removing the need for data management and only requiring to express the parallelism in the compute regions. In Unified Memory Mode, all data is managed by the CUDA runtime. Explicit data <code class="docutils literal notranslate">map</code> clauses which manage the data movement across the host and devices become optional. All variables are accessible from the OpenMP offload compute regions executing on the GPU. The <code class="docutils literal notranslate">map</code> clause with <code class="docutils literal notranslate">alloc</code>, <code class="docutils literal notranslate">to</code>, <code class="docutils literal notranslate">from</code>, and <code class="docutils literal notranslate">tofrom</code> type will not result in any device allocation or data transfer. The OpenMP runtime, however, may leverage such clauses to communicate preferable data placement to the CUDA runtime by means of memory hint APIs as elaborated in the following blog post on the NVIDIA website: <a class="reference external" href="https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management">Simplifying GPU Application Development with Heterogeneous Memory Management</a>. Device memory can be allocated or deallocated in OpenMP programs in Unified Memory Mode by using the <code class="docutils literal notranslate">omp_target_alloc</code> and <code class="docutils literal notranslate">omp_target_free</code> API calls. Please, note that the memory allocated through <code class="docutils literal notranslate">omp_target_alloc</code> cannot be accessed by the host. Understanding Data Movement When the compiler encounters a compute construct without visible <code class="docutils literal notranslate">target data</code> directives or <code class="docutils literal notranslate">map</code> clauses, it attempts to determine what data is required for correct execution of the region on the GPU. When the compiler is unable to determine the size and shape of data needing to be accessible on the device, it behaves as follows: <ul class="simple"> <li>In Separate Memory Mode, the compiler may not be able to alert you to the need for an explicit data clause specifying size and/or shape of data being copied to/from the GPU. In this case, the default length of one may be used. This may cause illegal memory access errors at runtime on the GPU devices.</li> <li>In Managed Memory Mode (<code class="docutils literal notranslate">-gpu=mem:managed</code>), the compiler assumes the data is allocated in managed memory and thus is accessible from the device; if this assumption is wrong, for example, if the data was defined globally or is located on the CPU stack, the program may fail at runtime.</li> <li>In Unified Memory Mode (<code class="docutils literal notranslate">-gpu=mem:unified</code>), all data is accessible from the device making information about size and shape unnecessary.</li> </ul> Take the following example in C: <div class="highlight-c++ notranslate"><div class="highlight"><pre>#pragma omp declare target void set(int* ptr, int i, int j, int dim){ int idx = i * dim + j; return ptr[idx] = someval(i, j); } #pragma omp end declare target void fill2d(int* ptr, int dim){ #pragma omp target teams distribute parallel for for (int i = 0; i < dim; i++) for (int j = 0; j < dim; j++) set(ptr, i, j, dim); } </pre></div> </div> In Separate Memory Mode, the only way to guarantee correctness for this example is to specify an array section in the <code class="docutils literal notranslate">target</code> construct as follows: <div class="highlight-c++ notranslate"><div class="highlight"><pre>#pragma omp target teams distribute parallel for map(from: ptr[0:dim*dim]) </pre></div> </div> This change explicitly instructs the OpenMP implementation about the precise data segment used within the target for loop. In Unified Memory Mode, the <code class="docutils literal notranslate">map</code> clause is not required. The next example, in Fortran, illustrates how a global variable can be accessed in an OpenMP routine without requiring any explicit annotation. <div class="highlight-c++ notranslate"><div class="highlight"><pre>module m integer :: globmin = 1234 contains subroutine findmin(a) !$omp declare target integer, intent(in) :: a(:) integer :: i do i = 1, size(a) if (a(i) .lt. globmin) then globmin = a(i) endif end do end subroutine end module m </pre></div> </div> Compile the example above for Unified Memory Mode: <div class="highlight-c++ notranslate"><div class="highlight"><pre>nvfortran -mp=gpu -gpu=mem:unified example.f90 </pre></div> </div> The source does not need any OpenMP directives to access module variable <code class="docutils literal notranslate">globmin</code>, to either read or update its value, in the routine invoked from CPU and GPU. Moreover, any access to <code class="docutils literal notranslate">globmin</code> will be made to the same exact instance of the variable from CPU and GPU; its value is synchronized automatically. In Separate or Managed Memory Modes, such behavior can only be achieved with a combination of OpenMP <code class="docutils literal notranslate">declare target</code> and <code class="docutils literal notranslate">target update</code> directives in the source code. Migrating existing OpenMP applications written for Separate Memory Mode should, in most cases, be a seamless process requiring no source changes. Some data access patterns, however, may lead to different results produced during application execution in Unified Memory Mode. Applications which rely on having separate data copies in GPU memory to conduct temporary computations on the GPU – without maintaining data synchronization with the CPU – pose a challenge for migration to unified memory. For the following Fortran example, the value of variable <code class="docutils literal notranslate">c</code> after the last loop will differ depending on whether the example is compiled with or without <code class="docutils literal notranslate">-gpu=mem:unified</code>. <div class="highlight-c++ notranslate"><div class="highlight"><pre>b(:) = ... c = 0 !$omp target data map(to: b) map(from: a) !$omp target distribute teams parallel for do i = 1, N b(i) = b(i) * i end do !$omp target distribute teams parallel for do i = 1, N a(i) = b(i) + i end do !$omp end target data do i = 1, N c = c + a(i) + b(i) end do </pre></div> </div> Without Unified Memory, array <code class="docutils literal notranslate">b</code> is copied into the GPU memory at the beginning of the OpenMP <code class="docutils literal notranslate">target data</code> region. It is then updated in the GPU memory and used to compute elements of array <code class="docutils literal notranslate">a</code>. As instructed by the data clause <code class="docutils literal notranslate">map(to:b)</code>, <code class="docutils literal notranslate">b</code> is not copied back to the CPU memory at the end of the <code class="docutils literal notranslate">target data</code> region and therefore its initial value is used in the computation of <code class="docutils literal notranslate">c</code>. With <code class="docutils literal notranslate">-mp=gpu -gpu=mem:unified</code>, the updated value of <code class="docutils literal notranslate">b</code> in the first loop is automatically visible in the last loop leading to a different value of <code class="docutils literal notranslate">c</code> at its end. Additional complications may arise from the asynchronous execution as the use of unified shared memory may require extra synchronizations to avoid data races. </section> <section id="multiple-device-support"> <h2>7.9. Multiple Device Support<a class="headerlink" href="#multiple-device-support" title="Permalink to this headline"></a></h2> A program can use multiple devices on a single node. This functionality is supported using the <code class="docutils literal notranslate">omp_set_default_device</code> API call and the <code class="docutils literal notranslate">device()</code> clause on the <code class="docutils literal notranslate">target</code> constructs. Our experience is that most programs use MPI parallelism with each MPI rank selecting a single GPU to which to offload. Some programs assign multiple MPI ranks to each GPU, in order to keep the GPU fully occupied, though the fixed memory size of the GPU limits how effective this strategy can be. Similarly, other programs use OpenMP thread parallelism on the CPU, with each thread selecting a single GPU to which to offload. </section> <section id="interoperability-with-cuda"> <h2>7.10. Interoperability with CUDA<a class="headerlink" href="#interoperability-with-cuda" title="Permalink to this headline"></a></h2> The HPC Compilers support interoperability of OpenMP and CUDA to the same extent they support CUDA interoperability with OpenACC. If OpenMP and CUDA code coexist in the same program, the OpenMP runtime and the CUDA runtime use the same CUDA context on each GPU. To enable this coexistence, use the compilation and linking option <code class="docutils literal notranslate">-cuda</code>. CUDA-allocated data is available for use inside OpenMP target regions with the OpenMP analog <code class="docutils literal notranslate">is_device_ptr</code> to OpenACC’s <code class="docutils literal notranslate">deviceptr()</code> clause. OpenMP-allocated data is available for use inside CUDA kernels directly if the data was allocated with the <code class="docutils literal notranslate">omp_target_alloc()</code> API call; if the OpenMP data was created with a <code class="docutils literal notranslate">target data map</code> clause, it can be made available for use inside CUDA kernels using the <code class="docutils literal notranslate">target data use_device_addr()</code> clause. Calling a CUDA device function inside an OpenMP target region is supported, as long as the CUDA function is a scalar function, that is, does not use CUDA shared memory or any inter-thread synchronization. Calling an OpenMP <code class="docutils literal notranslate">declare target</code> function inside a CUDA kernel is supported as long as the <code class="docutils literal notranslate">declare target</code> function has no OpenMP constructs or API calls. </section> <section id="interoperability-with-other-openmp-compilers"> <h2>7.11. Interoperability with Other OpenMP Compilers<a class="headerlink" href="#interoperability-with-other-openmp-compilers" title="Permalink to this headline"></a></h2> OpenMP CPU-parallel object files compiled with NVIDIA’s HPC compilers are interoperable with OpenMP CPU-parallel object files compiled by other compilers using the KMPC OpenMP runtime interface. Compilers supporting KMPC OpenMP include Intel and CLANG. The HPC compilers support a GNU OpenMP interface layer as well which provides OpenMP CPU-parallel interoperability with the GNU compilers. For OpenMP GPU computation, there is no similar formal or informal standard library interface for launching GPU compute constructs or managing GPU memory. There is also no standard way to manage the device context in such a way as to interoperate between multiple offload libraries. The HPC compilers therefore do not support interoperability of device compute offload operations and similar operations generated with another compiler. </section> <section id="gnu-stl"> <h2>7.12. GNU STL<a class="headerlink" href="#gnu-stl" title="Permalink to this headline"></a></h2> When using nvc++ on Linux, the GNU STL is thread-safe to the extent listed in the GNU documentation as required by the C++11 standard. If an STL thread-safe issue is suspected, the suspect code can be run sequentially inside of an OpenMP region using <code class="docutils literal notranslate">#pragma omp critical</code> sections. </section> </section> <section id="using-stdpar"> <h1>8. Using Stdpar<a class="headerlink" href="#using-stdpar" title="Permalink to this headline"></a></h1> This chapter describes the NVIDIA HPC Compiler support for standard language parallelism, also known as Stdpar: <ul class="simple"> <li>ISO C++ standard library parallel algorithms with <code class="docutils literal notranslate">nvc++</code></li> <li>ISO Fortran <code class="docutils literal notranslate">do concurrent</code> loop construct with <code class="docutils literal notranslate">nvfortran</code></li> </ul> Use the <code class="docutils literal notranslate">-stdpar</code> compiler option to enable parallel execution with standard parallelism. The sub-options to <code class="docutils literal notranslate">-stdpar</code> are the following: <ul class="simple"> <li><code class="docutils literal notranslate">gpu</code>: compile for parallel execution on GPU; this sub-option is the default. This feature is supported on the NVIDIA Pascal architecture and newer.</li> <li><code class="docutils literal notranslate">multicore</code>: compile for multicore CPU execution.</li> </ul> By default, NVC++ auto-detects and generates GPU code for the type of GPU that is installed on the system on which the compiler is running. To generate code for a specific GPU architecture, which may be necessary when the application is compiled and run on different systems, add the <code class="docutils literal notranslate">-gpu=ccXX</code> command-line option. More details can be found in <a class="reference internal" href="#compute-cap">Compute Capability</a>. Predefined Macros The following macros corresponding to the parallel execution target compiled for are added implicitly: <ul class="simple"> <li><code class="docutils literal notranslate">__NVCOMPILER_STDPAR_GPU</code> for parallel execution on GPU.</li> <li><code class="docutils literal notranslate">__NVCOMPILER_STDPAR_MULTICORE</code> for parallel execution on multicore CPU.</li> </ul> <section id="gpu-memory-modes"> <h2>8.1. GPU Memory Modes<a class="headerlink" href="#gpu-memory-modes" title="Permalink to this headline"></a></h2> When compiling for GPU execution, Stdpar utilizes <a class="reference internal" href="#acc-mem-unified">Managed and Unified Memory Modes</a> for managing data accessed from the sequential code running on CPU and from the parallel code on GPU. The compiler detects the memory capability of the system on which the compiler is running and uses that information to enable the correct memory mode as follows: <ul class="simple"> <li>When compiled on the platform with full CUDA Unified Memory capability, <code class="docutils literal notranslate">-stdpar</code> implies <code class="docutils literal notranslate">-gpu=mem:unified</code>.</li> <li>When compiled on the platform with CUDA Managed Memory capability only, <code class="docutils literal notranslate">-stdpar</code> implies <code class="docutils literal notranslate">-gpu=mem:managed</code>.</li> </ul> To compile code for a specific Memory Mode regardless of the memory capability of the system on which you are compiling, add the desired <code class="docutils literal notranslate">-gpu=mem:unified</code> or <code class="docutils literal notranslate">-gpu=mem:managed</code> option. Stdpar with Separate Memory Mode can only be supported when the data are fully managed through features of other programming models e.g. OpenACC. All restrictions on variables used on the GPU in standard language parallel code in Managed Memory Mode have been removed when using Unified Memory Mode. If the compiler utilises CUDA Managed Memory automatically, the interception of deallocations is enabled implicitly at runtime. This is to prevent deallocating the data with unmatching API which may lead to undefined behavior. The interception incurs some runtime overhead and may be unnecessary if allocatations and deallocations for all data in the application are performed using the matching APIs. The interception can be disabled using dedicated command-line options detailed in <a class="reference internal" href="#gpu-mem-intercept">Interception of Deallocations</a>. More details about the memory modes supported by the NVIDIA HPC Compilers and dedicated command-line options can be found in <a class="reference internal" href="#acc-mem-model">Memory Model</a>. </section> <section id="stdpar-c"> <h2>8.2. Stdpar C++<a class="headerlink" href="#stdpar-c" title="Permalink to this headline"></a></h2> The NVIDIA HPC C++ compiler, NVC++, supports C++ Standard Language Parallelism (Stdpar) for execution on NVIDIA GPUs and multicore CPUs. As mentioned previously, use the NVC++ command-line option <code class="docutils literal notranslate">-stdpar</code> to enable GPU accelerated C++ Parallel Algorithms. The following sections go into more detail about the NVC++ support for the ISO C++ Standard Library Parallel Algorithms. <section id="introduction-to-stdpar-c"> <h3>8.2.1. Introduction to Stdpar C++<a class="headerlink" href="#introduction-to-stdpar-c" title="Permalink to this headline"></a></h3> The C++17 Standard introduced higher-level parallelism features that allow users to request parallelization of Standard Library algorithms. This higher-level parallelism is expressed by adding an execution policy as the first parameter to any algorithm that supports execution policies. Most of the existing Standard C++ algorithms were enhanced to support execution policies. C++17 defined several new parallel algorithms, including the useful <a class="reference external" href="https://en.cppreference.com/w/cpp/algorithm/reduce">std::reduce</a> and <a class="reference external" href="https://en.cppreference.com/w/cpp/algorithm/transform_reduce">std::transform_reduce</a>. C++17 defines three <a class="reference external" href="https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag">execution policies</a>: <ul class="simple"> <li><code class="docutils literal notranslate">std::execution::seq:</code> Sequential execution. No parallelism is allowed.</li> <li><code class="docutils literal notranslate">std::execution::par:</code> Parallel execution on one or more threads.</li> <li><code class="docutils literal notranslate">std::execution::par_unseq:</code> Parallel execution on one or more threads, with each thread possibly vectorized.</li> </ul> When you use an execution policy other than <code class="docutils literal notranslate">std::execution::seq</code>, you are communicating two important things to the compiler: <ul class="simple"> <li>You prefer but do not require that the algorithm be run in parallel. A conforming C++17 implementation may ignore the hint and run the algorithm sequentially, but a performance-oriented implementation takes the hint and executes in parallel when possible and prudent.</li> <li>The algorithm is safe to run in parallel. For the <code class="docutils literal notranslate">std::execution::par</code> and <code class="docutils literal notranslate">std::execution::par_unseq</code> policies, any user-provided code—such as iterators, lambdas, or function objects passed into the algorithm—must not introduce data races if run concurrently on separate threads. For the <code class="docutils literal notranslate">std::execution::par_unseq</code> policy, any user-provided code must not introduce data races or deadlocks if multiple calls are interleaved on the same thread, which is what happens when a loop is vectorized. For more information about potential deadlocks, see the <a class="reference external" href="https://en.cppreference.com/w/cpp/language/memory_model#Progress_guarantee">forward progress guarantees</a> provided by the parallel policies or watch <a class="reference external" href="https://www.youtube.com/watch?v=FJIn1YhPJJc">CppCon 2018: Bryce Adelstein Lelbach “The C++ Execution Model”</a>.</li> </ul> The C++ Standard grants compilers great freedom to choose if, when, and how to execute algorithms in parallel as long as the forward progress guarantees the user requests are honored. For example, <code class="docutils literal notranslate">std::execution::par_unseq</code> may be implemented with vectorization and <code class="docutils literal notranslate">std::execution::par</code> may be implemented with a CPU thread pool. It is also possible to execute parallel algorithms on a GPU, which is a good choice for invocations with sufficient parallelism to take advantage of the processing power and memory bandwidth of NVIDIA GPU processors. </section> <section id="nvc-compiler-parallel-algorithms-support"> <h3>8.2.2. NVC++ Compiler Parallel Algorithms Support<a class="headerlink" href="#nvc-compiler-parallel-algorithms-support" title="Permalink to this headline"></a></h3> NVC++ supports C++ Standard Language Parallelism with the parallel execution policies <code class="docutils literal notranslate">std::execution::par</code> or <code class="docutils literal notranslate">std::execution::par_unseq</code> for execution on GPUs or multicore CPUs. Lambdas, including generic lambdas, are fully supported in parallel algorithm invocations. No language extensions or non-standard libraries are required to enable GPU acceleration. All data movement between host memory and GPU device memory is performed implicitly and automatically under the control of <a class="reference internal" href="#acc-mem-unified">Managed and Unified Memory Modes</a>. It’s straightforward to automatically GPU accelerate C++ Parallel Algorithms with NVC++. However, there are some restrictions and limitations you need to be aware of as explained below. <section id="enabling-parallel-algorithms-with-the-stdpar-option"> <h4>8.2.2.1. Enabling Parallel Algorithms with the -stdpar Option<a class="headerlink" href="#enabling-parallel-algorithms-with-the-stdpar-option" title="Permalink to this headline"></a></h4> GPU acceleration of C⁠+⁠+ Parallel Algorithms is enabled with the <code class="docutils literal notranslate">-⁠stdpar=gpu</code> command-line option to NVC++. If <code class="docutils literal notranslate">-⁠stdpar=gpu</code> is specified (or <code class="docutils literal notranslate">-⁠stdpar</code> without an argument), almost all algorithms that use a parallel execution policy are compiled for offloading to run in parallel on an NVIDIA GPU: <div class="highlight-text notranslate"><div class="highlight"><pre>nvc++ -stdpar=gpu program.cpp -o program </pre></div> </div> <div class="highlight-text notranslate"><div class="highlight"><pre>nvc++ -stdpar program.cpp -o program </pre></div> </div> In addition, the GPU acceleration sub-option can be further specialized using <code class="docutils literal notranslate">-⁠stdpar=gpu:acc</code>. This option directs the compiler to use its OpenACC implementation to GPU-accelerate a subset of algorithm with a parallel execution policy: <div class="highlight-text notranslate"><div class="highlight"><pre>nvc++ -stdpar=gpu:acc program.cpp -o program </pre></div> </div> More details about the OpenACC support of Stdpar C++ is provided in <a class="reference external" href="index.html#stdpar-cpp-impl-openacc">OpenACC Implementation of Parallel Algorithms</a>. Acceleration of C⁠+⁠+ Parallel Algorithms with multicore CPUs is enabled with the <code class="docutils literal notranslate">-⁠stdpar=multicore</code> command-line option to NVC⁠+⁠+. If <code class="docutils literal notranslate">-⁠stdpar=multicore</code> specified, almost all algorithms that use a parallel execution policy are compiled to run on a multicore CPU: <div class="highlight-text notranslate"><div class="highlight"><pre>nvc++ -stdpar=multicore program.cpp -o program </pre></div> </div> When either <code class="docutils literal notranslate">-⁠stdpar=gpu,multicore</code> or <code class="docutils literal notranslate">-⁠stdpar=gpu:acc,multicore</code> command-line options are specified to NVC++, the parallel algorithms code is compiled for both GPU and multicore CPU. When the execution platform has any GPU the binary executes on the GPU and otherwise on the multicore CPU. <div class="highlight-text notranslate"><div class="highlight"><pre>nvc++ -stdpar=gpu,multicore program.cpp -o program </pre></div> </div> <div class="highlight-text notranslate"><div class="highlight"><pre>nvc++ -stdpar=gpu:acc,multicore program.cpp -o program </pre></div> </div> </section> </section> <section id="stdpar-c-simple-example"> <h3>8.2.3. Stdpar C++ Simple Example<a class="headerlink" href="#stdpar-c-simple-example" title="Permalink to this headline"></a></h3> Here are a few simple examples to get a feel for how the C++ Parallel Algorithms work. From the early days of C++, sorting items stored in an appropriate container has been relatively easy using a single call like the following: <div class="highlight-c++ notranslate"><div class="highlight"><pre>std::sort(employees.begin(), employees.end(), CompareByLastName()); </pre></div> </div> Assuming the comparison class <code class="docutils literal notranslate">CompareByLastName</code> is thread-safe, which is true for most comparison functions, parallelizing this sort is simple with C++ Parallel Algorithms. Include <code class="docutils literal notranslate"><execution></code> and add an execution policy to the function call: <div class="highlight-c++ notranslate"><div class="highlight"><pre>std:sort(std::execution::par, employees.begin(), employees.end(), CompareByLastName()); </pre></div> </div> Calculating the sum of all the elements in a container is also simple with the <code class="docutils literal notranslate">std::accumulate</code> algorithm. Prior to C++17, transforming the data in some way while taking the sum was somewhat awkward. For example, to compute the average age of your employees, you might write the following code: <div class="highlight-c++ notranslate"><div class="highlight"><pre>int ave_age = std::accumulate(employees.begin(), employees.end(), 0, [](int sum, const Employee& emp){ return sum + emp.age(); }) / employees.size(); </pre></div> </div> The <code class="docutils literal notranslate">std::transform_reduce</code> algorithm introduced in C++17 makes it simple to parallelize this code. It also results in cleaner code by separating the reduction operation, in this case <code class="docutils literal notranslate">std::plus</code>, from the transformation operation, in this case <code class="docutils literal notranslate">emp.age():</code> <div class="highlight-c++ notranslate"><div class="highlight"><pre>int ave_age = std::transform_reduce(std::execution::par_unseq, employees.begin(), employees.end(), 0, std::plus<int>(), [](const Employee& emp){ return emp.age(); }) / employees.size(); </pre></div> </div> </section> <section id="openacc-implementation-of-parallel-algorithms"> <h3>8.2.4. OpenACC Implementation of Parallel Algorithms<a class="headerlink" href="#openacc-implementation-of-parallel-algorithms" title="Permalink to this headline"></a></h3> NVC++ has an experimental GPU support for a subset of algorithms with parallel execution policies <code class="docutils literal notranslate">std::par</code> and <code class="docutils literal notranslate">std::par_unseq</code> accelerated through the OpenACC implementation. This feature, enabled with the <code class="docutils literal notranslate">-stdpar=gpu:acc</code> option, may result in better application performance on the GPU and faster compilation speed. The following subset of algorithms have OpenACC implementation support: <ul class="simple"> <li><code class="docutils literal notranslate">std::for_each</code></li> <li><code class="docutils literal notranslate">std::for_each_n</code></li> <li><code class="docutils literal notranslate">std::transform</code></li> </ul> The following algorithms have OpenACC implementation support for scalar data types and the standard <code class="docutils literal notranslate">std::plus</code> reduction operation: <ul class="simple"> <li><code class="docutils literal notranslate">std::reduce</code></li> <li><code class="docutils literal notranslate">std::transform_reduce</code></li> </ul> The remainder of the parallel algorithms are parallelized using the default GPU implementation as if <code class="docutils literal notranslate">-stdpar=gpu</code> was specified. When the code is compiled for GPU with the OpenACC acceleration <code class="docutils literal notranslate">__NVCOMPILER_STDPAR_OPENACC_GPU</code> macro is defined implicitly. </section> <section id="coding-guidelines-for-gpu-accelerating-parallel-algorithms"> <h3>8.2.5. Coding Guidelines for GPU-accelerating Parallel Algorithms<a class="headerlink" href="#coding-guidelines-for-gpu-accelerating-parallel-algorithms" title="Permalink to this headline"></a></h3> GPUs are not simply CPUs with more threads. To effectively take advantage of the massive parallelism and memory bandwidth available on GPUs, it is typical for GPU programming models to put some limitations on code executed on the GPU. The NVC++ implementation of C++ Parallel Algorithms is no exception in this regard. The sections which follow detail the limitations that apply in the current release. <section id="parallel-algorithms-and-device-function-annotations"> <h4>8.2.5.1. Parallel Algorithms and Device Function Annotations<a class="headerlink" href="#parallel-algorithms-and-device-function-annotations" title="Permalink to this headline"></a></h4> Functions to be executed on the GPU within parallel algorithms do not need any <code class="docutils literal notranslate">__device__</code> annotations or other special markings to be compiled for GPU execution. The NVC++ compiler walks the call graph for each source file and automatically infers which functions must be compiled for GPU execution. However, this only works when the compiler can see the function definition in the same source file where the function is called. This is true for most inline functions and template functions but may fail when functions are defined in a different source file or linked in from an external library. You need to be aware of this when formulating parallel algorithms invocations that you expect to be offloaded and accelerated on NVIDIA GPUs. When calling an externally defined function from within a parallel algorithm region, such functions require some form of device annotations from other GPU programming models e.g. OpenACC routine directive (refer to <a class="reference internal" href="#stdpar-cpp-interop-openacc-routine">External Device Function Annotations</a> for more information). </section> <section id="data-management-in-parallel-algorithms"> <h4>8.2.5.2. Data Management in Parallel Algorithms<a class="headerlink" href="#data-management-in-parallel-algorithms" title="Permalink to this headline"></a></h4> When offloading parallel algorithms to a GPU, it’s essential to consider how data is accessed from the parallel region. Some GPUs may not access certain segments of the CPU’s address space. Developers targeting platforms without unified shared memory or those seeking to optimize performance must be aware of these memory distinctions, as they may affect the folowing types of data accessed in parallel algorithm regions: <ul class="simple"> <li>Pointer data passed into lambda functions within the parallel algorithm.</li> <li>Data captured by reference in lambda functions or pointer data captured by value.</li> <li>Variables with static storage duration referenced inside the parallel algorithm.</li> </ul> To avoid memory access violations, developers must ensure that all of the above data is accessible to the GPU before the parallel algorithm is executed. Stdpar C++ only supports <a class="reference internal" href="#acc-mem-unified">Managed and Unified Memory Modes</a> which allow data being accessed from CPU and GPU. Through support in both the CUDA device driver and the NVIDIA GPU hardware, the CUDA Unified Memory manager automatically moves some types of data based on usage. Stdpar with Separate Memory Mode can only be supported when the data are fully managed through the OpenACC data directives, refer to <a class="reference internal" href="#stdpar-cpp-interop-openacc">Interoperability with OpenACC</a>. Since object-oriented design is fundamental to C++, special consideration must be given to composite data types with pointer or reference members. The data referenced or pointed to may not be stored contiguously within the composite data type. Moreover, such data might not even be allocated in the same memory segment as the composite type itself. As a result, when accessing both the composite data type and its referenced or pointed-to data from parallel algorithms, the developer must ensure that the member data is also made accessible to the GPU. These considerations should also be taken into account when standard library containers are used in the parallel algorithms as the containers frequently contain member pointers to their elements. The discussion in this section assumes familiarity with the Managed and Unified Memory Modes covered in <a class="reference internal" href="#acc-mem-model">Memory Model</a> and <a class="reference internal" href="#acc-mem-unified">Managed and Unified Memory Modes</a>. The code executing within the parallel algorithm is referred to as the accelerator subprogram. In contrast to the code executing outside of the parallel algorithm which is referred to as the host subprogram. Managed Memory Mode When Stdpar code is compiled with Managed Memory Mode (as default mode or by passing <code class="docutils literal notranslate">-gpu=mem:managed</code>) only data dynamically allocated on the heap in CPU code can be managed automatically. CPU and GPU automatic storage (stack memory) and static storage (global or static data) cannot be automatically managed. Likewise, data that is dynamically allocated in program units not compiled by <code class="docutils literal notranslate">nvc++</code> with the <code class="docutils literal notranslate">-stdpar</code> option is not automatically managed by CUDA Unified Memory even though it is on the CPU heap. The compiler utilizes CUDA Managed Memory for dynamic allocations to make data accessible from CPU and GPU. As managed memory allocation calls can incur higher runtime overhead than standard allocator calls, the implementation uses memory pools for performance reasons by default as detailed in <cite>Memory Pool Allocator <gpu-mem-poolallocator></cite>. The Managed Memory Mode is intended for binaries run on targets with CUDA Managed Memory capability only. Any pointer that is dereferenced and any C++ object that is referenced within a parallel algorithm invocation must refer to data on the CPU heap that is allocated in a program unit compiled by <code class="docutils literal notranslate">nvc++</code> with <code class="docutils literal notranslate">-stdpar</code>. Dereferencing a pointer to a CPU stack or a global object will result in a memory violation in GPU code. Unified Memory Mode When Unified Memory is the default memory mode or is selected explicitly on the command line by passing <code class="docutils literal notranslate">-gpu=mem:unified</code>, there are no restrictions on variables accessed in the parallel algorithms. Therefore, all CPU data (either residing on stack, heap, or globally) are simply accessible in the parallel algorithm functions. Note that memory dynamically allocated in GPU code is only visible from GPU code and can never be accessed by the CPU regardless of the CUDA Unified Memory capability. When compiling a binary for platforms with full CUDA Unified Memory capability, only those source files using features from the standard parallel algorithms library must be compiled by <code class="docutils literal notranslate">nvc++</code> with the <code class="docutils literal notranslate">-stdpar</code> option. There is no requirement that the code dynamically allocating memory accessed on GPU is also compiled in such a way. Unified Memory Mode may utilize CUDA Managed Memory for dynamic allocation, more details can be found in <a class="reference internal" href="#acc-mem-unified">Managed and Unified Memory Modes</a>. Summary The following table provides a key summary of important command-line options selecting memory modes and the impact of memory modes on the Stdpar features. <table class="table-no-stripes docutils align-default" id="id31"> <caption>Table 21. Stdpar C++ Feature Differences for Memory Modes<a class="headerlink" href="#id31" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 25%" /> <col style="width: 22%" /> <col style="width: 25%" /> <col style="width: 28%" /> </colgroup> <tbody> <tr class="row-odd"><td>Command-line options</td> <td>Dynamically allocated variables outside of parallel algorithm region</td> <td>Automatic or static storage variables outside of parallel algorithm region</td> <td>Dynamic allocator</td> </tr> <tr class="row-even"><td>No memory-specific flags passed, compiling on target with CUDA Managed Memory only</td> <td>Can be accessed within parallel region code</td> <td>Cannot be accessed within parallel algorithm code</td> <td>cudaMallocManaged</td> </tr> <tr class="row-odd"><td>No memory-specific flags passed, compiling on target with full CUDA Unified Memory</td> <td>Can be accessed within parallel region code</td> <td>Can be accessed within parallel algorithm code</td> <td>cudaMallocManaged or system allocators: new/malloc (compiler picks the most suitable allocator)</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-gpu=mem:managed</code></td> <td>Can be accessed within parallel region code</td> <td>Cannot be accessed within parallel algorithm code</td> <td>cudaMallocManaged</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-gpu=mem:unified</code></td> <td>Can be accessed within parallel region code</td> <td>Can be accessed within parallel algorithm code</td> <td>cudaMallocManaged or system allocators: new/malloc (compiler picks the most suitable allocator)</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-gpu=mem:unified:managedalloc</code></td> <td>Can be accessed within parallel region code</td> <td>Can be accessed within parallel algorithm code</td> <td>cudaMallocManaged</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-gpu=mem:unified:nomanagedalloc</code></td> <td>Can be accessed within parallel region code</td> <td>Can be accessed within parallel algorithm code</td> <td>System allocators: new/malloc</td> </tr> </tbody> </table> Examples For example, <code class="docutils literal notranslate">std::vector</code> uses dynamically allocated memory, which is accessible from the GPU when using Stdpar. Iterating over the contents of a <code class="docutils literal notranslate">std::vector</code> in a parallel algorithm works as expected when compiling with either <code class="docutils literal notranslate">-gpu=mem:managed</code> or <code class="docutils literal notranslate">-gpu=mem:unified</code>: <div class="highlight-c++ notranslate"><div class="highlight"><pre>std::vector<int> v = ...; std::sort(std::execution::par, v.begin(), v.end()); // Okay, accesses heap memory. </pre></div> </div> On the other hand, <code class="docutils literal notranslate">std::array</code> performs no dynamic allocations. Its contents are stored within the <code class="docutils literal notranslate">std::array</code> object itself, which is often on a CPU stack. Iterating over the contents of a <code class="docutils literal notranslate">std::array</code> will not work on systems with only CUDA Managed Memory support unless the <code class="docutils literal notranslate">std::array</code> itself is allocated on the heap and the code is compiled with <code class="docutils literal notranslate">-gpu=mem:managed</code>: <div class="highlight-c++ notranslate"><div class="highlight"><pre>std::array<int, 1024> a = ...; std::sort(std::execution::par, a.begin(), a.end()); // Fails on targets with CUDA Managed // Memory capability only, array is on // a CPU stack inaccessible from GPU. // Works correctly on targets whith full // CUDA Unified Memory support. </pre></div> </div> The above example works as expected when run on a target supporting full CUDA Unified Memory capability. When executing on targets with CUDA Managed Memory capability only, pay particular attention to lambda captures, especially capturing data objects by reference, which may contain non-obvious pointer dereferences: <div class="highlight-c++ notranslate"><div class="highlight"><pre>void saxpy(float* x, float* y, int N, float a) { std::transform(std::execution::par_unseq, x, x + N, y, y, [&](float xi, float yi){ return a * xi + yi; }); } </pre></div> </div> In the earlier example, the containing function parameter <code class="docutils literal notranslate">a</code> is captured by reference. The code within the body of the lambda, which is running on the GPU, tries to access <code class="docutils literal notranslate">a</code>, which is in the CPU stack memory. This attempt results in a memory violation and undefined behavior. In this case, the problem can easily be fixed by changing the lambda to capture by value: <div class="highlight-c++ notranslate"><div class="highlight"><pre>void saxpy(float* x, float* y, int N, float a) { std::transform(std::execution::par_unseq, x, x + N, y, y, [=](float xi, float yi){ return a * xi + yi; }); } </pre></div> </div> With this one-character change, the lambda makes a copy of <code class="docutils literal notranslate">a</code>, which is then copied to the GPU, and there are no attempts to reference CPU stack memory from GPU code. Such code will run correctly without requiring modifications on targets with full CUDA Unified Memory capability. If <code class="docutils literal notranslate">std::vector</code> is accessed through a subscript operator from the device this would require such a vector object to be accessible from the parallel code executing on the GPU. This means that the <code class="docutils literal notranslate">std::vector</code> needs to be allocated dynamically in order to make it accessible from the GPU when compiled for the systems with only CUDA Managed Memory support. <div class="highlight-c++ notranslate"><div class="highlight"><pre>std::vector<int> v = ...; std::for_each(std::execution::par, idx.begin(), idx.end(), [&](auto i) {v[i] = 1;}); // Fails on targets with CUDA Managed // Memory capability only, vector object is on // a CPU stack inaccessible from GPU. // Works correctly on targets with full // CUDA Unified Memory support. </pre></div> </div> An alternative approach to managing the content of the <code class="docutils literal notranslate">std::vector</code> on systems with CUDA Managed Memory support only would be to obtain a pointer to its elements data region using <code class="docutils literal notranslate">data()</code> member. <div class="highlight-c++ notranslate"><div class="highlight"><pre>std::vector<int> v = ...; int* vdataptr = v.data(); std::for_each(std::execution::par, idx.begin(), idx.end(), [&](auto i) {vdataptr[i] = 1;}); // Works, vector elements are in heap // memory </pre></div> </div> Whether <code class="docutils literal notranslate">-gpu=mem:unified</code> is enabled by default or passed explicitly on the command line, parallel algorithms can access global variables and accesses to global variables from CPU and GPU are kept in sync. Extra care should be taken when accessing global variables within parallel algorithms, as simultaneous updates in different iterations running on the GPU can lead to data races. The following example illustrates the safe update of a global variable in the parallel algorithm since the update only occurs in one iteration. <div class="highlight-c++ notranslate"><div class="highlight"><pre>int globvar = 123; void foo() { auto r = std::views::iota(0, N); std::for_each(std::execution::par_unseq, r.begin(), r.end(), [](auto i) { if (i == N - 1) globvar += 345; }); // globvar is equal to 468. } </pre></div> </div> </section> <section id="parallel-algorithms-and-function-pointers"> <h4>8.2.5.3. Parallel Algorithms and Function Pointers<a class="headerlink" href="#parallel-algorithms-and-function-pointers" title="Permalink to this headline"></a></h4> Functions compiled to run on either the CPU or the GPU must be compiled into two different versions, one with the CPU machine instructions and one with the GPU machine instructions. In the current implementation, a function pointer either points to the CPU or the GPU version of the functions. This causes problems if you attempt to pass function pointers between CPU and GPU code. You might inadvertently pass a pointer to the CPU version of the function to GPU code. In the future, it may be possible to automatically and seamlessly support the use of function pointers across CPU and GPU code boundaries, but it is not supported in the current implementation. Function pointers can’t be passed to Parallel Algorithms to be run on the GPU, and functions may not be called through a function pointer within GPU code. For example, the following code example won’t work correctly: <div class="highlight-c++ notranslate"><div class="highlight"><pre>void square(int& x) { x = x * x; } void square_all(std::vector<int>& v) { std::for_each(std::execution::par_unseq, v.begin(), v.end(), &square); } </pre></div> </div> It passes a pointer to the CPU version of the function square to a parallel <code class="docutils literal notranslate">for_each</code> algorithm invocation. When the algorithm is parallelized and offloaded to the GPU, the program fails to resolve the function pointer to the GPU version of <code class="docutils literal notranslate">square</code>. You can often solve this issue by using a function object, which is an object with a function call operator. The function object’s call operator is resolved at compile time to the GPU version of the function, instead of being resolved at run time to the incorrect CPU version of the function as in the previous example. For example, the following code example works: <div class="highlight-c++ notranslate"><div class="highlight"><pre>struct squared { void operator()(int& x) const { x = x * x; } }; void square_all(std::vector<int>& v) { std::for_each(std::execution::par_unseq, v.begin(), v.end(), squared{}); } </pre></div> </div> Another possible workaround is to change the function to a lambda, because a lambda is implemented as a nameless function object: <div class="highlight-c++ notranslate"><div class="highlight"><pre>void square_all(std::vector<int>& v) { std::for_each(std::execution::par_unseq, v.begin(), v.end(), [](int& x) { x = x * x; }); } </pre></div> </div> If the function in question is too big to be converted to a function object or a lambda, then it should be possible to wrap the call to the function in a lambda: <div class="highlight-c++ notranslate"><div class="highlight"><pre>void compute(int& x) { // Assume lots and lots of code here. } void compute_all(std::vector<int>& v) { std::for_each(std::execution::par_unseq, v.begin(), v.end(), [](int& x) { compute(x); }); } </pre></div> </div> No function pointers are used in this example. The restriction on calling a function through a function pointer unfortunately means passing polymorphic objects from CPU code to GPU-accelerated Parallel Algorithms is not currently supported, as virtual tables are implemented using function pointers. </section> <section id="random-access-iterators"> <h4>8.2.5.4. Random Access Iterators<a class="headerlink" href="#random-access-iterators" title="Permalink to this headline"></a></h4> The C++ Standard requires that the iterators passed to most C++ Parallel Algorithms be forward iterators. However, C++ Parallel Algorithms on GPUs only works with random access iterators. Passing a forward iterator or a bidirectional iterator to a GPU/CPU-accelerated Parallel Algorithm results in a compilation error. Passing raw pointers or Standard Library random access iterators to the algorithms has the best performance, but most other random-access iterators work correctly. </section> <section id="interoperability-with-the-c-standard-library"> <h4>8.2.5.5. Interoperability with the C++ Standard Library<a class="headerlink" href="#interoperability-with-the-c-standard-library" title="Permalink to this headline"></a></h4> Large parts of the C++ Standard Library can be used with stdpar on GPUs. <ul class="simple"> <li><code class="docutils literal notranslate">std::atomic<T></code> objects within GPU code work provided that <code class="docutils literal notranslate">T</code> is a four-byte or eight-byte integer type.</li> <li>Math functions that operate on floating-point types—such as <code class="docutils literal notranslate">sin</code>, <code class="docutils literal notranslate">cos</code>, <code class="docutils literal notranslate">log</code>, and most of the other functions declared in <code class="docutils literal notranslate"><cmath></code> —can be used in GPU code and resolve to the same implementations that are used in CUDA C++ programs.</li> <li><code class="docutils literal notranslate">std::complex</code>, <code class="docutils literal notranslate">std::tuple</code>, <code class="docutils literal notranslate">std::pair</code>, <code class="docutils literal notranslate">std::optional</code>, <code class="docutils literal notranslate">std::variant</code>, and <code class="docutils literal notranslate"><type_traits></code>, are supported and work as expected in GPU code.</li> </ul> The parts of the C++ Standard Library that aren’t supported in GPU code include I/O functions and in general any function that accesses the CPU operating system. As a special case, basic <code class="docutils literal notranslate">printf</code> calls can be used within GPU code and leverage the same implementation that is used in NVIDIA CUDA C++. </section> <section id="no-exceptions-in-gpu-code"> <h4>8.2.5.6. No Exceptions in GPU Code<a class="headerlink" href="#no-exceptions-in-gpu-code" title="Permalink to this headline"></a></h4> As with most other GPU programming models, throwing and catching C++ exceptions is not supported within Parallel Algorithm invocations that are offloaded to the GPU. Unlike some other GPU programming models where try/catch blocks and throw expressions are compilation errors, exception code does compile but with non-standard behavior. Catch clauses are ignored, and throw expressions abort the GPU kernel if actually executed. Exceptions in CPU code work without restrictions. </section> </section> <section id="nvc-experimental-features"> <h3>8.2.6. NVC++ Experimental Features<a class="headerlink" href="#nvc-experimental-features" title="Permalink to this headline"></a></h3> nvc++ experimental features are enabled with the –experimental–stdpar compiler flag. Experimental feature headers are exposed via the <code class="docutils literal notranslate"><experimental/...></code> namespaces and limited support for these features is available in older C++ versions. Table 1 lists all experimental features available and the minimum language version required to use them. <table class="table-no-stripes docutils align-default" id="id32"> <caption>Table 22. Experimental features information<a class="headerlink" href="#id32" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 31%" /> <col style="width: 9%" /> <col style="width: 11%" /> <col style="width: 24%" /> <col style="width: 25%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Feature</th> <th class="head">Recommended</th> <th class="head">Limited support</th> <th class="head">Standard proposal</th> <th class="head">Other notes</th> </tr> </thead> <tbody> <tr class="row-even"><td>Multi-dimensional spans (mdspan)</td> <td>C++23</td> <td>C++17</td> <td><a class="reference external" href="http://wg21.link/p0009">P0009</a></td> <td><a class="reference external" href="https://github.com/NVIDIA/libcudacxx">https://github.com/NVIDIA/libcudacxx</a></td> </tr> <tr class="row-odd"><td>Slices of multi-dimensional spans (submdspan)</td> <td>C++23</td> <td>C++17</td> <td><a class="reference external" href="http://wg21.link/p2630">P2630</a></td> <td><a class="reference external" href="https://github.com/NVIDIA/libcudacxx">https://github.com/NVIDIA/libcudacxx</a></td> </tr> <tr class="row-even"><td>Multi-dimensional arrays (mdarray)</td> <td>C++23</td> <td>C++17</td> <td><a class="reference external" href="http://wg21.link/p1684">P1684</a></td> <td><a class="reference external" href="https://github.com/kokkos/mdspan">https://github.com/kokkos/mdspan</a></td> </tr> <tr class="row-odd"><td>Senders and receivers</td> <td>C++23</td> <td>C++20</td> <td><a class="reference external" href="http://wg21.link/p2300">P2300</a></td> <td><a class="reference external" href="https://github.com/NVIDIA/stdexec">https://github.com/NVIDIA/stdexec</a></td> </tr> <tr class="row-even"><td>Linear algebra</td> <td>C++23</td> <td>C++17</td> <td><a class="reference external" href="http://wg21.link/p1673">P1673</a></td> <td><a class="reference external" href="https://github.com/kokkos/stdblas">https://github.com/kokkos/stdblas</a></td> </tr> </tbody> </table> <section id="multi-dimensional-spans"> <h4>8.2.6.1. Multi-dimensional Spans<a class="headerlink" href="#multi-dimensional-spans" title="Permalink to this headline"></a></h4> Multi-dimensional spans (<code class="docutils literal notranslate">std::mdspan</code>) enable customizable multi-dimensional access to data. This feature was added to C++23 (see <a class="reference external" href="http://wg21.link/p0009">P0009</a> and follow-on papers). <a class="reference external" href="https://github.com/kokkos/mdspan/wiki/A-Gentle-Introduction-to-mdspan">A Gentle Introduction to mdspan</a> gives a tutorial. The reference mdspan implementation <a class="reference external" href="https://github.com/kokkos/mdspan">https://github.com/kokkos/mdspan</a> also has many useful examples. nvc++ provides an implementation available in the <code class="docutils literal notranslate"><experimental/mdspan></code> namespace that works with C++17 or newer. It enables applications that are not targeting the C++23 version of the standard to use mdspan. nvc++ also provides the <a class="reference external" href="http://wg21.link/p0009r17">P0009R17</a> version of submdspan, which only works for the mdspan layouts in C++23; that is, it does not implement C++26 submdspan (<a class="reference external" href="http://wg21.link/p2630">P2630</a>) yet. C++23’s mdspan uses <code class="docutils literal notranslate">operator[]</code> for array access. For example, if <code class="docutils literal notranslate">A</code> is a rank-2 mdspan, and <code class="docutils literal notranslate">i</code> and <code class="docutils literal notranslate">j</code> are integers, then <code class="docutils literal notranslate">A[i, j]</code> accesses the element of <code class="docutils literal notranslate">A</code> at row <code class="docutils literal notranslate">i</code> and column <code class="docutils literal notranslate">j</code>. Before C++23, <code class="docutils literal notranslate">operator[]</code> was only allowed to take one argument. C++23 changed the language to permit any number of arguments (zero or more). nvc++ does not support this new language feature. As a result, the implementation of mdspan provided by nvc++ permits use of <code class="docutils literal notranslate">operator()</code> as a fall-back (e.g., <code class="docutils literal notranslate">A(i, j)</code> instead of <code class="docutils literal notranslate">A[i, j]</code>). Users may enable this fall-back manually, by defining the macro <code class="docutils literal notranslate">MDSPAN_USE_PAREN_OPERATOR</code> to <code class="docutils literal notranslate">1</code> before including any mdspan headers. The following example (<a class="reference external" href="https://godbolt.org/z/6eG93o7YW">godbolt</a>): <div class="highlight-c++ notranslate"><div class="highlight"><pre>#include <experimental/mdspan> #include <iostream> namespace stdex = std::experimental; int main() { std::array d{ 0, 5, 1, 3, 8, 4, 2, 7, 6, }; stdex::mdspan m{d.data(), stdex::extents{3, 3}}; static_assert(m.rank()==2, "Rank is two"); for (std::size_t i = 0; i < m.extent(0); ++i) for (std::size_t j = 0; j < m.extent(1); ++j) std::cout << "m(" << i << ", " << j << ") == " << m(i, j) << "\n"; return 0; } </pre></div> </div> is compiled as follows <div class="highlight-text notranslate"><div class="highlight"><pre>nvc++ -std=c++17 -o example example.cpp </pre></div> </div> and outputs <div class="highlight-text notranslate"><div class="highlight"><pre>m(0, 0) == 0 m(0, 1) == 5 m(0, 2) == 1 m(1, 0) == 3 m(1, 1) == 8 m(1, 2) == 4 m(2, 0) == 2 m(2, 1) == 7 m(2, 2) == 6 </pre></div> </div> </section> <section id="senders-and-receivers"> <h4>8.2.6.2. Senders and Receivers<a class="headerlink" href="#senders-and-receivers" title="Permalink to this headline"></a></h4> <a class="reference external" href="http://wg21.link/p2300">P2300 - std::execution</a> proposes a model of asynchronous programming for adoption into the C++26 Standard. For an introduction to this feature, see <a class="reference external" href="https://wg21.link/P2300#design-user">Design - user side</a> section of the proposal. The NVIDIA implementation of Senders and receivers is <a class="reference external" href="https://github.com/NVIDIA/stdexec">open source</a> and its repository contains many <a class="reference external" href="https://github.com/NVIDIA/stdexec/tree/main/examples">useful examples.</a> nvc++ provides access to the NVIDIA implementation which works in C++20 or newer. Since the proposal is still evolving, our implementation is not stable. It is experimental in nature and will change to follow the proposal closely without any warning. The NVIDIA implementation is structured as follows: <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 19%" /> <col style="width: 13%" /> <col style="width: 67%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Includes</th> <th class="head">Namespace</th> <th class="head">Description</th> </tr> </thead> <tbody> <tr class="row-even"><td><stdexec/…></td> <td>::stdexec</td> <td>Approved for C++ standard</td> </tr> <tr class="row-odd"><td><sexec/…></td> <td>::exec</td> <td>Generic additions and extensions</td> </tr> <tr class="row-even"><td><nvexec/…></td> <td>::nvexec</td> <td>NVIDIA-specific extensions and customizations</td> </tr> </tbody> </table> The following example (<a class="reference external" href="https://godbolt.org/z/axbhYs7vj">godbolt</a>) builds a task graph in which two different vectors, v0 and v1, are concurrently modified in bulk, using a CPU thread pool and a GPU stream context, respectively. This graph then transfers execution to the CPU thread pool, and adds both vectors into v2 on the CPU, returning the sum of all elements: <div class="highlight-c++ notranslate"><div class="highlight"><pre>int main() { // Declare a pool of 8 worker CPU threads: exec::static_thread_pool pool(8); // Declare a GPU stream context: nvexec::stream_context stream_ctx{}; // Get a handle to the thread pool: auto cpu_sched = pool.get_scheduler(); auto gpu_sched = stream_ctx.get_scheduler(); // Declare three dynamic array with N elements std::size_t N = 5; std::vector<int> v0 {1, 1, 1, 1, 1}; std::vector<int> v1 {2, 2, 2, 2, 2}; std::vector<int> v2 {0, 0, 0, 0, 0}; // Describe some work: auto work = stdexec::when_all( // Double v0 on the CPU stdexec::just() | exec::on(cpu_sched, stdexec::bulk(N, [v0 = v0.data()](std::size_t i) { v0[i] *= 2; })), // Triple v1 on the GPU stdexec::just() | exec::on(gpu_sched, stdexec::bulk(N, [v1 = v1.data()](std::size_t i) { v1[i] *= 3; })) ) | stdexec::transfer(cpu_sched) // Add the two vectors into the output vector v2 = v0 + v1: | stdexec::bulk(N, [&](std::size_t i) { v2[i] = v0[i] + v1[i]; }) | stdexec::then([&] { int r = 0; for (std::size_t i = 0; i < N; ++i) r += v2[i]; return r; }); auto [sum] = stdexec::sync_wait(work).value(); // Print the results: std::printf("sum = %d\n", sum); for (int i = 0; i < N; ++i) { std::printf("v0[%d] = %d, v1[%d] = %d, v2[%d] = %d\n", i, v0[i], i, v1[i], i, v2[i]); } return 0; } </pre></div> </div> is compiled as follows: <div class="highlight-text notranslate"><div class="highlight"><pre>nvc++ --stdpar=gpu --experimental-stdpar -std=c++20 -o example example.cpp </pre></div> </div> and outputs: <div class="highlight-text notranslate"><div class="highlight"><pre>sum = 40 v0[0] = 2, v1[0] = 6, v2[0] = 8 v0[1] = 2, v1[1] = 6, v2[1] = 8 v0[2] = 2, v1[2] = 6, v2[2] = 8 v0[3] = 2, v1[3] = 6, v2[3] = 8 v0[4] = 2, v1[4] = 6, v2[4] = 8 </pre></div> </div> </section> <section id="linear-algebra"> <h4>8.2.6.3. Linear Algebra<a class="headerlink" href="#linear-algebra" title="Permalink to this headline"></a></h4> <a class="reference external" href="https://wg21.link/p1673">P1673 - A free function linear algebra interface based on the BLAS</a> proposes standardizing an idiomatic C++ interface based on std::mdspan for a subset of the Basic Linear Algebra Subroutines (BLAS) standard. For an introduction to this feature, see <a class="reference external" href="https://youtu.be/n7mBGDqSzlQ">P1673 (C++ linear algebra library) background & motivation</a>. There are many useful examples available in $HPCSDK_HOME/examples/stdpar/stdblas and in the repository of the <a class="reference external" href="https://github.com/kokkos/stdBLAS/tree/main/examples">reference implementation</a>. A detailed documentation is available at $HPCSDK_HOME/compilers/include/experimental/__p1673_bits/README.md. nvc++ provides access to the NVIDIA implementation which works in C++17 or newer. Since the proposal is still evolving, our implementation is not stable. It is experimental in nature and will change to follow the proposal closely without any warning. To use the linear algebra library facilities, a suitable linear algebra library must be linked: cuBLAS for GPU execution via the -cudalib=cublas flag, and a CPU BLAS library for CPU execution. The HPC SDK bundles OpenBLAS which may be linked using the -lblas linker flag. <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 13%" /> <col style="width: 17%" /> <col style="width: 30%" /> <col style="width: 40%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Execution</th> <th class="head">BLAS library</th> <th class="head">Architectures</th> <th class="head">Compiler flags</th> </tr> </thead> <tbody> <tr class="row-even"><td>Multicore</td> <td>OpenBLAS</td> <td>x86_64, aarch64</td> <td><code class="docutils literal notranslate">-stdpar=multicore -lblas</code></td> </tr> <tr class="row-odd"><td>GPU</td> <td>cuBLAS</td> <td>All</td> <td><code class="docutils literal notranslate">-stdpar=gpu -cudalib=cublas</code></td> </tr> </tbody> </table> The following example (<a class="reference external" href="https://godbolt.org/z/8cT537dfo">godbolt</a>): <div class="highlight-c++ notranslate"><div class="highlight"><pre>#include <experimental/mdspan> #include <experimental/linalg> #include <vector> #include <array> namespace stdex = std::experimental; int main() { constexpr size_t N = 4; constexpr size_t M = 2; std::vector<double> A_vec(N*M); std::vector<double> x_vec(M); std::array<double, N> y_vec(N); stdex::mdspan A(A_vec.data(), N, M); stdex::mdspan x(x_vec.data(), M); stdex::mdspan y(y_vec.data(), N); for(int i = 0; i < A.extent(0); ++i) for(int j = 0; j < A.extent(1); ++j) A(i,j) = 100.0 * i + j; for(int j = 0; j < x.extent(0); ++j) x(j) = 1.0 * j; for(int i = 0; i < y.extent(0); ++i) y(i) = -1.0 * i; stdex::linalg::matrix_vector_product(A, x, y); // y = A * x // y = 0.5 * y + 2 * A * x stdex::linalg::matrix_vector_product(std::execution::par, stdex::linalg::scaled(2.0, A), x, stdex::linalg::scaled(0.5, y), y); // Print the results: for (int i = 0; i < N; ++i) std::printf("y[%d] = %f\n", i, y(i)); return 0; } </pre></div> </div> is compiled as follows for GPU execution: <div class="highlight-text notranslate"><div class="highlight"><pre>nvc++ -std=c++17 -stdpar=gpu -cudalib=cublas -o example example.cpp </pre></div> </div> And as follows for CPU execution: <div class="highlight-text notranslate"><div class="highlight"><pre>nvc++ -std=c++17 -stdpar=multicore -o example example.cpp -lblas </pre></div> </div> and produces the same outputs in both cases: <div class="highlight-text notranslate"><div class="highlight"><pre>y[0] = 2.500000 y[1] = 252.500000 y[2] = 502.500000 y[3] = 752.500000 </pre></div> </div> </section> </section> <section id="stdpar-c-larger-example-lulesh"> <h3>8.2.7. Stdpar C++ Larger Example: LULESH<a class="headerlink" href="#stdpar-c-larger-example-lulesh" title="Permalink to this headline"></a></h3> The <a class="reference external" href="https://github.com/LLNL/LULESH">LULESH hydrodynamics mini-app</a> was developed at Lawrence Livermore National Laboratory to stress test compilers and model performance of hydrodynamics applications. It is about 9,000 lines of C++ code, of which 2,800 lines are the core computation that should be parallelized. We ported LULESH to C++ Parallel Algorithms and made the port available on <a class="reference external" href="https://github.com/LLNL/LULESH/tree/2.0.2-dev/stdpar">LULESH’s GitHub repository</a>. To compile it, install the <a class="reference external" href="https://developer.nvidia.com/hpc-sdk">NVIDIA HPC SDK</a>, check out the 2.0.2-dev branch of the LULESH repository, go to the correct directory, and <code class="docutils literal notranslate">run make</code>. <div class="highlight-text notranslate"><div class="highlight"><pre>git clone --branch 2.0.2-dev https://github.com/LLNL/LULESH.git cd LULESH/stdpar/build make run </pre></div> </div> While LULESH is too large to show the entire source code here, there are some key code sequences that demonstrate the use of stdpar. The LULESH code has many loops with large bodies and no loop-carried dependencies, making them good candidates for parallelization. Most of these were easily converted into calls to <code class="docutils literal notranslate">std::for_each_n</code> with the <code class="docutils literal notranslate">std::execution::par</code> policy, where the body of the lambda passed to <code class="docutils literal notranslate">std::for_each_n</code> is identical to the original loop body. The function <code class="docutils literal notranslate">CalcMonotonicQRegionForElems</code> is an example of this. The loop header written for OpenMP looks as follows: <div class="highlight-c++ notranslate"><div class="highlight"><pre>#pragma omp parallel for firstprivate(qlc_monoq, qqc_monoq, \ monoq_limiter_mult, monoq_max_slope, ptiny) for ( Index_t i = 0 ; i < domain.regElemSize(r); ++i ) { </pre></div> </div> This loop header in the C++ Parallel Algorithms version becomes <a class="reference external" href="https://github.com/LLNL/LULESH/blob/2.0.2-dev/stdpar/src/lulesh.cc#L1555-L1756">the following</a>: <div class="highlight-c++ notranslate"><div class="highlight"><pre>std::for_each_n( std::execution::par, counting_iterator(0), domain.regElemSize(r), [=, &domain](Index_t i) { </pre></div> </div> The loop body, which in this case is almost 200 lines long, becomes the body of the lambda but is otherwise unchanged from the OpenMP version. In a number of places, an explicit <code class="docutils literal notranslate">for</code> loop was changed to use C++ Parallel Algorithms that better express the intent of the code, such as the function <code class="docutils literal notranslate">CalcPressureForElems</code>: <div class="highlight-c++ notranslate"><div class="highlight"><pre>#pragma omp parallel for firstprivate(length) for (Index_t i = 0; i < length ; ++i) { Real_t c1s = Real_t(2.0)/Real_t(3.0) ; bvc[i] = c1s * (compression[i] + Real_t(1.)); pbvc[i] = c1s; } </pre></div> </div> This function was rewritten as <a class="reference external" href="https://github.com/LLNL/LULESH/blob/2.0.2-dev/stdpar/src/lulesh.cc#L1825-L1830">as follows</a>: <div class="highlight-c++ notranslate"><div class="highlight"><pre>constexpr Real_t cls = Real_t(2.0) / Real_t(3.0); std::transform(std::execution::par, compression, compression + length, bvc, [=](Real_t compression_i) { return cls * (compression_i + Real_t(1.0)); }); std::fill(std::execution::par, pbvc, pbvc + length, cls); </pre></div> </div> </section> <section id="interoperability-with-openacc"> <h3>8.2.8. Interoperability with OpenACC<a class="headerlink" href="#interoperability-with-openacc" title="Permalink to this headline"></a></h3> A subset of OpenACC features can be used when compiling Stdpar code for GPUs. Such a subset is documented in this section. To activate OpenACC directives recognition with Stdpar code add -acc command line flag to nvc++. <div class="highlight-text notranslate"><div class="highlight"><pre>nvc++ -stdpar -acc example.cpp </pre></div> </div> OpenACC functionality is detailed in the OpenACC specification and the NVHPC compiler specific differences are detailed in <a class="reference internal" href="#acc-use">Using OpenACC</a> of this guide. Combining OpenACC features with Stdpar offers greater flexibility in how code is written. For instance, it allows external functions to be called from within parallel algorithms. Additionally, it provides opportunities for performance tuning, such as through explicit data management. <section id="data-management-directives"> <h4>8.2.8.1. Data Management Directives<a class="headerlink" href="#data-management-directives" title="Permalink to this headline"></a></h4> C++ parallel algorithms can be offloaded to the GPU when the data accessed in such algorithms is managed through the OpenACC directives. With data fully managed through the OpenACC directives, Stdpar code can run with all GPU Memory Modes including Separate Memory Mode (compiled with <code class="docutils literal notranslate">-gpu=mem:separate</code>). The following data directives are supported: <ul class="simple"> <li>OpenACC structured data construct directive</li> <li>OpenACC unstructured enter/exit data directives</li> <li>OpenACC host_data directive</li> <li>OpenACC update directive</li> </ul> Only the data that are captured by reference or pointer-like data captured by values as well as pointer-like data passed as arguments in the parallel algorithm lambdas can be managed through OpenACC. Any non-pointer variables that are captured by value in the parallel algorithm lambda or non-pointer data passed in as lambda arguments are managed by the C++ implementation. A copy of such data is automatically created in the memory accessible from the GPU. For additional details refer to <a class="reference internal" href="#stdpar-cpp-unified-memory">Data Management in Parallel Algorithms</a>. OpenACC data management can serve two main purposes: <ul class="simple"> <li>Explicit Data Management: This is necessary for data that cannot be managed implicitly, such as on platforms without full CUDA Unified Memory support and when data is not allocated in the CUDA Managed Memory segment.</li> <li>Performance Tuning: Even when data is located in the GPU-accessible memory, performance can be optimized via OpenACC features. Many OpenACC data directives and clauses provide hints to the CUDA device driver, which can improve implicit data management.</li> </ul> Data management strategies may differ depending on the specific goals being pursued. These differences are outlined where applicable. General Rules All directives, except <code class="docutils literal notranslate">host_data</code>, can be used for data management tasks such as allocating memory in the GPU and copying data between the CPU and the GPU. These directives can be used to ensure that the data is present on the device during the execution of parallel algorithms. The <code class="docutils literal notranslate">host_data</code> construct, on the other hand, is used for address translation between CPU and GPU address spaces when data is accessed in parallel algorithms. <div class="highlight-c++ notranslate"><div class="highlight"><pre>int n = get_n(); T* in = new T[nelem]; T* out = new T[nelem]; // Data captured by the lambda are managed explicitly with OpenACC #pragma acc enter data copyin(n, in[0:nelem]) create(out[0:nelem]) #pragma acc host_data use_device(n, in, out) { std::for_each(std::execution::par_unseq, r.begin(), r.end(), [&,in,out](auto i) { out[i] = in[i] * n; }); } #pragma acc exit data copyout(out[0:nelem]) </pre></div> </div> In the above example all data accessed from <code class="docutils literal notranslate">std::for_each</code> through the lambda capture are managed explicitly through the OpenACC data directives. Since the data inside the parallel algorithms are either captured by reference or capturing a pointer, the application code must ensure that such data is accessible from the GPU. To make non-GPU resident data accessible in the parallel region, such a region must be enclosed into the <code class="docutils literal notranslate">host_data</code> construct region with all variables that are managed explicitly via OpenACC runtime listed in the <code class="docutils literal notranslate">use_device</code> clause. The data need to be present (copied or created) at the time the <code class="docutils literal notranslate">host_data</code> directive is encountered/executed at runtime and the data must also be present for the duration of parallel algorithm execution. The implications of the above are such that lambdas accessing variables enclosed in <code class="docutils literal notranslate">use_device</code> regions can not be additionally invoked from the host code (from outside the parallel region executing on the GPU) because the variable addresses from the GPU obtained through <code class="docutils literal notranslate">host_data</code> may not be accessible on the CPU. <div class="admonition note"> Note If the iterator in the above example would be a pointer type it would require explicit data management in addition to the data captured by the lambda. </div> If the example below is compiled for Separate Memory Mode (-gpu=mem:separate) calling <code class="docutils literal notranslate">fn</code> from within a parallel <code class="docutils literal notranslate">std::for_each</code> works fine but not from outside of any parallel algorithm function since the data resident on GPU would need to be accessed from the CPU. <div class="highlight-c++ notranslate"><div class="highlight"><pre>int n = get_n(); T* in = new T[nelem]; T* out = new T[nelem]; #pragma acc enter data copyin(n, in[0:nelem]) create(out[0:nelem]) #pragma acc host_data use_device(n, in, out) { auto fn = [&,in,out](auto i) { out[i] = in[i] * n;}; std::for_each(std::execution::par_unseq, r.begin(), r.end(), fn); // The following line would not be legal, fn accesses variables in GPU memory //std::for_each(r.begin(), r.end(), fn); } #pragma acc exit data copyout(out[0:nelem]) </pre></div> </div> <div class="admonition note"> Note The behavior of using <code class="docutils literal notranslate">use_device</code> with non-pointer data type is such that all occurrences of non-pointer variables inside the <code class="docutils literal notranslate">host_data</code> region are converted to using the addresses of the variable in the GPU address space before accessing that variable. This is essentially equivalent to translating original occurrences of such variable <code class="docutils literal notranslate">var</code> into <code class="docutils literal notranslate">dvar = *acc_device(&var)</code>. </div> Composite Data Types Composite data types with pointer members can also be managed explicitly but require explicit deep copy to work correctly including pointer attach/detach. <div class="highlight-c++ notranslate"><div class="highlight"><pre>struct S { float *ptr; } int idx[N] = {/*...*/}; float arr[N]; S s{arr}; // Deep copying ptr member with OpenACC #pragma acc enter data copyin(s.ptr[0:N]) #pragma acc enter data copyin(s, idx) #pragma acc data attach(s.ptr) #pragma acc host_data use_device(s, idx) { std::for_each_n(std::execution::par, idx, N, [&](int i) { s.ptr[i] += 5.0; }); } #pragma acc exit data copyout(s.ptr[0:N]) #pragma acc exit data copyout(s) </pre></div> </div> When variable of struct <code class="docutils literal notranslate">S</code> type in the above example is copied to the device, a deep copy is performed with the content pointed by <code class="docutils literal notranslate">S.ptr</code> copied separately. The pointer attachment is used to ensure the address of the pointer is changed to the device memory equivalent before it is accessed from the GPU. Depending on the order of the copies, the pointer <code class="docutils literal notranslate">attach</code> clause may not be required. <div class="admonition note"> Note In the above example the pointer-like iterator <code class="docutils literal notranslate">idx</code> is managed through the OpenACC directives in addition to the data captured by the lambda. </div> Standard Containers If the standard containers with non-contiguous storage must be used in host code with explicit data management to GPU memory, the only viable option is to access the raw data directly using the raw pointer to data (e.g. obtained via <code class="docutils literal notranslate">data()</code> member of <code class="docutils literal notranslate">std::vector</code>) unless the iterator over the data can be used. <div class="highlight-c++ notranslate"><div class="highlight"><pre>std::vector<T> in(nelem); std::vector<T> out(nelem); T *inptr=in.data(),*outptr=out.data(); #pragma acc data copyin(inptr[0:nelem]) copyout(outptr[0:nelem]) #pragma acc host_data use_device(inptr,outptr) { std::for_each(std::execution::par_unseq, r.begin(), r.end(), [=](auto i) { outptr[i] = inptr[i]; }); } </pre></div> </div> In the above example vector elements are accessed through raw pointers to their elements obtained through <code class="docutils literal notranslate">vector::data()</code> member, they are explicitly management through the OpenACC data clauses. Static Storage Data Global or static variables can be made accessible in the parallel algorithms using OpenACC data directives similarly to other variables. <div class="highlight-c++ notranslate"><div class="highlight"><pre>int glob_arr[N] = {/*...*/}; void foo(){ #pragma acc data copy(glob_arr) #pragma acc host_data use_device(glob_arr) { std::for_each_n(std::execution::par, glob_arr, N, [](int &e) { e += 1; }); } } </pre></div> </div> In the above example the global array <code class="docutils literal notranslate">glob_arr</code> is updated on the GPU with help of OpenACC data directives. Member Functions When the data members are managed inside the member functions the implicit object pointer <code class="docutils literal notranslate">this</code> needs to be explicitly managed for correctness as accessing members is always done through the dereference of the object itself. <div class="highlight-c++ notranslate"><div class="highlight"><pre>struct S { float *ptr; void update_member() { #pragma acc data copy(ptr[0:N], this) #pragma acc host_data use_device(ptr, this) { std::for_each(std::execution::par, ptr, ptr + N, [=](float &e) { ptr[&e - ptr] += 5.0; }); } } }; </pre></div> </div> GPU Memory Mode Related Differences In Separate Memory Mode all data must be managed explicitly via extra device allocations and <code class="docutils literal notranslate">memcpy</code> between the host and device and the address translations. This also applies to variables with automatic or static storage duration in Managed Memory Mode. In Unified Memory Mode all data is automatically managed by the CUDA device driver. Additionally in Managed Memory Mode all dynamic allocations are managed by the CUDA device driver. Use of data clauses and directives can only propagate memory usage hints to the CUDA device driver which are used to improve the data management performance. More details can be found in Memory Model and <a class="reference internal" href="#acc-openacc-unified-mem">OpenACC with CUDA Unified Memory</a> . All the data managed by the CUDA device driver can benefit from the simplified uses of the OpenACC features, particularly: <ul class="simple"> <li>Use of <code class="docutils literal notranslate">host_data</code> directive is not required since the host and device address of data in unified shared memory is identical.</li> <li>Use of pointer attach or detach is not required since the host and device pointers in unified shared memory are identical.</li> </ul> The following example illustrates simplified data managment with only OpenACC data construct enclosing the <code class="docutils literal notranslate">std::for_each</code> with Unified Memory Mode. <div class="highlight-c++ notranslate"><div class="highlight"><pre>int n = get_n(); T* in = new T[nelem]; T* out = new T[nelem]; #pragma acc data copyin(in[0:nelem]) copyout(out[0:nelem]) { std::for_each(std::execution::par_unseq, r.begin(), r.end(), [&](auto i) { out[i] = in[i] * n; }); } </pre></div> </div> In the above example we leverage OpenACC explicit data management construct to indicate how data is used on GPU for the computation executed in <code class="docutils literal notranslate">std::for_each</code>: <ul class="simple"> <li><code class="docutils literal notranslate">in</code> is moved into the GPU memory;</li> <li><code class="docutils literal notranslate">out</code> is moved from the GPU memory.</li> </ul> Both <code class="docutils literal notranslate">in</code> and <code class="docutils literal notranslate">out</code> are captured by reference and therefore their host address is used in the lambda of <code class="docutils literal notranslate">std::for_each</code>. The scalar variable <code class="docutils literal notranslate">n</code> is not managed. The use of <code class="docutils literal notranslate">host_data</code> construct is not required. When standard containers are used in data directives and clauses, the underlying data collection can be managed too. For example, in order to indicate that elements of the <code class="docutils literal notranslate">std::vector</code> are accessed from the GPU the application code must first retrieve the pointer to the array elements using its <code class="docutils literal notranslate">data()</code> member. Then such pointers can be used in the regular data directives. <div class="highlight-c++ notranslate"><div class="highlight"><pre>std::vector<T> in(nelem); std::vector<T> out(nelem); T *inptr=in.data(), *outptr=out.data(); #pragma acc data copyin(inptr[0:nelem]) copyout(outptr[0:nelem]) { std::for_each(std::execution::par_unseq, r.begin(), r.end(), [&](auto i) { out[i] = in[i]; }); } </pre></div> </div> The above example demonstrates the use of OpenACC data directives with a raw pointer to elements of <code class="docutils literal notranslate">std::vector</code> which can improve memory performance for data in unified memory and the full deep copy of vector content using attach/detach is not required. <div class="highlight-c++ notranslate"><div class="highlight"><pre>int n = get_n(); T* in = new T[nelem]; T* out = new T[nelem]; #pragma acc enter data copyin(n) #pragma acc host_data use_device(n) { std::for_each(std::execution::par_unseq, r.begin(), r.end(), [&, in, out](auto i) { out[i] = in[i] * n; }); } #pragma acc enter data delete(n) </pre></div> </div> In the above example, <code class="docutils literal notranslate">in</code> and <code class="docutils literal notranslate">out</code> are dynamically allocated and managed by CUDA device driver with Managed Memory Mode, <code class="docutils literal notranslate">n</code> is on the stack and therefore managed explicitly via OpenACC directives. </section> <section id="external-device-function-annotations"> <h4>8.2.8.2. External Device Function Annotations<a class="headerlink" href="#external-device-function-annotations" title="Permalink to this headline"></a></h4> Using OpenACC routine directive annotations allows calling external device functions. <div class="highlight-c++ notranslate"><div class="highlight"><pre>// In file1.cpp extern int foo(); void bar() { std::for_each(std::execution::par_unseq, r.begin(), r.end(), [=](auto i) { ou[i] = foo(); }); } // In file2.cpp #pragma acc routine int foo(){ return 4; } </pre></div> </div> The above code can be compiled/linked as follows: <div class="highlight-text notranslate"><div class="highlight"><pre>nvc++ -stdpar file1.cpp nvc++ -acc file2.cpp nvc++ -stdpar -acc file1.o file2.o </pre></div> </div> </section> </section> <section id="getting-started-with-parallel-algorithms-for-gpus"> <h3>8.2.9. Getting Started with Parallel Algorithms for GPUs<a class="headerlink" href="#getting-started-with-parallel-algorithms-for-gpus" title="Permalink to this headline"></a></h3> To get started, download and install the <a class="reference external" href="https://developer.nvidia.com/hpc-sdk">NVIDIA HPC SDK</a> on your x86-64 or Arm CPU-based system running a supported version of Linux. The NVIDIA HPC SDK is freely downloadable and includes a perpetual use license for all NVIDIA Registered Developers, including access to future release updates as they are issued. After you have the NVIDIA HPC SDK installed on your system, the nvc++ compiler is available under the <code class="docutils literal notranslate">/opt/nvidia/hpc_sdk</code> directory structure. <ul class="simple"> <li>To use the compilers including nvc++ on a Linux/x86-64 system, add the directory <code class="docutils literal notranslate">/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/bin</code> to your path.</li> <li>On an Arm CPU-based system, replace <code class="docutils literal notranslate">Linux_x86_64</code> with <code class="docutils literal notranslate">Linux_aarch64</code>.</li> </ul> <section id="supported-nvidia-gpus"> <h4>8.2.9.1. Supported NVIDIA GPUs<a class="headerlink" href="#supported-nvidia-gpus" title="Permalink to this headline"></a></h4> The NVC++ compiler can automatically offload C++ Parallel Algorithms to NVIDIA GPUs based on the Volta architecture or newer. These architectures include features – such as independent thread scheduling and hardware optimizations for CUDA Unified Memory – that were specifically designed to support high-performance, general-purpose parallel programming models like the C++ Parallel Algorithms. The NVC++ compiler provides limited support for C++ Parallel Algorithms on the Pascal architecture, which does not have the <a class="reference external" href="https://devblogs.nvidia.com/inside-volta/">independent thread scheduling</a> necessary to properly support the <code class="docutils literal notranslate">std::execution::par</code> policy. When compiling for the Pascal architecture (-gpu=cc60), NVC++ compiles algorithms with the <code class="docutils literal notranslate">std::execution::par</code> policy for serial execution on the CPU. Only algorithms with the <code class="docutils literal notranslate">std::execution::par_unseq</code> policy will be scheduled to run on Pascal GPUs. </section> <section id="supported-cuda-versions"> <h4>8.2.9.2. Supported CUDA Versions<a class="headerlink" href="#supported-cuda-versions" title="Permalink to this headline"></a></h4> The NVC++ compiler is built on CUDA libraries and technologies and uses CUDA to accelerate C++ Parallel Algorithms on NVIDIA GPUs. A GPU-accelerated system on which NVC++-compiled applications are to be run must have a CUDA 11.2 or newer device driver installed. The NVIDIA HPC SDK compilers ship with an integrated CUDA toolchain, header files, and libraries to use during compilation, so it is not necessary to have a CUDA Toolkit installed on the system. When <code class="docutils literal notranslate">-stdpar</code> is specified, NVC++ compiles using the CUDA toolchain version that best matches the CUDA driver installed on the system on which compilation is performed. To compile using a different version of the CUDA toolchain, use the <code class="docutils literal notranslate">-gpu=cudaX.Y</code> option. For example, use the <code class="docutils literal notranslate">-gpu=cuda11.8</code> option to specify that your program should be compiled for a CUDA 11.8 system using the CUDA 11.8 toolchain. </section> </section> </section> <section id="stdpar-fortran"> <h2>8.3. Stdpar Fortran<a class="headerlink" href="#stdpar-fortran" title="Permalink to this headline"></a></h2> Fortran 2008 introduced the <code class="docutils literal notranslate">do concurrent</code> (DC) loop construct signaling that loop iterations have no interdependencies. With <code class="docutils literal notranslate">-stdpar</code> such loop iterations will be executed in parallel on the GPU when <code class="docutils literal notranslate">-stdpar</code> (or <code class="docutils literal notranslate">-stdpar=gpu</code>) is passed to <code class="docutils literal notranslate">nvfortran</code> or using CPU threads when <code class="docutils literal notranslate">-stdpar=multicore</code> is passed to <code class="docutils literal notranslate">nvfortran</code>. More details can be found in the following blog post on the NVIDIA website: <a class="reference external" href="https://developer.nvidia.com/blog/accelerating-fortran-do-concurrent-with-gpus-and-the-nvidia-hpc-sdk">Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK</a>. <section id="calling-routines-in-do-concurrent-on-the-gpu"> <h3>8.3.1. Calling Routines in DO CONCURRENT on the GPU<a class="headerlink" href="#calling-routines-in-do-concurrent-on-the-gpu" title="Permalink to this headline"></a></h3> When compiling for the GPU, calling routines in the body of <code class="docutils literal notranslate">do concurrent</code> loop can be constrained. PURE routines can generally be called inside the <code class="docutils literal notranslate">do concurrent</code> loop body. The compiler detects that such routines are to be compiled for the GPU target. External routines, however, can’t be called from within the DC loop unless they are explicitly annotated with the OpenACC routine directive (refer to <cite>Interoperability with OpenACC <stdpar-fortran-interop-openacc></cite>) or CUDA device attribute (refer to <a class="reference internal" href="#stdpar-fortran-interop-cuf">Interoperability with CUDA Fortran</a>). The following example will compile successfully. <div class="highlight-fortran notranslate"><div class="highlight"><pre>module m contains pure subroutine foo() return end subroutine end module m program dc use m implicit none integer :: i do concurrent (i=1:10) call foo() enddo end program </pre></div> </div> The following example, however, doesn’t compile unless <code class="docutils literal notranslate">foo</code> is either <ul class="simple"> <li>annotated with <code class="docutils literal notranslate">!$acc routine</code>,</li> <li>or attributed with <code class="docutils literal notranslate">attributes(device)</code> and compiled as Stdpar and CUDA Fortran.</li> </ul> <div class="highlight-fortran notranslate"><div class="highlight"><pre>program dc implicit none interface pure subroutine foo() end subroutine foo end interface integer :: i do concurrent (i=1:10) call foo() enddo end program </pre></div> </div> </section> <section id="gpu-data-management"> <h3>8.3.2. GPU Data Management<a class="headerlink" href="#gpu-data-management" title="Permalink to this headline"></a></h3> If <code class="docutils literal notranslate">-gpu=mem:managed</code> is enabled by default or is explicitly passed on the command line, some data accesses in <code class="docutils literal notranslate">do concurrent</code> loops are invalid. For example, accessing global variables in the routines called from the <code class="docutils literal notranslate">do concurrent</code> loop does not perform expected value updates in the CPU code. Additionally, there are rare instances where the compiler cannot accurately determine variable sizes for implicit data movements between CPU and GPU. As demonstrated in the following example, <code class="docutils literal notranslate">a</code> is an assumed-size array, and its access region inside the DC construct cannot be determined at compile time because the element index positions are taken from another array <code class="docutils literal notranslate">b</code> initialized outside of the routine. Such code does not update <code class="docutils literal notranslate">a</code> as expected and may result in a memory violation and undefined behavior. <div class="highlight-fortran notranslate"><div class="highlight"><pre>subroutine r(a, b) integer :: a(*) integer :: b(:) do concurrent (i = 1 : size(b)) a(b(i)) = i enddo end subroutine </pre></div> </div> There are no limitations on the variable accessed in <code class="docutils literal notranslate">do concurrent</code> loops described above when the code is compiled with <code class="docutils literal notranslate">-gpu=mem:unified</code>, whether this option is enabled by default or explicitly via an option on the command line. </section> <section id="id3"> <h3>8.3.3. Interoperability with OpenACC<a class="headerlink" href="#id3" title="Permalink to this headline"></a></h3> OpenACC features can be used when compiling Stdpar code for GPUs. To activate OpenACC directives recognition with Stdpar code add <code class="docutils literal notranslate">-acc</code> command line flag to <code class="docutils literal notranslate">nvfortran</code>. <div class="highlight-text notranslate"><div class="highlight"><pre>nvfortran -stdpar -acc example.f90 </pre></div> </div> OpenACC functionality and interoperability with DO-CONCURRENT loop is detailed in the OpenACC specification and the NVIDIA HPC compiler specific differences are detailed in <a class="reference internal" href="#acc-use">Using OpenACC</a> of this guide. Using OpenACC features can enhance functionality of DC-loop for example with the following: <ul class="simple"> <li>Explicit data management to improve performance of CPU-GPU implicit data movements or even leverage separate memory compiling on the GPU when compiling with <code class="docutils literal notranslate">-gpu=mem:separate</code> passed in.</li> <li>Tuning DC-loop execution on the GPU e.g. GPU kernels launch configuration.</li> <li>Executing DC-loops asynchronously.</li> <li>Calling external routines from within DC-loops.</li> <li>Atomic operations in DC-loops.</li> </ul> Examples Some examples of using OpenACC directives with DC-loops are provided below. The following example demonstrates how the data accessed inside the DC-loop are fully managed in the OpenACC data construct. <div class="highlight-fortran notranslate"><div class="highlight"><pre>!$acc data copyin(b) copyout(a) do concurrent (j=1:N) do i=1,K a(j,i) = b(j,i) end do end do !$acc end data </pre></div> </div> While in the above example the data construct is used for GPU data management, the same effect can be achieved with the use of data clauses on the compute construct enclosing DC-loop. The following example shows how the scheduling of DC loop on the GPU is controlled through the clauses on the compute construct. <div class="highlight-fortran notranslate"><div class="highlight"><pre>!$acc parallel loop num_gangs(50000) vector_length(32) do concurrent (i=1:K,j=1:N) a(j,i) = real(j) end do </pre></div> </div> Use of OpenACC async clause on the compute constructs can be utilised to perform computations in DC-loop asynchronously. <div class="highlight-fortran notranslate"><div class="highlight"><pre>!$acc parallel loop async do concurrent (j=1:N) a(j) = j end do b = foo() #pragma acc wait c = sum(a) + b </pre></div> </div> In the previous example, array <code class="docutils literal notranslate">a</code> is filled in with values asynchronously in DC-loop. </section> <section id="interoperability-with-cuda-fortran"> <h3>8.3.4. Interoperability with CUDA Fortran<a class="headerlink" href="#interoperability-with-cuda-fortran" title="Permalink to this headline"></a></h3> CUDA Fortran features can also be used when compiling Stdpar code for GPUs. To recognize CUDA Fortran features in your source code, compile with the <code class="docutils literal notranslate">-cuda</code> command line flag using <code class="docutils literal notranslate">nvfortran</code>. <div class="highlight-text notranslate"><div class="highlight"><pre>nvfortran -stdpar -cuda example.f90 </pre></div> </div> Using CUDA Fortran extensions can enhance the functionality of a do concurrent (DC) loop and Stdpar program, for several cases: <ul class="simple"> <li>Explicit data locality, accessing CUDA Fortran attributed arrays or other data with the device, managed, unified, or constant attributes from within DC-loops.</li> <li>Tuning DC-loop execution on the GPU e.g. controlling the GPU kernels launch configuration.</li> <li>Executing DC-loops asynchronously using a specific CUDA stream.</li> <li>Calling external, user-defined CUDA device routines from within DC-loops.</li> <li>Using CUDA Atomic operations in DC-loops, or other CUDA-specific device-side runtime library calls.</li> <li>Inserting CUDA Runtime API calls for memory tuning hints outside of DC-loops.</li> </ul> Examples Some examples of using CUDA Fortran features with DC-loops are provided below. The following example demonstrates how a DC-loop can access CUDA Fortran device data, run on a specific CUDA stream, call the CUDA Runtime API for creating a stream, and hide non-standard features behind the CUF sentinel for code portability. <div class="highlight-fortran notranslate"><div class="highlight"><pre>!@cuf use cudafor !@cuf integer(kind=cuda_stream_kind) :: istrm real, allocatable :: a(:,:), b(:,:) !@cuf attributes(device) :: a ! A is device array only, not unified/managed . . . !@cuf istat = cudaStreamCreate(istrm) . . . a(:,:) = 0.0 . . . !$cuf kernel do(1) <<< *, *, stream=istrm>>> do concurrent (j=1:N) do i=1,K a(j,i) = a(j,i) + 2.0 * b(j,i) end do end do </pre></div> </div> This program demonstrates how to call low-level CUDA device functions from within a DC-loop. The function can be written in either CUDA Fortran or CUDA C++, depending on the interface. The CUDA C function must be compiled for relocatable device code. This can be used for accessing features in CUDA and NVIDIA GPUs not readily available in directive-based models or standard languages. <div class="highlight-fortran notranslate"><div class="highlight"><pre>module mcuda contains attributes(host,device) pure integer function std_dbg(itype) integer, value :: itype if (itype.eq.1) then std_dbg = threadIdx%x else if (itype.eq.2) then std_dbg = blockIdx%x else std_dbg = (blockIdx%x-1)*blockDim%x + threadIdx%x end if end function end module program test use mcuda integer, parameter :: N = 2000 integer, allocatable :: a(:), b(:), c(:) allocate(a(N),b(N),c(N)) do concurrent (j=1:N) a(j) = std_dbg(1) b(j) = std_dbg(2) c(j) = std_dbg(3) end do print *,a(1),a(N/2),a(N) print *,b(1),b(N/2),b(N) print *,c(1),c(N/2),c(N) end </pre></div> </div> Many functions from the CUDA Fortran <code class="docutils literal notranslate">cudadevice</code> module are available within do concurrent loops, not just atomics. This code snippet shows two uses: <div class="highlight-text notranslate"><div class="highlight"><pre>real :: tmp(4), x, y ... block; use cudadevice do concurrent (i=1:K,j=1:N) x = real(j) + a(i,j) y = atomicAdd(b(1,j), x) end do do concurrent (j=1:N) x = real(j) tmp(1:4) = __ldca(a(1:4,j)) tmp(1:4) = tmp(1:4) + x call __stwt(b(1:4,j), tmp) end do end block </pre></div> </div> </section> </section> </section> <section id="pcast"> <h1>9. PCAST<a class="headerlink" href="#pcast" title="Permalink to this headline"></a></h1> Parallel Compiler Assisted Software Testing (PCAST) is a set of API calls and compiler directives useful in testing program correctness. Numerical results produced by a program can diverge when parts of the program are mapped onto a GPU, when new or additional compiler options are used, or when changes are made to the program itself. PCAST can help you determine where these divergences begin, and pinpoint the changes that cause them. It is useful in other situations as well, including when using new libraries, determining whether parallel execution is safe, or porting programs from one ISA or type of processor to another. <section id="id5"> <h2>9.1. Overview<a class="headerlink" href="#id5" title="Permalink to this headline"></a></h2> PCAST Comparisons can be performed in two ways. The first saves the initial run’s data into a file through the <code class="docutils literal notranslate">pcast_compare</code> call or directive. Add the calls or directives to your application where you want intermediate results to be compared. Then, execute the program to save the “golden” results where the values are known to be correct. During subsequent runs of the program, the same pcast_compare calls or directives will compare the computed intermediate results to the saved “golden” results and report the differences. The second approach works in conjunction with the NVIDIA OpenACC implementation to compare GPU computation against the same program running on a CPU. In this case, all compute constructs are performed redundantly, both on the CPU and GPU. GPU results are compared against the CPU results, and differences reported. This is essentially like the first case where the CPU-calculated values are treated as the “golden” results. GPU to CPU comparisons can be done implicitly at the end of data regions with the <code class="docutils literal notranslate">autocompare</code> flag or explicitly after kernels with the <code class="docutils literal notranslate">acc_compare</code> call or directive. With the autocompare flag, OpenACC regions will run redundantly on the CPU and GPU. On an OpenACC region exit where data is to be downloaded from device to host, PCAST will compare the values calculated on the CPU with those calculated in the GPU. Comparisons done with <code class="docutils literal notranslate">autocompare</code> or <code class="docutils literal notranslate">acc_compare</code> are handled in memory and do not write results to an intermediate file. The following table outlines the supported data types that can be used with PCAST. Short, integer, long, and half precision data types are not supported with <code class="docutils literal notranslate">ABS</code>, <code class="docutils literal notranslate">REL</code>, <code class="docutils literal notranslate">ULP</code>, or <code class="docutils literal notranslate">IEEE</code> options; only a bit-for-bit comparison is supported. For floating-point types, PCAST can calculate absolute, relative, and unit-last-place differences. Absolute differences measures only the absolute value of the difference (subtraction) between two values, i.e. abs(A-B). Relative differences are calculated as a ratio between the difference of values, A-B, and the previous value A; abs((A-B)/A). Unit-least precision (Unit-last place) is a measure of the smallest distance between two values A and B. With the <code class="docutils literal notranslate">ULP</code> option set, PCAST will report if the calculated ULP between two numbers is greater than some threshold. <table class="table-no-stripes docutils align-default" id="pcast-overview-pcast-supported-types-table"> <caption>Table 23. Supported Types for Tolerance Measurements<a class="headerlink" href="#pcast-overview-pcast-supported-types-table" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 30%" /> <col style="width: 46%" /> <col style="width: 6%" /> <col style="width: 6%" /> <col style="width: 6%" /> <col style="width: 7%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">C/C++ Type</th> <th class="head">Fortran Type</th> <th class="head">ABS</th> <th class="head">REL</th> <th class="head">ULP</th> <th class="head">IEEE</th> </tr> </thead> <tbody> <tr class="row-even"><td>float</td> <td>real, real(4)</td> <td>Yes</td> <td>Yes</td> <td>Yes</td> <td>Yes</td> </tr> <tr class="row-odd"><td>double</td> <td>double precision, real(8)</td> <td>Yes</td> <td>Yes</td> <td>Yes</td> <td>Yes</td> </tr> <tr class="row-even"><td>float _Complex</td> <td>complex, complex(4)</td> <td>Yes</td> <td>Yes</td> <td>Yes</td> <td>Yes</td> </tr> <tr class="row-odd"><td>double _Complex</td> <td>complex(8)</td> <td>Yes</td> <td>Yes</td> <td>Yes</td> <td>Yes</td> </tr> <tr class="row-even"><td>-</td> <td>real(2)</td> <td>No</td> <td>No</td> <td>No</td> <td>No</td> </tr> <tr class="row-odd"><td>(un)signed short</td> <td>integer(2)</td> <td>N/A</td> <td>N/A</td> <td>N/A</td> <td>N/A</td> </tr> <tr class="row-even"><td>(un)signed int</td> <td>integer, integer(4)</td> <td>N/A</td> <td>N/A</td> <td>N/A</td> <td>N/A</td> </tr> <tr class="row-odd"><td>(un)signed long</td> <td>integer(8)</td> <td>N/A</td> <td>N/A</td> <td>N/A</td> <td>N/A</td> </tr> </tbody> </table> </section> <section id="pcast-with-a-golden-file"> <h2>9.2. PCAST with a “Golden” File<a class="headerlink" href="#pcast-with-a-golden-file" title="Permalink to this headline"></a></h2> The run-time call <code class="docutils literal notranslate">pcast_compare</code> highlights differences between successive program runs. It has two modes of operation, depending on the presence of a data file named pcast_compare.dat by default. If the file does not exist, <code class="docutils literal notranslate">pcast_compare</code> assumes this is the first “golden” run. It will create the file and fill it with the computed data at each call to <code class="docutils literal notranslate">pcast_compare</code>. If the file exists, <code class="docutils literal notranslate">pcast_compare</code> assumes it is a test run. It will read the file and compare the computed data with the saved data from the file. The default behavior is to consider the first 50 differences to be a reportable error, no matter how small. By default, the <code class="docutils literal notranslate">pcast_compare.dat</code> file is in the same directory as the executable. The behavior of <code class="docutils literal notranslate">pcast_compare</code>, and other comparison parameters, can be changed at runtime with the PCAST_COMPARE environment variable discussed in the <a class="reference internal" href="#pcast-env-vars">Environment Variables</a> section. The signature of <code class="docutils literal notranslate">pcast_compare</code> for C++ and C is: <div class="highlight-c++ notranslate"><div class="highlight"><pre>void pcast_compare(void*, char*, size_t, char*, char*, char*, int); </pre></div> </div> The signature of <code class="docutils literal notranslate">pcast_compare</code> for Fortran is: <div class="highlight-c++ notranslate"><div class="highlight"><pre>subroutine pcast_compare(a, datatype, len, varname, filename, funcname, lineno) type(*), dimension(..) :: a character(*) :: datatype, varname, filename, funcname integer(8),value :: len integer(4),value :: lineno </pre></div> </div> The call takes seven arguments: <ol class="arabic simple"> <li>The address of the data to be saved or compared.</li> <li>A string containing the data type.</li> <li>The number of elements to compare.</li> <li>A string treated as the variable name.</li> <li>A string treated as the source file name.</li> <li>A string treated as the function name.</li> <li>An integer treated as a line number.</li> </ol> For example, the <code class="docutils literal notranslate">pcast_compare</code> runtime call can be invoked like the following: <div class="highlight-c++ notranslate"><div class="highlight"><pre>pcast_compare(a, "float", N, "a", "pcast_compare03.c", "main", 1); </pre></div> </div> <div class="highlight-fortran notranslate"><div class="highlight"><pre>call pcast_compare(a, 'real', n, 'a', 'pcast_compare1.f90', 'program', 9) </pre></div> </div> The caller should give meaningful names to the last four arguments. They can be anything, since they only serve to annotate the report. It is imperative that the identifiers are not modified between comparisons; comparisons must be called in the same order for each program run. If, for example, you are calling <code class="docutils literal notranslate">pcast_compare</code> inside a loop, it is reasonable to set the last argument to be the loop index. There also exists a directive form of the <code class="docutils literal notranslate">pcast_compare</code>, which is functionally the same as the runtime call. It can be used at any point in the program to compare the current value of data to that recorded in the golden file, same as the runtime call. There are two benefits to using the directive over the API call: <ol class="arabic"> <li>The directive syntax is much simpler than the API syntax. Most of what the compare call needs to output data to the user can be gleaned by the compiler at compile-time (The type, variable name, file name, function name, and line number). <div class="highlight-c++ notranslate"><div class="highlight"><pre>#pragma nvidia compare(a[0:n]) </pre></div> </div> as opposed to: <div class="highlight-c++ notranslate"><div class="highlight"><pre>pcast_compare(a, "float", N, "a", "pcast_compare03.c", "main", 1); </pre></div> </div> </li> <li>The directive is only enabled when the -Mpcast flag is set, so the source need not be changed when testing is complete. Consider the following usage examples: <div class="highlight-c++ notranslate"><div class="highlight"><pre>#pragma nvidia compare(a[0:N]) // C++ and C !$nvf compare(a(1:N)) ! Fortran </pre></div> </div> </li> </ol> The directive interface is given below in C++ or C style, and in Fortran. Note that for Fortran, <code class="docutils literal notranslate">var-list</code> is a variable name, a subarray specification, an array element, or a composite variable member. <div class="highlight-c++ notranslate"><div class="highlight"><pre>#pragma nvidia compare (var-list) // C++ and C !$nvf compare (var-list) ! Fortran </pre></div> </div> Let’s look at an example of <div class="highlight-c++ notranslate"><div class="highlight"><pre>#include <stdlib.h> #include <openacc.h> int main() { int size = 1000; int i, t; float *a1; float *a2; a1 = (float*)malloc(sizeof(float)*size); a2 = (float*)malloc(sizeof(float)*size); for (i = 0; i < size; i++) { a1[i] = 1.0f; a2[i] = 2.0f; } for (t = 0; t < 5; t++) { for(i = 0; i < size; i++) { a2[i] += a1[i]; } pcast_compare(a2, "float", size, "a2", "example.c", "main", 23); } return 0; } </pre></div> </div> Compile the example using these compiler options: <div class="highlight-c++ notranslate"><div class="highlight"><pre>$ nvc -fast -o a.out example.c </pre></div> </div> Compiling with redundant or autocompare options are not required to use pcast_compare. Once again, running the compiled executable using the options below, results in the following output: <div class="highlight-c++ notranslate"><div class="highlight"><pre>$ PCAST_COMPARE=summary,rel=1 ./out.o datafile pcast_compare.dat created with 5 blocks, 5000 elements, 20000 bytes $ PCAST_COMPARE=summary,rel=1 ./out.o datafile pcast_compare.dat compared with 5 blocks, 5000 elements, 20000 bytes no errors found relative tolerance = 0.100000, rel=1 </pre></div> </div> Running the program for the first time, the data file “pcast_compare.dat” is created. Subsequent runs compare calculated data against this file. Use the <code class="docutils literal notranslate">PCAST_COMPARE</code> environment variable to set the name of the file, or force the program to create a new file on the disk with <code class="docutils literal notranslate">PCAST_COMPARE=create</code>. The same example above can be written with the compare directive. Notice how much more concise the directive is to the update host and <code class="docutils literal notranslate">pcast_compare</code> calls. <div class="highlight-c++ notranslate"><div class="highlight"><pre>#include <stdlib.h> #include <openacc.h> int main() { int size = 1000; int i, t; float *a1; float *a2; a1 = (float*)malloc(sizeof(float)*size); a2 = (float*)malloc(sizeof(float)*size); for (i = 0; i < size; i++) { a1[i] = 1.0f; a2[i] = 2.0f; } for (t = 0; t < 5; t++) { for(i = 0; i < size; i++) { a2[i] += a1[i]; } #pragma nvidia compare(a2[0:size]) } return 0; } </pre></div> </div> With the directive, you will want to add “-Mpcast” to the compilation line to enable the directive. Other than that, the output from this program is identical to the runtime example above. </section> <section id="pcast-with-openacc"> <h2>9.3. PCAST with OpenACC<a class="headerlink" href="#pcast-with-openacc" title="Permalink to this headline"></a></h2> PCAST can also be used with the NVIDIA OpenACC implementation to compare GPU computation against the same program running on a CPU. In this case, all compute constructs are performed redundantly on both the CPU and GPU. The CPU results are considered to be the “golden master” copy which GPU results are compared against. There are two ways to perform comparisons with GPU-calculated results. The first is with the explicit call or directive <code class="docutils literal notranslate">acc_compare</code>. To use <code class="docutils literal notranslate">acc_compare</code>, you must compile with <code class="docutils literal notranslate">-acc -gpu=redundant</code> to force the CPU and GPU to compute results redundantly. Then, insert calls to <code class="docutils literal notranslate">acc_compare</code> or put an <code class="docutils literal notranslate">acc compare</code> directive at points where you want to compare the GPU-computed values against those computed by the CPU. The second approach is to turn on autocompare mode by compiling with <code class="docutils literal notranslate">-acc -gpu=autocompare</code>. In autocompare mode, PCAST will automatically perform a comparison at each point where data is moved from the device to the host. It does not require the programmer to add any additional directives or runtime calls; it’s a convenient way to do all comparisons at the end of a data region. If there are multiple compute kernels within a data region, and you’re only interested in one specific kernel, you should use the previously-mentioned <code class="docutils literal notranslate">acc_compare</code> to target a specific kernel. Note that autocompare mode implies <code class="docutils literal notranslate">-gpu=redundant</code>. During redundant execution, the compiler will generate both CPU and GPU code for each compute construct. At runtime, both the CPU and GPU versions will execute redundantly, with the CPU code reading and modifying values in system memory and the GPU reading and modifying values in device memory. Insert calls to <code class="docutils literal notranslate">acc_compare()</code> calls (or the equivalent <code class="docutils literal notranslate">acc compare</code> directive) at points where you want to compare the GPU-computed values against CPU-computed values. PCAST treats the values generated by the CPU code as the “golden” values. It will compare those results against GPU values. Unlike <code class="docutils literal notranslate">pcast_compare</code>, <code class="docutils literal notranslate">acc_compare</code> does not write to an intermediary file; the comparisons are done in-memory. <code class="docutils literal notranslate">acc_compare</code> only has two arguments: a pointer to the data to be compared, hostptr, and the number of elements to compare, count. The type can be inferred in the OpenACC runtime, so it doesn’t need to be specified. The C++ and C interface is given below: <div class="highlight-c++ notranslate"><div class="highlight"><pre>void acc_compare(void *, size_t); </pre></div> </div> And in Fortran: <div class="highlight-fortran notranslate"><div class="highlight"><pre>subroutine acc_compare(a) subroutine acc_compare(a, len) type(*), dimension(*) :: a integer(8), value :: len </pre></div> </div> You can call <code class="docutils literal notranslate">acc_compare</code> on any variable or array that is present in device memory. You can also call <code class="docutils literal notranslate">acc_compare_all</code> (no arguments) to compare all values that are present in device memory against the correponding values in host memory. <div class="highlight-c++ notranslate"><div class="highlight"><pre>void acc_compare_all() </pre></div> </div> <div class="highlight-fortran notranslate"><div class="highlight"><pre>subroutine acc_compare_all() </pre></div> </div> Directive forms of the <code class="docutils literal notranslate">acc_compare</code> calls exist. They work the same as the API calls and can be used in lieu of them. Similar to PCAST <code class="docutils literal notranslate">compare</code> directives, <code class="docutils literal notranslate">acc compare</code> directives are ignored when redundant or autocompare modes are not enabled on the compilation line. The <code class="docutils literal notranslate">acc compare</code> directive takes one or more arguments, or the ‘all’ clause (which corresponds to <code class="docutils literal notranslate">acc_compare_all()</code>. The interfaces are given below in C++ or C, and Fortran respectively. Argument “var-list” can be a variable name, a sub-array specification, and array element, or a composite variable member. <div class="highlight-c++ notranslate"><div class="highlight"><pre>#pragma acc compare [ (var-list) | all ] </pre></div> </div> <div class="highlight-text notranslate"><div class="highlight"><pre>$!acc compare [ (var-list) | all ] </pre></div> </div> For example: <div class="highlight-fortran notranslate"><div class="highlight"><pre>#pragma acc compare(a[0:N]) #pragma acc compare all !$acc compare(a, b) !$acc compare(a(1:N)) !$acc compare all </pre></div> </div> Consider the following OpenACC program that uses the <code class="docutils literal notranslate">acc_compare()</code> API call and an <code class="docutils literal notranslate">acc compare</code> directive. This Fortran example uses real*4 and real*8 arrays. <div class="highlight-c++ notranslate"><div class="highlight"><pre>program main use openacc implicit none parameter N = 1000 integer :: i real :: a(N) real*4 :: b(N) real(4) :: c(N) double precision :: d(N) real*8 :: e(N) real(8) :: f(N) d = 1.0d0 e = 0.1d0 !$acc data copyout(a, b, c, f) copyin(d, e) !$acc parallel loop do i = 1,N a(i) = 1.0 b(i) = 2.0 c(i) = 0.0 enddo !$acc end parallel !$acc compare(a(1:N), b(1:N), c(1:N)) !$acc parallel loop do i = 1,N f(i) = d(i) * e(i) enddo !$acc end parallel !$acc compare(f) !$acc parallel loop do i = 1,N a(i) = 1.0 b(i) = 1.0 c(i) = 1.0 enddo !$acc end parallel call acc_compare(a, N) call acc_compare(b, N) call acc_compare(c, N) !$acc parallel loop do i = 1,N f(i) = 1.0D0 enddo !$acc end parallel call acc_compare_all() !$acc parallel loop do i = 1,N a(i) = 3.14; b(i) = 3.14; c(i) = 3.14; f(i) = 3.14d0; enddo !$acc end parallel ! In redundant mode, no comparison is performed here. In ! autocompare mode, a comparison is made for a, b, c, and f (but ! not e and d), since they are copied out of the data region. !$acc end data call verify(N, a, b, c, f) end program subroutine verify(N, a, b, c, f) integer, intent(in) :: N real, intent(in) :: a(N) real*4, intent(in) :: b(N) real(4), intent(in) :: c(N) real(8), intent(in) :: f(N) integer :: i, errcnt errcnt = 0 do i=1,N if(abs(a(i) - 3.14e0) .gt. 1.0e-06) then errcnt = errcnt + 1 endif end do do i=1,N if(abs(b(i) - 3.14e0) .gt. 1.0e-06) then errcnt = errcnt + 1 endif end do do i=1,N if(abs(c(i) - 3.14e0) .gt. 1.0e-06) then errcnt = errcnt + 1 endif end do do i=1,N if(abs(f(i) - 3.14d0) .gt. 1.0d-06) then errcnt = errcnt + 1 endif end do if(errcnt /= 0) then write (*, *) "FAILED" else write (*, *) "PASSED" endif end subroutine verify </pre></div> </div> The program can be compiled with the following command: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -fast -acc -gpu=redundant -Minfo=accel example.F90 main: 16, Generating copyout(a(:),b(:)) Generating copyin(e(:)) Generating copyout(f(:),c(:)) Generating copyin(d(:)) 18, Generating Tesla code 19, !$acc loop gang, vector(128) ! blockidx%x threadidx%x 26, Generating acc compare(c(:),b(:),a(:)) 28, Generating Tesla code 29, !$acc loop gang, vector(128) ! blockidx%x threadidx%x 34, Generating acc compare(f(:)) 36, Generating Tesla code 37, !$acc loop gang, vector(128) ! blockidx%x threadidx%x 48, Generating Tesla code 49, !$acc loop gang, vector(128) ! blockidx%x threadidx%x 56, Generating Tesla code 57, !$acc loop gang, vector(128) ! blockidx%x threadidx%x </pre></div> </div> Here, you can see where the acc compare directives are generated on lines 26 and 34. The program can be run with the following command: <div class="highlight-text notranslate"><div class="highlight"><pre>$ ./a.out PASSED </pre></div> </div> As you can see, no PCAST output is generated when the comparisons match. We can get more information with the summary option: <div class="highlight-text notranslate"><div class="highlight"><pre>$ PCAST_COMPARE=summary ./a.out PASSED compared 13 blocks, 13000 elements, 68000 bytes no errors found absolute tolerance = 0.00000000000000000e+00, abs=0 </pre></div> </div> There are 13 blocks compared. Let’s count the blocks in the compare calls. <div class="highlight-text notranslate"><div class="highlight"><pre>!$acc compare(a(1:N), b(1:N), c(1:N)) </pre></div> </div> Compares three blocks, one each for a, b, and c. <div class="highlight-text notranslate"><div class="highlight"><pre>!$acc compare(f) </pre></div> </div> Compares one block for f. <div class="highlight-text notranslate"><div class="highlight"><pre>call acc_compare(a, N) call acc_compare(b, N) call acc_compare(c, N) </pre></div> </div> Each call compares one block for their respective array. <div class="highlight-text notranslate"><div class="highlight"><pre>call acc_compare_all() </pre></div> </div> Compares one block for each array present on the device (a, b, c, d, e, and f) for a total of 6 blocks. If the same example is compiled with autocompare, we’ll see four additional comparisons, since the four arrays that are copied out (with the copyout clause) are compared at the end of the data region. <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -fast -acc -gpu=autocompare example.F90 $ PCAST_COMPARE=summary ./a.out PASSED compared 17 blocks, 17000 elements, 88000 bytes no errors found absolute tolerance = 0.00000000000000000e+00, abs=0 </pre></div> </div> </section> <section id="limitations"> <h2>9.4. Limitations<a class="headerlink" href="#limitations" title="Permalink to this headline"></a></h2> There are currently a few limitations with using PCAST that are worth keeping in mind. <ul class="simple"> <li>Comparisons are not thread-safe. If you are using PCAST with multiple threads, ensure that only one thread is doing the comparisons. This is especially true if you are using PCAST with MPI. If you use <code class="docutils literal notranslate">pcast_compare</code> with MPI, you must make sure that only one thread is writing to the comparison file. Or, use a script to set PCAST_COMPARE to encode the file name with the MPI rank.</li> <li>Comparisons must be done with like types; you cannot compare one type with another. It is not possible to, for example, check for differing results after changing from double precision to single. Comparisons are limited to those present in table <a class="reference internal" href="#pcast-overview-pcast-supported-types-table">Table 23</a>. Currently there is no support for structured or derived types.</li> <li>The <code class="docutils literal notranslate">-gpu=mem:managed</code> or <code class="docutils literal notranslate">-gpu=mem:unified</code> options are incompatible with autocompare and <code class="docutils literal notranslate">acc_compare</code>. Both the CPU and GPU need to calculate result separately and to do so they must have their own working memory spaces.</li> <li>If you do any data movement on the device, you must account for it on the host. For example, if you are using CUDA-aware MPI or GPU-accelerated libraries that modify device data, then you must also make the host aware of the changes. In these cases it is helpful to use the <code class="docutils literal notranslate">host_data</code> clause, which allows you to use device addresses within host code.</li> </ul> </section> <section id="id6"> <h2>9.5. Environment Variables<a class="headerlink" href="#id6" title="Permalink to this headline"></a></h2> Behavior of PCAST/Autocompare is controlled through the <code class="docutils literal notranslate">PCAST_COMPARE</code> variable. Options can be specified in a comma-separated list: <code class="docutils literal notranslate">PCAST_COMPARE=<opt1>,<opt2>,...</code> If no options are specified, the default is to perform comparisons with abs=0. Comparison options are not mutually exclusive. PCAST can compare absolute differences with some n=3 and relative differences with a different threshold, e.g. n=5; PCAST_COMPARE=abs=3,rel=5,…. You can specify either an absolute or relative location to be used with the datafile option. The parent directory should be owned by the same user executing the comparisons and the datafile should have the appropriate read/write permissions set. <table class="table-no-stripes docutils align-default" id="id33"> <caption>Table 24. PCAST_COMPARE Options<a class="headerlink" href="#id33" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 14%" /> <col style="width: 86%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Option</th> <th class="head">Description</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">abs=n</code></td> <td>Compare absolute difference; tolerate differences up to 10^(-n), only applicable to floating point types. Default value is 0</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">create</code></td> <td>Specifies that this is the run that will produce the reference file (<code class="docutils literal notranslate">pcast_compare</code> only)</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">compare</code></td> <td>Specifies that the current run will be compared with a reference file (<code class="docutils literal notranslate">pcast_compare</code> only)</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">datafile="name"</code></td> <td>Name of the file that data will be saved to, or compared against. If empty will use the default, <code class="docutils literal notranslate">pcast_compare.dat</code> (<code class="docutils literal notranslate">pcast_compare</code> only)</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">disable</code></td> <td>Calls to <code class="docutils literal notranslate">pcast_compare</code>, <code class="docutils literal notranslate">acc_compare</code>, <code class="docutils literal notranslate">acc_compare_all</code>, and directives (<code class="docutils literal notranslate">pcast compare</code>, <code class="docutils literal notranslate">acc compare</code>, and <code class="docutils literal notranslate">acc compare</code>) all immediately return from the runtime with no effect. Note that this doesn’t disable redundant execution; that will require a recompile.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">ieee</code></td> <td>Compare IEEE NaN checks (only implemented for floats and doubles)</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">outputfile="name"</code></td> <td>Save comparison output to a specific file. Default behavior is to output to stderr</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">patch</code></td> <td>Patch errors (outside tolerance) with correct values</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">patchall</code></td> <td>Patch all differences (inside and outside tolerance) with correct values</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">rel=n</code></td> <td>Compare relative difference; tolerated differences up to 10^(-n), only applicable to floating point types. Default value is 0.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">report=n</code></td> <td>Report up to n (default of 50) passes/fails</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">reportall</code></td> <td>Report all passes and fails (overrides limit set in report=n)</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">reportpass</code></td> <td>Report passes; respects limit set with report=n</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">silent</code></td> <td>Suppress output - overrides all other output options, including summary and verbose</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">stop</code></td> <td>Stop at first differences</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">summary</code></td> <td>Print summary of comparisons at end of run</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">ulp=n</code></td> <td>Compare Unit of Least Precision difference (only for floats and doubles)</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">verbose</code></td> <td>Outputs more details of comparison (including patches)</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">verboseautocompare</code></td> <td>Outputs verbose reporting of what and where the host is comparing (autocompare only)</td> </tr> </tbody> </table> </section> </section> <section id="using-mpi"> <h1>10. Using MPI<a class="headerlink" href="#using-mpi" title="Permalink to this headline"></a></h1> MPI (Message Passing Interface) is an industry-standard application programming interface designed for rapid data exchange between processors in a distributed-memory environment. MPI is computer software used in scalable computer systems that allows the processes of a parallel application to communicate with one another. The NVIDIA HPC SDK includes a pre-compiled version of Open MPI. You can build using alternate versions of MPI with the <code class="docutils literal notranslate">-I</code>, <code class="docutils literal notranslate">-L</code>, and <code class="docutils literal notranslate">-l</code> options. This section describes how to use Open MPI with the NVIDIA HPC Compilers. <section id="using-open-mpi-on-linux"> <h2>10.1. Using Open MPI on Linux<a class="headerlink" href="#using-open-mpi-on-linux" title="Permalink to this headline"></a></h2> The NVIDIA HPC Compilers for Linux ship with a pre-compiled version of Open MPI that includes everything required to compile, execute and debug MPI programs using Open MPI. To build an application using Open MPI, use the Open MPI compiler wrappers: <code class="docutils literal notranslate">mpicc</code>, <code class="docutils literal notranslate">mpic⁠+⁠+</code> and <code class="docutils literal notranslate">mpifort</code>. These wrappers automatically set up the compiler commands with the correct include file search paths, library directories, and link libraries. The following MPI example program uses Open MPI. <pre class="literal-block">$ cd my_example_dir $ cp -r /opt/nvidia/hpc_sdk/Linux_x86_64/25.1/examples/MPI/samples/mpihello . $ cd mpihello $ export PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/mpi/openmpi/bin:$PATH $ mpifort mpihello.f -o mpihello</pre> <div class="highlight-text notranslate"><div class="highlight"><pre>$ mpiexec mpihello Hello world! I'm node 0 </pre></div> </div> <div class="highlight-text notranslate"><div class="highlight"><pre>$ mpiexec -np 4 mpihello Hello world! I'm node 0 Hello world! I'm node 2 Hello world! I'm node 1 Hello world! I'm node 3 </pre></div> </div> To build an application using Open MPI for debugging, add -g to the compiler wrapper command line arguments. </section> <section id="using-mpi-compiler-wrappers"> <h2>10.2. Using MPI Compiler Wrappers<a class="headerlink" href="#using-mpi-compiler-wrappers" title="Permalink to this headline"></a></h2> When you use MPI compiler wrappers to build with the <code class="docutils literal notranslate">-fpic</code> or <code class="docutils literal notranslate">-mcmodel=medium</code> options, then you must specify <code class="docutils literal notranslate">-fortranlibs</code> to link with the correct libraries. Here are a few examples: For a static link to the MPI libraries, use this command: <div class="highlight-text notranslate"><div class="highlight"><pre>$ mpifort hello.f </pre></div> </div> For a dynamic link to the MPI libraries, use this command: <div class="highlight-text notranslate"><div class="highlight"><pre>$ mpifort hello.f -fortranlibs </pre></div> </div> To compile with <code class="docutils literal notranslate">-fpic</code>, which, by default, invokes dynamic linking, use this command: <div class="highlight-text notranslate"><div class="highlight"><pre>$ mpifort -fpic -fortranlibs hello.f </pre></div> </div> To compile with <code class="docutils literal notranslate">-mcmodel=medium</code>, use this command: <div class="highlight-text notranslate"><div class="highlight"><pre>$ mpifort -mcmodel=medium -fortranlibs hello.f </pre></div> </div> </section> <section id="testing-and-benchmarking"> <h2>10.3. Testing and Benchmarking<a class="headerlink" href="#testing-and-benchmarking" title="Permalink to this headline"></a></h2> The /opt/nvidia/hpc_sdk/Linux_x86_64/25.1/examples/MPI directory contains various benchmarks and tests. Copy this directory into a local working directory by issuing the following command: <pre class="literal-block">text % cp -r /opt/nvidia/hpc_sdk/Linux_x86_64/25.1/examples/MPI .</pre> There are several example programs available in this directory. </section> </section> <section id="creating-and-using-libraries"> <h1>11. Creating and Using Libraries<a class="headerlink" href="#creating-and-using-libraries" title="Permalink to this headline"></a></h1> A library is a collection of functions or subprograms that are grouped for reference and ease of linking. This section discusses issues related to NVIDIA-supplied compiler libraries. Specifically, it addresses the use of C++ and C builtin functions in place of the corresponding libc routines, creation of dynamically linked libraries, known as shared objects or shared libraries, and math libraries. <div class="admonition note"> Note This section does not duplicate material related to using libraries for inlining which are described in <a class="reference internal" href="#fn-inline-create-lib">Creating an Inline Library</a>. </div> NVIDIA provides libraries that export C interfaces by using Fortran modules. <section id="using-builtin-math-functions-in-c-and-c"> <h2>11.1. Using builtin Math Functions in C++ and C<a class="headerlink" href="#using-builtin-math-functions-in-c-and-c" title="Permalink to this headline"></a></h2> The name of the math header file is <code class="docutils literal notranslate">math.h</code>. Include the math header file in all of your source files that use a math library routine as in the following example, which calculates the inverse cosine of 3.5. <div class="highlight-text notranslate"><div class="highlight"><pre>#include <math.h> #include <stdio.h> #define PI 3.1415926535 void main() { double x, y; x = PI/3.0; y = acos(0.5); printf('%f %f\n',x,y); } </pre></div> </div> Including <code class="docutils literal notranslate">math.h</code> causes the NVIDIA C++ and C compilers to use builtin functions, which are much more efficient than library calls. In particular, if you include <code class="docutils literal notranslate">math.h</code>, the following intrinsics calls are processed using builtins: <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 16%" /> <col style="width: 19%" /> <col style="width: 16%" /> <col style="width: 13%" /> <col style="width: 16%" /> <col style="width: 19%" /> </colgroup> <tbody> <tr class="row-odd"><td>abs</td> <td>acosf</td> <td>asinf</td> <td>atan</td> <td>atan2</td> <td>atan2f</td> </tr> <tr class="row-even"><td>atanf</td> <td>cos</td> <td>cosf</td> <td>exp</td> <td>expf</td> <td>fabs</td> </tr> <tr class="row-odd"><td>fabsf</td> <td>fmax</td> <td>fmaxf</td> <td>fmin</td> <td>fminf</td> <td>log</td> </tr> <tr class="row-even"><td>log10</td> <td>log10f</td> <td>logf</td> <td>pow</td> <td>powf</td> <td>sin</td> </tr> <tr class="row-odd"><td>sinf</td> <td>sqrt</td> <td>sqrtf</td> <td>tan</td> <td>tanf</td> <td></td> </tr> </tbody> </table> </section> <section id="using-system-library-routines"> <h2>11.2. Using System Library Routines<a class="headerlink" href="#using-system-library-routines" title="Permalink to this headline"></a></h2> Release 25.1 of the NVIDIA HPC Compilers runtime libraries makes use of Linux system libraries to implement, for example, OpenMP and Fortran I/O. The NVIDIA HPC Compilers runtime libraries make use of several additional system library routines. On 64-bit Linux systems, the system library routines used include these: <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 21%" /> <col style="width: 32%" /> <col style="width: 38%" /> <col style="width: 9%" /> </colgroup> <tbody> <tr class="row-odd"><td>aio_error</td> <td>aio_write</td> <td>pthread_mutex_init</td> <td>sleep</td> </tr> <tr class="row-even"><td>aio_read</td> <td>calloc</td> <td>pthread_mutex_lock</td> <td></td> </tr> <tr class="row-odd"><td>aio_return</td> <td>getrlimit</td> <td>pthread_mutex_unlock</td> <td></td> </tr> <tr class="row-even"><td>aio_suspend</td> <td>pthread_attr_init</td> <td>setrlimit</td> <td></td> </tr> </tbody> </table> </section> <section id="creating-and-using-shared-object-files-on-linux"> <h2>11.3. Creating and Using Shared Object Files on Linux<a class="headerlink" href="#creating-and-using-shared-object-files-on-linux" title="Permalink to this headline"></a></h2> All of the NVIDIA HPC Fortran, C++ and C compilers support creation of shared object files. Unlike statically-linked object and library files, shared object files link and resolve references with an executable at runtime via a dynamic linker supplied with your operating system. The NVIDIA HPC Compilers must generate position independent code to support creation of shared objects by the linker. However, this is not the default. You must create object files with position independent code and shared object files that will include them. <section id="procedure-to-create-a-use-a-shared-object-file"> <h3>11.3.1. Procedure to create a use a shared object file<a class="headerlink" href="#procedure-to-create-a-use-a-shared-object-file" title="Permalink to this headline"></a></h3> The following steps describe how to create and use a shared object file. <ol class="arabic"> <li>Create an object file with position independent code. To do this, compile your code with the appropriate NVIDIA HPC compiler using the <code class="docutils literal notranslate">-⁠fpic</code> option, or one of the equivalent options, such as <code class="docutils literal notranslate">-⁠fPIC</code>, <code class="docutils literal notranslate">-⁠Kpic,</code> and <code class="docutils literal notranslate">-⁠KPIC</code>, which are supported for compatibility with other systems. For example, use the following command to create an object file with position independent code using nvfortran: <div class="highlight-text notranslate"><div class="highlight"><pre>% nvfortran -c -fpic tobeshared.f </pre></div> </div> </li> <li>Produce a shared object file. To do this, use the appropriate NVIDIA HPC compiler to invoke the linker supplied with your system. It is customary to name such files using a <code class="docutils literal notranslate">.so</code> filename extension. On Linux, you do this by passing the <code class="docutils literal notranslate">-shared</code> option to the linker: <div class="highlight-text notranslate"><div class="highlight"><pre>% nvfortran -shared -o tobeshared.so tobeshared.o </pre></div> </div> <div class="admonition note"> Note Compilation and generation of the shared object can be performed in one step using both the -fpic option and the appropriate option for generation of a shared object file. </div> </li> <li>Use a shared object file. To do this, use the appropriate NVIDIA HPC compiler to compile and link the program which will reference functions or subroutines in the shared object file, and list the shared object on the link line, as shown here: <div class="highlight-text notranslate"><div class="highlight"><pre>% nvfortran -o myprog myprog.f tobeshared.so </pre></div> </div> </li> <li>Make the executable available. You now have an executable <code class="docutils literal notranslate">myprog</code> which does not include any code from functions or subroutines in <code class="docutils literal notranslate">tobeshared.so</code>, but which can be executed and dynamically linked to that code. By default, when the program is linked to produce <code class="docutils literal notranslate">myprog</code>, no assumptions are made on the location of <code class="docutils literal notranslate">tobeshared.so</code>. Therefore, for <code class="docutils literal notranslate">myprog</code> to execute correctly, you must initialize the environment variable <code class="docutils literal notranslate">LD_LIBRARY_PATH</code> to include the directory containing <code class="docutils literal notranslate">tobeshared.so</code>. If <code class="docutils literal notranslate">LD_LIBRARY_PATH</code> is already initialized, it is important not to overwrite its contents. If you have placed <code class="docutils literal notranslate">tobeshared.so</code> in directory <code class="docutils literal notranslate">/home/myusername/bin</code>, you can initialize <code class="docutils literal notranslate">LD_LIBRARY_PATH</code> to include that directory and preserve its existing contents, as shown in the following: <div class="highlight-text notranslate"><div class="highlight"><pre>% setenv LD_LIBRARY_PATH "$LD_LIBRARY_PATH":/home/myusername/bin </pre></div> </div> If you know that tobeshared.so always resides in a specific directory, you can create the executable myprog in a form that assumes this directory by using the -R link-time option. For example, you can link as follows: <div class="highlight-text notranslate"><div class="highlight"><pre>% nvfortran -o myprog myprof.f tobeshared.so -R/home/myusername/bin </pre></div> </div> <div class="admonition note"> Note As with the <code class="docutils literal notranslate">-⁠L</code> option, there is no space between <code class="docutils literal notranslate">-⁠R</code> and the directory name. If the -⁠R option is used, it is not necessary to initialize <code class="docutils literal notranslate">LD_LIBRARY_PATH</code>. </div> In the previous example, the dynamic linker always looks in <code class="docutils literal notranslate">/home/myusername/bin</code> to resolve references to <code class="docutils literal notranslate">tobeshared.so</code>. By default, if the <code class="docutils literal notranslate">LD_LIBRARY_PATH</code> environment variable is not set, the linker only searches <code class="docutils literal notranslate">/usr/lib</code> and <code class="docutils literal notranslate">/lib</code> for shared objects. </li> </ol> </section> <section id="ldd-command"> <h3>11.3.2. ldd Command<a class="headerlink" href="#ldd-command" title="Permalink to this headline"></a></h3> The <code class="docutils literal notranslate">ldd</code> command is a useful tool when working with shared object files and executables that reference them. When applied to an executable, as shown in the following example, <code class="docutils literal notranslate">ldd</code> lists all shared object files referenced in the executable along with the pathname of the directory from which they will be extracted. <div class="highlight-text notranslate"><div class="highlight"><pre>% ldd myprog </pre></div> </div> If the pathname is not hard-coded using the <code class="docutils literal notranslate">-⁠R</code> option, and if <code class="docutils literal notranslate">LD_LIBRARY_PATH</code> is not initialized, the pathname is listed as “not found”. For more information on <code class="docutils literal notranslate">ldd</code>, its options and usage, see the online man page for <code class="docutils literal notranslate">ldd</code>. </section> </section> <section id="using-lib3f"> <h2>11.4. Using LIB3F<a class="headerlink" href="#using-lib3f" title="Permalink to this headline"></a></h2> The NVFORTRAN compiler includes support for the de facto standard LIB3F library routines. See the Fortran Language Reference manual for a complete list of available routines in the NVIDIA implementation of LIB3F. </section> <section id="lapack-blas-and-ffts"> <h2>11.5. LAPACK, BLAS and FFTs<a class="headerlink" href="#lapack-blas-and-ffts" title="Permalink to this headline"></a></h2> The NVIDIA HPC SDK includes a BLAS and LAPACK library based on the customized OpenBLAS project source and built with the NVIDIA HPC Compilers. The LAPACK library is called <code class="docutils literal notranslate">liblapack.a</code>. The BLAS library is called <code class="docutils literal notranslate">libblas.a</code>. To use these libraries, simply link them in using the <code class="docutils literal notranslate">-l</code> option when linking your main program: <div class="highlight-text notranslate"><div class="highlight"><pre>% nvfortran myprog.f -llapack -lblas </pre></div> </div> </section> <section id="linking-with-scalapack"> <h2>11.6. Linking with ScaLAPACK<a class="headerlink" href="#linking-with-scalapack" title="Permalink to this headline"></a></h2> The ScaLAPACK libraries are automatically installed with each MPI library version which accompanies an NVIDIA HPC SDK installation. You can link with the ScaLAPACK libraries by specifying <code class="docutils literal notranslate">-Mscalapack</code> on any of the MPI wrapper command lines. For example: <div class="highlight-text notranslate"><div class="highlight"><pre>% mpifort myprog.f -Mscalapack </pre></div> </div> A pre-built version of the BLAS library is automatically added when the <code class="docutils literal notranslate">-⁠Mscalapack</code> switch is specified. If you wish to use a different BLAS library, and still use the <code class="docutils literal notranslate">-⁠Mscalapack</code> switch, then you can list the set of libraries explicitly on your link line. If the <code class="docutils literal notranslate">-⁠Mnvpl</code> switch is also specified in addition to <code class="docutils literal notranslate">-⁠Mscalapack</code>, then the NVPL ScaLAPACK library will be used. </section> <section id="the-c-standard-template-library"> <h2>11.7. The C++ Standard Template Library<a class="headerlink" href="#the-c-standard-template-library" title="Permalink to this headline"></a></h2> On Linux, the GNU-compatible nvc++ compiler uses the GNU g++ header files and Standard Template Library (STL) directly. The versions used are dependent on the version of the GNU compilers installed on your system, or specified when makelocalrc was run during installation of the NVIDIA HPC Compilers. </section> <section id="nvidia-performance-libraries-nvpl"> <h2>11.8. NVIDIA Performance Libraries (NVPL)<a class="headerlink" href="#nvidia-performance-libraries-nvpl" title="Permalink to this headline"></a></h2> The NVIDIA Performance Libraries (NVPL) are a suite of high performance mathematical libraries optimized for the NVIDIA Grace Arm architecture. These CPU-only libraries have no dependencies on CUDA or CTK, and are drop in replacements for standard C and Fortran mathematical APIs allowing HPC applications to achieve maximum performance on the Grace platform. They are available for Arm CPUs only. The NVPL includes the following math libraries: BLAS, FFT, LAPACK, RAND, ScaLAPACK, Sparse, and Tensor. Refer to the <a class="reference external" href="https://docs.nvidia.com/nvpl">NVPL documentation</a> for more information about these math libraries. The following section explains how to use them with the NVHPC compilers. To use the NVPL libraries, use the <code class="docutils literal notranslate">-Mnvpl</code> option when linking your main program: <div class="highlight-text notranslate"><div class="highlight"><pre>% nvfortran myprog.f -Mnvpl </pre></div> </div> You can link only the NVPL libraries your application needs using the sub-options to <code class="docutils literal notranslate">-Mnvpl</code>. For example, if you only want the BLAS and FFT libraries from the NVPL, link as follows: <div class="highlight-text notranslate"><div class="highlight"><pre>% nvfortran myprog.f -Mnvpl=blas,fft </pre></div> </div> Refer to the NVIDIA HPC Compilers Reference Guide for a complete list of supported options for the <code class="docutils literal notranslate">-Mnvpl</code> flag. ScaLAPACK Similar to other ScaLAPACK libraries, the NVPL version is designed to be used with MPI. A straightforward way to access the NVPL ScaLAPACK library is to use an MPI wrapper (i.e., <code class="docutils literal notranslate">mpicc</code>, <code class="docutils literal notranslate">mpic++</code>, <code class="docutils literal notranslate">mpifort</code>) and link with both <code class="docutils literal notranslate">-⁠Mnvpl</code> and <code class="docutils literal notranslate">-⁠Mscalapack</code>. For example: <div class="highlight-text notranslate"><div class="highlight"><pre>% mpic++ myprog.cpp -Mscalapack -Mnvpl </pre></div> </div> If you choose not to use an MPI wrapper, you can satisfy ScaLAPACK’s dependency on libmpi.so by explicitly providing this library at link time. The NVPL ScaLAPACK interfaces are available for the following MPI variants: MPICH, Open MPI 3.x, Open MPI 4.x (including HPC-X), and Open MPI 5.x. The HPC SDK contains builds of Open MPI 3, Open MPI 4, and HPC-X; to take advantage of the NVPL’s ScaLAPACK interfaces for MPICH or Open MPI 5.x, you must supply your own build of these MPI libraries. </section> <section id="linking-with-the-nvmalloc-library"> <h2>11.9. Linking with the nvmalloc Library<a class="headerlink" href="#linking-with-the-nvmalloc-library" title="Permalink to this headline"></a></h2> The NVIDIA HPC SDK installation includes a custom host (system) memory allocation library based on the jemalloc memory allocator. This library replaces the system malloc(), free(), and other related functions used by the nvc, nvc++, and nvfortran runtime for dynamic heap allocations. You can link with this library by specifying -nvmalloc on any of the compiler command lines used for linking. For example: <div class="highlight-text notranslate"><div class="highlight"><pre>% nvc main.c -nvmalloc </pre></div> </div> </section> </section> <section id="id7"> <h1>12. Environment Variables<a class="headerlink" href="#id7" title="Permalink to this headline"></a></h1> Environment variables allow you to set and pass information that can alter the default behavior of the NVIDIA HPC compilers and the executables which they generate. This section includes explanations of the environment variables specific to the NVIDIA HPC Compilers. . <ul class="simple"> <li>Standard OpenMP environment variables are used to control the behavior of OpenMP programs; these environment variables are described in the OpenMP Specification available online.</li> <li>Several NVIDIA-specific environment variables can be used to control the behavior of OpenACC programs. OpenACC-related environment variables are described in the OpenACC section: <a class="reference internal" href="#env-vars">Environment Variables</a> and the <a class="reference external" href="../openacc-gs/index.htm">OpenACC Getting Started Guide</a>.</li> </ul> <section id="setting-environment-variables"> <h2>12.1. Setting Environment Variables<a class="headerlink" href="#setting-environment-variables" title="Permalink to this headline"></a></h2> Before we look at the environment variables that you might use with the HPC compilers and tools, let’s take a look at how to set environment variables. To illustrate how to set these variables in various environments, let’s look at how a user might initialize a Linux shell environment to enable use of the NVIDIA HPC Compilers. <section id="setting-environment-variables-on-linux"> <h3>12.1.1. Setting Environment Variables on Linux<a class="headerlink" href="#setting-environment-variables-on-linux" title="Permalink to this headline"></a></h3> Let’s assume that you want access to the NVIDIA products when you log in, and that you installed the NVIDIA HPC SDK in /opt/nvidia/hpc_sdk. For access at startup, you can add the following lines to your shell startup files on a Linux_x86_64 system. For csh, use these commands: <pre class="literal-block">$ setenv NVHPCSDK /opt/nvidia/hpc_sdk $ setenv MANPATH "$MANPATH":$NVHPCSDK/Linux_x86-64/25.1/compilers/man $ set path = ($NVHPCSDK/Linux_x86_64/25.1/compilers/bin $path)</pre> For bash, sh, zsh, or ksh, use these commands: <pre class="literal-block">$ NVHPCSDK=/opt/nvidia/hpc_sdk; export NVHPCSDK $ MANPATH=$MANPATH:$NVHPCSDK/Linux_x86_64/25.1/compilers/man; export MANPATH $ PATH=$NVHPCSDK/Linux_x86_64/25.1/compilers/bin:$PATH; export PATH</pre> On a Linux/Arm Server system replace <code class="docutils literal notranslate">Linux_x86_64</code> with <code class="docutils literal notranslate">Linux_aarch64</code>. </section> </section> <section id="hpc-compiler-related-environment-variables"> <h2>12.2. HPC Compiler Related Environment Variables<a class="headerlink" href="#hpc-compiler-related-environment-variables" title="Permalink to this headline"></a></h2> The following table provides a listing of environment variables that affect the behavior of the NVIDIA HPC Compilers and the executables they generate. <table class="table-no-stripes docutils align-default" id="env-vars-nv-related-env-vars-nv-related-tbl"> <caption>Table 25. NVIDIA HPC Compilers Environment Variable Summary<a class="headerlink" href="#env-vars-nv-related-env-vars-nv-related-tbl" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 14%" /> <col style="width: 86%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Environment Variable</th> <th class="head">Description</th> </tr> </thead> <tbody> <tr class="row-even"><td>FORTRANOPT</td> <td>Allows the user to specify that the NVIDIA Fortran compiler should use VAX I/O or other custom I/O conventions.</td> </tr> <tr class="row-odd"><td>FORT_FMT_RECL</td> <td>Allows the user to change the default Fortran stdout (unit 6) line length before a line break occurs. Default: 80 bytes.</td> </tr> <tr class="row-even"><td>GMON_OUT_PREFIX</td> <td>Specifies the name of the output file for programs that are compiled and linked with the -pg option.</td> </tr> <tr class="row-odd"><td>LD_LIBRARY_PATH</td> <td>Specifies a colon-separated set of directories where libraries should first be searched, prior to searching the standard set of directories.</td> </tr> <tr class="row-even"><td>MANPATH</td> <td>Sets the directories that are searched for manual pages associated with the command that the user types.</td> </tr> <tr class="row-odd"><td>NO_STOP_MESSAGE</td> <td>If used, the execution of a plain STOP statement does not produce the message <code class="docutils literal notranslate">FORTRAN STOP</code>.</td> </tr> <tr class="row-even"><td>PATH</td> <td>Determines which locations are searched for commands the user may type.</td> </tr> <tr class="row-odd"><td>NVCOMPILER_FPU_STATE</td> <td>Manages the initial state of the processor’s floating point control and status register at program startup.</td> </tr> <tr class="row-even"><td>NVCOMPILER_TERM</td> <td>Controls the stack traceback and just-in-time debugging functionality.</td> </tr> <tr class="row-odd"><td>NVCOMPILER_TERM_DEBUG</td> <td>Overrides the default behavior when <code class="docutils literal notranslate">NVCOMPILER_TERM</code> is set to <code class="docutils literal notranslate">debug</code>.</td> </tr> <tr class="row-even"><td>PWD</td> <td>Allows you to display the current directory.</td> </tr> <tr class="row-odd"><td>STATIC_RANDOM_SEED</td> <td>Forces the seed returned by <code class="docutils literal notranslate">RANDOM_SEED</code> to be constant.</td> </tr> <tr class="row-even"><td>TMP</td> <td>Sets the directory to use for temporary files created during execution of the HPC compilers and tools; interchangeable with <code class="docutils literal notranslate">TMPDIR</code>.</td> </tr> <tr class="row-odd"><td>TMPDIR</td> <td>Sets the directory to use for temporary files created during execution of the HPC compilers and tools.</td> </tr> </tbody> </table> </section> <section id="hpc-compilers-environment-variables"> <h2>12.3. HPC Compilers Environment Variables<a class="headerlink" href="#hpc-compilers-environment-variables" title="Permalink to this headline"></a></h2> Use the environment variables listed in <a class="reference internal" href="#env-vars-nv-related-env-vars-nv-related-tbl">Table 25</a> to alter the default behavior of the NVIDIA HPC Compilers and the executables which they generate. This section provides more detailed descriptions about the variables in this table. <section id="fortranopt"> <h3>12.3.1. FORTRANOPT<a class="headerlink" href="#fortranopt" title="Permalink to this headline"></a></h3> <code class="docutils literal notranslate">FORTRANOPT</code> allows the user to adjust the behavior of the NVIDIA Fortran compiler. <ul class="simple"> <li>If <code class="docutils literal notranslate">FORTRANOPT</code> exists and contains the value <code class="docutils literal notranslate">vaxio</code>, the record length in the open statement is in units of 4-byte words, and the $ edit descriptor only has an effect for lines beginning with a space or a plus sign (+).</li> <li>If <code class="docutils literal notranslate">FORTRANOPT</code> exists and contains the value <code class="docutils literal notranslate">format_relaxed</code>, an I/O item corresponding to a numerical edit descriptor (such as F, E, I, and so on) is not required to be a type implied by the descriptor.</li> <li>If <code class="docutils literal notranslate">FORTRANOPT</code> exists and contains the value <code class="docutils literal notranslate">no_minus_zero</code>, an I/O item corresponding to a numerical edit descriptor (such as F, E, I, and so on) equal to negative zero will be output as if it were positive zero.</li> <li>If <code class="docutils literal notranslate">FORTRANOPT</code> exists and contains the value <code class="docutils literal notranslate">crif</code>, a sequential formatted or list-directed record is allowed to be terminated with the character sequence <code class="docutils literal notranslate">\\r\\n</code> (carriage return, newline). This approach is useful when reading records from a file produced on a Windows system.</li> </ul> The following example causes the NVIDIA Fortran compiler to use VAX I/O conventions: <div class="highlight-text notranslate"><div class="highlight"><pre>$ setenv FORTRANOPT vaxio </pre></div> </div> </section> <section id="fort-fmt-recl"> <h3>12.3.2. FORT_FMT_RECL<a class="headerlink" href="#fort-fmt-recl" title="Permalink to this headline"></a></h3> The <code class="docutils literal notranslate">FORT_FMT_RECL</code> environment variable specifies the maximum line in bytes for Fortran formatted output to standard out (unit 6) before a newline will be generated. If the environment variable <code class="docutils literal notranslate">FORT_FMT_RECL</code> is present, the Fortran runtime library will use the value specified as the number of bytes to output before a newline is generated. The default value of <code class="docutils literal notranslate">FORT_FMT_RECL</code> is 80. <ul> <li>In csh: <div class="highlight-text notranslate"><div class="highlight"><pre>$ setenv FORT_FMT_RECL length-in-bytes </pre></div> </div> </li> <li>In bash, sh, zsh, or ksh: <div class="highlight-text notranslate"><div class="highlight"><pre>$ FORT_FMT_RECL=length-in-bytes $ export FORT_FMT_RECL </pre></div> </div> </li> </ul> </section> <section id="gmon-out-prefix"> <h3>12.3.3. GMON_OUT_PREFIX<a class="headerlink" href="#gmon-out-prefix" title="Permalink to this headline"></a></h3> <code class="docutils literal notranslate">GMON_OUT_PREFIX</code> specifies the name of the output file for programs that are compiled and linked with the <code class="docutils literal notranslate">-pg</code> option. The default name is <code class="docutils literal notranslate">gmon.out</code>. If <code class="docutils literal notranslate">GMON_OUT_PREFIX</code> is set, the name of the output file has <code class="docutils literal notranslate">GMON_OUT_PREFIX</code> as a prefix. Further, the suffix is the pid of the running process. The prefix and suffix are separated by a dot. For example, if the output file is <code class="docutils literal notranslate">mygmon</code>, then the full filename may look something similar to this: <code class="docutils literal notranslate">mygmon.0012348567</code>. The following example causes the NVIDIA Fortran compiler to use <code class="docutils literal notranslate">nvout</code> as the output file for programs compiled and linked with the <code class="docutils literal notranslate">-pg</code> option. <div class="highlight-text notranslate"><div class="highlight"><pre>$ setenv GMON_OUT_PREFIX nvout </pre></div> </div> </section> <section id="ld-library-path"> <h3>12.3.4. LD_LIBRARY_PATH<a class="headerlink" href="#ld-library-path" title="Permalink to this headline"></a></h3> The <code class="docutils literal notranslate">LD_LIBRARY_PATH</code> variable is a colon-separated set of directories specifying where libraries should first be searched, prior to searching the standard set of directories. This variable is useful when debugging a new library or using a nonstandard library for special purposes. The following csh example adds the current directory to your <code class="docutils literal notranslate">LD_LIBRARY_PATH</code> variable. <div class="highlight-text notranslate"><div class="highlight"><pre>$ setenv LD_LIBRARY_PATH "$LD_LIBRARY_PATH":"./" </pre></div> </div> </section> <section id="manpath"> <h3>12.3.5. MANPATH<a class="headerlink" href="#manpath" title="Permalink to this headline"></a></h3> The <code class="docutils literal notranslate">MANPATH</code> variable sets the directories that are searched for manual pages associated with the commands that the user types. When using NVIDIA HPC Compilers, it is important that you set your <code class="docutils literal notranslate">PATH</code> to include the location of the compilers and then set the <code class="docutils literal notranslate">MANPATH</code> variable to include the man pages associated with the products. The following csh example targets the Linux_x86_64 version of the compilers and enables access to the manual pages associated with them. The settings are similar for Linux_aarch64 targets: <pre class="literal-block">$ set path = (/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/bin $path) $ setenv MANPATH "$MANPATH":/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/man</pre> </section> <section id="no-stop-message"> <h3>12.3.6. NO_STOP_MESSAGE<a class="headerlink" href="#no-stop-message" title="Permalink to this headline"></a></h3> If the <code class="docutils literal notranslate">NO_STOP_MESSAGE</code> variable exists, the execution of a plain <code class="docutils literal notranslate">STOP</code> statement does not produce the message FORTRAN STOP. The default behavior of the NVIDIA Fortran compiler is to issue this message. </section> <section id="path"> <h3>12.3.7. PATH<a class="headerlink" href="#path" title="Permalink to this headline"></a></h3> The <code class="docutils literal notranslate">PATH</code> variable determines the directories that are searched for commands that the user types. When using the NVIDIA HPC compilers, it is important that you set your <code class="docutils literal notranslate">PATH</code> to include the location of the compilers. The following csh example initializes path settings to use the Linux_x86_64 versions of the NVIDIA HPC Compilers. Settings for Linux_aarch64 are done similarly: <pre class="literal-block">$ set path = (/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/bin $path)</pre> </section> <section id="nvcompiler-fpu-state"> <h3>12.3.8. NVCOMPILER_FPU_STATE<a class="headerlink" href="#nvcompiler-fpu-state" title="Permalink to this headline"></a></h3> The <code class="docutils literal notranslate">NVCOMPILER_FPU_STATE</code> environment variable manages the initial state of the processor’s floating point control and status register. <code class="docutils literal notranslate">NVCOMPILER_FPU_STATE</code> eliminates the need to compile the main entry point (c/c++/Fortran) of programs with <code class="docutils literal notranslate">-M[no]daz</code>, <code class="docutils literal notranslate">-M[no]flushz</code>, or <code class="docutils literal notranslate">-Ktrap=</code> command line options, as those options can now be specified at runtime. <div class="admonition note"> Note Linux only If the environment variable NVCOMPILER_FPU_STATE is present, all settings from the command line options <code class="docutils literal notranslate">-M[no]daz</code>, <code class="docutils literal notranslate">-M[no]flushz</code>, or <code class="docutils literal notranslate">-Ktrap=</code> are ignored and the FPU is initialized according to the options specified. NVCOMPILER_FPU_STATE with no options resets the floating-point control and status register to the system defaults. </div> The value of <code class="docutils literal notranslate">NVCOMPILER_FPU_STATE</code> is a comma-separated list of options. The commands for setting the environment variable follow. <ul> <li>In csh: <div class="highlight-text notranslate"><div class="highlight"><pre>$ setenv NVCOMPILER_FPU_STATE option[,option...] </pre></div> </div> </li> <li>In bash, sh, zsh, or ksh: <div class="highlight-text notranslate"><div class="highlight"><pre>$ NVCOMPILER_FPU_STATE=option[,option...] $ export NVCOMPILER_FPU_STATE </pre></div> </div> </li> </ul> <a class="reference internal" href="#env-vars-nv-fpu-state-env-vars-nv-fpu-state-vals">Table 26</a> lists the supported values for <code class="docutils literal notranslate">option</code>. By default, these options are taken from the compiler command line options <code class="docutils literal notranslate">-M[no]daz</code>, <code class="docutils literal notranslate">-M[no]flushz</code>, and <code class="docutils literal notranslate">-Ktrap=</code>. <table class="table-no-stripes docutils align-default" id="env-vars-nv-fpu-state-env-vars-nv-fpu-state-vals"> <caption>Table 26. Supported NVCOMPILER_FPU_STATE options<a class="headerlink" href="#env-vars-nv-fpu-state-env-vars-nv-fpu-state-vals" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 9%" /> <col style="width: 91%" /> </colgroup> <tbody> <tr class="row-odd"><td><code class="docutils literal notranslate">fp</code></td> <td>Shorthand for inv,divz,ovf</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">inv</code></td> <td>Raise exception on floating-point invalid operation (infinity - infinity, infinity / infinity, 0 / 0, …)</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">invalid</code></td> <td>Alias for inv</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">denorm</code></td> <td>Raise exception with floating-point denormalized operands (x86_64 only)</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">divz</code></td> <td>Raise exception on floating-point divide-by-zero</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">zero</code></td> <td>Alias for divz</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">ovf</code></td> <td>Raise exception on floating-point overflow in result</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">overflow</code></td> <td>Alias for ovf</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">unf</code></td> <td>Raise exception on floating-point underflow in result</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">underflow</code></td> <td>Alias for unf</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">inexact</code></td> <td>Raise exception on floating-point inexact result</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">daz</code></td> <td>Convert denormal source operands to zero</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">nodaz</code></td> <td>Do not convert denormal source operands to zero</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">ftz</code></td> <td>Flush underflow results to zero</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">flushz</code></td> <td>Alias for ftz</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">noftz</code></td> <td>Do not flush underflow results to zero</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">noflushz</code></td> <td>Alias for noftz</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">print</code></td> <td>Print to stderr the state of floating point control and status register before and after processing of environment variable <code class="docutils literal notranslate">NVCOMPILER_FPU_STATE</code></td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">debug</code></td> <td>Alias for print</td> </tr> </tbody> </table> </section> <section id="nvcompiler-term"> <h3>12.3.9. NVCOMPILER_TERM<a class="headerlink" href="#nvcompiler-term" title="Permalink to this headline"></a></h3> The <code class="docutils literal notranslate">NVCOMPILER_TERM</code> environment variable controls the stack traceback and just-in-time debugging functionality. The runtime libraries use the value of <code class="docutils literal notranslate">NVCOMPILER_TERM</code> to determine what action to take when a program abnormally terminates. The value of <code class="docutils literal notranslate">NVCOMPILER_TERM</code> is a comma-separated list of options. The commands for setting the environment variable follow. <ul> <li>In csh: <div class="highlight-text notranslate"><div class="highlight"><pre>$ setenv NVCOMPILER_TERM option[,option...] </pre></div> </div> </li> <li>In bash, sh, zsh, or ksh: <div class="highlight-text notranslate"><div class="highlight"><pre>$ NVCOMPILER_TERM=option[,option...] $ export NVCOMPILER_TERM </pre></div> </div> </li> </ul> Table 27 lists the supported values for <code class="docutils literal notranslate">option</code>. Following the table is a complete description of each option that indicates specifically how you might apply the option. By default, all of these options are disabled. <table class="table-no-stripes docutils align-default" id="env-vars-nv-term-env-vars-nv-term-vals"> <caption>Table 27. Supported NVCOMPILER_TERM Values<a class="headerlink" href="#env-vars-nv-term-env-vars-nv-term-vals" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 15%" /> <col style="width: 85%" /> </colgroup> <tbody> <tr class="row-odd"><td><code class="docutils literal notranslate">[no]debug</code></td> <td>Enables/disables just-in-time debugging (debugging invoked on error)</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">[no]trace</code></td> <td>Enables/disables stack traceback on error</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">[no]trace-⁠fp</code></td> <td>Enables/disables stack traceback and printing of SIMD registers (ymm/zmm) on error (Linux x86_64 only)</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">[no]signal</code></td> <td>Enables/disables establishment of signal handlers for common signals that cause program termination</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">[no]abort</code></td> <td>Enables/disables calling the system termination routine abort()</td> </tr> </tbody> </table> [no]debug This enables/disables just-in-time debugging. The default is <code class="docutils literal notranslate">nodebug</code>. When <code class="docutils literal notranslate">NVCOMPILER_TERM</code> is set to <code class="docutils literal notranslate">debug</code>, the command to which <code class="docutils literal notranslate">NVCOMPILER_TERM_DEBUG</code> is set is invoked on error. [no]trace This enables/disables stack traceback on error. [no]trace-fp This enables/disables stack traceback and printing of SIMD registers (ymm/zmm) on error. (Linux x86_64 only) [no]signal This enables/disables establishing signal handlers for the most common signals that cause program termination. The default is <code class="docutils literal notranslate">nosignal</code>. Setting <code class="docutils literal notranslate">trace</code> and <code class="docutils literal notranslate">debug</code> automatically enables <code class="docutils literal notranslate">signal</code>. Specifically setting <code class="docutils literal notranslate">nosignal</code> allows you to override this behavior. [no]abort This enables/disables calling the system termination routine abort(). The default is <code class="docutils literal notranslate">noabort</code>. When <code class="docutils literal notranslate">noabort</code> is in effect the process terminates by calling <code class="docutils literal notranslate">_exit(127)</code>. On Linux, when <code class="docutils literal notranslate">abort</code> is in effect, the abort routine creates a core file and exits with code 127. A few runtime errors just print an error message and call <code class="docutils literal notranslate">exit(127)</code>, regardless of the status of <code class="docutils literal notranslate">NVCOMPILER_TERM</code>. These are mainly errors such as specifying an invalid environment variable value where a traceback would not be useful. If it appears that abort() does not generate core files on a Linux system, be sure to unlimit the coredumpsize. You can do this in these ways: <ul> <li>Using csh: <div class="highlight-text notranslate"><div class="highlight"><pre>$ limit coredumpsize unlimited $ setenv NVCOMPILER_TERM abort </pre></div> </div> </li> <li>Using bash, sh, zsh, or ksh: <div class="highlight-text notranslate"><div class="highlight"><pre>$ ulimit -c unlimited $ export NVCOMPILER_TERM=abort </pre></div> </div> </li> </ul> To debug a core file with gdb, invoke gdb with the –core option. For example, to view a core file named “core” for a program named “a.out”: <div class="highlight-text notranslate"><div class="highlight"><pre>$ gdb --core=core a.out </pre></div> </div> For more information on why to use this variable, refer to <a class="reference internal" href="#env-vars-stack-trace-jit-dbg">Stack Traceback and JIT Debugging</a>. </section> <section id="nvcompiler-term-debug"> <h3>12.3.10. NVCOMPILER_TERM_DEBUG<a class="headerlink" href="#nvcompiler-term-debug" title="Permalink to this headline"></a></h3> The <code class="docutils literal notranslate">NVCOMPILER_TERM_DEBUG</code> variable may be set to override the default behavior when <code class="docutils literal notranslate">NVCOMPILER_TERM</code> is set to <code class="docutils literal notranslate">debug</code>. The value of <code class="docutils literal notranslate">NVCOMPILER_TERM_DEBUG</code> should be set to the command line used to invoke the program. For example: … code:: text <blockquote> <div>gdb –quiet –pid %d </div></blockquote> The first occurrence of <code class="docutils literal notranslate">%d</code> in the <code class="docutils literal notranslate">NVCOMPILER_TERM_DEBUG</code> string is replaced by the process id. The program named in the <code class="docutils literal notranslate">NVCOMPILER_TERM_DEBUG</code> string must be found on the current <code class="docutils literal notranslate">PATH</code> or specified with a full path name. </section> <section id="pwd"> <h3>12.3.11. PWD<a class="headerlink" href="#pwd" title="Permalink to this headline"></a></h3> The PWD variable allows you to display the current directory. </section> <section id="static-random-seed"> <h3>12.3.12. STATIC_RANDOM_SEED<a class="headerlink" href="#static-random-seed" title="Permalink to this headline"></a></h3> You can use <code class="docutils literal notranslate">STATIC_RANDOM_SEED</code> to force the seed returned by the Fortran 90/95 <code class="docutils literal notranslate">RANDOM_SEED</code> intrinsic to be constant. The first call to <code class="docutils literal notranslate">RANDOM_SEED</code> without arguments resets the random seed to a default value, then advances the seed by a variable amount based on time. Subsequent calls to <code class="docutils literal notranslate">RANDOM_SEED</code> without arguments reset the random seed to the same initial value as the first call. Unless the time is exactly the same, each time a program is run a different random number sequence is generated. Setting the environment variable <code class="docutils literal notranslate">STATIC_RANDOM_SEED</code> to <code class="docutils literal notranslate">YES</code> forces the seed returned by <code class="docutils literal notranslate">RANDOM_SEED</code> to be constant, thereby generating the same sequence of random numbers at each execution of the program. </section> <section id="tmp"> <h3>12.3.13. TMP<a class="headerlink" href="#tmp" title="Permalink to this headline"></a></h3> You can use <code class="docutils literal notranslate">TMP</code> to specify the directory to use for placement of any temporary files created during execution of the NVIDIA HPC Compilers. This variable is interchangeable with <code class="docutils literal notranslate">TMPDIR</code>. </section> <section id="tmpdir"> <h3>12.3.14. TMPDIR<a class="headerlink" href="#tmpdir" title="Permalink to this headline"></a></h3> You can use <code class="docutils literal notranslate">TMPDIR</code> to specify the directory to use for placement of any temporary files created during execution of the NVIDIA HPC Compilers. </section> </section> <section id="using-environment-modules-on-linux"> <h2>12.4. Using Environment Modules on Linux<a class="headerlink" href="#using-environment-modules-on-linux" title="Permalink to this headline"></a></h2> On Linux, if you use the Environment Modules package, that is, the <code class="docutils literal notranslate">module load</code> command, the NVIDIA HPC Compilers include a script to set up the appropriate module files. The install script will generate environment module files for you as part of the set up process. Assuming your installation base directory is <code class="docutils literal notranslate">/opt/nvidia/hpc_sdk</code>, the environment modules will be installed under <code class="docutils literal notranslate">/opt/nvidia/hpc_sdk/modulefiles</code>. There will be three sets of module files: <ol class="arabic"> <li>nvhpc Adds environment variable settings for the NVIDIA HPC Compilers, CUDA libraries, and additional libraries such as MPI, NCCL, and NVSHMEM. </li> <li>nvhpc-nompi Adds environment variable settings for the NVIDIA HPC Compilers, CUDA libraries, and additional libraries such as NCCL and NVSHMEM. This will not include MPI, if you wish to use an alternate MPI implementation. </li> <li>nvhpc-byo-compilers Adds environment variable settings for the CUDA libraries and additional libraries such as NCCL and NVSHMEM. This will not include the NVIDIA HPC Compilers nor MPI, if you wish to use alternate compilers and MPI. </li> </ol> You can load the nvhpc environment module for the 20.11 release as follows: <pre class="literal-block">$ module load nvhpc/25.1</pre> To see what versions of nvhpc are available on this system, use this command: <div class="highlight-c++ notranslate"><div class="highlight"><pre>$ module avail nvhpc </pre></div> </div> The <code class="docutils literal notranslate">module load</code> command sets or modifies the environment variables as indicated in the following table. <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 17%" /> <col style="width: 83%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">This Environment Variable…</th> <th class="head">Is set or modified by the module load command</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">CC</code></td> <td>Full path to nvc (nvhpc and nvhpc-nompi only)</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">CPATH</code></td> <td>Prepends the math libraries include directory, the MPI include directory (nvhpc only), and NCCL and NVSHMEM include directories</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">CPP</code></td> <td>C preprocessor, normally cpp (nvhpc and nvhpc-nompi only)</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">CXX</code></td> <td>Path to nvc++ (nvhpc and nvhpc-nompi only)</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">FC</code></td> <td>Full path to nvfortran (nvhpc and nvhpc-nompi only)</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">F90</code></td> <td>Full path to nvfortran (nvhpc and nvhpc-nompi only)</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">F77</code></td> <td>Full path to nvfortran (nvhpc and nvhpc-nompi only)</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">LD_LIBRARY_PATH</code></td> <td>Prepends the CUDA library directory, the NVIDIA HPC Compilers library directory (nvhpc and nvhpc-nompi only), math libraries library directory, MPI library directory (nvhpc only), and NCCL and NVSHMEM library directories</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">MANPATH</code></td> <td>Prepends the NVIDIA HPC Compilers man page directory (nvhpc and nvhpc-nompi only)</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">OPAL_PREFIX</code></td> <td>Full path to the MPI directory (nvhpc only), e.g. /opt/nvidia/hpc_sdk/Linux_x86_64/25.1/comm_libs/mpi</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">PATH</code></td> <td>Prepends the CUDA bin directory, the MPI bin directory (nvhpc only), and the NVIDIA HPC Compilers bin directory (nvhpc and nvhpc-nompi only)</td> </tr> </tbody> </table> <div class="admonition note"> Note NVIDIA does not provide support for the Environment Modules package. For more information about the package, go to: <a class="reference external" href="http://modules.sourceforge.net">http://modules.sourceforge.net</a>. </div> </section> <section id="stack-traceback-and-jit-debugging"> <h2>12.5. Stack Traceback and JIT Debugging<a class="headerlink" href="#stack-traceback-and-jit-debugging" title="Permalink to this headline"></a></h2> When a programming error results in a runtime error message or an application exception, a program will usually exit, perhaps with an error message. The NVIDIA HPC Compilers runtime library includes a mechanism to override this default action and instead print a stack traceback, start a debugger, or, on Linux, create a core file for post-mortem debugging. The stack traceback and just-in-time debugging functionality is controlled by an environment variable, <code class="docutils literal notranslate">NVCOMPILER_TERM</code>, described in <a class="reference internal" href="#env-vars-nv-term">NVCOMPILER_TERM</a>. The runtime libraries use the value of <code class="docutils literal notranslate">NVCOMPILER_TERM</code> to determine what action to take when a program abnormally terminates. When the NVIDIA HPC Compilers runtime library detects an error or catches a signal, it calls the routine <code class="docutils literal notranslate">nvcompiler_stop_here()</code> prior to generating a stack traceback or starting the debugger. The <code class="docutils literal notranslate">nvcompiler_stop_here()</code> routine is a convenient spot to set a breakpoint when debugging a program. </section> </section> <section id="distributing-files-deployment"> <h1>13. Distributing Files - Deployment<a class="headerlink" href="#distributing-files-deployment" title="Permalink to this headline"></a></h1> Once you have successfully built, debugged and tuned your application, you may want to distribute it to users who need to run it on a variety of systems. This section addresses how to effectively distribute applications built using NVIDIA HPC Compilers. The application must be installed in such a way that it executes accurately on a system other than the one on which it was built, and which may be configured differently. <section id="deploying-applications-on-linux"> <h2>13.1. Deploying Applications on Linux<a class="headerlink" href="#deploying-applications-on-linux" title="Permalink to this headline"></a></h2> To successfully deploy your application on Linux, some of the issues to consider include: <ul class="simple"> <li>Runtime Libraries</li> <li>64-bit Linux Systems</li> <li>Redistribution of Files</li> </ul> <section id="runtime-library-considerations"> <h3>13.1.1. Runtime Library Considerations<a class="headerlink" href="#runtime-library-considerations" title="Permalink to this headline"></a></h3> On Linux systems, the system runtime libraries can be linked to an application either statically or dynamically. For example, for the C runtime library, <code class="docutils literal notranslate">libc</code>, you can use either the static version <code class="docutils literal notranslate">libc.a</code> or the shared object version <code class="docutils literal notranslate">libc.so</code>. If the application is intended to run on Linux systems other than the one on which it was built, it is generally safer to use the shared object version of the library. This approach ensures that the application uses a version of the library that is compatible with the system on which the application is running. Further, it works best when the application is linked on a system that has an equivalent or earlier version of the system software than the system on which the application will be run. <div class="admonition note"> Note Building on a newer system and running the application on an older system may not produce the desired output. </div> To use the shared object version of a library, the application must also link to shared object versions of the NVIDIA HPC Compilers runtime libraries. To execute an application built in such a way on a system on which NVIDIA HPC Compilers are not installed, those shared objects must be available.To build using the shared object versions of the runtime libraries, use the <code class="docutils literal notranslate">-Bdynamic</code> option, as shown here: <div class="highlight-c++ notranslate"><div class="highlight"><pre>$ nvfortran -Bdynamic myprog.f90 </pre></div> </div> </section> <section id="bit-linux-considerations"> <h3>13.1.2. 64-bit Linux Considerations<a class="headerlink" href="#bit-linux-considerations" title="Permalink to this headline"></a></h3> On 64-bit Linux systems, 64-bit applications that use the <code class="docutils literal notranslate">-⁠mcmodel=medium</code> option sometimes cannot be successfully linked statically. Therefore, users with executables built with the <code class="docutils literal notranslate">-⁠mcmodel=medium</code> option may need to use shared libraries, linking dynamically. Also, runtime libraries built using the <code class="docutils literal notranslate">-⁠fpic</code> option use 32-bit offsets, so they sometimes need to reside near other runtime <code class="docutils literal notranslate">libs</code> in a shared area of Linux program memory. <div class="admonition note"> Note If your application is linked dynamically using shared objects, then the shared object versions of the NVIDIA HPC Compilers runtime are required. </div> </section> <section id="linux-redistributable-files"> <h3>13.1.3. Linux Redistributable Files<a class="headerlink" href="#linux-redistributable-files" title="Permalink to this headline"></a></h3> The method for installing the shared object versions of the runtime libraries required for applications built with NVIDIA HPC Compilers is manual distribution. When the NVIDIA HPC Compilers are installed, there are directories that have a name that begins with <code class="docutils literal notranslate">REDIST</code>; these directories contain the redistributed shared object libraries. These may be redistributed by licensed NVIDIA HPC Compilers users under the terms of the End-User License Agreement. </section> <section id="restrictions-on-linux-portability"> <h3>13.1.4. Restrictions on Linux Portability<a class="headerlink" href="#restrictions-on-linux-portability" title="Permalink to this headline"></a></h3> You cannot expect to be able to run an executable on any given Linux machine. Portability depends on the system you build on as well as how much your program uses system routines that may have changed from Linux release to Linux release. For example, an area of significant change between some versions of Linux is in <code class="docutils literal notranslate">libpthread.so</code> and <code class="docutils literal notranslate">libnuma.so</code>. NVIDIA HPC Compilers use these dynamically linked libraries for the options <code class="docutils literal notranslate">-acc</code> (OpenACC), <code class="docutils literal notranslate">-mp</code> (OpenMP) and <code class="docutils literal notranslate">-Mconcur</code> (multicore auto-parallel). Statically linking these libraries may not be possible, or may result in failure at execution. Typically, portability is supported for forward execution, meaning running a program on the same or a later version of Linux. But not for backward compatibility, that is, running on a prior release. For example, a user who compiles and links a program under RHEL 7.2 should not expect the program to run without incident on a RHEL 5.2 system, an earlier Linux version. It may run, but it is less likely. Developers might consider building applications on earlier Linux versions for wider usage. Dynamic linking of Linux and gcc system routines on the platform executing the program can also reduce problems. </section> <section id="licensing-for-redistributable-redist-files"> <h3>13.1.5. Licensing for Redistributable (REDIST) Files<a class="headerlink" href="#licensing-for-redistributable-redist-files" title="Permalink to this headline"></a></h3> The files in the REDIST directories may be redistributed under the terms of the End-User License Agreement for the product in which they were included. </section> </section> </section> <section id="inter-language-calling"> <h1>14. Inter-language Calling<a class="headerlink" href="#inter-language-calling" title="Permalink to this headline"></a></h1> This section describes inter-language calling conventions for C, C++, and Fortran programs using the HPC compilers. Fortran 2003 ISO_C_Binding provides a mechanism to support the interoperability with C. This includes the <code class="docutils literal notranslate">iso_c_binding</code> intrinsic module, binding labels, and the BIND attribute. Additional interoperability with C is available with Fortran 2018 and the <code class="docutils literal notranslate">ISO_Fortran_binding.h</code> C header file. nvfortran supports both the <code class="docutils literal notranslate">iso_c_binding</code> and the <code class="docutils literal notranslate">ISO_Fortan_Binding.h</code> header file. In the absence of these mechanisms, the following sections describe how to call a Fortran function or subroutine from a C or C++ program and how to call a C or C++ function from a Fortran program. This section provides examples that use the following options related to inter-language calling. <ul class="simple"> <li><code class="docutils literal notranslate">-c</code></li> <li><code class="docutils literal notranslate">-Mnomain</code></li> <li><code class="docutils literal notranslate">-Miface</code></li> <li><code class="docutils literal notranslate">-Mupcase</code></li> </ul> <section id="overview-of-calling-conventions"> <h2>14.1. Overview of Calling Conventions<a class="headerlink" href="#overview-of-calling-conventions" title="Permalink to this headline"></a></h2> This section includes information on the following topics: <ul class="simple"> <li>Functions and subroutines in Fortran, C, and C++</li> <li>Naming and case conversion conventions</li> <li>Compatible data types</li> <li>Argument passing and special return values</li> <li>Arrays and indexes</li> </ul> The sections <a class="reference internal" href="#intr-lang-call-consider">Inter-language Calling Considerations</a> through <a class="reference internal" href="#intr-lang-exam-cpp-fort">Example – C++ Calling Fortran</a> describe how to perform inter-language calling using the Linux or Win64 convention. </section> <section id="inter-language-calling-considerations"> <h2>14.2. Inter-language Calling Considerations<a class="headerlink" href="#inter-language-calling-considerations" title="Permalink to this headline"></a></h2> In general, when argument data types and function return values agree, you can call a C or C++ function from Fortran as well as call a Fortran function from C or C++. When data types for arguments do not agree, you may need to develop custom mechanisms to handle them. For example, the Fortran <code class="docutils literal notranslate">COMPLEX</code> type has a matching type in C99 but does not have a matching type in C89; however, it is still possible to provide inter-language calls but there are no general calling conventions for such cases. <ul> <li>If a C++ function contains objects with constructors and destructors, calling such a function from either C or Fortran is not possible unless the initialization in the main program is performed from a C++ program in which constructors and destructors are properly initialized.</li> <li>In general, you can call a C or Fortran function from C++ without problems as long as you use the extern “C” keyword to declare the function in the C++ program. This declaration prevents name mangling for the C function name. If you want to call a C++ function from C or Fortran, you also have to use the extern “C” keyword to declare the C++ function. This keeps the C++ compiler from mangling the name of the function.</li> <li>You can use the __cplusplus macro to allow a program or header file to work for both C and C++. For example, the following defines in the header file stdio.h allow this file to work for both C and C++. <div class="highlight-c++ notranslate"><div class="highlight"><pre>#ifndef _STDIO_H #define _STDIO_H #ifdef __cplusplus extern "C" { #endif /* __cplusplus */ . . /* Functions and data types defined... */ . #ifdef __cplusplus } #endif /* __cplusplus */ #endif </pre></div> </div> </li> <li>C++ member functions cannot be declared <code class="docutils literal notranslate">extern</code>, since their names will always be mangled. Therefore, C++ member functions cannot be called from C or Fortran.</li> </ul> </section> <section id="functions-and-subroutines"> <h2>14.3. Functions and Subroutines<a class="headerlink" href="#functions-and-subroutines" title="Permalink to this headline"></a></h2> Fortran, C, and C++ define functions and subroutines differently. For a Fortran program calling a C or C++ function, observe the following return value convention: <ul class="simple"> <li>When a C or C++ function returns a value, call it from Fortran as a function.</li> <li>When a C or C++ function does not return a value, call it as a subroutine.</li> </ul> For a C/C++ program calling a Fortran function, the call should return a similar type. <a class="reference internal" href="#intr-lang-data-types-intr-lang-data-types-tbl">Table 28</a>, <a class="reference internal" href="#intr-lang-data-types-intr-lang-data-types-tbl">Fortran and C/C++ Data Type Compatibility</a>, lists compatible types. If the call is to a Fortran subroutine, or a Fortran <code class="docutils literal notranslate">CHARACTER</code> function, or a Fortran <code class="docutils literal notranslate">COMPLEX</code> function, call it from C/C++ as a function that returns void. The exception to this convention is when a Fortran subroutine has alternate returns; call such a subroutine from C/C++ as a function returning <code class="docutils literal notranslate">int</code> whose value is the value of the integer expression specified in the alternate <code class="docutils literal notranslate">RETURN</code> statement. </section> <section id="upper-and-lower-case-conventions-underscores"> <h2>14.4. Upper and Lower Case Conventions, Underscores<a class="headerlink" href="#upper-and-lower-case-conventions-underscores" title="Permalink to this headline"></a></h2> By default on Linux and Win64 systems, all Fortran symbol names are converted to lower case. C and C++ are case sensitive, so upper-case function names stay upper-case. When you use inter-language calling, you can either name your C/C++ functions with lower-case names, or invoke the Fortran compiler command with the option <code class="docutils literal notranslate">-Mupcase</code>, in which case it will not convert symbol names to lower-case. When programs are compiled using one of the HPC Fortran compilers on Linux and Win64 systems, an underscore is appended to Fortran global names (names of functions, subroutines and common blocks). This mechanism distinguishes Fortran name space from C/C++ name space. Use these naming conventions: <ul class="simple"> <li>If you call a C/C++ function from Fortran, you should rename the C/C++ function by appending an underscore or use <code class="docutils literal notranslate">bind(c)</code> in the Fortran program.</li> <li>If you call a Fortran function from C/C++, you should append an underscore to the Fortran function name in the calling program.</li> </ul> </section> <section id="compatible-data-types"> <h2>14.5. Compatible Data Types<a class="headerlink" href="#compatible-data-types" title="Permalink to this headline"></a></h2> <a class="reference internal" href="#intr-lang-data-types-intr-lang-data-types-tbl">Table 28</a> shows compatible data types between Fortran and C/C++. <a class="reference internal" href="#intr-lang-data-types-intr-lang-data-types-repr-tbl">Table 29</a>, <a class="reference internal" href="#intr-lang-data-types-intr-lang-data-types-repr-tbl">Fortran and C/C++ Representation of the COMPLEX Type</a> shows how the Fortran <code class="docutils literal notranslate">COMPLEX</code> type may be represented in C/C++. <div class="admonition tip"> Tip If you can make your function/subroutine parameters as well as your return values match types, you should be able to use inter-language calling. </div> <table class="table-no-stripes docutils align-default" id="intr-lang-data-types-intr-lang-data-types-tbl"> <caption>Table 28. Fortran and C/C++ Data Type Compatibility<a class="headerlink" href="#intr-lang-data-types-intr-lang-data-types-tbl" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 43%" /> <col style="width: 36%" /> <col style="width: 21%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Fortran Type (lower case)</th> <th class="head">C/C++ Type</th> <th class="head">Size (bytes)</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">character x</code></td> <td><code class="docutils literal notranslate">char x</code></td> <td>1</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">character*n x</code></td> <td><code class="docutils literal notranslate">char x[n]</code></td> <td>n</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">real x</code></td> <td><code class="docutils literal notranslate">float x</code></td> <td>4</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">real*4 x</code></td> <td><code class="docutils literal notranslate">float x</code></td> <td>4</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">real*8 x</code></td> <td><code class="docutils literal notranslate">double x</code></td> <td>8</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">double precision</code></td> <td><code class="docutils literal notranslate">double x</code></td> <td>8</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">integer x</code></td> <td><code class="docutils literal notranslate">int x</code></td> <td>4</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">integer*1 x</code></td> <td><code class="docutils literal notranslate">signed char x</code></td> <td>1</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">integer*2 x</code></td> <td><code class="docutils literal notranslate">short x</code></td> <td>2</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">integer*4 x</code></td> <td><code class="docutils literal notranslate">int x</code></td> <td>4</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">integer*8 x</code></td> <td><code class="docutils literal notranslate">long long x</code></td> <td>8</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">logical x</code></td> <td><code class="docutils literal notranslate">int x</code></td> <td>4</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">logical*1 x</code></td> <td><code class="docutils literal notranslate">char x</code></td> <td>1</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">logical*2 x</code></td> <td><code class="docutils literal notranslate">short x</code></td> <td>2</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">logical*4</code></td> <td><code class="docutils literal notranslate">int x</code></td> <td>4</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">logical*8</code></td> <td><code class="docutils literal notranslate">long x</code></td> <td>8</td> </tr> </tbody> </table> <table class="table-no-stripes docutils align-default" id="intr-lang-data-types-intr-lang-data-types-repr-tbl"> <caption>Table 29. Fortran and C/C++ Representation of the COMPLEX Type<a class="headerlink" href="#intr-lang-data-types-intr-lang-data-types-repr-tbl" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 33%" /> <col style="width: 38%" /> <col style="width: 28%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Fortran Type (lower case)</th> <th class="head">C/C++ Type</th> <th class="head">Size (bytes)</th> </tr> </thead> <tbody> <tr class="row-even"><td rowspan="2"><code class="docutils literal notranslate">complex x</code></td> <td><code class="docutils literal notranslate">struct {float r,i;} x;</code></td> <td>8</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">float complex x;</code></td> <td>8</td> </tr> <tr class="row-even"><td rowspan="2"><code class="docutils literal notranslate">complex*8 x</code></td> <td><code class="docutils literal notranslate">struct {float r,i;} x;</code></td> <td>8</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">float complex x;</code></td> <td>8</td> </tr> <tr class="row-even"><td rowspan="2"><code class="docutils literal notranslate">double complex x</code></td> <td><code class="docutils literal notranslate">struct {double dr,di;} x;</code></td> <td>16</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">double complex x;</code></td> <td>16</td> </tr> <tr class="row-even"><td rowspan="2"><code class="docutils literal notranslate">complex \*16 x</code></td> <td><code class="docutils literal notranslate">struct {double dr,di;} x;</code></td> <td>16</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">double complex x;</code></td> <td>16</td> </tr> </tbody> </table> <div class="admonition note"> Note For C/C++, the <code class="docutils literal notranslate">complex</code> type implies C99 or later. </div> <section id="fortran-named-common-blocks"> <h3>14.5.1. Fortran Named Common Blocks<a class="headerlink" href="#fortran-named-common-blocks" title="Permalink to this headline"></a></h3> A named Fortran common block can be represented in C/C++ by a structure whose members correspond to the members of the common block. The name of the structure in C/C++ must have the added underscore. For example, here is a Fortran common block: <div class="highlight-fortran notranslate"><div class="highlight"><pre>INTEGER I COMPLEX C DOUBLE COMPLEX CD DOUBLE PRECISION D COMMON /COM/ i, c, cd, d </pre></div> </div> This Fortran Common Block is represented in C with the following equivalent: <div class="highlight-c++ notranslate"><div class="highlight"><pre>extern struct { int i; struct {float real, imag;} c; struct {double real, imag;} cd; double d; } com_; </pre></div> </div> This same Fortran Common Block is represented in C++ with the following equivalent: <div class="highlight-c++ notranslate"><div class="highlight"><pre>extern "C" struct { int i; struct {float real, imag;} c; struct {double real, imag;} cd; double d; } com_; </pre></div> </div> <div class="admonition tip"> Tip For global or external data sharing, <code class="docutils literal notranslate">extern "C"</code> is not required. </div> </section> </section> <section id="argument-passing-and-return-values"> <h2>14.6. Argument Passing and Return Values<a class="headerlink" href="#argument-passing-and-return-values" title="Permalink to this headline"></a></h2> In Fortran, arguments are passed by reference, that is, the address of the argument is passed, rather than the argument itself. In C/C++, arguments are passed by value, except for strings and arrays, which are passed by reference. Due to the flexibility provided in C/C++, you can work around these differences. Solving the parameter passing differences generally involves intelligent use of the <code class="docutils literal notranslate">&</code> and <code class="docutils literal notranslate">*</code> operators in argument passing when C/C++ calls Fortran and in argument declarations when Fortran calls C/C++. For strings declared in Fortran as type <code class="docutils literal notranslate">CHARACTER</code>, an argument representing the length of the string is also passed to a calling function. On Linux systems, the compiler places the length argument(s) at the end of the parameter list, following the other formal arguments. The length argument is passed by value, not by reference. <section id="passing-by-value-val"> <h3>14.6.1. Passing by Value (%VAL)<a class="headerlink" href="#passing-by-value-val" title="Permalink to this headline"></a></h3> When passing parameters from a Fortran subprogram to a C/C++ function, it is possible to pass by value using the <code class="docutils literal notranslate">%VAL</code> function. If you enclose a Fortran parameter with <code class="docutils literal notranslate">%VAL()</code>, the parameter is passed by value. For example, the following call passes the integer <code class="docutils literal notranslate">i</code> and the logical <code class="docutils literal notranslate">bvar</code> by value. <div class="highlight-fortran notranslate"><div class="highlight"><pre>integer*1 i logical*1 bvar call cvalue (%VAL(i), %VAL(bvar)) </pre></div> </div> </section> <section id="character-return-values"> <h3>14.6.2. Character Return Values<a class="headerlink" href="#character-return-values" title="Permalink to this headline"></a></h3> <a class="reference internal" href="#intr-lang-funcs-subs">Functions and Subroutines</a> describes the general rules for return values for C/C++ and Fortran inter-language calling. There is a special return value to consider. When a Fortran function returns a character, two arguments need to be added at the beginning of the C/C++ calling function’s argument list: <ul class="simple"> <li>The address of the return character or characters</li> <li>The length of the return character</li> </ul> The following example illustrates the extra parameters, <code class="docutils literal notranslate">tmp</code> and <code class="docutils literal notranslate">10</code>, supplied by the caller: Character Return Parameters <div class="highlight-fortran notranslate"><div class="highlight"><pre>! Fortran function returns a character CHARACTER*(*) FUNCTION CHF(C1,I) CHARACTER*(*) C1 INTEGER I END </pre></div> </div> <div class="highlight-c++ notranslate"><div class="highlight"><pre>/* C declaration of Fortran function */ extern void chf_(); char tmp[10]; char c1[9]; int i; chf_(tmp, 10, c1, &i, 9); </pre></div> </div> If the Fortran function is declared to return a character value of constant length, for example <code class="docutils literal notranslate">CHARACTER*4 FUNCTION CHF()</code>, the second extra parameter representing the length must still be supplied, but is not used. <div class="admonition note"> Note The value of the character function is not automatically NULL-terminated. </div> </section> <section id="complex-return-values"> <h3>14.6.3. Complex Return Values<a class="headerlink" href="#complex-return-values" title="Permalink to this headline"></a></h3> When a Fortran function returns a complex value, an argument needs to be added at the beginning of the C/C++ calling function’s argument list; this argument is the address of the complex return value. <a class="reference internal" href="#intr-lang-arg-cmplx-rtn-val-complx-rtn-val-exam">COMPLEX Return Values</a> illustrates the extra parameter, <code class="docutils literal notranslate">cplx</code>, supplied by the caller. COMPLEX Return Values <div class="highlight-c++ notranslate"><div class="highlight"><pre>COMPLEX FUNCTION CF(C, I) INTEGER I . . . END </pre></div> </div> <div class="highlight-c++ notranslate"><div class="highlight"><pre>extern void cf_(); typedef struct {float real, imag;} cplx; cplx c1; int i; cf_(&c1, &i); </pre></div> </div> </section> </section> <section id="array-indices"> <h2>14.7. Array Indices<a class="headerlink" href="#array-indices" title="Permalink to this headline"></a></h2> C/C++ arrays and Fortran arrays use different default initial array index values. By default, arrays in C/C++ start at 0 and arrqays in Fortran start at 1. If you adjust your array comparisons so that a Fortran second element is compared to a C/C++ first element, and adjust similarly for other elements, you should not have problems working with this difference. If this is not satisfactory, you can declare your Fortran arrays to start at zero. Another difference between Fortran and C/C++ arrays is the storage method used. Fortran uses column-major order and C/C++ uses row-major order. For one-dimensional arrays, this poses no problems. For two-dimensional arrays, where there are an equal number of rows and columns, row and column indexes can simply be reversed. For arrays other than single dimensional arrays, and square two-dimensional arrays, inter-language function mixing is not recommended. </section> <section id="id8"> <h2>14.8. Examples<a class="headerlink" href="#id8" title="Permalink to this headline"></a></h2> This section contains examples that illustrate inter-language calling. <section id="example-fortran-calling-c"> <h3>14.8.1. Example – Fortran Calling C<a class="headerlink" href="#example-fortran-calling-c" title="Permalink to this headline"></a></h3> <div class="admonition note"> Note There are other solutions to calling C from Fortran than the one presented in this section. For example, you can use the <code class="docutils literal notranslate">iso_c_binding</code> intrinsic module which NVIDIA does support. For more information on this module and for examples of how to use it, search the web using the keyword iso_c_binding. </div> <a class="reference internal" href="#intr-lang-exam-fort-c-intr-lang-fort-c-exam-subr">C function f2c_func_</a> shows a C function that is called by the Fortran main program shown in <a class="reference internal" href="#intr-lang-exam-fort-c-intr-lang-fort-c-exam">Fortran Main Program f2c_main.f</a>. Notice that each argument is defined as a pointer, since Fortran passes by reference. Also notice that the C function name uses all lower-case and a trailing “_”. Fortran Main Program f2c_main.f <div class="highlight-fortran notranslate"><div class="highlight"><pre>logical*1 bool1 character letter1 integer*4 numint1, numint2 real numfloat1 double precision numdoub1 integer*2 numshor1 external f2c_func call f2c_func(bool1, letter1, numint1, numint2, numfloat1, numdoub1, numshor1) write( *, "(L2, A2, I5, I5, F6.1, F6.1, I5)") + bool1, letter1, numint1, numint2, numfloat1,numdoub1, numshor1 end </pre></div> </div> C function f2c_func_ <div class="highlight-c++ notranslate"><div class="highlight"><pre>#define TRUE 0xff #define FALSE 0 void f2c_func_( bool1, letter1, numint1, numint2, numfloat1,\ numdoub1, numshor1, len_letter1) char *bool1, *letter1; int *numint1, *numint2; float *numfloat1; double *numdoub1; short *numshor1; int len_letter1; { *bool1 = TRUE; *letter1 = 'v'; *numint1 = 11; *numint2 = -44; *numfloat1 = 39.6 ; *numdoub1 = 39.2; *numshor1 = 981; } </pre></div> </div> Compile and execute the program <code class="docutils literal notranslate">f2c_main.f</code> with the call to <code class="docutils literal notranslate">f2c_func\_</code> using the following command lines: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -c f2c_func.c $ nvfortran f2c_func.o f2c_main.f </pre></div> </div> Executing the <code class="docutils literal notranslate">a.out</code> file should produce the following output: <div class="highlight-text notranslate"><div class="highlight"><pre>T v 11 -44 39.6 39.2 981 </pre></div> </div> </section> <section id="example-c-calling-fortran"> <h3>14.8.2. Example - C Calling Fortran<a class="headerlink" href="#example-c-calling-fortran" title="Permalink to this headline"></a></h3> <div class="admonition note"> Note There are other solutions to calling Fortran from C than the one presented in this section. For example, you can use the <code class="docutils literal notranslate">ISO_Fortran_binding.h</code> C header file which NVIDIA does support. For more information on this header file and for examples of how to use it, search the web using the keyword ISO_Fortran_binding. </div> The example <a class="reference internal" href="#intr-lang-exam-c-fort-intr-lang-c-fort-exam">C Main Program c2f_main.c</a> shows a C main program that calls the Fortran subroutine shown in <a class="reference internal" href="#intr-lang-exam-c-fort-intr-lang-c-fort-exam-sub">Fortran Subroutine c2f_sub.f</a>. <ul class="simple"> <li>Each call uses the & operator to pass by reference.</li> <li>The call to the Fortran subroutine uses all lower-case and a trailing “_”.</li> </ul> C Main Program c2f_main.c <div class="highlight-c++ notranslate"><div class="highlight"><pre>void main () { char bool1, letter1; int numint1, numint2; float numfloat1; double numdoub1; short numshor1; extern void c2f_func_(); c2f_sub_(&bool1,&letter1,&numint1,&numint2,&numfloat1,&numdoub1,&numshor1, 1); printf(" %s %c %d %d %3.1f %.0f %d\n", bool1?"TRUE":"FALSE", letter1, numint1, numint2, numfloat1, numdoub1, numshor1); } </pre></div> </div> Fortran Subroutine c2f_sub.f <div class="highlight-fortran notranslate"><div class="highlight"><pre>subroutine c2f_func ( bool1, letter1, numint1, numint2, + numfloat1, numdoub1, numshor1) logical*1 bool1 character letter1 integer numint1, numint2 double precision numdoub1 real numfloat1 integer*2 numshor1 bool1 = .true. letter1 = "v" numint1 = 11 numint2 = -44 numdoub1 = 902 numfloat1 = 39.6 numshor1 = 299 return end </pre></div> </div> To compile this Fortran subroutine and C program, use the following commands: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -c c2f_main.c $ nvfortran -Mnomain c2f_main.o c2_sub.f </pre></div> </div> Executing the resulting <code class="docutils literal notranslate">a.out</code> file should produce the following output: <div class="highlight-text notranslate"><div class="highlight"><pre>TRUE v 11 -44 39.6 902 299 </pre></div> </div> </section> <section id="example-c-calling-c"> <h3>14.8.3. Example – C++ Calling C<a class="headerlink" href="#example-c-calling-c" title="Permalink to this headline"></a></h3> <a class="reference internal" href="#intr-lang-exam-cpp-c-intr-lang-cpp-c-exam">C++ Main Program cp2c_main.C Calling a C Function</a> shows a C++ main program that calls the C function shown in <a class="reference internal" href="#intr-lang-exam-cpp-c-intr-lang-cpp-c-exam-subr">Simple C Function c2cp_func.c</a>. C++ Main Program cp2c_main.C Calling a C Function <div class="highlight-c++ notranslate"><div class="highlight"><pre>extern "C" void cp2c_func(int n, int m, int *p); #include <iostream> main() { int a,b,c; a=8; b=2; c=0; cout << "main: a = "<<a<<" b = "<<b<<"ptr c = "<<hex<<&c<< endl; cp2c_func(a,b,&c); cout << "main: res = "<<c<<endl; } </pre></div> </div> Simple C Function c2cp_func.c <div class="highlight-c++ notranslate"><div class="highlight"><pre>void cp2c_func(num1, num2, res) int num1, num2, *res; { printf("func: a = %d b = %d ptr c = %x\n",num1,num2,res); *res=num1/num2; printf("func: res = %d\n",*res); } </pre></div> </div> To compile this C function and C++ main program, use the following commands: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -c cp2c_func.c $ nvc++ cp2c_main.C cp2c_func.o </pre></div> </div> Executing the resulting a.out file should produce the following output: <div class="highlight-text notranslate"><div class="highlight"><pre>main: a = 8 b = 2 ptr c = 0xbffffb94 func: a = 8 b = 2 ptr c = bffffb94 func: res = 4 main: res = 4 </pre></div> </div> </section> <section id="id9"> <h3>14.8.4. Example – C Calling C ++<a class="headerlink" href="#id9" title="Permalink to this headline"></a></h3> The example in <a class="reference internal" href="#intr-lang-exam-c-cpp-intr-lang-c-cpp-exam">C Main Program c2cp_main.c Calling a C++ Function</a> shows a C main program that calls the C++ function shown in <a class="reference internal" href="#intr-lang-exam-c-cpp-intr-lang-c-cpp-extern-exam">Simple C++ Function c2cp_func.C with Extern C</a>. C Main Program c2cp_main.c Calling a C++ Function <div class="highlight-c++ notranslate"><div class="highlight"><pre>extern void c2cp_func(int a, int b, int *c); #include <stdio.h> main() { int a,b,c; a=8; b=2; printf("main: a = %d b = %d ptr c = %x\n",a,b,&c); c2cp_func(a,b,&c); printf("main: res = %d\n",c); } </pre></div> </div> Simple C++ Function c2cp_func.C with Extern C <div class="highlight-c++ notranslate"><div class="highlight"><pre>#include <iostream> extern "C" void c2cp_func(int num1,int num2,int *res) { cout << "func: a = "<<num1<<" b = "<<num2<<"ptr c ="<<res<<endl; *res=num1/num2; cout << "func: res = "<<res<<endl; } </pre></div> </div> To compile this C function and C++ main program, use the following commands: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc -c c2cp_main.c $ nvc++ c2cp_main.o c2cp_func.C </pre></div> </div> Executing the resulting a.out file should produce the following output: <div class="highlight-text notranslate"><div class="highlight"><pre>main: a = 8 b = 2 ptr c = 0xbffffb94 func: a = 8 b = 2 ptr c = bffffb94 func: res = 4 main: res = 4 </pre></div> </div> <div class="admonition note"> Note You cannot use the extern “C” form of declaration for an object’s member functions. </div> </section> <section id="id10"> <h3>14.8.5. Example – Fortran Calling C++<a class="headerlink" href="#id10" title="Permalink to this headline"></a></h3> The Fortran main program shown in <a class="reference internal" href="#intr-lang-exam-fort-cpp-lang-fort-cpp-exam">Fortran Main Program f2cp_main.f calling a C++ function</a> calls the C++ function shown in <a class="reference internal" href="#intr-lang-exam-fort-cpp-lang-fort-cpp-exam-subr">C++ function f2cp_func.C</a> . Notice: <ul class="simple"> <li>Each argument is defined as a pointer in the C++ function, since Fortran passes by reference.</li> <li>The C++ function name uses all lower-case and a trailing “_”:</li> </ul> Fortran Main Program f2cp_main.f calling a C++ function <div class="highlight-fortran notranslate"><div class="highlight"><pre> logical*1 bool1 character letter1 integer*4 numint1, numint2 real numfloat1 double precision numdoub1 integer*2 numshor1 external f2cpfunc call f2cp_func (bool1, letter1, numint1, + numint2, numfloat1, numdoub1, numshor1) write( *, "(L2, A2, I5, I5, F6.1, F6.1, I5)") + bool1, letter1, numint1, numint2, numfloat1, + numdoub1, numshor1 end </pre></div> </div> C++ function f2cp_func.C <div class="highlight-c++ notranslate"><div class="highlight"><pre>#define TRUE 0xff #define FALSE 0 extern "C" { extern void f2cp_func_ ( char *bool1, *letter1, int *numint1, *numint2, float *numfloat1, double *numdoub1, short *numshort1, int len_letter1) { *bool1 = TRUE; *letter1 = 'v'; *numint1 = 11; *numint2 = -44; *numfloat1 = 39.6; *numdoub1 = 39.2; *numshort1 = 981; } } </pre></div> </div> Assuming the Fortran program is in a file fmain.f, and the C++ function is in a file cpfunc.C, create an executable, using the following command lines: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvc++ -c f2cp_func.C $ nvfortran f2cp_func.o f2cp_main.f -c++libs </pre></div> </div> Executing the a.out file should produce the following output: <div class="highlight-text notranslate"><div class="highlight"><pre>T v 11 -44 39.6 39.2 981 </pre></div> </div> </section> <section id="id11"> <h3>14.8.6. Example – C++ Calling Fortran<a class="headerlink" href="#id11" title="Permalink to this headline"></a></h3> <a class="reference internal" href="#intr-lang-exam-cpp-fort-intr-lang-cpp-fort-exam-subr">Fortran Subroutine cp2f_func.f</a> shows a Fortran subroutine called by the C++ main program shown in <a class="reference internal" href="#intr-lang-exam-cpp-fort-intr-lang-cpp-fort-exam">C++ main program cp2f_main.C</a>. Notice that each call uses the <code class="docutils literal notranslate">&</code> operator to pass by reference. Also notice that the call to the Fortran subroutine uses all lower-case and a trailing “<code class="docutils literal notranslate">_</code>”: C++ main program cp2f_main.C <div class="highlight-c++ notranslate"><div class="highlight"><pre>#include <iostream> extern "C" { extern void cp2f_func_(char *,char *,int *,int *, float *,double *,short *); } main () { char bool1, letter1; int numint1, numint2; float numfloat1; double numdoub1; short numshor1; cp2f_func(&bool1,&letter1,&numint1,&numint2,&numfloat1, &numdoub1,&numshor1); cout << " bool1 = "; bool1?cout << "TRUE ":cout << "FALSE "; cout <<endl; cout << " letter1 = " << letter1 <<endl; cout << " numint1 = " << numint1 <<endl; cout << " numint2 = " << numint2 <<endl; cout << " numfloat1 = " << numfloat1 <<endl; cout << " numdoub1 = " << numdoub1 <<endl; cout << " numshor1 = " << numshor1 <<endl; } </pre></div> </div> Fortran Subroutine cp2f_func.f <div class="highlight-c++ notranslate"><div class="highlight"><pre>subroutine cp2f_func ( bool1, letter1, numint1, + numint2, numfloat1, numdoub1, numshor1) logical*1 bool1 character letter1 integer numint1, numint2 double precision numdoub1 real numfloat1 integer*2 numshor1 bool1 = .true. ; letter1 = "v" numint1 = 11 ; numint2 = -44 numdoub1 = 902 ; numfloat1 = 39.6 ; numshor1 = 299 return end </pre></div> </div> To compile this Fortran subroutine and C++ program, use the following command lines: <div class="highlight-text notranslate"><div class="highlight"><pre>$ nvfortran -c cp2f_func.f $ nvc++ cp2f_func.o cp2f_main.C -fortranlibs </pre></div> </div> Executing this C++ main should produce the following output: <div class="highlight-text notranslate"><div class="highlight"><pre>bool1 = TRUE letter1 = v numint1 = 11 numint2 = -44 numfloat1 = 39.6 numdoub1 = 902 numshor1 = 299 </pre></div> </div> <div class="admonition note"> Note You must explicitly link in the NVFORTRAN runtime support libraries when linking nvfortran-compiled program units into C++ or C main programs. </div> </section> </section> </section> <section id="programming-considerations-for-64-bit-environments"> <h1>15. Programming Considerations for 64-Bit Environments<a class="headerlink" href="#programming-considerations-for-64-bit-environments" title="Permalink to this headline"></a></h1> NVIDIA provides 64-bit compilers for 64-bit Linux operating systems running on x86-64 (Linux_x86_64) and Arm Server (Linux_aarch64) architectures. You can use these compilers to create programs that use 64-bit memory addresses. The GNU toolchain on 64-bit Linux systems implements an option to control 32-bit vs 64-bit code generation, as described in <a class="reference internal" href="#prog-64bits-static-data-linux">Large Static Data in Linux</a>. This section describes the specifics of how to use the NVIDIA compilers to make use of 64-bit memory addressing. <div class="admonition note"> Note The NVIDIA HPC compilers themselves are 64-bit applications which can only run on 64-bit CPUs running 64-bit Operating Systems. </div> This section describes how to use the following options related to 64-bit programming. <ul class="simple"> <li><code class="docutils literal notranslate">-fPIC</code></li> <li><code class="docutils literal notranslate">-mcmodel=medium</code></li> <li><code class="docutils literal notranslate">-Mlarge_arrays</code></li> <li><code class="docutils literal notranslate">-i8</code></li> <li><code class="docutils literal notranslate">-Mlargeaddressaware</code></li> </ul> <section id="data-types-in-the-64-bit-environment"> <h2>15.1. Data Types in the 64-Bit Environment<a class="headerlink" href="#data-types-in-the-64-bit-environment" title="Permalink to this headline"></a></h2> The size of some data types can differ across 64-bit environments. This section describes the major differences. <section id="c-and-c-data-types"> <h3>15.1.1. C++ and C Data Types<a class="headerlink" href="#c-and-c-data-types" title="Permalink to this headline"></a></h3> On 64-bit Linux operating systems, the size of an int is 4 bytes, a long is 8 bytes, a long long is 8 bytes, and a pointer is 8 bytes. </section> <section id="fortran-data-types"> <h3>15.1.2. Fortran Data Types<a class="headerlink" href="#fortran-data-types" title="Permalink to this headline"></a></h3> In Fortran, the default size of the INTEGER type is 4 bytes. The <code class="docutils literal notranslate">-i8</code> compiler option may be used to make the default size of all INTEGER data in the program 8 bytes. When using the <code class="docutils literal notranslate">-Mlarge_arrays</code> option, described in <a class="reference internal" href="#prog-64bits-array-indexing">64-Bit Array Indexing</a>, any 4-byte INTEGER variables that are used to index arrays are silently promoted by the compiler to 8 bytes. This promotion can lead to unexpected consequences, so 8-byte INTEGER variables are recommended for array indexing when using the option <code class="docutils literal notranslate">-Mlarge_arrays</code>. </section> </section> <section id="large-static-data-in-linux"> <h2>15.2. Large Static Data in Linux<a class="headerlink" href="#large-static-data-in-linux" title="Permalink to this headline"></a></h2> 64-bit Linux operating systems support two different memory models. The default model used by the NVIDIA HPC compilers on Linux_x86_64 and Linux_aarch64 targets is the small memory model, which can be specified using -mcmodel=small. This is the 32-bit model, which limits the size of code plus statically allocated data, including system and user libraries, to 2GB. The medium memory model, specified by -mcmodel=medium, allows combined code and static data areas (.text and .bss sections) larger than 2GB. The <code class="docutils literal notranslate">-mcmodel=medium</code> option must be used on both the compile command and the link command in order to take effect. There are implications to using <code class="docutils literal notranslate">-mcmodel=medium</code>. The generated code requires increased addressing overhead to support the large data range. This can affect performance, though the compilers seek to minimize the added overhead through careful instruction selection and optimization. Linux_aarch64 does not support -mcmodel=medium. If the medium model is specified on the command-line, the compiler driver will automatically select the large model. </section> <section id="large-dynamically-allocated-data"> <h2>15.3. Large Dynamically Allocated Data<a class="headerlink" href="#large-dynamically-allocated-data" title="Permalink to this headline"></a></h2> Dynamically allocated data objects in programs compiled by the NVIDIA HPC compilers can be larger than 2GB. No special compiler options are required to enable this functionality. The size of the allocation is only limited by the system. However, to correctly access dynamically allocated arrays with more than 2G elements you should use the <code class="docutils literal notranslate">-Mlarge_arrays</code> option, described in the following section. </section> <section id="bit-array-indexing"> <h2>15.4. 64-Bit Array Indexing<a class="headerlink" href="#bit-array-indexing" title="Permalink to this headline"></a></h2> The NVIDIA Fortran compilers provide an option, <code class="docutils literal notranslate">-Mlarge_arrays</code>, that enables 64-bit indexing of arrays. This means that, as necessary, 64-bit INTEGER constants and variables are used to index arrays. <div class="admonition note"> Note In the presence of <code class="docutils literal notranslate">-Mlarge_arrays</code>, the compiler may silently promote 32-bit integers to 64 bits, which can have unexpected side effects. </div> On 64-bit Linux, the <code class="docutils literal notranslate">-Mlarge_arrays</code> option also enables single static data objects larger than 2 GB. This option is the default in the presence of <code class="docutils literal notranslate">-mcmodel=medium</code>. </section> <section id="compiler-options-for-64-bit-programming"> <h2>15.5. Compiler Options for 64-bit Programming<a class="headerlink" href="#compiler-options-for-64-bit-programming" title="Permalink to this headline"></a></h2> The usual switches that apply to 64-bit programmers seeking to increase the data range of their applications are in the following table. <table class="table-no-stripes docutils align-default" id="id34"> <caption>Table 30. 64-bit Compiler Options<a class="headerlink" href="#id34" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 7%" /> <col style="width: 37%" /> <col style="width: 56%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Option</th> <th class="head">Purpose</th> <th class="head">Considerations</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">-⁠mcmodel=medium</code></td> <td>Allow for data declarations larger than 2GB.</td> <td>Linux_aarch64 does not support -mcmodel=medium. If the medium model is specified on the command-line, the compiler driver will automatically select the large model.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-⁠Mlarge_arrays</code></td> <td>Perform all array-location-to-address calculations using 64-bit integer arithmetic.</td> <td>Slightly slower execution. Is implicit with <code class="docutils literal notranslate">-mcmodel=medium</code>. Can be used with option <code class="docutils literal notranslate">-mcmodel=small</code>.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">-⁠fpic</code></td> <td>Position independent code. Necessary for shared libraries.</td> <td>Dynamic linking restricted to a 32-bit offset. External symbol references should refer to other shared lib routines, rather than the program calling them.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">-⁠i8</code></td> <td>All INTEGER functions, data, and constants not explicitly declared INTEGER*4 are assumed to be INTEGER*8.</td> <td>Users should take care to explicitly declare INTEGER functions as INTEGER*4.</td> </tr> </tbody> </table> The following table summarizes the limits of these programming models under the specified conditions. The compiler options you use vary by processor. <table class="table-no-stripes docutils align-default" id="id35"> <caption>Table 31. Effects of Options on Memory and Array Sizes<a class="headerlink" href="#id35" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 63%" /> <col style="width: 6%" /> <col style="width: 10%" /> <col style="width: 11%" /> <col style="width: 5%" /> <col style="width: 5%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Condition</th> <th class="head" colspan="2">Addr. Math</th> <th class="head" colspan="3">Max Size Gbytes</th> </tr> </thead> <tbody> <tr class="row-even"><td></td> <td>A</td> <td>I</td> <td>AS</td> <td>DS</td> <td>TS</td> </tr> <tr class="row-odd"><td>64-bit addr limited by option <code class="docutils literal notranslate">-mcmodel=small</code></td> <td>64</td> <td>32</td> <td>2</td> <td>2</td> <td>2</td> </tr> <tr class="row-even"><td>-fpic incompatible with <code class="docutils literal notranslate">-mcmodel=medium</code></td> <td>64</td> <td>32</td> <td>2</td> <td>2</td> <td>2</td> </tr> <tr class="row-odd"><td>Enable full support for 64-bit data addressing</td> <td>64</td> <td>64</td> <td>>2</td> <td>>2</td> <td>>2</td> </tr> </tbody> </table> <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 6%" /> <col style="width: 94%" /> </colgroup> <tbody> <tr class="row-odd"><td><code class="docutils literal notranslate">A</code></td> <td>Address Type – size in bits of data used for address calculations, 64-bits.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">I</code></td> <td>Index Arithmetic -bit-size of data used to index into arrays and other aggregate data structures. If 32-bit, total range of any single data object is limited to 2GB.</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">AS</code></td> <td>Maximum Array Size - the maximum size in gigabytes of any single data object.</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">DS</code></td> <td>Maximum Data Size - max size in gigabytes combined of all data objects in .bss</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">TS</code></td> <td>Maximum Total Size - max size in gigabytes, in aggregate, of all executable code and data objects in a running program.</td> </tr> </tbody> </table> </section> <section id="practical-limitations-of-large-array-programming"> <h2>15.6. Practical Limitations of Large Array Programming<a class="headerlink" href="#practical-limitations-of-large-array-programming" title="Permalink to this headline"></a></h2> The 64-bit addressing capability of 64-bit Linux environments can cause unexpected issues when data sizes are enlarged significantly. The following table describes the most common occurrences of practical limitations of large array programming. <table class="table-no-stripes docutils align-default" id="id36"> <caption>Table 32. 64-Bit Limitations<a class="headerlink" href="#id36" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 15%" /> <col style="width: 85%" /> </colgroup> <tbody> <tr class="row-odd"><td>array initialization</td> <td>Initializing a large array with a data statement may result in very large assembly and object files, where a line of assembler source is required for each element in the initialized array. Compilation and linking can be very time consuming as well. To avoid this issue, consider initializing large arrays in a loop at runtime rather than in a data statement.</td> </tr> <tr class="row-even"><td>stack space</td> <td>Stack space can be a problem for data that is stack-based. On Linux, stack size is increased in your shell environment. If setting stacksize to unlimited is not large enough, try setting the size explicitly: <div class="highlight-c++ notranslate"><div class="highlight"><pre>limit stacksize new_size ! in csh </pre></div> </div> <div class="highlight-c++ notranslate"><div class="highlight"><pre>ulimit -s new_size ! in bash </pre></div> </div> </td> </tr> <tr class="row-odd"><td>page swapping</td> <td>If your executable is much larger than the physical size of memory, page swapping can cause it to run dramatically slower; it may even fail. This is not a compiler problem. Try smaller data sets to determine whether or not a problem is due to page thrashing.</td> </tr> <tr class="row-even"><td>configured space</td> <td>Be sure your Linux system is configured with swap space sufficiently large to support the data sets used in your application(s). If your memory+swap space is not sufficiently large, your application will likely encounter a segmentation fault at runtime.</td> </tr> <tr class="row-odd"><td>support for large address offsets in object file format</td> <td>Arrays that are not dynamically allocated are limited by how the compiler can express the ‘distance’ between them when generating code. A field in the object file stores this ‘distance’ value, which is limited to 32-bits on Linux with <code class="docutils literal notranslate">-mcmodel=small</code>. It is 64-bits on Linux with <code class="docutils literal notranslate">-mcmodel=medium</code>. <div class="admonition note"> Note Without the 64-bit offset support in the object file format, large arrays cannot be declared statically, or locally on the stack. </div> </td> </tr> </tbody> </table> </section> <section id="medium-memory-model-and-large-array-in-c"> <h2>15.7. Medium Memory Model and Large Array in C<a class="headerlink" href="#medium-memory-model-and-large-array-in-c" title="Permalink to this headline"></a></h2> Consider the following example, where the aggregate size of the arrays exceeds 2GB. Medium Memory Model and Large Array in C <div class="highlight-c++ notranslate"><div class="highlight"><pre>% cat bigadd.c #include <stdio.h> #define SIZE 600000000 /* > 2GB/4 */ static float a[SIZE], b[SIZE]; int main() { long long i, n, m; float c[SIZE]; /* goes on stack */ n = SIZE; m = 0; for (i = 0; i < n; i += 10000) { a[i] = i + 1; b[i] = 2.0 * (i + 1); c[i] = a[i] + b[i]; m = i; } printf("a[0]=%g b[0]=%g c[0]=%g\n", a[0], b[0], c[0]); printf("m=%lld a[%lld]=%g b[%lld]=%gc[%lld]=%g\n",m,m,a[m],m,b[m],m,c[m]); return 0; } </pre></div> </div> <div class="highlight-text notranslate"><div class="highlight"><pre>% nvc -mcmodel=medium -o bigadd bigadd.c </pre></div> </div> When SIZE is greater than 2G/4, and the arrays are of type float with 4 bytes per element, the size of each array is greater than 2GB. With nvc, using the -mcmodel=medium switch, a static data object can now be > 2GB in size. If you execute with these settings in your environment, you may see the following: <div class="highlight-text notranslate"><div class="highlight"><pre>% bigadd Segmentation fault </pre></div> </div> Execution fails because the stack size is not large enough. You can most likely correct this error by using the <code class="docutils literal notranslate">limit stacksize</code> command to reset the stack size in your environment: <div class="highlight-text notranslate"><div class="highlight"><pre>% limit stacksize 3000M </pre></div> </div> <div class="admonition note"> Note The command <code class="docutils literal notranslate">limit stacksize unlimited</code> probably does not provide as large a stack as we are using in the <a class="reference internal" href="#prog-64bits-lrg-ary-med-mem-c">this example</a>. </div> <div class="highlight-text notranslate"><div class="highlight"><pre>% bigadd a[0]=1 b[0]=2 c[0]=3 n=599990000 a[599990000]=5.9999e+08 b[599990000]=1.19998e+09 c[599990000]=1.79997e+09 </pre></div> </div> </section> <section id="medium-memory-model-and-large-array-in-fortran"> <h2>15.8. Medium Memory Model and Large Array in Fortran<a class="headerlink" href="#medium-memory-model-and-large-array-in-fortran" title="Permalink to this headline"></a></h2> The following example works with the NVFORTRAN compiler. It uses 64-bit addresses and index arithmetic when the <code class="docutils literal notranslate">-mcmodel=medium</code> option is used. Consider the following example: Medium Memory Model and Large Array in Fortran <div class="highlight-fortran notranslate"><div class="highlight"><pre>% cat mat.f program mat integer i, j, k, size, l, m, n parameter (size=16000) ! >2GB parameter (m=size,n=size) real*8 a(m,n),b(m,n),c(m,n),d do i = 1, m do j = 1, n a(i,j)=10000.0D0*dble(i)+dble(j) b(i,j)=20000.0D0*dble(i)+dble(j) enddo enddo !$omp parallel !$omp do do i = 1, m do j = 1, n c(i,j) = a(i,j) + b(i,j) enddo enddo !$omp do do i=1,m do j = 1, n d = 30000.0D0*dble(i)+dble(j)+dble(j) if (d .ne. c(i,j)) then print *,"err i=",i,"j=",j print *,"c(i,j)=",c(i,j) print *,"d=",d stop endif enddo enddo !$omp end parallel print *, "M =",M,", N =",N print *, "c(M,N) = ", c(m,n) end </pre></div> </div> When compiled with the NVFORTRAN compiler using <code class="docutils literal notranslate">-mcmodel=medium</code>: <div class="highlight-text notranslate"><div class="highlight"><pre>% nvfortran -Mfree -mp -o mat mat.f -i8 -mcmodel=medium % setenv OMP_NUM_THREADS 2 % mat M = 16000 , N = 16000 c(M,N) = 480032000.0000000 </pre></div> </div> </section> <section id="large-array-and-small-memory-model-in-fortran"> <h2>15.9. Large Array and Small Memory Model in Fortran<a class="headerlink" href="#large-array-and-small-memory-model-in-fortran" title="Permalink to this headline"></a></h2> The following example uses large, dynamically-allocated arrays. The code is divided into a main and subroutine so you could put the subroutine into a shared library. Dynamic allocation of large arrays saves space in the size of executable and saves time initializing data. Large Array and Small Memory Model in Fortran <div class="highlight-c++ notranslate"><div class="highlight"><pre>% cat mat_allo.f90 </pre></div> </div> <div class="highlight-c++ notranslate"><div class="highlight"><pre>program mat_allo integer i, j integer size, m, n parameter (size=16000) parameter (m=size,n=size) double precision, allocatable::a(:,:),b(:,:),c(:,:) allocate(a(m,n), b(m,n), c(m,n)) do i = 100, m, 1 do j = 100, n, 1 a(i,j) = 10000.0D0 * dble(i) + dble(j) b(i,j) = 20000.0D0 * dble(i) + dble(j) enddo enddo call mat_add(a,b,c,m,n) print *, "M =",m,",N =",n print *, "c(M,N) = ", c(m,n) end </pre></div> </div> <div class="highlight-c++ notranslate"><div class="highlight"><pre>subroutine mat_add(a,b,c,m,n) integer m, n, i, j double precision a(m,n),b(m,n),c(m,n) do i = 1, m do j = 1, n c(i,j) = a(i,j) + b(i,j) enddo enddo return end </pre></div> </div> <div class="highlight-c++ notranslate"><div class="highlight"><pre>% nvfortran -o mat_allo mat_allo.f90 -i8 -Mlarge_arrays -mp -fast </pre></div> </div> </section> </section> <section id="c-and-c-inline-assembly-and-intrinsics"> <h1>16. C++ and C Inline Assembly and Intrinsics<a class="headerlink" href="#c-and-c-inline-assembly-and-intrinsics" title="Permalink to this headline"></a></h1> The examples in this section are shown using x86-64 assembly instructions. Inline assembly is supported on Arm Server platforms as well, but is not documented in detail in this section. <section id="inline-assembly"> <h2>16.1. Inline Assembly<a class="headerlink" href="#inline-assembly" title="Permalink to this headline"></a></h2> Inline Assembly lets you specify machine instructions inside a “C” function. The format for an inline assembly instruction is this: <div class="highlight-c++ notranslate"><div class="highlight"><pre>{ asm | __asm__ } ("string"); </pre></div> </div> The asm statement begins with the asm or __asm__ keyword. The __asm__ keyword is typically used in header files that may be included in ISO “C” programs. string is one or more machine specific instructions separated with a semi-colon (;) or newline (\n) character. These instructions are inserted directly into the compiler’s assembly-language output for the enclosing function. Some simple asm statements are: <div class="highlight-c++ notranslate"><div class="highlight"><pre>asm ("cli"); asm ("sti"); </pre></div> </div> These asm statements disable and enable system interrupts respectively. In the following example, the eax register is set to zero. <div class="highlight-c++ notranslate"><div class="highlight"><pre>asm( "pushl %eax\n\t" "movl $0, %eax\n\t" "popl %eax"); </pre></div> </div> Notice that eax is pushed on the stack so that it is it not clobbered. When the statement is done with eax, it is restored with the popl instruction. Typically a program uses macros that enclose asm statements. The following two examples use the interrupt constructs created previously in this section: <div class="highlight-c++ notranslate"><div class="highlight"><pre>#define disableInt __asm__ ("cli"); #define enableInt __asm__ ("sti"); </pre></div> </div> </section> <section id="extended-inline-assembly"> <h2>16.2. Extended Inline Assembly<a class="headerlink" href="#extended-inline-assembly" title="Permalink to this headline"></a></h2> <a class="reference internal" href="#inline-asm">Inline Assembly</a> explains how to use inline assembly to specify machine specific instructions inside a “C” function. This approach works well for simple machine operations such as disabling and enabling system interrupts. However, inline assembly has three distinct limitations: <ol class="arabic simple"> <li>The programmer must choose the registers required by the inline assembly.</li> <li>To prevent register clobbering, the inline assembly must include push and pop code for registers that get modified by the inline assembly.</li> <li>There is no easy way to access stack variables in an inline assembly statement.</li> </ol> Extended Inline Assembly was created to address these limitations. The format for extended inline assembly, also known as extended asm, is as follows: <div class="highlight-c++ notranslate"><div class="highlight"><pre>{ asm | __asm__ } [ volatile | __volatile__ ] ("string" [: [output operands]] [: [input operands]] [: [clobberlist]]); </pre></div> </div> <ul> <li>Extended asm statements begin with the asm or __asm__ keyword. Typically the __asm__ keyword is used in header files that may be included by ISO “C” programs.</li> <li>An optional volatile or __volatile__ keyword may appear after the asm keyword. This keyword instructs the compiler not to delete, move significantly, or combine with any other asm statement. Like __asm__, the __volatile__ keyword is typically used with header files that may be included by ISO “C” programs.</li> <li>“string” is one or more machine specific instructions separated with a semi-colon (;) or newline (\n) character. The string can also contain operands specified in the [output operands], [input operands], and [clobber list]. The instructions are inserted directly into the compiler’s assembly-language output for the enclosing function.</li> <li>The [output operands], [input operands], and [clobber list] items each describe the effect of the instruction for the compiler. For example: <div class="highlight-c++ notranslate"><div class="highlight"><pre>asm( "movl %1, %%eax\n" "movl %%eax, %0":"=r" (x) : "r" (y) : "%eax" ); </pre></div> </div> where <ul class="simple"> <li>“=r” (x) is an output operand.</li> <li>“r” (y) is an input operand.</li> <li>“%eax” is the clobber list consisting of one register, “%eax”.</li> </ul> The notation for the output and input operands is a constraint string surrounded by quotes, followed by an expression, and surrounded by parentheses. The constraint string describes how the input and output operands are used in the asm “string”. For example, “r” tells the compiler that the operand is a register. The “=” tells the compiler that the operand is write only, which means that a value is stored in an output operand’s expression at the end of the asm statement. Each operand is referenced in the asm “string” by a percent “%” and its number. The first operand is number 0, the second is number 1, the third is number 2, and so on. In the preceding example, “%0” references the output operand, and “%1” references the input operand. The asm “string” also contains “%%eax”, which references machine register “%eax”. Hard coded registers like “%eax” should be specified in the clobber list to prevent conflicts with other instructions in the compiler’s assembly-language output. [output operands], [input operands], and [clobber list] items are described in more detail in the following sections. </li> </ul> <section id="output-operands"> <h3>16.2.1. Output Operands<a class="headerlink" href="#output-operands" title="Permalink to this headline"></a></h3> The [output operands] are an optional list of output constraint and expression pairs that specify the result(s) of the asm statement. An output constraint is a string that specifies how a result is delivered to the expression. For example, “=r” (x) says the output operand is a write-only register that stores its value in the “C” variable x at the end of the asm statement. An example follows: <div class="highlight-c++ notranslate"><div class="highlight"><pre>int x; void example() { asm( "movl $0, %0" : "=r" (x) ); } </pre></div> </div> The previous example assigns 0 to the “C” variable x. For the function in this example, the compiler produces the following assembly. If you want to produce an assembly listing, compile the example with the nvc -S compiler option: <div class="highlight-c++ notranslate"><div class="highlight"><pre>example: ..Dcfb0: pushq %rbp ..Dcfi0: movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 8 movl $0, %eax movl %eax, x(%rip) ## lineno: 0 popq %rbp ret </pre></div> </div> In the generated assembly shown, notice that the compiler generated two statements for the asm statement at line number 5. The compiler generated “movl $0, %eax” from the asm “string”. Also notice that %eax appears in place of “%0” because the compiler assigned the %eax register to variable x. Since item 0 is an output operand, the result must be stored in its expression (x). In addition to write-only output operands, there are read/write output operands designated with a “+” instead of a “=”. For example, “+r” (x) tells the compiler to initialize the output operand with variable x at the beginning of the asm statement. To illustrate this point, the following example increments variable x by 1: <div class="highlight-c++ notranslate"><div class="highlight"><pre>int x=1; void example2() { asm( "addl $1, %0" : "+r" (x) ); } </pre></div> </div> To perform the increment, the output operand must be initialized with variable x. The read/write constraint modifier (“+”) instructs the compiler to initialize the output operand with its expression. The compiler generates the following assembly code for the example2() function: <div class="highlight-c++ notranslate"><div class="highlight"><pre>example2: ..Dcfb0: pushq %rbp ..Dcfi0: movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 5 movl x(%rip), %eax addl $1, %eax movl %eax, x(%rip) ## lineno: 0 popq %rbp ret </pre></div> </div> From the example2() code, two extraneous moves are generated in the assembly: one movl for initializing the output register and a second movl to write it to variable x. To eliminate these moves, use a memory constraint type instead of a register constraint type, as shown in the following example: <div class="highlight-c++ notranslate"><div class="highlight"><pre>int x=1; void example2() { asm( "addl $1, %0" : "+m" (x) ); } </pre></div> </div> The compiler generates a memory reference in place of a memory constraint. This eliminates the two extraneous moves. Because the assembly uses a memory reference to variable x, it does not have to move x into a register prior to the asm statement; nor does it need to store the result after the asm statement. Additional constraint types are found in <a class="reference internal" href="#inline-asm-addl-constraints">Additional Constraints</a>. <div class="highlight-c++ notranslate"><div class="highlight"><pre>example2: ..Dcfb0: pushq %rbp ..Dcfi0: movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 5 addl $1, x(%rip) ## lineno: 0 popq %rbp ret </pre></div> </div> The examples thus far have used only one output operand. Because extended asm accepts a list of output operands, asm statements can have more than one result, as shown in the following example: <div class="highlight-c++ notranslate"><div class="highlight"><pre>void example4() { int x=1; int y=2; asm( "addl $1, %1\n" "addl %1, %0": "+r" (x), "+m" (y) ); } </pre></div> </div> This example increments variable y by 1 then adds it to variable x. Multiple output operands are separated with a comma. The first output operand is item 0 (“%0”) and the second is item 1 (“%1”) in the asm “string”. The resulting values for x and y are 4 and 3 respectively. </section> <section id="input-operands"> <h3>16.2.2. Input Operands<a class="headerlink" href="#input-operands" title="Permalink to this headline"></a></h3> The [input operands] are an optional list of input constraint and expression pairs that specify what “C” values are needed by the asm statement. The input constraints specify how the data is delivered to the asm statement. For example, “r” (x) says that the input operand is a register that has a copy of the value stored in “C” variable x. Another example is “m” (x) which says that the input item is the memory location associated with variable x. Other constraint types are discussed in <a class="reference internal" href="#inline-asm-addl-constraints">Additional Constraints</a>. An example follows: <div class="highlight-c++ notranslate"><div class="highlight"><pre>void example5() { int x=1; int y=2; int z=3; asm( "addl %2, %1\n" "addl %2, %0" : "+r" (x), "+m" (y) : "r" (z) ); } </pre></div> </div> The previous example adds variable z, item 2, to variable x and variable y. The resulting values for x and y are 4 and 5 respectively. Another type of input constraint worth mentioning here is the matching constraint. A matching constraint is used to specify an operand that fills both an input as well as an output role. An example follows: <div class="highlight-c++ notranslate"><div class="highlight"><pre>int x=1; void example6() { asm( "addl $1, %1" : "=r" (x) : "0" (x) ); } </pre></div> </div> The previous example is equivalent to the example2() function shown in <a class="reference internal" href="#inline-asm-output-operands">Output Operands</a>. The constraint/expression pair, “0” (x), tells the compiler to initialize output item 0 with variable x at the beginning of the asm statement. The resulting value for x is 2. Also note that “%1” in the asm “string” means the same thing as “%0” in this case. That is because there is only one operand with both an input and an output role. Matching constraints are very similar to the read/write output operands mentioned in <a class="reference internal" href="#inline-asm-output-operands">Output Operands</a>. However, there is one key difference between read/write output operands and matching constraints. The matching constraint can have an input expression that differs from its output expression. The following example uses different values for the input and output roles: <div class="highlight-c++ notranslate"><div class="highlight"><pre>int x; int y=2; void example7() { asm( "addl $1, %1" : "=r" (x) : "0" (y) ); } </pre></div> </div> The compiler generates the following assembly for example7(): <div class="highlight-c++ notranslate"><div class="highlight"><pre>example7: ..Dcfb0: pushq %rbp ..Dcfi0: movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 8 movl y(%rip), %eax addl $1, %eax movl %eax, x(%rip) ## lineno: 0 popq %rbp ret </pre></div> </div> Variable x gets initialized with the value stored in y, which is 2. After adding 1, the resulting value for variable x is 3. Because matching constraints perform an input role for an output operand, it does not make sense for the output operand to have the read/write (”+”) modifier. In fact, the compiler disallows matching constraints with read/write output operands. The output operand must have a write only (”=”) modifier. </section> <section id="clobber-list"> <h3>16.2.3. Clobber List<a class="headerlink" href="#clobber-list" title="Permalink to this headline"></a></h3> The [clobber list] is an optional list of strings that hold machine registers used in the asm “string”. Essentially, these strings tell the compiler which registers may be clobbered by the asm statement. By placing registers in this list, the programmer does not have to explicitly save and restore them as required in traditional inline assembly (described in <a class="reference internal" href="#inline-asm">Inline Assembly</a>). The compiler takes care of any required saving and restoring of the registers in this list. Each machine register in the [clobber list] is a string separated by a comma. The leading ‘%’ is optional in the register name. For example, “%eax” is equivalent to “eax”. When specifying the register inside the asm “string”, you must include two leading ‘%’ characters in front of the name (for example., “%%eax”). Otherwise, the compiler will behave as if a bad input/output operand was specified and generate an error message. An example follows: <div class="highlight-c++ notranslate"><div class="highlight"><pre>void example8() { int x; int y=2; asm( "movl %1, %%eax\n" "movl %1, %%edx\n" "addl %%edx, %%eax\n" "addl %%eax, %0" : "=r" (x) : "0" (y) : "eax", "edx" ); } </pre></div> </div> This code uses two hard-coded registers, eax and edx. It performs the equivalent of 3*y and assigns it to x, producing a result of 6. In addition to machine registers, the clobber list may contain the following special flags: <dl class="simple"> <dt>“cc”</dt><dd>The asm statement may alter the control code register. </dd> <dt>“memory”</dt><dd>The asm statement may modify memory in an unpredictable fashion. </dd> </dl> When the “memory” flag is present, the compiler does not keep memory values cached in registers across the asm statement and does not optimize stores or loads to that memory. For example: <div class="highlight-c++ notranslate"><div class="highlight"><pre>asm("call MyFunc":::"memory"); </pre></div> </div> This asm statement contains a “memory” flag because it contains a call. The callee may otherwise clobber registers in use by the caller without the “memory” flag. The following function uses extended asm and the “cc” flag to compute a power of 2 that is less than or equal to the input parameter n. <div class="highlight-c++ notranslate"><div class="highlight"><pre>#pragma noinline int asmDivideConquer(int n) { int ax = 0; int bx = 1; asm ( "LogLoop:n" "cmp %2, %1n" "jnle Donen" "inc %0n" "add %1,%1n" "jmp LogLoopn" "Done:n" "dec %0n" :"+r" (ax), "+r" (bx) : "r" (n) : "cc"); return ax; } </pre></div> </div> The ‘cc’ flag is used because the asm statement contains some control flow that may alter the control code register. The #pragma noinline statement prevents the compiler from inlining the asmDivideConquer() function. If the compiler inlines asmDivideConquer(), then it may illegally duplicate the labels LogLoop and Done in the generated assembly. </section> <section id="additional-constraints"> <h3>16.2.4. Additional Constraints<a class="headerlink" href="#additional-constraints" title="Permalink to this headline"></a></h3> Operand constraints can be divided into four main categories: <ul class="simple"> <li>Simple Constraints</li> <li>Machine Constraints</li> <li>Multiple Alternative Constraints</li> <li>Constraint Modifiers</li> </ul> </section> <section id="simple-constraints"> <h3>16.2.5. Simple Constraints<a class="headerlink" href="#simple-constraints" title="Permalink to this headline"></a></h3> The simplest kind of constraint is a string of letters or characters, known as Simple Constraints, such as the “r” and “m” constraints introduced in <a class="reference internal" href="#inline-asm-output-operands">Output Operands</a>. Table 33 describes these constraints. <table class="table-no-stripes docutils align-default" id="inline-asm-simple-constraints-inline-asm-simple-constraints-tbl"> <caption>Table 33. Simple Constraints<a class="headerlink" href="#inline-asm-simple-constraints-inline-asm-simple-constraints-tbl" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 7%" /> <col style="width: 93%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Constraint</th> <th class="head">Description</th> </tr> </thead> <tbody> <tr class="row-even"><td>whitespace</td> <td>Whitespace characters are ignored.</td> </tr> <tr class="row-odd"><td>E</td> <td>An immediate floating point operand.</td> </tr> <tr class="row-even"><td>F</td> <td>Same as “E”.</td> </tr> <tr class="row-odd"><td>g</td> <td>Any general purpose register, memory, or immediate integer operand is allowed.</td> </tr> <tr class="row-even"><td>i</td> <td>An immediate integer operand.</td> </tr> <tr class="row-odd"><td>m</td> <td>A memory operand. Any address supported by the machine is allowed.</td> </tr> <tr class="row-even"><td>n</td> <td>Same as “i”.</td> </tr> <tr class="row-odd"><td>o</td> <td>Same as “m”.</td> </tr> <tr class="row-even"><td>p</td> <td>An operand that is a valid memory address. The expression associated with the constraint is expected to evaluate to an address (for example, “p” (&x) ).</td> </tr> <tr class="row-odd"><td>r</td> <td>A general purpose register operand.</td> </tr> <tr class="row-even"><td>X</td> <td>Same as “g”.</td> </tr> <tr class="row-odd"><td>0,1,2,..9</td> <td>Matching Constraint. See <a class="reference internal" href="#inline-asm-output-operands">Output Operands</a> for a description.</td> </tr> </tbody> </table> The following example uses the general or “g” constraint, which allows the compiler to pick an appropriate constraint type for the operand; the compiler chooses from a general purpose register, memory, or immediate operand. This code lets the compiler choose the constraint type for “y”. <div class="highlight-c++ notranslate"><div class="highlight"><pre>void example9() { int x, y=2; asm( "movl %1, %0\n" : "=r" (x) : "g" (y) ); } </pre></div> </div> This technique can result in more efficient code. For example, when compiling example9() the compiler replaces the load and store of y with a constant 2. The compiler can then generate an immediate 2 for the y operand in the example. The assembly generated by nvc for our example is as follows: <div class="highlight-c++ notranslate"><div class="highlight"><pre>example9: ..Dcfb0: pushq %rbp ..Dcfi0: movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 3 movl $2, %eax ## lineno: 6 popq %rbp ret </pre></div> </div> In this example, notice the use of $2 for the “y” operand. Of course, if y is always 2, then the immediate value may be used instead of the variable with the “i” constraint, as shown here: <div class="highlight-c++ notranslate"><div class="highlight"><pre>void example10() { int x; asm( "movl %1, %0\n" : "=r" (x) : "i" (2) ); } </pre></div> </div> Compiling example10() with nvc produces assembly similar to that produced for example9(). </section> <section id="machine-constraints"> <h3>16.2.6. Machine Constraints<a class="headerlink" href="#machine-constraints" title="Permalink to this headline"></a></h3> Another category of constraints is Machine Constraints. The x86_64 architectures has several classes of registers. To choose a particular class of register, you can use the x86_64 machine constraints described in <a class="reference internal" href="#inline-asm-machine-constraints-inline-asm-machine-constraints-tbl">Table 34</a>. <table class="table-no-stripes docutils align-default" id="inline-asm-machine-constraints-inline-asm-machine-constraints-tbl"> <caption>Table 34. x86_64 Machine Constraints<a class="headerlink" href="#inline-asm-machine-constraints-inline-asm-machine-constraints-tbl" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 9%" /> <col style="width: 91%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Constraint</th> <th class="head">Description</th> </tr> </thead> <tbody> <tr class="row-even"><td>a</td> <td>a register (e.g., %al, %ax, %eax, %rax)</td> </tr> <tr class="row-odd"><td>A</td> <td>Specifies a or d registers. The d register holds the most significant bits and the a register holds the least significant bits.</td> </tr> <tr class="row-even"><td>b</td> <td>b register (e.g, %bl, %bx, %ebx, %rbx)</td> </tr> <tr class="row-odd"><td>c</td> <td>c register (e.g., %cl, %cx, %ecx, %rcx)</td> </tr> <tr class="row-even"><td>C</td> <td>Not supported.</td> </tr> <tr class="row-odd"><td>d</td> <td>d register (e.g., %dl, %dx, %edx, %rdx)</td> </tr> <tr class="row-even"><td>D</td> <td>di register (e.g., %dil, %di, %edi, %rdi)</td> </tr> <tr class="row-odd"><td>e</td> <td>Constant in range of 0xffffffff to 0x7fffffff</td> </tr> <tr class="row-even"><td>f</td> <td>Not supported.</td> </tr> <tr class="row-odd"><td>G</td> <td>Floating point constant in range of 0.0 to 1.0.</td> </tr> <tr class="row-even"><td>I</td> <td>Constant in range of 0 to 31 (e.g., for 32-bit shifts).</td> </tr> <tr class="row-odd"><td>J</td> <td>Constant in range of 0 to 63 (e.g., for 64-bit shifts)</td> </tr> <tr class="row-even"><td>K</td> <td>Constant in range of 0to 127.</td> </tr> <tr class="row-odd"><td>L</td> <td>Constant in range of 0 to 65535.</td> </tr> <tr class="row-even"><td>M</td> <td>Constant in range of 0 to 3 constant (e.g., shifts for lea instruction).</td> </tr> <tr class="row-odd"><td>N</td> <td>Constant in range of 0 to 255 (e.g., for out instruction).</td> </tr> <tr class="row-even"><td>q</td> <td>Same as “r” simple constraint.</td> </tr> <tr class="row-odd"><td>Q</td> <td>Same as “r” simple constraint.</td> </tr> <tr class="row-even"><td>R</td> <td>Same as “r” simple constraint.</td> </tr> <tr class="row-odd"><td>S</td> <td>si register (e.g., %sil, %si, %edi, %rsi)</td> </tr> <tr class="row-even"><td>t</td> <td>Not supported.</td> </tr> <tr class="row-odd"><td>u</td> <td>Not supported.</td> </tr> <tr class="row-even"><td>x</td> <td>XMM SSE register</td> </tr> <tr class="row-odd"><td>y</td> <td>Not supported.</td> </tr> <tr class="row-even"><td>Z</td> <td>Constant in range of 0 to 0x7fffffff.</td> </tr> </tbody> </table> The following example uses the “x” or XMM register constraint to subtract c from b and store the result in a. <div class="highlight-c++ notranslate"><div class="highlight"><pre>double example11() { double a; double b = 400.99; double c = 300.98; asm ( "subpd %2, %0;" :"=x" (a) : "0" (b), "x" (c) ); return a; } </pre></div> </div> The generated assembly for this example is this: <div class="highlight-c++ notranslate"><div class="highlight"><pre>example11: ..Dcfb0: pushq %rbp ..Dcfi0: movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 4 movsd .C00128(%rip), %xmm1 movsd .C00130(%rip), %xmm2 movapd %xmm1, %xmm0 subpd %xmm2, %xmm0; ## lineno: 10 ## lineno: 11 popq %rbp ret </pre></div> </div> If a specified register is not available, the nvc and nvc++ compilers issue an error message. </section> <section id="multiple-alternative-constraints"> <h3>16.2.7. Multiple Alternative Constraints<a class="headerlink" href="#multiple-alternative-constraints" title="Permalink to this headline"></a></h3> Sometimes a single instruction can take a variety of operand types. For example, the x86-64 permits register-to-memory and memory-to-register operations. To allow this flexibility in inline assembly, use multiple alternative constraints. An alternative is a series of constraints for each operand. To specify multiple alternatives, separate each alternative with a comma. <table class="table-no-stripes docutils align-default" id="id37"> <caption>Table 35. Multiple Alternative Constraints<a class="headerlink" href="#id37" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 16%" /> <col style="width: 84%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Constraint</th> <th class="head">Description</th> </tr> </thead> <tbody> <tr class="row-even"><td>,</td> <td>Separates each alternative for a particular operand.</td> </tr> <tr class="row-odd"><td>?</td> <td>Ignored</td> </tr> <tr class="row-even"><td>!</td> <td>Ignored</td> </tr> </tbody> </table> The following example uses multiple alternatives for an add operation. <div class="highlight-c++ notranslate"><div class="highlight"><pre>void example13() { int x=1; int y=1; asm( "addl %1, %0\n" : "+ab,cd" (x) : "db,cam" (y) ); } </pre></div> </div> The preceding example13() has two alternatives for each operand: “ab,cd” for the output operand and “db,cam” for the input operand. Each operand must have the same number of alternatives; however, each alternative can have any number of constraints (for example, the output operand in example13() has two constraints for its second alternative and the input operand has three for its second alternative). The compiler first tries to satisfy the left-most alternative of the first operand (for example, the output operand in example13()). When satisfying the operand, the compiler starts with the left-most constraint. If the compiler cannot satisfy an alternative with this constraint (for example, if the desired register is not available), it tries to use any subsequent constraints. If the compiler runs out of constraints, it moves on to the next alternative. If the compiler runs out of alternatives, it issues an error similar to the one mentioned in example12(). If an alternative is found, the compiler uses the same alternative for subsequent operands. For example, if the compiler chooses the “c” register for the output operand in example13(), then it will use either the “a” or “m” constraint for the input operand. </section> <section id="constraint-modifiers"> <h3>16.2.8. Constraint Modifiers<a class="headerlink" href="#constraint-modifiers" title="Permalink to this headline"></a></h3> Characters that affect the compiler’s interpretation of a constraint are known as Constraint Modifiers. Two constraint modifiers, the “=” and the “+”, were introduced in <a class="reference internal" href="#inline-asm-output-operands">Output Operands</a>. The following table summarizes each constraint modifier. <table class="table-no-stripes docutils align-default" id="id38"> <caption>Table 36. Constraint Modifier Characters<a class="headerlink" href="#id38" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 4%" /> <col style="width: 96%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Constraint Modifier</th> <th class="head">Description</th> </tr> </thead> <tbody> <tr class="row-even"><td>=</td> <td>This operand is write-only. It is valid for output operands only. If specified, the “=” must appear as the first character of the constraint string.</td> </tr> <tr class="row-odd"><td>+</td> <td>This operand is both read and written by the instruction. It is valid for output operands only. The output operand is initialized with its expression before the first instruction in the asm statement. If specified, the “+” must appear as the first character of the constraint string.</td> </tr> <tr class="row-even"><td>&</td> <td>A constraint or an alternative constraint, as defined in <a class="reference internal" href="#inline-asm-mult-alt-constraints">Multiple Alternative Constraints</a>, containing an “&” indicates that the output operand is an early clobber operand. This type operand is an output operand that may be modified before the asm statement finishes using all of the input operands. The compiler will not place this operand in a register that may be used as an input operand or part of any memory address.</td> </tr> <tr class="row-odd"><td>%</td> <td>Ignored.</td> </tr> <tr class="row-even"><td>#</td> <td>Characters following a “#” up to the first comma (if present) are to be ignored in the constraint.</td> </tr> <tr class="row-odd"><td>*</td> <td>The character that follows the “*” is to be ignored in the constraint.</td> </tr> </tbody> </table> The “=” and “+” modifiers apply to the operand, regardless of the number of alternatives in the constraint string. For example, the “+” in the output operand of example13() appears once and applies to both alternatives in the constraint string. The “&”, “#”, and “*” modifiers apply only to the alternative in which they appear. Normally, the compiler assumes that input operands are used before assigning results to the output operands. This assumption lets the compiler reuse registers as needed inside the asm statement. However, if the asm statement does not follow this convention, the compiler may indiscriminately clobber a result register with an input operand. To prevent this behavior, apply the early clobber “&” modifier. An example follows: <div class="highlight-c++ notranslate"><div class="highlight"><pre>void example15() { int w=1; int z; asm( "movl $1, %0\n" "addl %2, %0\n" "movl %2, %1" : "=a" (w), "=r" (z) : "r" (w) ); } </pre></div> </div> The previous code example presents an interesting ambiguity because “w” appears both as an output and as an input operand. So, the value of “z” can be either 1 or 2, depending on whether the compiler uses the same register for operand 0 and operand 2. The use of constraint “r” for operand 2 allows the compiler to pick any general purpose register, so it may (or may not) pick register “a” for operand 2. This ambiguity can be eliminated by changing the constraint for operand 2 from “r” to “a” so the value of “z” will be 2, or by adding an early clobber “&” modifier so that “z” will be 1. The following example shows the same function with an early clobber “&” modifier: <div class="highlight-c++ notranslate"><div class="highlight"><pre>void example16() { int w=1; int z; asm( "movl $1, %0\n" "addl %2, %0\n" "movl %2, %1" : "=&a" (w), "=r" (z) : "r" (w) ); } </pre></div> </div> Adding the early clobber “&” forces the compiler not to use the “a” register for anything other than operand 0. Operand 2 will therefore get its own register with its own copy of “w”. The result for “z” in example16() is 1. </section> </section> <section id="operand-aliases"> <h2>16.3. Operand Aliases<a class="headerlink" href="#operand-aliases" title="Permalink to this headline"></a></h2> Extended asm specifies operands in assembly strings with a percent ‘%’ followed by the operand number. For example, “%0” references operand 0 or the output item “=&a” (w) in function example16() in the previous example. Extended asm also supports operand aliasing, which allows use of a symbolic name instead of a number for specifying operands, as illustrated in this example: <div class="highlight-c++ notranslate"><div class="highlight"><pre>void example17() { int w=1, z=0; asm( "movl $1, %[output1]\n" "addl %[input], %[output1]\n" "movl %[input], %[output2]" : [output1] "=&a" (w), [output2] "=r" (z) : [input] "r" (w)); } </pre></div> </div> In example18(), “%0” and “%[output1]” both represent the output operand. </section> <section id="assembly-string-modifiers"> <h2>16.4. Assembly String Modifiers<a class="headerlink" href="#assembly-string-modifiers" title="Permalink to this headline"></a></h2> Special character sequences in the assembly string affect the way the assembly is generated by the compiler. For example, the “%” is an escape sequence for specifying an operand, “%%” produces a percent for hard coded registers, and “\n” specifies a new line. <a class="reference internal" href="#inline-asm-strg-modifiers-inline-asm-strg-modifiers-tbl">Table 37</a> summarizes these modifiers, known as Assembly String Modifiers. <table class="table-no-stripes docutils align-default" id="inline-asm-strg-modifiers-inline-asm-strg-modifiers-tbl"> <caption>Table 37. Assembly String Modifier Characters<a class="headerlink" href="#inline-asm-strg-modifiers-inline-asm-strg-modifiers-tbl" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 21%" /> <col style="width: 79%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Modifier</th> <th class="head">Description</th> </tr> </thead> <tbody> <tr class="row-even"><td>\</td> <td>Same as \ in printf format strings.</td> </tr> <tr class="row-odd"><td>%*</td> <td>Adds a ‘*’ in the assembly string.</td> </tr> <tr class="row-even"><td>%%</td> <td>Adds a ‘%’ in the assembly string.</td> </tr> <tr class="row-odd"><td>%A</td> <td>Adds a ‘*’ in front of an operand in the assembly string. (For example, %A0 adds a ‘*’ in front of operand 0 in the assembly output.)</td> </tr> <tr class="row-even"><td>%B</td> <td>Produces the byte op code suffix for this operand. (For example, %b0 produces ‘b’ on x86-64.)</td> </tr> <tr class="row-odd"><td>%L</td> <td>Produces the word op code suffix for this operand. (For example, %L0 produces ‘l’ on x86-64.)</td> </tr> <tr class="row-even"><td>%P</td> <td>If producing Position Independent Code (PIC), the compiler adds the PIC suffix for this operand. (For example, %P0 produces @PLT on x86-64.)</td> </tr> <tr class="row-odd"><td>%Q</td> <td>Produces a quad word op code suffix for this operand if it is supported by the target. Otherwise, it produces a word op code suffix. (For example, %Q0 produces ‘q’ on x86-64.)</td> </tr> <tr class="row-even"><td>%S</td> <td>Produces ‘s’ suffix for this operand. (For example, %S0 produces ‘s’ on x86-64.)</td> </tr> <tr class="row-odd"><td>%T</td> <td>Produces ‘t’ suffix for this operand. (For example, %S0 produces ‘t’ on x86-64.)</td> </tr> <tr class="row-even"><td>%W</td> <td>Produces the half word op code suffix for this operand. (For example, %W0 produces ‘w’ on x86-64.)</td> </tr> <tr class="row-odd"><td>%a</td> <td>Adds open and close parentheses ( ) around the operand.</td> </tr> <tr class="row-even"><td>%b</td> <td>Produces the byte register name for an operand. (For example, if operand 0 is in register ‘a’, then %b0 will produce ‘%al’.)</td> </tr> <tr class="row-odd"><td>%c</td> <td>Cuts the ‘$’ character from an immediate operand.</td> </tr> <tr class="row-even"><td>%k</td> <td>Produces the word register name for an operand. (For example, if operand 0 is in register ‘a’, then %k0 will produce ‘%eax’.)</td> </tr> <tr class="row-odd"><td>%q</td> <td>Produces the quad word register name for an operand if the target supports quad word. Otherwise, it produces a word register name. (For example, if operand 0 is in register ‘a’, then %q0 produces %rax on x86-64.)</td> </tr> <tr class="row-even"><td>%w</td> <td>Produces the half word register name for an operand. (For example, if operand 0 is in register ‘a’, then %w0 will produce ‘%ax’.)</td> </tr> <tr class="row-odd"><td>%z</td> <td>Produces an op code suffix based on the size of an operand. (For example, ‘b’ for byte, ‘w’ for half word, ‘l’ for word, and ‘q’ for quad word.)</td> </tr> <tr class="row-even"><td>%+ %C %D %F %O %X %f %h %l %n %s %y</td> <td>Not supported.</td> </tr> </tbody> </table> These modifiers begin with either a backslash “\” or a percent “%”. The modifiers that begin with a backslash “\” (e.g., “\n”) have the same effect as they do in a printf format string. The modifiers that are preceded with a “%” are used to modify a particular operand. These modifiers begin with either a backslash “\” or a percent “%” For example, “%b0” means, “produce the byte or 8 bit version of operand 0”. If operand 0 is a register, it will produce a byte register such as %al, %bl, %cl, and so on. </section> <section id="extended-asm-macros"> <h2>16.5. Extended Asm Macros<a class="headerlink" href="#extended-asm-macros" title="Permalink to this headline"></a></h2> As with traditional inline assembly, described in <a class="reference internal" href="#inline-asm">Inline Assembly</a>, extended asm can be used in a macro. For example, you can use the following macro to access the runtime stack pointer. <div class="highlight-c++ notranslate"><div class="highlight"><pre>#define GET_SP(x) \ asm("mov %%sp, %0": "=m" (##x):: "%sp" ); void example20() { void * stack_pointer; GET_SP(stack_pointer); } </pre></div> </div> The GET_SP macro assigns the value of the stack pointer to whatever is inserted in its argument (for example, stack_pointer). Another C extension known as statement expressions is used to write the GET_SP macro another way: <div class="highlight-c++ notranslate"><div class="highlight"><pre>#define GET_SP2 ({ \ void *my_stack_ptr; \ asm("mov %%sp, %0": "=m" (my_stack_ptr) :: "%sp" ); \ my_stack_ptr; \ }) void example21() { void * stack_pointer = GET_SP2; } </pre></div> </div> The statement expression allows a body of code to evaluate to a single value. This value is specified as the last instruction in the statement expression. In this case, the value is the result of the asm statement, my_stack_ptr. By writing an asm macro with a statement expression, the asm result may be assigned directly to another variable (for example, void * stack_pointer = GET_SP2) or included in a larger expression, such as: void * stack_pointer = GET_SP2 - sizeof(long). Which style of macro to use depends on the application. If the asm statement needs to be a part of an expression, then a macro with a statement expression is a good approach. Otherwise, a traditional macro, like GET_SP(x), will probably suffice. </section> <section id="intrinsics"> <h2>16.6. Intrinsics<a class="headerlink" href="#intrinsics" title="Permalink to this headline"></a></h2> Inline intrinsic functions map to actual x86-64 machine instructions. Intrinsics are inserted inline to avoid the overhead of a function call. The compiler has special knowledge of intrinsics, so with use of intrinsics, better code may be generated as compared to extended inline assembly code. The NVIDIA HPC Compilers intrinsics library implements MMX, SSE, SS2, SSE3, SSSE3, SSE4a, ABM, and AVX instructions. The intrinsic functions are available to C and C++ programs. Unlike most functions which are in libraries, intrinsics are implemented internally by the compiler. A program can call the intrinsic functions from C/C++ source code after including the corresponding header file. The intrinsics are divided into header files as follows: <table class="table-no-stripes docutils align-default" id="id39"> <caption>Table 38. Intrinsic Header File Organization<a class="headerlink" href="#id39" title="Permalink to this table"></a></caption> <colgroup> <col style="width: 26%" /> <col style="width: 23%" /> <col style="width: 2%" /> <col style="width: 26%" /> <col style="width: 23%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Instructions</th> <th class="head">Header File</th> <th class="head"></th> <th class="head">Instructions</th> <th class="head">Header File</th> </tr> </thead> <tbody> <tr class="row-even"><td>ABM</td> <td>intrin.h</td> <td></td> <td>SSE2</td> <td>emmintrin.h</td> </tr> <tr class="row-odd"><td>AVX</td> <td>immintrin.h</td> <td></td> <td>SSE3</td> <td>pmmintrin.h</td> </tr> <tr class="row-even"><td>MMX</td> <td>mmintrin.h</td> <td></td> <td>SSSE3</td> <td>tmmintrin.h</td> </tr> <tr class="row-odd"><td>SSE</td> <td>xmmintrin.h</td> <td></td> <td>SSE4a</td> <td>ammintrin.h</td> </tr> </tbody> </table> The following is a simple example program that calls XMM intrinsics. <div class="highlight-c++ notranslate"><div class="highlight"><pre>#include <xmmintrin.h> int main(){ __m128 __A, __B, result; __A = _mm_set_ps(23.3, 43.7, 234.234, 98.746); __B = _mm_set_ps(15.4, 34.3, 4.1, 8.6); result = _mm_add_ps(__A,__B); return 0; } </pre></div> </div> Notices Notice ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation. Trademarks NVIDIA, the NVIDIA logo, CUDA, CUDA-X, GPUDirect, HPC SDK, NGC, NVIDIA Volta, NVIDIA DGX, NVIDIA Nsight, NVLink, NVSwitch, and Tesla are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. </section> </section> </div> </div> <footer> <hr/> <div role="contentinfo"> <img src="_static/NVIDIA-LogoBlack.svg" class="only-light"/> <img src="_static/NVIDIA-LogoWhite.svg" class="only-dark"/> <a href="https://www.nvidia.com/en-us/about-nvidia/privacy-policy/" target="_blank">Privacy Policy</a> | <a href="https://www.nvidia.com/en-us/about-nvidia/privacy-center/" target="_blank">Manage My Privacy</a> | <a href="https://www.nvidia.com/en-us/preferences/start/" target="_blank">Do Not Sell or Share My Data</a> | <a href="https://www.nvidia.com/en-us/about-nvidia/terms-of-service/" target="_blank">Terms of Service</a> | <a href="https://www.nvidia.com/en-us/about-nvidia/accessibility/" target="_blank">Accessibility</a> | <a href="https://www.nvidia.com/en-us/about-nvidia/company-policies/" target="_blank">Corporate Policies</a> | <a href="https://www.nvidia.com/en-us/product-security/" target="_blank">Product Security</a> | <a href="https://www.nvidia.com/en-us/contact/" target="_blank">Contact</a> Copyright © 2013-2025, NVIDIA Corporation & affiliates. All rights reserved. Last updated on Jan 22, 2025. </div> </footer> </div> </div> </section> </div> <script> jQuery(function () { SphinxRtdTheme.Navigation.enable(false); }); </script> <script type="text/javascript">if (typeof _satellite !== "undefined"){_satellite.pageBottom();}</script> </body> </html>