CINXE.COM

HPC Compilers User's Guide Version 24.11 for ARM, OpenPower, x86

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-us" xml:lang="en-us"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <meta http-equiv="X-UA-Compatible" content="IE=edge" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <meta name="copyright" content="(C) Copyright 2018 NVIDIA Corp." /> <meta name="DC.rights.owner" content="(C) Copyright 2018 NVIDIA Corp." /> <meta name="DC.Type" content="concept" /> <meta name="DC.Title" content="NVIDIA HPC Compilers User's Guide" /> <meta name="abstract" content="NVIDIA HPC Compilers for NVIDIA CPUs version 24.11 user's guide" /> <meta name="description" content="NVIDIA HPC Compilers for NVIDIA CPUs version 24.11 user's guide" /> <meta name="DC.Coverage" content="HPC Compilers" /> <meta name="DC.subject" content="NVIDIA HPC Compiler, NVIDIA, NVIDIA HPC Compilers manual, NVIDIA HPC Compilers instructions, NVIDIA HPC Compilers command-line options" /> <meta name="keywords" content="NVIDIA HPC Compiler, NVIDIA, NVIDIA HPC Compilers manual, NVIDIA HPC Compilers instructions, NVIDIA HPC Compilers command-line options" /> <meta name="DC.Format" content="XHTML" /> <meta name="DC.Identifier" content="abstract" /> <link rel="stylesheet" type="text/css" href="/hpc-sdk/common/formatting/commonltr.css" /> <title>HPC Compilers User's Guide Version 24.11 for ARM, OpenPower, x86</title> <link rel="canonical" href="hpc-compilers-user-guide/index.html" /> <!--[if lt IE 9]> <script src="{$common-formatting-url}html5shiv-printshiv.min.js"></script> <![endif]--> <link rel="stylesheet" type="text/css" href="/hpc-sdk/common/formatting/consent.css" /> <link rel="stylesheet" type="text/css" href="/hpc-sdk/common/formatting/site.css" /><script type="text/javascript"> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-59690533-1', 'auto'); ga('send', 'pageview'); </script><script type="text/javascript" charset="utf-8" src="/hpc-sdk/common/formatting/jquery.min.js"></script><script type="text/javascript" charset="utf-8" src="/hpc-sdk/common/formatting/jquery.ba-hashchange.min.js"></script><script type="text/javascript" charset="utf-8" src="/hpc-sdk/common/formatting/jquery.scrollintoview.min.js"></script><script type="text/javascript" src="/hpc-sdk/compilers/search/htmlFileList.js"></script><script type="text/javascript" src="/hpc-sdk/compilers/search/htmlFileInfoList.js"></script><script type="text/javascript" src="/hpc-sdk/compilers/search/nwSearchFnt.min.js"></script><script type="text/javascript" src="/hpc-sdk/compilers/search/stemmers/en_stemmer.min.js"></script><script type="text/javascript" src="/hpc-sdk/compilers/search/index-1.js"></script><script type="text/javascript" src="/hpc-sdk/compilers/search/index-2.js"></script><script type="text/javascript" src="/hpc-sdk/compilers/search/index-3.js"></script><link rel="stylesheet" type="text/css" href="/hpc-sdk/common/formatting/qwcode.highlight.css" /> </head> <body> <div class="main-wrapper"> <header id="header"><span id="company"><a href="https://docs.nvidia.com/"><img src="/hpc-sdk/common/formatting/acc-comp-logo.png" alt="HPC SDK" height="32" /></a></span><span id="site-title"><a href=""><img src="/hpc-sdk/common/formatting/hpc-sdk-documentation.png" alt="HPC SDK Documentation" height="16" /></a></span><form id="search" method="get" action="search"><input type="text" name="search-text" /><fieldset id="search-location"> <legend>Search In:</legend><label><input type="radio" name="search-type" value="site" />Entire Site</label><label><input type="radio" name="search-type" value="document" />Just This Document</label></fieldset><button type="reset">clear search</button><button id="submit" type="submit">search</button></form> </header> <div id="site-content"> <nav id="site-nav"> <div class="category closed"><a href="../index.html" title="Parent Directory.">HPC SDK version 24.11</a></div> <div class="category closed"><a href="./index.html" title="HPC Compilers User's Guide">HPC Compilers User's Guide</a></div> <ul> <li> <div class="section-link"><a href="#"></a></div> </li> <li> <div class="section-link"><a href="#"></a></div> </li> <li> <div class="section-link"><a href="#nv-ug-preface">Preface</a></div> <ul> <li> <div class="section-link"><a href="#audience">Audience Description</a></div> </li> <li> <div class="section-link"><a href="#standards">Compatibility and Conformance to Standards</a></div> </li> <li> <div class="section-link"><a href="#nv-ug-organization">Organization</a></div> </li> <li> <div class="section-link"><a href="#hw-sw-constraints">Hardware and Software Constraints</a></div> </li> <li> <div class="section-link"><a href="#doc-conventions">Conventions</a></div> </li> <li> <div class="section-link"><a href="#doc-conventions-more">Terms</a></div> </li> <li> <div class="section-link"><a href="#related-pubs">Related Publications</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#gs-nv">1. Getting Started</a></div> <ul> <li> <div class="section-link"><a href="#gs-nv-overview">1.1. Overview</a></div> </li> <li> <div class="section-link"><a href="#example-hello">1.2. Creating an Example</a></div> </li> <li> <div class="section-link"><a href="#invoke-nv-comp">1.3. Invoking the Command-level NVIDIA HPC Compilers</a></div> <ul> <li> <div class="section-link"><a href="#cmd-line-syntax">1.3.1. Command-line Syntax</a></div> </li> <li> <div class="section-link"><a href="#cmd-line-opts">1.3.2. Command-line Options</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#fn-conv">1.4. Filename Conventions</a></div> <ul> <li> <div class="section-link"><a href="#fn-conv-input">1.4.1. Input Files</a></div> </li> <li> <div class="section-link"><a href="#fn-conv-output">1.4.2. Output Files</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#data-types">1.5. Fortran, C++ and C Data Types</a></div> </li> <li> <div class="section-link"><a href="#platform-spec-considerations">1.6. Platform-specific considerations</a></div> <ul> <li> <div class="section-link"><a href="#use-nv-compilers-linux">1.6.1. Using the NVIDIA HPC Compilers on Linux</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#site-spec-comp-cust">1.7. Site-Specific Customization of the Compilers</a></div> <ul> <li> <div class="section-link"><a href="#site-spec-using-siterc">1.7.1. Use siterc Files</a></div> </li> <li> <div class="section-link"><a href="#site-spec-using-user-rc">1.7.2. Using User rc Files</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#common-dev-tasks">1.8. Common Development Tasks</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#cmdln-options-use">2. Use Command-line Options</a></div> <ul> <li> <div class="section-link"><a href="#cmdln-opt-overview">2.1. Command-line Option Overview</a></div> <ul> <li> <div class="section-link"><a href="#cmdln-opt-syntax">2.1.1. Command-line Options Syntax</a></div> </li> <li> <div class="section-link"><a href="#cmdln-subops">2.1.2. Command-line Suboptions</a></div> </li> <li> <div class="section-link"><a href="#cmdln-conflict-ops">2.1.3. Command-line Conflicting Options</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#cmdln-options-help">2.2. Help with Command-line Options</a></div> </li> <li> <div class="section-link"><a href="#gs-perform">2.3. Getting Started with Performance</a></div> <ul> <li> <div class="section-link"><a href="#gs-perform-fast">2.3.1. Using -fast</a></div> </li> <li> <div class="section-link"><a href="#gs-perform-other">2.3.2. Other Performance-Related Options</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#freq-used-options">2.4. Frequently-used Options</a></div> </li> <li> <div class="section-link"><a href="#float-point-subnormal">2.5. Floating-point Subnormal</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#opt-parallel">3. Multicore CPU Optimization</a></div> <ul> <li> <div class="section-link"><a href="#opt-overview">3.1. Overview of Optimization</a></div> <ul> <li> <div class="section-link"><a href="#opt-local">3.1.1. Local Optimization</a></div> </li> <li> <div class="section-link"><a href="#opt-global">3.1.2. Global Optimization</a></div> </li> <li> <div class="section-link"><a href="#opt-loop">3.1.3. Loop Optimization: Unrolling, Vectorization and Parallelization</a></div> </li> <li> <div class="section-link"><a href="#opt-ipa">3.1.4. Interprocedural Analysis (IPA) and Optimization</a></div> </li> <li> <div class="section-link"><a href="#opt-func-inline">3.1.5. Function Inlining</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#opt-gs">3.2. Getting Started with Optimization</a></div> <ul> <li> <div class="section-link"><a href="#opt-gs-help">3.2.1. -help</a></div> </li> <li> <div class="section-link"><a href="#opt-gs-minfo">3.2.2. -Minfo</a></div> </li> <li> <div class="section-link"><a href="#opt-gs-mneginfo">3.2.3. -Mneginfo</a></div> </li> <li> <div class="section-link"><a href="#opt-gs-dryrun">3.2.4. -dryrun</a></div> </li> <li> <div class="section-link"><a href="#opt-gs-v">3.2.5. -v</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#opt-local-global">3.3. Local and Global Optimization</a></div> <ul> <li> <div class="section-link"><a href="#opt-loc-glob-msafeptr">3.3.1. -Msafeptr</a></div> </li> <li> <div class="section-link"><a href="#opt-loc-glob-o">3.3.2. -O</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#loop-unroll-munroll">3.4. Loop Unrolling using -Munroll</a></div> </li> <li> <div class="section-link"><a href="#vect-mvect">3.5. Vectorization using -Mvect</a></div> <ul> <li> <div class="section-link"><a href="#vect-subopt">3.5.1. Vectorization Sub-options</a></div> </li> <li> <div class="section-link"><a href="#vect-exam-simd">3.5.2. Vectorization Example Using SIMD Instructions</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#opt-ipa-mipa">3.6. Interprocedural Analysis and Optimization using -Mipa</a></div> <ul> <li> <div class="section-link"><a href="#opt-ipa-no-build-1step">3.6.1. Building a Program Without IPA – Single Step</a></div> </li> <li> <div class="section-link"><a href="#opt-ipa-no-build-multi-step">3.6.2. Building a Program Without IPA – Several Steps</a></div> </li> <li> <div class="section-link"><a href="#opt-ipa-no-build-use-make">3.6.3. Building a Program Without IPA Using Make</a></div> </li> <li> <div class="section-link"><a href="#opt-ipa-build-prog">3.6.4. Building a Program with IPA</a></div> </li> <li> <div class="section-link"><a href="#opt-ipa-build-1step">3.6.5. Building a Program with IPA – Single Step</a></div> </li> <li> <div class="section-link"><a href="#opt-ipa-build-multi-step">3.6.6. Building a Program with IPA – Several Steps</a></div> </li> <li> <div class="section-link"><a href="#opt-ipa-build-use-make">3.6.7. Building a Program with IPA Using Make</a></div> </li> <li> <div class="section-link"><a href="#opt-ipa-qa">3.6.8. Questions about IPA</a></div> </li> </ul> </li> </ul> </li> <li> <div class="section-link"><a href="#fn-inline-use">4. Using Function Inlining</a></div> <ul> <li> <div class="section-link"><a href="#fn-inline-auto">4.1. Automatic function inlining in C++ and C</a></div> </li> <li> <div class="section-link"><a href="#fn-inline-invoke">4.2. Invoking Procedure Inlining</a></div> </li> <li> <div class="section-link"><a href="#fn-inline-use-lib">4.3. Using an Inline Library</a></div> </li> <li> <div class="section-link"><a href="#fn-inline-create-lib">4.4. Creating an Inline Library</a></div> <ul> <li> <div class="section-link"><a href="#fn-inline-work-lib">4.4.1. Working with Inline Libraries</a></div> </li> <li> <div class="section-link"><a href="#fn-inline-lib-depend">4.4.2. Dependencies</a></div> </li> <li> <div class="section-link"><a href="#fn-inline-upd-lib-makefiles">4.4.3. Updating Inline Libraries – Makefiles</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#fn-inline-errors">4.5. Error Detection during Inlining</a></div> </li> <li> <div class="section-link"><a href="#fn-inline-examples">4.6. Examples</a></div> </li> <li> <div class="section-link"><a href="#fn-inline-restrictions">4.7. Restrictions on Inlining</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#gpus-use">5. Using GPUs</a></div> <ul> <li> <div class="section-link"><a href="#acc-overview">5.1. Overview</a></div> </li> <li> <div class="section-link"><a href="#acc-terms">5.2. Terminology</a></div> </li> <li> <div class="section-link"><a href="#acc-exe-model">5.3. Execution Model</a></div> <ul> <li> <div class="section-link"><a href="#acc-exe-host-funcs">5.3.1. Host Functions</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#acc-mem-model">5.4. Memory Model</a></div> <ul> <li> <div class="section-link"><a href="#acc-mem-sep-host">5.4.1. Separate Host and Accelerator Memory Considerations</a></div> <ul> <li> <div class="section-link"><a href="#acc-mem-model-mem">5.4.1.1. Accelerator Memory</a></div> </li> <li> <div class="section-link"><a href="#acc-mem-pinned-buffer">5.4.1.2. Staging Memory Buffer</a></div> </li> <li> <div class="section-link"><a href="#acc-mem-cache-mgmt">5.4.1.3. Cache Management</a></div> </li> <li> <div class="section-link"><a href="#env-vars-memory">5.4.1.4. Environment Variables Controlling Device Memory Management</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#acc-mem-unified">5.4.2. Managed and Unified Memory Modes</a></div> <ul> <li> <div class="section-link"><a href="#gpu-mem-mode-managed">5.4.2.1. Managed Memory Mode</a></div> </li> <li> <div class="section-link"><a href="#gpu-mem-mode-unified">5.4.2.2. Unified Memory Mode</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#gpu-mem-poolallocator">5.4.3. Memory Pool Allocator</a></div> </li> <li> <div class="section-link"><a href="#gpu-mem-intercept">5.4.4. Interception of Deallocations</a></div> </li> <li> <div class="section-link"><a href="#gpu-mem-flags">5.4.5. Command-line Options Selecting Compiler Memory Modes</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#fortran-ptrs-on-gpus">5.5. Fortran pointers in device code</a></div> </li> <li> <div class="section-link"><a href="#calling-routines-on-gpus">5.6. Calling routines in a compute kernel</a></div> </li> <li> <div class="section-link"><a href="#acc-supp-procs-gpus">5.7. Supported Processors and GPUs</a></div> </li> <li> <div class="section-link"><a href="#cuda-toolkit-versions">5.8. CUDA Versions</a></div> </li> <li> <div class="section-link"><a href="#compute-cap">5.9. Compute Capability</a></div> </li> <li> <div class="section-link"><a href="#ptx-jit">5.10. PTX JIT Compilation</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#acc-use">6. Using OpenACC</a></div> <ul> <li> <div class="section-link"><a href="#acc-directives">6.1. OpenACC Programming Model</a></div> <ul> <li> <div class="section-link"><a href="#acc-exe-levl-parallel">6.1.1. Levels of Parallelism</a></div> </li> <li> <div class="section-link"><a href="#acc-dir-enable">6.1.2. Enable OpenACC Directives</a></div> </li> <li> <div class="section-link"><a href="#acc-dir-openacc-sum">6.1.3. OpenACC Support</a></div> </li> <li> <div class="section-link"><a href="#acc-dir-clauses">6.1.4. OpenACC Extensions</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#acc-cmdln-opts">6.2. Compiling an OpenACC Program</a></div> <ul> <li> <div class="section-link"><a href="#acc">6.2.1. -[no]acc</a></div> </li> <li> <div class="section-link"><a href="#gpu">6.2.2. -gpu</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#multicore">6.3. OpenACC for Multicore CPUs</a></div> </li> <li> <div class="section-link"><a href="#acc-openacc-unified-mem">6.4. OpenACC with CUDA Unified Memory</a></div> </li> <li> <div class="section-link"><a href="#openacc-error-handling">6.5. OpenACC Error Handling</a></div> </li> <li> <div class="section-link"><a href="#openacc-cuda-graphs">6.6. OpenACC and CUDA Graphs</a></div> </li> <li> <div class="section-link"><a href="#env-vars">6.7. Environment Variables</a></div> </li> <li> <div class="section-link"><a href="#acc-prof-kernels">6.8. Profiling Accelerator Kernels</a></div> </li> <li> <div class="section-link"><a href="#acc-openacc-runtm-lib">6.9. OpenACC Runtime Libraries</a></div> <ul> <li> <div class="section-link"><a href="#acc-openacc-lib-defn">6.9.1. Runtime Library Definitions</a></div> </li> <li> <div class="section-link"><a href="#acc-openacc-lib-routines">6.9.2. Runtime Library Routines</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#acc-supt-intrin">6.10. Supported Intrinsics</a></div> <ul> <li> <div class="section-link"><a href="#acc-fort-intrin-sum">6.10.1. Supported Fortran Intrinsics Summary Table</a></div> </li> <li> <div class="section-link"><a href="#acc-c-intrin-sum">6.10.2. Supported C Intrinsics Summary Table</a></div> </li> </ul> </li> </ul> </li> <li> <div class="section-link"><a href="#openmp-use">7. Using OpenMP</a></div> <ul> <li> <div class="section-link"><a href="#openmp-env-vars">7.1. Environment Variables</a></div> </li> <li> <div class="section-link"><a href="#openmp-fallback">7.2. Fallback Mode</a></div> </li> <li> <div class="section-link"><a href="#openmp-loop">7.3. Loop</a></div> </li> <li> <div class="section-link"><a href="#openmp-subset">7.4. OpenMP Subset</a></div> </li> <li> <div class="section-link"><a href="#openmp-metadirective">7.5. Using metadirective</a></div> </li> <li> <div class="section-link"><a href="#openmp-cuda-streams">7.6. Mapping target constructs to CUDA streams</a></div> </li> <li> <div class="section-link"><a href="#openmp-noncontiguous-arrays">7.7. Noncontiguous Array Sections</a></div> </li> <li> <div class="section-link"><a href="#openmp-unified-mem">7.8. OpenMP with CUDA Unified Memory</a></div> </li> <li> <div class="section-link"><a href="#openmp-multi-gpus">7.9. Multiple Device Support</a></div> </li> <li> <div class="section-link"><a href="#openmp-interop-cuda">7.10. Interoperability with CUDA</a></div> </li> <li> <div class="section-link"><a href="#openmp-interop-openmp">7.11. Interoperability with Other OpenMP Compilers</a></div> </li> <li> <div class="section-link"><a href="#openmp-gnu">7.12. GNU STL</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#stdpar-use">8. Using Stdpar</a></div> <ul> <li> <div class="section-link"><a href="#stdpar-mem-modes">8.1. GPU Memory Modes</a></div> </li> <li> <div class="section-link"><a href="#stdpar-cpp">8.2. Stdpar C++</a></div> <ul> <li> <div class="section-link"><a href="#stdpar-cpp-intro">8.2.1.  Introduction to Stdpar C++</a></div> </li> <li> <div class="section-link"><a href="#stdpar-cpp-compiler-support">8.2.2.  NVC++ Compiler Parallel Algorithms Support </a></div> <ul> <li> <div class="section-link"><a href="#stdpar-cpp-stdpar-option">8.2.2.1.  Enabling Parallel Algorithms with the -⁠stdpar Option </a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#stdpar-cpp-example-1">8.2.3.  Stdpar C++ Simple Example </a></div> </li> <li> <div class="section-link"><a href="#stdpar-cpp-impl-openacc">8.2.4.  OpenACC Implementation of Parallel Algorithms </a></div> </li> <li> <div class="section-link"><a href="#stdpar-cpp-guidelines">8.2.5.  Coding Guidelines for GPU-accelerating Parallel Algorithms </a></div> <ul> <li> <div class="section-link"><a href="#stdpar-cpp-device-functions">8.2.5.1.  Parallel Algorithms and Device Function Annotations </a></div> </li> <li> <div class="section-link"><a href="#stdpar-cpp-unified-memory">8.2.5.2. Data Management in Parallel Algorithms</a></div> </li> <li> <div class="section-link"><a href="#stdpar-cpp-function-pointers">8.2.5.3.  Parallel Algorithms and Function Pointers </a></div> </li> <li> <div class="section-link"><a href="#stdpar-cpp-iterators">8.2.5.4.  Random Access Iterators </a></div> </li> <li> <div class="section-link"><a href="#stdpar-cpp-interop">8.2.5.5.  Interoperability with the C++ Standard Library </a></div> </li> <li> <div class="section-link"><a href="#stdpar-cpp-exceptions">8.2.5.6.  No Exceptions in GPU Code </a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#stdpar-cpp-experimental">8.2.6.  NVC++ Experimental Features </a></div> <ul> <li> <div class="section-link"><a href="#stdpar-cpp-multi-dimensional-spans">8.2.6.1.  Multi-dimensional Spans </a></div> </li> <li> <div class="section-link"><a href="#stdpar-cpp-senders-and-receivers">8.2.6.2.  Senders and Receivers </a></div> </li> <li> <div class="section-link"><a href="#stdpar-cpp-linear-algebra">8.2.6.3.  Linear Algebra </a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#stdpar-cpp-example-2">8.2.7.  Stdpar C++ Larger Example: LULESH </a></div> </li> <li> <div class="section-link"><a href="#stdpar-cpp-interop-openacc">8.2.8.  Interoperability with OpenACC </a></div> <ul> <li> <div class="section-link"><a href="#stdpar-cpp-interop-openacc-dataman">8.2.8.1. Data Management Directives</a></div> </li> <li> <div class="section-link"><a href="#stdpar-cpp-interop-openacc-routine">8.2.8.2. External Device Function Annotations</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#stdpar-cpp-gpu-getting-started">8.2.9.  Getting Started with Parallel Algorithms for GPUs </a></div> <ul> <li> <div class="section-link"><a href="#stdpar-cpp-supported-gpus">8.2.9.1.  Supported NVIDIA GPUs </a></div> </li> <li> <div class="section-link"><a href="#stdpar-cpp-supported-cuda">8.2.9.2.  Supported CUDA Versions </a></div> </li> </ul> </li> </ul> </li> <li> <div class="section-link"><a href="#stdpar-fortran">8.3. Stdpar Fortran</a></div> <ul> <li> <div class="section-link"><a href="#stdpar-fortran-routine">8.3.1. Calling Routines in DO CONCURRENT on the GPU</a></div> </li> <li> <div class="section-link"><a href="#stdpar-fortran-um">8.3.2. GPU Data Management</a></div> </li> <li> <div class="section-link"><a href="#stdpar-fortran-interop-openacc">8.3.3. Interoperability with OpenACC</a></div> </li> <li> <div class="section-link"><a href="#stdpar-fortran-interop-cuf">8.3.4. Interoperability with CUDA Fortran</a></div> </li> </ul> </li> </ul> </li> <li> <div class="section-link"><a href="#pcast">9. PCAST</a></div> <ul> <li> <div class="section-link"><a href="#pcast-overview">9.1. Overview</a></div> </li> <li> <div class="section-link"><a href="#pcast-nv-compare">9.2. PCAST with a "Golden" File</a></div> </li> <li> <div class="section-link"><a href="#pcast-gpu">9.3. PCAST with OpenACC</a></div> </li> <li> <div class="section-link"><a href="#pcast-limitations">9.4. Limitations</a></div> </li> <li> <div class="section-link"><a href="#pcast-env-vars">9.5. Environment Variables</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#mpi-use">10. Using MPI</a></div> <ul> <li> <div class="section-link"><a href="#mpi-open-mpi-linux">10.1. Using Open MPI on Linux</a></div> </li> <li> <div class="section-link"><a href="#mpi-compiler-wrappers">10.2. Using MPI Compiler Wrappers</a></div> </li> <li> <div class="section-link"><a href="#mpi-testing-benchmarks">10.3. Testing and Benchmarking</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#lib-create-use">11. Creating and Using Libraries</a></div> <ul> <li> <div class="section-link"><a href="#lib-builtin-c-cpp">11.1. Using builtin Math Functions in C++ and C</a></div> </li> <li> <div class="section-link"><a href="#lib-sys-routines">11.2. Using System Library Routines</a></div> </li> <li> <div class="section-link"><a href="#lib-create-use-shared-obj-linux">11.3. Creating and Using Shared Object Files on Linux</a></div> <ul> <li> <div class="section-link"><a href="#lib-create-use-shared-obj-linux-proc">11.3.1. Procedure to create a use a shared object file</a></div> </li> <li> <div class="section-link"><a href="#lib-create-use-shared-obj-linux-ldd-cmd">11.3.2. ldd Command</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#lib-use-lib3f">11.4. Using LIB3F</a></div> </li> <li> <div class="section-link"><a href="#lib-use-lapack-blas-ffts">11.5. LAPACK, BLAS and FFTs</a></div> </li> <li> <div class="section-link"><a href="#lib-link-scalapack">11.6. Linking with ScaLAPACK</a></div> </li> <li> <div class="section-link"><a href="#lib-std-cpp-template">11.7. The C++ Standard Template Library</a></div> </li> <li> <div class="section-link"><a href="#lib-use-nvpl">11.8. NVIDIA Performance Libraries (NVPL)</a></div> </li> <li> <div class="section-link"><a href="#lib-link-nvmalloc">11.9. Linking with the nvmalloc Library</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#env-vars-use">12. Environment Variables</a></div> <ul> <li> <div class="section-link"><a href="#env-vars-set">12.1. Setting Environment Variables</a></div> <ul> <li> <div class="section-link"><a href="#env-vars-set-linux">12.1.1. Setting Environment Variables on Linux</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#env-vars-nv-related">12.2. HPC Compiler Related Environment Variables</a></div> </li> <li> <div class="section-link"><a href="#env-vars-nv">12.3. HPC Compilers Environment Variables</a></div> <ul> <li> <div class="section-link"><a href="#env-vars-nv-fortranopt">12.3.1. FORTRANOPT</a></div> </li> <li> <div class="section-link"><a href="#env-vars-nv-fort_fmt_recl">12.3.2. FORT_FMT_RECL</a></div> </li> <li> <div class="section-link"><a href="#env-vars-nv-gmon-out-prefix">12.3.3. GMON_OUT_PREFIX</a></div> </li> <li> <div class="section-link"><a href="#env-vars-nv-ld-library-path">12.3.4. LD_LIBRARY_PATH</a></div> </li> <li> <div class="section-link"><a href="#env-vars-nv-manpath">12.3.5. MANPATH</a></div> </li> <li> <div class="section-link"><a href="#env-vars-nv-no-stop-msg">12.3.6. NO_STOP_MESSAGE</a></div> </li> <li> <div class="section-link"><a href="#env-vars-nv-path">12.3.7. PATH</a></div> </li> <li> <div class="section-link"><a href="#env-vars-nv-fpu-state">12.3.8. NVCOMPILER_FPU_STATE</a></div> </li> <li> <div class="section-link"><a href="#env-vars-nv-term">12.3.9. NVCOMPILER_TERM</a></div> </li> <li> <div class="section-link"><a href="#env-vars-nv-term-debug">12.3.10. NVCOMPILER_TERM_DEBUG</a></div> </li> <li> <div class="section-link"><a href="#env-vars-nv-pwd">12.3.11. PWD</a></div> </li> <li> <div class="section-link"><a href="#env-vars-nv-static-random-seed">12.3.12. STATIC_RANDOM_SEED</a></div> </li> <li> <div class="section-link"><a href="#env-vars-nv-tmp">12.3.13. TMP</a></div> </li> <li> <div class="section-link"><a href="#env-vars-nv-tmpdir">12.3.14. TMPDIR</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#env-vars-use-mod-linux">12.4. Using Environment Modules on Linux</a></div> </li> <li> <div class="section-link"><a href="#env-vars-stack-trace-jit-dbg">12.5. Stack Traceback and JIT Debugging</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#deploy-dist-files">13. Distributing Files – Deployment</a></div> <ul> <li> <div class="section-link"><a href="#deploy-app-linux">13.1. Deploying Applications on Linux</a></div> <ul> <li> <div class="section-link"><a href="#deploy-app-linux-runtm-lib">13.1.1. Runtime Library Considerations</a></div> </li> <li> <div class="section-link"><a href="#deploy-app-linux-64bit">13.1.2. 64-bit Linux Considerations</a></div> </li> <li> <div class="section-link"><a href="#deploy-app-linux-redist">13.1.3. Linux Redistributable Files</a></div> </li> <li> <div class="section-link"><a href="#deploy-app-linux-port">13.1.4. Restrictions on Linux Portability</a></div> </li> <li> <div class="section-link"><a href="#deploy-app-linux-license">13.1.5. Licensing for Redistributable (REDIST) Files</a></div> </li> </ul> </li> </ul> </li> <li> <div class="section-link"><a href="#intr-lang-call">14. Inter-language Calling</a></div> <ul> <li> <div class="section-link"><a href="#intr-lang-call-conv-ovr">14.1. Overview of Calling Conventions</a></div> </li> <li> <div class="section-link"><a href="#intr-lang-call-consider">14.2. Inter-language Calling Considerations</a></div> </li> <li> <div class="section-link"><a href="#intr-lang-funcs-subs">14.3. Functions and Subroutines</a></div> </li> <li> <div class="section-link"><a href="#intr-lang-case-conv">14.4. Upper and Lower Case Conventions, Underscores</a></div> </li> <li> <div class="section-link"><a href="#intr-lang-data-types">14.5. Compatible Data Types</a></div> <ul> <li> <div class="section-link"><a href="#intr-lang-data-fort-blks">14.5.1. Fortran Named Common Blocks</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#intr-lang-arg-pass-rtn-val">14.6. Argument Passing and Return Values</a></div> <ul> <li> <div class="section-link"><a href="#intr-lang-arg-pass-by-val">14.6.1. Passing by Value (%VAL)</a></div> </li> <li> <div class="section-link"><a href="#intr-lang-arg-char-rtn-val">14.6.2. Character Return Values</a></div> </li> <li> <div class="section-link"><a href="#intr-lang-arg-cmplx-rtn-val">14.6.3. Complex Return Values</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#intr-lang-array-idx">14.7. Array Indices</a></div> </li> <li> <div class="section-link"><a href="#intr-lang-examples">14.8. Examples</a></div> <ul> <li> <div class="section-link"><a href="#intr-lang-exam-fort-c">14.8.1. Example – Fortran Calling C</a></div> </li> <li> <div class="section-link"><a href="#intr-lang-exam-c-fort">14.8.2. Example – C Calling Fortran</a></div> </li> <li> <div class="section-link"><a href="#intr-lang-exam-cpp-c">14.8.3. Example – C++ Calling C</a></div> </li> <li> <div class="section-link"><a href="#intr-lang-exam-c-cpp">14.8.4. Example – C Calling C ++</a></div> </li> <li> <div class="section-link"><a href="#intr-lang-exam-fort-cpp">14.8.5. Example – Fortran Calling C++</a></div> </li> <li> <div class="section-link"><a href="#intr-lang-exam-cpp-fort">14.8.6. Example – C++ Calling Fortran</a></div> </li> </ul> </li> </ul> </li> <li> <div class="section-link"><a href="#prog-64bits">15. Programming Considerations for 64-Bit Environments</a></div> <ul> <li> <div class="section-link"><a href="#prog-64bits-data-types">15.1. Data Types in the 64-Bit Environment</a></div> <ul> <li> <div class="section-link"><a href="#prog-64bits-data-types-c">15.1.1. C++ and C Data Types</a></div> </li> <li> <div class="section-link"><a href="#prog-64bits-data-types-fort">15.1.2. Fortran Data Types</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#prog-64bits-static-data-linux">15.2. Large Static Data in Linux</a></div> </li> <li> <div class="section-link"><a href="#prog-64bits-dynamic-alloc-data">15.3. Large Dynamically Allocated Data</a></div> </li> <li> <div class="section-link"><a href="#prog-64bits-array-indexing">15.4. 64-Bit Array Indexing</a></div> </li> <li> <div class="section-link"><a href="#prog-64bits-comp-opts">15.5. Compiler Options for 64-bit Programming</a></div> </li> <li> <div class="section-link"><a href="#prog-64bits-array-limit">15.6. Practical Limitations of Large Array Programming</a></div> </li> <li> <div class="section-link"><a href="#prog-64bits-lrg-ary-med-mem-c">15.7. Medium Memory Model and Large Array in C</a></div> </li> <li> <div class="section-link"><a href="#prog-64bits-lrg-ary-med-mem-fort">15.8. Medium Memory Model and Large Array in Fortran</a></div> </li> <li> <div class="section-link"><a href="#prog-64bits-lrg-ary-sml-mem-fort">15.9. Large Array and Small Memory Model in Fortran</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#inline-asm-intrin-c-cpp">16. C++ and C Inline Assembly and Intrinsics</a></div> <ul> <li> <div class="section-link"><a href="#inline-asm">16.1. Inline Assembly</a></div> </li> <li> <div class="section-link"><a href="#inline-asm-extended">16.2. Extended Inline Assembly</a></div> <ul> <li> <div class="section-link"><a href="#inline-asm-output-operands">16.2.1. Output Operands</a></div> </li> <li> <div class="section-link"><a href="#inline-asm-input-operands">16.2.2. Input Operands</a></div> </li> <li> <div class="section-link"><a href="#inline-asm-clobber-lst">16.2.3. Clobber List</a></div> </li> <li> <div class="section-link"><a href="#inline-asm-addl-constraints">16.2.4. Additional Constraints</a></div> </li> <li> <div class="section-link"><a href="#inline-asm-simple-constraints">16.2.5. Simple Constraints</a></div> </li> <li> <div class="section-link"><a href="#inline-asm-machine-constraints">16.2.6. Machine Constraints</a></div> </li> <li> <div class="section-link"><a href="#inline-asm-mult-alt-constraints">16.2.7. Multiple Alternative Constraints</a></div> </li> <li> <div class="section-link"><a href="#inline-asm-constraint-modifiers">16.2.8. Constraint Modifiers</a></div> </li> </ul> </li> <li> <div class="section-link"><a href="#inline-asm-operand-aliases">16.3. Operand Aliases</a></div> </li> <li> <div class="section-link"><a href="#inline-asm-strg-modifiers">16.4. Assembly String Modifiers</a></div> </li> <li> <div class="section-link"><a href="#inline-asm-extended-macros">16.5. Extended Asm Macros</a></div> </li> <li> <div class="section-link"><a href="#inline-asm-intrinsics">16.6. Intrinsics</a></div> </li> </ul> </li> </ul> </nav> <div id="resize-nav"></div> <nav id="search-results"> <h2>Search Results</h2> <ol></ol> </nav> <div id="contents-container"> <div id="breadcrumbs-container"> <div id="release-info">HPC Compilers User's Guide (<a href="../../pdf/hpc2411ug.pdf">PDF</a>) - HPC SDK version 24.11 (<a href="https://docs.nvidia.com/hpc-sdk/archive/index.html">Documentation Archives</a>) - Last updated November 13, 2024 - <a href="mailto:enterprisesupport@nvidia.com?subject=HPC SDK version 24.11 Feedback: HPC Compilers User's Guide">Send Feedback</a> - <span class="st_facebook"></span><span class="st_twitter"></span><span class="st_linkedin"></span><span class="st_reddit"></span><span class="st_slashdot"></span><span class="st_tumblr"></span><span class="st_sharethis"></span></div> </div> <article id="contents"> <div class="topic nested0" id="abstract"><a name="abstract" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#abstract" name="abstract" shape="rect">NVIDIA HPC Compilers User's Guide</a></h2> <div class="body conbody"></div> </div> <div class="topic concept nested0" id="nv-ug-preface"><a name="nv-ug-preface" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#nv-ug-preface" name="nv-ug-preface" shape="rect">Preface</a></h2> <div class="body conbody"> <p class="p"> This guide is part of a set of manuals that describe how to use the NVIDIA HPC Fortran, C++ and C compilers. These compilers include the <em class="ph i">NVFORTRAN</em>, <em class="ph i">NVC++</em> and <em class="ph i">NVC</em> compilers. They work in conjunction with an assembler, linker, libraries and header files on your target system, and include a CUDA toolchain, libraries and header files for GPU computing. You can use the NVIDIA HPC compilers to develop, optimize and parallelize applications for NVIDIA GPUs and x86-64, OpenPOWER and Arm Server multicore CPUs. </p> <p class="p">The <em class="ph i">NVIDIA HPC Compilers User’s Guide</em> provides operating instructions for the NVIDIA HPC compilers command-level development environment. The <em class="ph i">NVIDIA HPC Compilers Reference Manual</em> contains details concerning the NVIDIA compilers' interpretation of the Fortran, C++ and C language standards, implementation of language extensions, and command-level compilation. Users are expected to have previous experience with or knowledge of the Fortran, C++ and C programming languages. These guides do not teach the Fortran, C++ or C programming languages. </p> </div> <div class="topic concept nested1" id="audience"><a name="audience" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#audience" name="audience" shape="rect">Audience Description</a></h3> <div class="body conbody"> <p class="p">This manual is intended for scientists and engineers using the NVIDIA HPC compilers. To use these compilers, you should be aware of the role of high-level languages, such as Fortran, C++ and C as well as parallel programming models such as CUDA, OpenACC and OpenMP in the software development process, and you should have some level of understanding of programming. The NVIDIA HPC compilers are available on a variety of NVIDIA GPUs and x86-64, OpenPOWER and Arm CPU-based platforms and operating systems. You need to be familiar with the basic commands available on your system. </p> </div> </div> <div class="topic concept nested1" id="standards"><a name="standards" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#standards" name="standards" shape="rect">Compatibility and Conformance to Standards</a></h3> <div class="body conbody"> <p class="p"> Your system needs to be running a properly installed and configured version of the NVIDIA HPC compilers. For information on installing <span class="ph">NVIDIA HPC compilers</span>, refer to the Release Notes and Installation Guide included with your software. </p> <p class="p">For further information, refer to the following:</p> <ul class="ul"> <li class="li"><cite class="cite">American National Standard Programming Language FORTRAN</cite>, ANSI X3. -1978 (1978). </li> <li class="li"><cite class="cite">ISO/IEC 1539-1 : 1991, Information technology – Programming Languages – Fortran</cite>, Geneva, 1991 (Fortran 90). </li> <li class="li"><cite class="cite">ISO/IEC 1539-1 : 1997, Information technology – Programming Languages – Fortran</cite>, Geneva, 1997 (Fortran 95). </li> <li class="li"><cite class="cite">ISO/IEC 1539-1 : 2004, Information technology – Programming Languages – Fortran</cite>, Geneva, 2004 (Fortran 2003). </li> <li class="li"><cite class="cite">ISO/IEC 1539-1 : 2010, Information technology – Programming Languages – Fortran</cite>, Geneva, 2010 (Fortran 2008). </li> <li class="li"><cite class="cite">ISO/IEC 1539-1 : 2018, Information technology – Programming Languages – Fortran</cite>, Geneva, 2018 (Fortran 2018). </li> <li class="li"><cite class="cite">Fortran 95 Handbook Complete ISO/ANSI Reference</cite>, Adams et al, The MIT Press, Cambridge, Mass, 1997. </li> <li class="li"><cite class="cite">The Fortran 2003 Handbook</cite>, Adams et al, Springer, 2009. </li> <li class="li"><cite class="cite">OpenACC Application Program Interface</cite>, Version 2.7, November 2018, <a class="xref" href="http://www.openacc.org" target="_blank" shape="rect">http://www.openacc.org</a>. </li> <li class="li"><cite class="cite">OpenMP Application Program Interface</cite>, Version 5.0, November 2018, <a class="xref" href="http://www.openmp.org" target="_blank" shape="rect">http://www.openmp.org</a>. </li> <li class="li"><cite class="cite">Programming in VAX Fortran</cite>, Version 4.0, Digital Equipment Corporation (September, 1984). </li> <li class="li"><cite class="cite">IBM VS Fortran</cite>, IBM Corporation, Rev. GC26-4119. </li> <li class="li">Military Standard, Fortran, DOD Supplement to American National Standard Programming Language Fortran, ANSI x.3-1978, MIL-STD-1753 (November 9, 1978). </li> <li class="li"><cite class="cite">American National Standard Programming Language C</cite>, ANSI X3.159-1989. </li> <li class="li">ISO/IEC 9899:1990, Information technology – Programming Languages – C, Geneva, 1990 (C90).</li> <li class="li">ISO/IEC 9899:1999, Information technology – Programming Languages – C, Geneva, 1999 (C99).</li> <li class="li">ISO/IEC 9899:2011, Information Technology – Programming Languages – C, Geneva, 2011 (C11).</li> <li class="li">ISO/IEC 14882:2011, Information Technology – Programming Languages – C++, Geneva, 2011 (C++11).</li> <li class="li">ISO/IEC 14882:2014, Information Technology – Programming Languages – C++, Geneva, 2014 (C++14).</li> <li class="li">ISO/IEC 14882:2017, Information Technology – Programming Languages – C++, Geneva, 2017 (C++17).</li> </ul> </div> </div> <div class="topic concept nested1" id="nv-ug-organization"><a name="nv-ug-organization" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#nv-ug-organization" name="nv-ug-organization" shape="rect">Organization</a></h3> <div class="body conbody"> <p class="p">This guide contains the essential information on how to use the NVIDIA HPC compilers and is divided into these sections:</p> <p class="p"><a class="xref" href="index.html#gs-nv" shape="rect">Getting Started</a> provides an introduction to the NVIDIA HPC compilers and describes their use and overall features. </p> <p class="p"><a class="xref" href="index.html#cmdln-options-use" shape="rect">Use Command-line Options</a> provides an overview of the command-line options as well as task-related lists of options. </p> <p class="p"><a class="xref" href="index.html#opt-parallel" shape="rect">Multicore CPU Optimization</a> describes multicore CPU optimizations and related compiler options. </p> <p class="p"><a class="xref" href="index.html#fn-inline-use" shape="rect">Using Function Inlining</a> describes how to use function inlining and shows how to create an inline library. </p> <p class="p"><a class="xref" href="index.html#openmp-use" shape="rect">Using OpenMP</a> describes how to use OpenMP for multicore CPU programming. </p> <p class="p"><a class="xref" href="index.html#acc-use" shape="rect">Using OpenACC</a> describes how to use an NVIDIA GPU and gives an introduction to using OpenACC. </p> <p class="p"><a class="xref" href="index.html#stdpar-use" shape="rect">Using Stdpar</a> describes how to use C++/Fortran Standard Language Parallelism for programming an NVIDIA GPU or multicore CPU. </p> <p class="p"><a class="xref" href="index.html#pcast" shape="rect">PCAST</a> describes how to use the Parallel Compiler Assisted Testing features of the HPC Compilers. </p> <p class="p"><a class="xref" href="index.html#mpi-use" shape="rect">Using MPI</a> describes how to use MPI with the NVIDIA HPC compilers. </p> <p class="p"><a class="xref" href="index.html#lib-create-use" shape="rect">Creating and Using Libraries</a> discusses NVIDIA HPC compiler support libraries, shared object files, and environment variables that affect the behavior of the compilers. </p> <p class="p"><a class="xref" href="index.html#env-vars-use" shape="rect">Environment Variables</a> describes the environment variables that affect the behavior of the NVIDIA HPC compilers. </p> <p class="p"><a class="xref" href="index.html#deploy-dist-files" shape="rect">Distributing Files – Deployment</a> describes the deployment of your files once you have built, debugged and compiled them successfully. </p> <p class="p"><a class="xref" href="index.html#intr-lang-call" shape="rect">Inter-language Calling</a> provides examples showing how to place C language calls in a Fortran program and Fortran language calls in a C program. </p> <p class="p"><a class="xref" href="index.html#prog-64bits" shape="rect">Programming Considerations for 64-Bit Environments</a> discusses issues of which programmers should be aware when targeting 64-bit processors. </p> <p class="p"><a class="xref" href="index.html#inline-asm-intrin-c-cpp" shape="rect">C++ and C Inline Assembly and Intrinsics</a> describes how to use inline assembly code in C++ and C programs<span class="ph">, as well as how to use intrinsic functions that map directly to assembly machine instructions</span>. </p> </div> </div> <div class="topic concept nested1" id="hw-sw-constraints"><a name="hw-sw-constraints" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#hw-sw-constraints" name="hw-sw-constraints" shape="rect">Hardware and Software Constraints</a></h3> <div class="body conbody"> <p class="p">This guide describes versions of the NVIDIA HPC compilers that target NVIDIA GPUs and x86-64, OpenPOWER and Arm CPUs. Details concerning environment-specific values and defaults and system-specific features or limitations are presented in the release notes delivered with the NVIDIA HPC compilers. </p> </div> </div> <div class="topic concept nested1" id="doc-conventions"><a name="doc-conventions" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#doc-conventions" name="doc-conventions" shape="rect">Conventions</a></h3> <div class="body conbody"> <p class="p">This guide uses the following conventions:</p> <dl class="dl"> <dt class="dt dlterm"><em class="ph i">italic</em></dt> <dd class="dd">is used for emphasis.</dd> <dt class="dt dlterm"><samp class="ph codeph">Constant Width</samp></dt> <dd class="dd"> is used for filenames, directories, arguments, options, examples, and for language statements in the text, including assembly language statements. </dd> <dt class="dt dlterm"><strong class="ph b">Bold</strong></dt> <dd class="dd">is used for commands.</dd> <dt class="dt dlterm">[ item1 ]</dt> <dd class="dd"> in general, square brackets indicate optional items. In this case item1 is optional. In the context of p/t-sets, square brackets are required to specify a p/t-set. </dd> <dt class="dt dlterm">{ item2 | item 3 }</dt> <dd class="dd"> braces indicate that a selection is required. In this case, you must select either item2 or item3. </dd> <dt class="dt dlterm">filename ...</dt> <dd class="dd"> ellipsis indicate a repetition. Zero or more of the preceding item may occur. In this example, multiple filenames are allowed. </dd> <dt class="dt dlterm"><samp class="ph codeph">FORTRAN</samp></dt> <dd class="dd"> Fortran language statements are shown in the text of this guide using a reduced fixed point size. </dd> <dt class="dt dlterm"><samp class="ph codeph">C++ and C</samp></dt> <dd class="dd"> C++ and C language statements are shown in the test of this guide using a reduced fixed point size. </dd> </dl> </div> </div> <div class="topic concept nested1" id="doc-conventions-more"><a name="doc-conventions-more" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#doc-conventions-more" name="doc-conventions-more" shape="rect">Terms</a></h3> <div class="body conbody"> <p class="p"> A number of terms related to systems, processors, compilers and tools are used throughout this guide. For example: </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="border" border="1" rules="all"> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">accelerator</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">FMA</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">-mcmodel=medium</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">shared library</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">AVX</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">host</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">-mcmodel=small</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">SIMD</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">CUDA</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">hyperthreading (HT)</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">MPI</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">SSE</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">device</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">large arrays</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">MPICH</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">static linking</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">driver</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">linux86-64</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">NUMA</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">x86-64</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">DWARF</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">LLVM</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">OpenPOWER</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Arm</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">dynamic library</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">multicore</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">ppc64le</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Aarch64</td> </tr> </tbody> </table> </div> <p class="p"></p> <p class="p"> The following table lists the NVIDIA HPC compilers and their corresponding commands: </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 1. NVIDIA HPC Compilers and Commands</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e571" rowspan="1" colspan="1">Compiler or Tool</th> <th class="entry" valign="top" width="NaN%" id="d54e574" rowspan="1" colspan="1">Language or Function</th> <th class="entry" valign="top" width="NaN%" id="d54e577" rowspan="1" colspan="1">Command</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e571" rowspan="1" colspan="1">NVFORTRAN </td> <td class="entry" valign="top" width="NaN%" headers="d54e574" rowspan="1" colspan="1">ISO/ANSI Fortran 2003</td> <td class="entry" valign="top" width="NaN%" headers="d54e577" rowspan="1" colspan="1">nvfortran</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e571" rowspan="1" colspan="1">NVC++</td> <td class="entry" valign="top" width="NaN%" headers="d54e574" rowspan="1" colspan="1">ISO/ANSI C++17 with GNU compatibility</td> <td class="entry" valign="top" width="NaN%" headers="d54e577" rowspan="1" colspan="1">nvc++</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e571" rowspan="1" colspan="1">NVC</td> <td class="entry" valign="top" width="NaN%" headers="d54e574" rowspan="1" colspan="1">ISO/ANSI C11</td> <td class="entry" valign="top" width="NaN%" headers="d54e577" rowspan="1" colspan="1">nvc</td> </tr> </tbody> </table> </div> <p class="p"> In general, the designation <dfn class="term">NVFORTRAN</dfn> is used to refer to the NVIDIA Fortran compiler, and <dfn class="term">nvfortran</dfn> is used to refer to the command that invokes the compiler. A similar convention is used for each of the NVIDIA HPC compilers. </p> <p class="p"> For simplicity, examples of command-line invocation of the compilers generally reference the <span class="keyword cmdname">nvfortran</span> command, and most source code examples are written in Fortran. <span class="ph">Use of <em class="ph i">NVC⁠+⁠+</em> and <em class="ph i">NVC</em> is consistent with <em class="ph i">NVFORTRAN</em><span class="ph"></span>, though there are command-line options and features of these compilers that do not apply to <em class="ph i">NVFORTRAN</em><span class="ph"></span>, and vice versa.</span></p> <p class="p"> There are a wide variety of x86-64 CPUs in use. Most of these CPUs are forward-compatible, but not backward-compatible, meaning that code compiled to target a given processor will not necessarily execute correctly on a previous-generation processor. </p> <p class="p"> A table listing the processor options that NVIDIA HPC compilers support is available in the Release Notes. The table also includes the features utilized by the compilers that distinguish them from a compatibility standpoint. </p> <p class="p"> In this manual, the convention is to use "x86-64" to specify the group of CPUs that are x86-compatible, 64-bit enabled, and run a 64-bit operating system. x86-64 processors can differ in terms of their support for various prefetch, SSE and AVX instructions. Where such distinctions are important with respect to a given compiler option or feature, it is explicitly noted in this manual. </p> </div> </div> <div class="topic concept nested1" id="related-pubs"><a name="related-pubs" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#related-pubs" name="related-pubs" shape="rect">Related Publications</a></h3> <div class="body conbody"> <p class="p"> The following documents contain additional information related to the NVIDIA HPC compilers. </p> <ul class="ul"> <li class="li"><cite class="cite">System V Application Binary Interface Processor Supplement</cite> by AT&amp;T UNIX System Laboratories, Inc. (Prentice Hall, Inc.). </li> <li class="li"><cite class="cite">System V Application Binary Interface X86-64 Architecture Processor Supplement</cite>. </li> <li class="li"><cite class="cite">OpenPOWER ABI for Linux Supplement, Power Architecture 64-Bit ELF V2 ABI Specification</cite>, <a class="xref" href="http://openpowerfoundation.org/wp-content/uploads/2016/03/ABI64BitOpenPOWERv1.1_16July2015_pub4.pdf" target="_blank" shape="rect">http://openpowerfoundation.org/wp-content/uploads/2016/03/ABI64BitOpenPOWERv1.1_16July2015_pub4.pdf</a>. </li> <li class="li"><cite class="cite">Fortran 95 Handbook Complete ISO/ANSI Reference</cite>, Adams et al, The MIT Press, Cambridge, Mass, 1997. </li> <li class="li"><cite class="cite">Programming in VAX Fortran, Version 4.0</cite>, Digital Equipment Corporation (September, 1984). </li> <li class="li"><cite class="cite">IBM VS Fortran</cite>, IBM Corporation, Rev. GC26-4119. </li> <li class="li"><cite class="cite">The C Programming Language</cite> by Kernighan and Ritchie (Prentice Hall). </li> <li class="li"><cite class="cite">C: A Reference Manual</cite> by Samuel P. Harbison and Guy L. Steele Jr. (Prentice Hall, 1987). </li> <li class="li"><cite class="cite">The Annotated C++ Reference Manual</cite> by Margaret Ellis and Bjarne Stroustrup, AT&amp;T Bell Laboratories, Inc. (Addison-Wesley Publishing Co., 1990). </li> </ul> </div> </div> </div> <div class="topic concept nested0" id="gs-nv"><a name="gs-nv" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#gs-nv" name="gs-nv" shape="rect">1. Getting Started</a></h2> <div class="body conbody"> <p class="p">This section describes how to use the NVIDIA HPC compilers.</p> </div> <div class="topic concept nested1" id="gs-nv-overview"><a name="gs-nv-overview" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#gs-nv-overview" name="gs-nv-overview" shape="rect">1.1. Overview</a></h3> <div class="body conbody"> <p class="p">The command used to invoke a compiler, such as the nvfortran command, is called a <em class="ph i">compiler driver</em>. The compiler driver controls the following phases of compilation: preprocessing, compiling, assembling, and linking. Once a file is compiled and an executable file is produced, you can execute, debug, or profile the program on your system. </p> <p class="p">In general, using an NVIDIA HPC compiler involves three steps:</p> <ol class="ol"> <li class="li">Produce program source code in a file containing a .f extension or another appropriate extension, as described in <a class="xref" href="index.html#fn-conv-input" shape="rect">Input Files</a>. This program may be one that you have written or one that you are modifying. </li> <li class="li">Compile the program using the appropriate compiler command.</li> <li class="li">Execute, debug, or profile the executable file on your system.</li> </ol> <p class="p">You might also want to deploy your application, though this is not a required step.</p> <p class="p">The NVIDIA HPC compilers allow many variations on these general program development steps. These variations include the following: </p> <ul class="ul"> <li class="li">Stop the compilation after preprocessing, compiling or assembling to save and examine intermediate results. </li> <li class="li">Provide options to the driver that control compiler optimization or that specify various features or limitations.</li> <li class="li">Include as input intermediate files such as preprocessor output, compiler output, or assembler output.</li> </ul> </div> </div> <div class="topic task nested1" id="example-hello"><a name="example-hello" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#example-hello" name="example-hello" shape="rect">1.2. Creating an Example</a></h3> <div class="body taskbody"> <div class="section context"> <p class="p"> Let's look at a simple example of using the NVIDIA Fortran compiler to create, compile, and execute a program that prints: </p><pre class="pre msgblock" xml:space="preserve">hello</pre></div> <ol class="ol steps"> <li class="li step"><span class="ph cmd">Create your program.</span> For this example, suppose you enter the following simple Fortran program in the file <span class="ph filepath">hello.f</span>: <pre xml:space="preserve">print *, "hello" end </pre></li> <li class="li step"><span class="ph cmd">Compile the program.</span> When you created your program, you called it <span class="ph filepath">hello.f</span>. In this example, we compile it from a shell command prompt using the default <span class="keyword cmdname">nvfortran</span> driver option. Use the following syntax: <pre xml:space="preserve">$ nvfortran hello.f </pre><p class="p">By default, the executable output is placed in the file <span class="ph filepath">a.out</span>. However, you can specify an output file name by using the <span class="keyword option">-⁠o</span> option. </p> <p class="p"> To place the executable output in the file hello, use this command: </p><pre xml:space="preserve">$ nvfortran -o hello hello.f </pre></li> <li class="li step"><span class="ph cmd">Execute the program.</span> To execute the resulting hello program, simply type the filename at the command prompt and press the <span class="ph uicontrol">Return</span> or <span class="ph uicontrol">Enter</span> key on your keyboard: <pre class="pre screen" xml:space="preserve">$ hello</pre> Below is the expected output: <pre class="pre msgblock" xml:space="preserve">hello</pre></li> </ol> </div> </div> <div class="topic concept nested1" id="invoke-nv-comp"><a name="invoke-nv-comp" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#invoke-nv-comp" name="invoke-nv-comp" shape="rect">1.3. Invoking the Command-level NVIDIA HPC Compilers</a></h3> <div class="body conbody"> <p class="p"> To translate and link a Fortran, C, or C⁠+⁠+ program, the <span class="keyword cmdname">nvfortran</span>, <span class="keyword cmdname">nvc</span> and <span class="keyword cmdname">nvc⁠+⁠+</span> commands do the following: </p> <ol class="ol"> <li class="li">Preprocess the source text file.</li> <li class="li">Check the syntax of the source text.</li> <li class="li">Generate an assembly language file.</li> <li class="li">Pass control to the subsequent assembly and linking steps.</li> </ol> </div> <div class="topic concept nested2" id="cmd-line-syntax"><a name="cmd-line-syntax" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#cmd-line-syntax" name="cmd-line-syntax" shape="rect">1.3.1. Command-line Syntax</a></h3> <div class="body conbody"> <p class="p">The compiler command-line syntax, using nvfortran as an example, is:</p><pre xml:space="preserve">nvfortran [options] [path]filename [...]</pre><div class="p">Where: <dl class="dl"> <dt class="dt dlterm">options</dt> <dd class="dd">is one or more command-line options, all of which are described in detail in <a class="xref" href="index.html#cmdln-options-use" shape="rect">Use Command-line Options</a>. </dd> <dt class="dt dlterm">path</dt> <dd class="dd">is the pathname to the directory containing the file named by filename. If you do not specify the path for a filename, the compiler uses the current directory. You must specify the path separately for each filename not in the current directory. </dd> <dt class="dt dlterm">filename</dt> <dd class="dd">is the name of a source file, preprocessed source file, assembly-language file, object file, or library to be processed by the compilation system. You can specify more than one [path]filename. </dd> </dl> </div> </div> </div> <div class="topic concept nested2" id="cmd-line-opts"><a name="cmd-line-opts" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#cmd-line-opts" name="cmd-line-opts" shape="rect">1.3.2. Command-line Options</a></h3> <div class="body conbody"> <p class="p">The command-line options control various aspects of the compilation process. For a complete alphabetical listing and a description of all the command-line options, refer to <a class="xref" href="index.html#cmdln-options-use" shape="rect">Use Command-Line Options</a>. </p> <div class="p">The following list provides important information about proper use of command-line options. <ul class="ul"> <li class="li">Command-line options and their arguments are case sensitive.</li> <li class="li"> <p class="p">The compiler drivers recognize characters preceded by a hyphen (-⁠) as command-line options. For example, the <span class="ph filepath">-⁠Mlist</span> option specifies that the compiler creates a listing file. </p> <div class="note note"><span class="notetitle">Note:</span> The convention for the text of this manual is to show command-line options using a dash instead of a hyphen; for example, you see <span class="ph filepath">-⁠Mlist</span>. </div> </li> <li class="li"> <p class="p">The order of options and the filename is flexible. That is, you can place options before and after the filename argument on the command line. However, the placement of some options is significant, such as the <span class="ph filepath">-⁠l</span> option, in which the order of the filenames determines the search order. </p> <div class="note note"><span class="notetitle">Note:</span> If two or more options contradict each other, the last one in the command line takes precedence. </div> </li> <li class="li"> <p class="p"> You may write linker options into a text file prefixed with the '@' symbol, e.g. <span class="ph filepath">@file</span>, and pass that file to the compiler as an option. The contents of <span class="ph filepath">@file</span> are passed to the linker. </p><pre xml:space="preserve"> $ echo "foo.o bar.o" &gt; ./option_file.rsp $ nvc++ @./option_files.rsp </pre><p class="p"> The above will pass "foo.o bar.o" to the compiler as linker arguments. </p> </li> </ul> </div> </div> </div> </div> <div class="topic concept nested1" id="fn-conv"><a name="fn-conv" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#fn-conv" name="fn-conv" shape="rect">1.4. Filename Conventions</a></h3> <div class="body conbody"> <p class="p">The NVIDIA HPC compilers use the filenames that you specify on the command line to find and to create input and output files. This section describes the input and output filename conventions for the phases of the compilation process. </p> </div> <div class="topic concept nested2" id="fn-conv-input"><a name="fn-conv-input" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#fn-conv-input" name="fn-conv-input" shape="rect">1.4.1. Input Files</a></h3> <div class="body conbody"> <p class="p"> You can specify assembly-language files, preprocessed source files, Fortran/C/C++ source files, object files, and libraries as inputs on the command line. The compiler driver determines the type of each input file by examining the filename extensions. </p> <p class="p">The drivers use the following conventions:</p> <dl class="dl"> <dt class="dt dlterm"><span class="ph filepath">filename.f</span></dt> <dd class="dd">indicates a Fortran source file.</dd> <dt class="dt dlterm"><span class="ph filepath">filename.F</span></dt> <dd class="dd">indicates a Fortran source file that can contain macros and preprocessor directives (to be preprocessed).</dd> <dt class="dt dlterm"><span class="ph filepath">filename.FOR</span></dt> <dd class="dd">indicates a Fortran source file that can contain macros and preprocessor directives (to be preprocessed).</dd> <dt class="dt dlterm"><span class="ph filepath">filename.F90</span></dt> <dd class="dd">indicates a Fortran 90/95 source file that can contain macros and preprocessor directives (to be preprocessed).</dd> <dt class="dt dlterm"><span class="ph filepath">filename.F95</span></dt> <dd class="dd">indicates a Fortran 90/95 source file that can contain macros and preprocessor directives (to be preprocessed).</dd> <dt class="dt dlterm"><span class="ph filepath">filename.f90</span></dt> <dd class="dd">indicates a Fortran 90/95 source file that is in freeform format.</dd> <dt class="dt dlterm"><span class="ph filepath">filename.f95</span></dt> <dd class="dd">indicates a Fortran 90/95 source file that is in freeform format.</dd> <dt class="dt dlterm"><span class="ph filepath">filename.cuf</span></dt> <dd class="dd">indicates a Fortran 90/95 source file in free format with CUDA Fortran extensions.</dd> <dt class="dt dlterm"><span class="ph filepath">filename.CUF</span></dt> <dd class="dd">indicates a Fortran 90/95 source file in free format with CUDA Fortran extensions and that can contain macros and preprocessor directives (to be preprocessed). </dd> <dt class="dt dlterm"><span class="ph filepath">filename.c</span></dt> <dd class="dd">indicates a C source file that can contain macros and preprocessor directives (to be preprocessed).</dd> <dt class="dt dlterm"><span class="ph filepath">filename.C</span></dt> <dd class="dd">indicates a C++ source file that can contain macros and preprocessor directives (to be preprocessed).</dd> <dt class="dt dlterm"><span class="ph filepath">filename.i</span></dt> <dd class="dd">indicates a preprocessed C or C++ source file.</dd> <dt class="dt dlterm"><span class="ph filepath">filename.cc</span></dt> <dd class="dd">indicates a C++ source file that can contain macros and preprocessor directives (to be preprocessed).</dd> <dt class="dt dlterm"><span class="ph filepath">filename.cpp</span></dt> <dd class="dd">indicates a C++ source file that can contain macros and preprocessor directives (to be preprocessed).</dd> <dt class="dt dlterm"><span class="ph filepath">filename.s</span></dt> <dd class="dd">indicates an assembly-language file.</dd> <dt class="dt dlterm"><span class="ph filepath">filename.o</span></dt> <dd class="dd"><span class="ph">(Linux) </span>indicates an object file. </dd> <dt class="dt dlterm"><span class="ph filepath">filename.a</span></dt> <dd class="dd"><span class="ph">(Linux) </span>indicates a library of object files. </dd> <dt class="dt dlterm"><span class="ph filepath">filename.so</span></dt> <dd class="dd"><span class="ph">(Linux only) </span>indicates a library of shared object files. </dd> </dl> <p class="p"> The driver passes files with <span class="ph filepath">.s</span> extensions to the assembler and files with <span class="ph filepath">.o</span>, <span class="ph filepath">.so</span> and <span class="ph filepath">.a</span> extensions to the linker. Input files with unrecognized extensions, or no extension, are also passed to the linker. </p> <p class="p"> Files with a <span class="ph filepath">.F </span>(Capital F) or <span class="ph filepath">.FOR</span> suffix are first preprocessed by the Fortran compilers and the output is passed to the compilation phase. The Fortran preprocessor functions like cpp for C programs, but is built in to the Fortran compilers rather than implemented through an invocation of cpp. This design ensures consistency in the preprocessing step regardless of the type or revision of operating system under which you are compiling. </p> <p class="p"> Any input files not needed for a particular phase of processing are not processed. For example, if on the command line you specify an assembly-language file (<span class="ph filepath">filename.s</span>) and the <span class="ph filepath">-⁠S</span> option to stop before the assembly phase, the compiler takes no action on the assembly language file. Processing stops after compilation and the assembler does not run. In this scenario, the compilation must have been completed in a previous pass which created the <span class="ph filepath">.s</span> file. For a complete description of the <span class="ph filepath">-⁠S</span> option, refer to <a class="xref" href="index.html#fn-conv-output" shape="rect">Output Files</a>. </p> <p class="p"> In addition to specifying primary input files on the command line, code within other files can be compiled as part of include files using the INCLUDE statement in a Fortran source file or the preprocessor #include directive in Fortran source files that use a <span class="ph filepath">.F</span> extension or C++ and C source files. </p> <p class="p"> When linking a program with a library, the linker extracts only those library components that the program needs. The compiler drivers link in several libraries by default. For more information about libraries, refer to <a class="xref" href="index.html#lib-create-use" shape="rect">Create and Use Libraries</a>. </p> </div> </div> <div class="topic concept nested2" id="fn-conv-output"><a name="fn-conv-output" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#fn-conv-output" name="fn-conv-output" shape="rect">1.4.2. Output Files</a></h3> <div class="body conbody"> <p class="p"> By default, an executable output file produced by one of the NVIDIA HPC compilers is placed in the file <span class="ph filepath">a.out</span>. As the <a class="xref" href="index.html#example-hello" shape="rect">Hello example</a> shows, you can use the <span class="keyword option">-⁠o</span> option to specify the output file name. </p> <p class="p"> If you use option <span class="keyword option">-⁠F</span> (Fortran only), <span class="keyword option">-⁠P</span> (C/C++ only), <span class="keyword option">-⁠S</span> or <span class="keyword option">-⁠c</span>, the compiler produces a file containing the output of the last completed phase for each input file, as specified by the option supplied. </p> <p class="p"> The output file is a preprocessed source file, an assembly-language file, or an unlinked object file respectively. Similarly, the <span class="keyword option">-⁠E</span> option does not produce a file, but displays the preprocessed source file on the standard output. Using any of these options, the <span class="keyword option">-⁠o</span> option is valid only if you specify a single input file. If no errors occur during processing, you can use the files created by these options as input to a future invocation of any of the NVIDIA compiler drivers. </p> <p class="p"> The following table lists the stop-after options and the output files that the compilers create when you use these options. It also indicates the accepted input files. </p> <div class="tablenoborder"><a name="fn-conv-output__opt-desc" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="fn-conv-output__opt-desc" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 2. Option Descriptions</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="16.666666666666664%" id="d54e1440" rowspan="1" colspan="1">Option</th> <th class="entry" valign="top" width="16.666666666666664%" id="d54e1443" rowspan="1" colspan="1">Stop After</th> <th class="entry" valign="top" width="33.33333333333333%" id="d54e1446" rowspan="1" colspan="1">Input</th> <th class="entry" valign="top" width="33.33333333333333%" id="d54e1449" rowspan="1" colspan="1">Output</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="16.666666666666664%" headers="d54e1440" rowspan="1" colspan="1"><span class="keyword option">-E</span></td> <td class="entry" valign="top" width="16.666666666666664%" headers="d54e1443" rowspan="1" colspan="1">preprocessing</td> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e1446" rowspan="1" colspan="1">Source files</td> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e1449" rowspan="1" colspan="1">preprocessed file to standard out</td> </tr> <tr class="row"> <td class="entry" valign="top" width="16.666666666666664%" headers="d54e1440" rowspan="1" colspan="1"><span class="keyword option">-F</span></td> <td class="entry" valign="top" width="16.666666666666664%" headers="d54e1443" rowspan="1" colspan="1">preprocessing</td> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e1446" rowspan="1" colspan="1"> Source files. This option is not valid for <span class="keyword cmdname">nvc</span> or <span class="keyword cmdname">nvc⁠+⁠+</span>. </td> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e1449" rowspan="1" colspan="1">preprocessed file (<span class="ph filepath">.f</span>) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="16.666666666666664%" headers="d54e1440" rowspan="1" colspan="1"><span class="keyword option">-P</span></td> <td class="entry" valign="top" width="16.666666666666664%" headers="d54e1443" rowspan="1" colspan="1">preprocessing</td> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e1446" rowspan="1" colspan="1"> Source files. This option is not valid for <span class="keyword cmdname">nvfortran</span>. </td> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e1449" rowspan="1" colspan="1">preprocessed file (<span class="ph filepath">.i</span>) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="16.666666666666664%" headers="d54e1440" rowspan="1" colspan="1"><span class="keyword option">-S</span></td> <td class="entry" valign="top" width="16.666666666666664%" headers="d54e1443" rowspan="1" colspan="1">compilation</td> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e1446" rowspan="1" colspan="1">Source files or preprocessed files</td> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e1449" rowspan="1" colspan="1">assembly-language file (<span class="ph filepath">.s</span>) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="16.666666666666664%" headers="d54e1440" rowspan="1" colspan="1"><span class="keyword option">-c</span></td> <td class="entry" valign="top" width="16.666666666666664%" headers="d54e1443" rowspan="1" colspan="1">assembly</td> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e1446" rowspan="1" colspan="1"> Source files, or preprocessed files, or assembly-language files </td> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e1449" rowspan="1" colspan="1"> unlinked object file (<span class="ph filepath">.o</span><span class="ph"> or <span class="ph filepath">.obj</span></span>) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="16.666666666666664%" headers="d54e1440" rowspan="1" colspan="1">none</td> <td class="entry" valign="top" width="16.666666666666664%" headers="d54e1443" rowspan="1" colspan="1">linking</td> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e1446" rowspan="1" colspan="1"> Source files, or preprocessed files, assembly-language files, object files, or libraries </td> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e1449" rowspan="1" colspan="1"> executable file (<span class="ph filepath">a.out</span>) </td> </tr> </tbody> </table> </div> <p class="p"> If you specify multiple input files or do not specify an object filename, the compiler uses the input filenames to derive corresponding default output filenames of the following form, where <em class="ph i">filename</em> is the input filename without its extension: </p> <dl class="dl"> <dt class="dt dlterm"><span class="ph filepath">filename.f</span></dt> <dd class="dd"> indicates a preprocessed file, if you compiled a Fortran file using the <span class="keyword option">-⁠F</span> option. </dd> <dt class="dt dlterm"><span class="ph filepath">filename.i</span></dt> <dd class="dd"> indicates a preprocessed file, if you compiled using the <span class="keyword option">-⁠P</span> option. </dd> <dt class="dt dlterm"><span class="ph filepath">filename.lst</span></dt> <dd class="dd"> indicates a listing file from the <span class="keyword option">-⁠Mlist</span> option. </dd> <dt class="dt dlterm"><span class="ph filepath">filename.o</span><span class="ph"> or <span class="ph filepath">filename.obj</span></span></dt> <dd class="dd"> indicates a object file from the <span class="keyword option">-⁠c</span> option. </dd> <dt class="dt dlterm"><span class="ph filepath">filename.s</span></dt> <dd class="dd"> indicates an assembly-language file from the <span class="keyword option">-⁠S</span> option. </dd> </dl> <div class="note note"><span class="notetitle">Note:</span> Unless you specify otherwise, the destination directory for any output file is the current working directory. If the file exists in the destination directory, the compiler overwrites it. </div> <p class="p"> The following example demonstrates the use of output filename extensions. </p><pre xml:space="preserve">$ nvfortran -c proto.f proto1.F </pre><p class="p"> This produces the output files <span class="ph filepath">proto.o</span> and <span class="ph filepath">proto1.o</span>, which are binary object files. Prior to compilation, the file <span class="ph filepath">proto1.F</span> is preprocessed because it has a <span class="ph filepath">.F</span> filename extension. </p> </div> </div> </div> <div class="topic concept nested1" id="data-types"><a name="data-types" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#data-types" name="data-types" shape="rect">1.5. Fortran<span class="ph">, C++ and C</span> Data Types</a></h3> <div class="body conbody"> <p class="p"> The NVIDIA Fortran<span class="ph">, C++ and C </span> compilers recognize scalar and aggregate data types. A scalar data type holds a single value, such as the integer value 42 or the real value 112.6. An aggregate data type consists of one or more scalar data type objects, such as an array of integer values. </p> </div> </div> <div class="topic concept nested1" id="platform-spec-considerations"><a name="platform-spec-considerations" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#platform-spec-considerations" name="platform-spec-considerations" shape="rect">1.6. Platform-specific considerations</a></h3> <div class="body conbody"> <p class="p">The NVIDIA HPC Compilers are supported on x86-64, OpenPOWER and 64-bit Arm multicore CPUs running Linux.</p> </div> <div class="topic concept nested2" id="use-nv-compilers-linux"><a name="use-nv-compilers-linux" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#use-nv-compilers-linux" name="use-nv-compilers-linux" shape="rect">1.6.1. Using the NVIDIA HPC Compilers on Linux</a></h3> <div class="body conbody"> <div class="section"> <h4 class="title sectiontitle">Linux Header Files </h4> <p class="p">The Linux system header files contain many GNU gcc extensions. The NVIDIA HPC C++ and C compilers support many of these extensions and can compile most programs that the GNU compilers can compile. A few header files not interoperable with the NVIDIA compilers have been rewritten. </p> <p class="p">If you are using the NVIDIA HPC C++ or C compilers, please make sure that the supplied versions of these include files are found before the system versions. This hierarchy happens by default unless you explicitly add a -⁠I option that references one of the system <span class="ph filepath">include</span> directories. </p> </div> </div> </div> </div> <div class="topic concept nested1" id="site-spec-comp-cust"><a name="site-spec-comp-cust" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#site-spec-comp-cust" name="site-spec-comp-cust" shape="rect">1.7. Site-Specific Customization of the Compilers</a></h3> <div class="body conbody"> <p class="p">If you are using the NVIDIA HPC Compilers and want all your users to have access to specific libraries or other files, there are special files that allow you to customize the compilers for your site. </p> </div> <div class="topic concept nested2" id="site-spec-using-siterc"><a name="site-spec-using-siterc" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#site-spec-using-siterc" name="site-spec-using-siterc" shape="rect">1.7.1. Use siterc Files</a></h3> <div class="body conbody"> <p class="p">The NVIDIA HPC Compiler command-level drivers utilize a file named <span class="ph filepath">siterc</span> to enable site-specific customization of the behavior of the NVIDIA compilers. The <span class="ph filepath">siterc</span> file is located in the <span class="ph filepath">bin</span> subdirectory of the NVIDIA HPC Compilers installation directory. Using <span class="ph filepath">siterc</span>, you can control how the compiler drivers invoke the various components in the compilation tool chain. </p> </div> </div> <div class="topic concept nested2" id="site-spec-using-user-rc"><a name="site-spec-using-user-rc" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#site-spec-using-user-rc" name="site-spec-using-user-rc" shape="rect">1.7.2. Using User rc Files</a></h3> <div class="body conbody"> <p class="p"> In addition to the siterc file, user <span class="ph filepath">rc</span> files can reside in a given user's home directory, as specified by the user's HOME environment variable. You can use these files to control the respective NVIDIA HPC Compilers. All of these files are optional. </p> <p class="p"> On Linux, these files are named <span class="ph"><span class="ph filepath">.mynvfortranrc</span>,</span><span class="ph filepath">.mynvcrc</span>, and <span class="ph filepath">.mynvc++rc</span>. </p> <div class="section"> <p class="p"> The following examples show how you can use these rc files to tailor a given installation for a particular purpose on <span class="ph filepath">Linux_x86_64</span> targets. The process is similar with obvious substitutions for <span class="ph filepath">ppc64le</span> and <span class="ph filepath">aarch64</span> targets. </p> <div class="tablenoborder"><a name="site-spec-using-user-rc__opt-desc" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="site-spec-using-user-rc__opt-desc" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 3. Examples of Using siterc and User rc Files</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="33.33333333333333%" id="d54e1860" rowspan="1" colspan="1">To do this...</th> <th class="entry" valign="top" width="66.66666666666666%" id="d54e1863" rowspan="1" colspan="1">Add the line shown to the indicated file(s)</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e1860" rowspan="1" colspan="1"> Make available to all linux compilations the libraries found in <span class="ph filepath">/⁠opt/newlibs/64</span></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e1863" rowspan="1" colspan="1"><samp class="ph codeph">set SITELIB=/opt/newlibs/64;</samp> to <span class="ph filepath">/opt/nv/Linux_x86_64/<span class="keyword">24.11</span>/compilers/bin/siterc</span></td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e1860" rowspan="1" colspan="1"> Add to all linux compilations a new library path: <span class="ph filepath">/opt/local/fast</span></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e1863" rowspan="1" colspan="1"><samp class="ph codeph">append SITELIB=/opt/local/fast;</samp> to <span class="ph filepath">/opt/nv/Linux_x86_64/<span class="keyword">24.11</span>/compilers/bin/siterc</span></td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e1860" rowspan="1" colspan="1"> With linux compilations, change <span class="keyword option">-⁠Mmpi</span> to link in <span class="ph filepath">/opt/mympi/64/libmpix.a</span></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e1863" rowspan="1" colspan="1"><samp class="ph codeph">set MPILIBDIR=/opt/mympi/64;</samp><samp class="ph codeph">set MPILIBNAME=mpix;</samp> to <span class="ph filepath">/opt/nv/Linux_x86_64/<span class="keyword">24.11</span>/compilers/bin/siterc</span></td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e1860" rowspan="1" colspan="1"> Build a Fortran executable for linux that resolves shared objects in the relative directory <span class="ph filepath">./REDIST</span></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e1863" rowspan="1" colspan="1"><samp class="ph codeph">set RPATH=./REDIST;</samp> to <span class="ph filepath">~/.mynvfortranrc</span></td> </tr> </tbody> </table> </div> </div> </div> </div> </div> <div class="topic concept nested1" id="common-dev-tasks"><a name="common-dev-tasks" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#common-dev-tasks" name="common-dev-tasks" shape="rect">1.8. Common Development Tasks</a></h3> <div class="body conbody"> <p class="p"> Now that you have a brief introduction to the compiler, let's look at some common development tasks that you might wish to perform. </p> <ul class="ul"> <li class="li"> When you compile code you can specify a number of options on the command line that define specific characteristics related to how the program is compiled and linked, typically enhancing or overriding the default behavior of the compiler. For a list of the most common command line options and information on all the command line options, refer to <a class="xref" href="index.html#cmdln-options-use" shape="rect">Use Command-line Options</a>. </li> <li class="li"> Code optimization for multicore CPUs allows the compiler to organize your code for efficient execution. While possibly increasing compilation time and making the code more difficult to debug, these techniques typically produce code that runs significantly faster than code that does not use them. For more information on optimization refer to <a class="xref" href="index.html#opt-parallel" shape="rect">Multicore CPU Optimization</a>. </li> <li class="li"> Function inlining, a special type of optimization, replaces a call to a function or a subroutine with the body of the function or subroutine. This process can speed up execution by eliminating parameter passing and the function or subroutine call and return overhead. In addition, function inlining allows the compiler to optimize the function with the rest of the code. However, function inlining may also result in much larger code size with no increase in execution speed. For more information on function inlining, refer to <a class="xref" href="index.html#fn-inline-use" shape="rect">Using Function Inlining</a>. </li> <li class="li"> A library is a collection of functions or subprograms used to develop software. Libraries contain "helper" code and data, which provide services to independent programs, allowing code and data to be shared and changed in a modular fashion. The functions and programs in a library are grouped for ease of use and linking. When creating your programs, it is often useful to incorporate standard libraries or proprietary ones. For more information on this topic, refer to <a class="xref" href="index.html#lib-create-use" shape="rect">Creating and Using Libraries</a>. </li> <li class="li"> Environment variables define a set of dynamic values that can affect the way running processes behave on a computer. It is often useful to use these variables to set and pass information that alters the default behavior of the NVIDIA HPC Compilers and the executables which they generate. For more information on these variables, refer to <a class="xref" href="index.html#env-vars-use" shape="rect">Environment Variables</a>. </li> <li class="li"> Deployment, though possibly an infrequent task, can present some unique issues related to concerns of porting the code to other systems. Deployment, in this context, involves distribution of a specific file or set of files that are already compiled and configured. The distribution must occur in such a way that the application executes accurately on another system which may not be configured exactly the same as the system on which the code was created. For more information on what you might need to know to successfully deploy your code, refer to <a class="xref" href="index.html#deploy-dist-files" shape="rect">Distributing Files – Deployment</a>. </li> <li class="li"> An intrinsic is a function available in a given language whose implementation is handled specially by the compiler. Intrinsics make using processor-specific enhancements easier because they provide a C++ and C language interface to assembly instructions. In doing so, the compiler manages details that the user would normally have to be concerned with, such as register names, register allocations, and memory locations of data. </li> </ul> </div> </div> </div> <div class="topic concept nested0" id="cmdln-options-use"><a name="cmdln-options-use" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#cmdln-options-use" name="cmdln-options-use" shape="rect">2. Use Command-line Options</a></h2> <div class="body conbody"> <p class="p"> A command line option allows you to control specific behavior when a program is compiled and linked. This section describes the syntax for properly using command-line options and provides a brief overview of a few of the more common options. </p> </div> <div class="topic concept nested1" id="cmdln-opt-overview"><a name="cmdln-opt-overview" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#cmdln-opt-overview" name="cmdln-opt-overview" shape="rect">2.1. Command-line Option Overview</a></h3> <div class="body conbody"> <p class="p">Before looking at all the command-line options, first become familiar with the syntax for these options. There are a large number of options available to you, yet most users only use a few of them. So, start simple and progress into using the more advanced options. </p> <p class="p">By default, the NVIDIA HPC Compilers generate code that is optimized for the type of processor on which compilation is performed, the compilation host. Before adding options to your command-line, review <a class="xref" href="index.html#cmdln-options-help" shape="rect">Help with Command-line Options</a> and <a class="xref" href="index.html#freq-used-options" shape="rect">Frequently-used Options</a>. </p> </div> <div class="topic concept nested2" id="cmdln-opt-syntax"><a name="cmdln-opt-syntax" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#cmdln-opt-syntax" name="cmdln-opt-syntax" shape="rect">2.1.1. Command-line Options Syntax</a></h3> <div class="body conbody"> <p class="p"> On a command-line, options need to be preceded by a hyphen (-⁠). If the compiler does not recognize an option, you get an unknown switch error. The error can be downgraded to a warning by adding the <span class="ph filepath">-noswitcherror</span> option. </p> <p class="p"> This document uses the following notation when describing options: </p> <dl class="dl"> <dt class="dt dlterm">[item]</dt> <dd class="dd"> <p class="p">Square brackets indicate that the enclosed item is optional.</p> </dd> <dt class="dt dlterm">{item | item}</dt> <dd class="dd"> <p class="p"> Braces indicate that you must select one and only one of the enclosed items. A vertical bar (|) separates the choices. </p> </dd> <dt class="dt dlterm">...</dt> <dd class="dd"> <p class="p"> Horizontal ellipses indicate that zero or more instances of the preceding item are valid. </p> </dd> </dl> </div> </div> <div class="topic concept nested2" id="cmdln-subops"><a name="cmdln-subops" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#cmdln-subops" name="cmdln-subops" shape="rect">2.1.2. Command-line Suboptions</a></h3> <div class="body conbody"> <p class="p">Some options accept several suboptions. You can specify these suboptions either by using the full option statement multiple times or by using a comma-separated list for the suboptions. </p> <p class="p">The following two command lines are equivalent:</p><pre xml:space="preserve">nvfortran -Mvect=simd -Mvect=noaltcode</pre><pre xml:space="preserve">nvfortran -Mvect=simd,noaltcode</pre></div> </div> <div class="topic concept nested2" id="cmdln-conflict-ops"><a name="cmdln-conflict-ops" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#cmdln-conflict-ops" name="cmdln-conflict-ops" shape="rect">2.1.3. Command-line Conflicting Options</a></h3> <div class="body conbody"> <p class="p">Some options have an opposite or negated counterpart. For example, both <span class="ph filepath">-⁠Mvect</span> and <span class="ph filepath">-⁠Mnovect</span> are available. <span class="ph filepath">-⁠Mvect</span> enables vectorization and <span class="ph filepath">-⁠Mnovect</span> disables it. If you used both of these commands on a command line, they would conflict. </p> <div class="note note"><span class="notetitle">Note:</span> When you use conflicting options on a command line, the last encountered option takes precedence over any previous one. </div> <p class="p">The conflicting options rule is important for a number of reasons.</p> <ul class="ul"> <li class="li">Some options, such as <span class="ph filepath">-⁠fast</span>, include other options. Therefore, it is possible for you to be unaware that you have conflicting options. </li> <li class="li">You can use this rule to create makefiles that apply specific flags to a set of files, as shown in the following example.</li> </ul> <div class="example"> <h4 class="title sectiontitle">Example: Makefiles with Options</h4> <p class="p">In this makefile fragment, CCFLAGS uses vectorization. CCNOVECTFLAGS uses the flags defined for CCFLAGS but disables vectorization.</p><pre xml:space="preserve">CCFLAGS=c -Mvect=simd CCNOVECTFLAGS=$(CCFLAGS) -Mnovect</pre></div> </div> </div> </div> <div class="topic concept nested1" id="cmdln-options-help"><a name="cmdln-options-help" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#cmdln-options-help" name="cmdln-options-help" shape="rect">2.2. Help with Command-line Options</a></h3> <div class="body conbody"> <p class="p">If you are just getting started with the NVIDIA HPC Compilers, it is helpful to know which options are available, when to use them, and which options most users find effective. </p> <div class="section"> <h3 class="title sectiontitle">Using -help</h3> <p class="p">The <span class="ph filepath">-⁠help</span> option is useful because it provides information about all options supported by a given compiler. </p> <p class="p">You can use <span class="ph filepath">-⁠help</span> in one of three ways: </p> <div class="p"> <ul class="ul"> <li class="li">Use <span class="ph filepath">-⁠help</span> with no parameters to obtain a list of all the available options with a brief one-line description of each. </li> <li class="li"> <p class="p">Add a parameter to <span class="ph filepath">-⁠help</span> to restrict the output to information about a specific option. The syntax for this usage is: </p><pre xml:space="preserve">-help &lt;command line option&gt;</pre><p class="p">Suppose you use the following command to restrict the output to information about the <span class="ph filepath">-⁠fast</span> option: </p><pre xml:space="preserve">$ nvfortran -help -fast</pre><p class="p">The output you see is similar to:</p><pre xml:space="preserve">-fast Common optimizations; includes -O2 -Munroll=c:1 -Mnoframe -Mlre</pre><p class="p">In the following example, we add the <span class="ph filepath">-⁠help</span> parameter to restrict the output to information about the help command. The usage information for <span class="ph filepath">-⁠help</span> shows how groups of options can be listed or examined according to function. </p><pre xml:space="preserve">$ nvfortran -help -help -help[=groups|asm|debug|language|linker|opt|other|overall|phase|prepro| suffix|switch|target|variable]</pre></li> <li class="li"> <p class="p">Add a parameter to <span class="ph filepath">-⁠help</span> to restrict the output to a specific set of options or to a building process. The syntax for this usage is this: </p><pre xml:space="preserve">-help=&lt;subgroup&gt;</pre></li> </ul> </div> </div> </div> </div> <div class="topic concept nested1" id="gs-perform"><a name="gs-perform" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#gs-perform" name="gs-perform" shape="rect">2.3. Getting Started with Performance</a></h3> <div class="body conbody"> <p class="p">This section provides a quick overview of a few of the command-line options that are useful in improving multicore CPU performance.</p> </div> <div class="topic concept nested2" id="gs-perform-fast"><a name="gs-perform-fast" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#gs-perform-fast" name="gs-perform-fast" shape="rect">2.3.1. Using -fast</a></h3> <div class="body conbody"> <p class="p"> The NVIDIA HPC Compilers implement a wide range of options that allow users a fine degree of control on each optimization phase. When it comes to optimization of code, the quickest way to start is to use the option <span class="keyword option">-⁠fast</span>. These options create a generally optimal set of flags. They incorporate optimization options to enable use of vector streaming SIMD instructions<span class="ph"> for 64-bit targets</span>. They enable vectorization with SIMD instructions, cache alignment, and flush to zero mode. </p> <div class="note note"><span class="notetitle">Note:</span> The contents of the <span class="keyword option">-⁠fast</span> option are host-dependent. Further, you should use these options on both compile and link command lines. </div> <p class="p"> The following table shows the typical <span class="keyword option">-⁠fast</span> options. </p> <div class="tablenoborder"><a name="gs-perform-fast__fast-options" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="gs-perform-fast__fast-options" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 4. Typical <span class="keyword option">-⁠fast</span> Options</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="25%" id="d54e2355" rowspan="1" colspan="1">Use this option...</th> <th class="entry" valign="top" width="75%" id="d54e2358" rowspan="1" colspan="1">To do this...</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e2355" rowspan="1" colspan="1"><span class="keyword option">-O2</span></td> <td class="entry" valign="top" width="75%" headers="d54e2358" rowspan="1" colspan="1">Specifies a code optimization level of 2.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e2355" rowspan="1" colspan="1"><span class="keyword option">-Munroll=c:1</span></td> <td class="entry" valign="top" width="75%" headers="d54e2358" rowspan="1" colspan="1"> Unrolls loops, executing multiple instances of the original loop during each iteration. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e2355" rowspan="1" colspan="1"><span class="keyword option">-Mnoframe</span></td> <td class="entry" valign="top" width="75%" headers="d54e2358" rowspan="1" colspan="1"> Do not generate code to set up a stack frame. <strong class="ph b">Note:</strong> With this option, a stack trace does not work. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e2355" rowspan="1" colspan="1"><span class="keyword option">-Mlre</span></td> <td class="entry" valign="top" width="75%" headers="d54e2358" rowspan="1" colspan="1">Enable loop-carried redundancy elimination.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e2355" rowspan="1" colspan="1"><span class="keyword option">-Mpre </span></td> <td class="entry" valign="top" width="75%" headers="d54e2358" rowspan="1" colspan="1">Enable partial redundancy elimination</td> </tr> </tbody> </table> </div> <p class="p"> On most modern CPUs the <span class="keyword option">-⁠fast</span> also includes the options shown in this table: </p> <div class="tablenoborder"><a name="gs-perform-fast__fast-64-options" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="gs-perform-fast__fast-64-options" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 5. Additional <span class="keyword option">-⁠fast</span> Options</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="25%" id="d54e2446" rowspan="1" colspan="1">Use this option...</th> <th class="entry" valign="top" width="75%" id="d54e2449" rowspan="1" colspan="1">To do this...</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e2446" rowspan="1" colspan="1"><span class="keyword option">-Mvect=simd </span></td> <td class="entry" valign="top" width="75%" headers="d54e2449" rowspan="1" colspan="1">Generates packed SIMD instructions.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e2446" rowspan="1" colspan="1"><span class="keyword option">-Mcache_align</span></td> <td class="entry" valign="top" width="75%" headers="d54e2449" rowspan="1" colspan="1">Aligns long objects on cache-line boundaries.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e2446" rowspan="1" colspan="1"><span class="keyword option">-Mflushz </span></td> <td class="entry" valign="top" width="75%" headers="d54e2449" rowspan="1" colspan="1">Sets flush-to-zero mode.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e2446" rowspan="1" colspan="1"><span class="keyword option">-M[no]vect </span></td> <td class="entry" valign="top" width="75%" headers="d54e2449" rowspan="1" colspan="1">Controls automatic vector pipelining.</td> </tr> </tbody> </table> </div> <p class="p"> To see the specific behavior of <span class="keyword option">-⁠fast</span> for your target, use the following command: </p><pre class="pre screen" xml:space="preserve">$ nvfortran -help -fast</pre></div> </div> <div class="topic concept nested2" id="gs-perform-other"><a name="gs-perform-other" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#gs-perform-other" name="gs-perform-other" shape="rect">2.3.2. Other Performance-Related Options</a></h3> <div class="body conbody"> <p class="p"> While <span class="keyword option">-⁠fast</span> is designed to be the quickest route to best performance, it is limited to routine boundaries. Depending on the nature and writing style of the source code, the compiler often can perform further optimization by knowing the global context of usage of a given routine. For instance, determining the possible value range of actual parameters of a routine could enable a loop to be vectorized; similarly, determining static occurrence of calls helps to decide which routine is beneficial to inline. </p> <p class="p"> These types of global optimizations are under control of Interprocedural Analysis (IPA) in NVIDIA HPC Compilers. Option <span class="keyword option">-⁠Mipa</span> enables Interprocedural Analysis. <span class="keyword option">-⁠Mipa=fast</span> is the recommended option to get best performances for global optimization. You can also add the suboption <span class="keyword option">inline</span> to enable automatic global inlining across files. You might consider using <span class="keyword option">-⁠Mipa=fast,inline</span>. This option for interprocedural analysis and global optimization can improve performance. </p> <p class="p"> For more information on optimization, refer to <a class="xref" href="index.html#opt-parallel" shape="rect">Multicore CPU Optimization</a>. For specific information about these options, refer to the ‘Optimization Controls’ section of the <span class="ph"><a class="xref" href="../hpc-compilers-ref-guide/index.htm" target="_blank" shape="rect">HPC Compilers Reference Guide</a></span>. </p> </div> </div> </div> <div class="topic concept nested1" id="freq-used-options"><a name="freq-used-options" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#freq-used-options" name="freq-used-options" shape="rect">2.4. Frequently-used Options</a></h3> <div class="body conbody"> <p class="p"> In addition to overall performance, there are a number of other options that many users find useful when getting started. The following table provides a brief summary of these options. </p> <div class="tablenoborder"><a name="freq-used-options__opt-desc" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="freq-used-options__opt-desc" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 6. Commonly Used Command-Line Options</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="33.33333333333333%" id="d54e2582" rowspan="1" colspan="1">Use this option...</th> <th class="entry" valign="top" width="66.66666666666666%" id="d54e2585" rowspan="1" colspan="1">To do this...</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠acc</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1"> Enable parallelization using OpenACC directives. By default the compilers will parallelize and offload OpenACC regions to an NVIDIA GPU. Use <samp class="ph codeph">-⁠acc=multicore</samp> to parallelize OpenACC regions for execution on all the cores of a multicore CPU. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠fast</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1"> This option creates a generally optimal set of flags for targets that support SIMD capability. It incorporates optimization options to enable use of vector streaming SIMD instructions, cache alignment and flushz. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠g</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1"> Instructs the compiler to include symbolic debugging information in the object module; sets the optimization level to zero unless a <samp class="ph codeph">-⁠O</samp> option is present on the command line. Conversely, to prevent the generation of DWARF information, use the <samp class="ph codeph">-⁠Mnodwarf</samp> option. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gopt</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1"> Instructs the compiler to include symbolic debugging information in the object file, and to generate optimized code identical to that generated when <samp class="ph codeph">-⁠g</samp> is not specified. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1"> Control the type of GPU for which code is generated, the version of CUDA to be targeted, and several other aspects of GPU code generation. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠help</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1">Provides information about available options.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠mcmodel=medium</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1"> Enables medium=model code generation for 64-bit targets, which is useful when the data space of the program exceeds 4GB. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠mp</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1"> Enable parallelization using OpenMP directives. By default the compilers will parallelize OpenMP regions for execution on all the cores of a multicore CPU. Use <samp class="ph codeph">-⁠mp=gpu</samp> to parallelize OpenMP regions for offload to an NVIDIA GPU. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠Mconcur</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1"> Instructs the compiler to enable auto-concurrentization of loops. If specified, the compiler uses multiple CPU cores to execute loops that it determines to be parallelizable; thus, loop iterations are split to execute optimally in a multithreaded execution context. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠Minfo</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1">Instructs the compiler to produce information on standard error.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠Minline</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1">Enables function inlining.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠Mipa=fast,inline</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1"> Enables interprocedural analysis and optimization. Also enables automatic procedure inlining. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠Mkeepasm</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1">Keeps the generated assembly files.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠Munroll</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1"> Invokes the loop unroller to unroll loops, executing multiple instances of the loop during each iteration. This also sets the optimization level to 2 if the level is set to less than 2, or if no -⁠O or -⁠g options are supplied. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠M[no]vect</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1">Enables [Disables] the code vectorizer.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">--⁠[no_]exceptions</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1"> Removes exception handling from user code. For C⁠+⁠+, declares that the functions in this file generate no C⁠+⁠+ exceptions, allowing more optimal code generation. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠o</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1">Names the output file.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠O &lt;level&gt;</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1">Specifies code optimization level where &lt;level&gt; is 0, 1, 2, 3, or 4.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠stdpar</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1"> Enable parallelization and offloading of Standard C⁠+⁠+ and Fortran parallel constructs to NVIDIA GPUs; default is -⁠stdpar=gpu. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠tp &lt;target&gt; </samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1"> Specify a CPU target other than the compilation host CPU. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e2582" rowspan="1" colspan="1"><samp class="ph codeph">-⁠Wl, &lt;option&gt;</samp></td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e2585" rowspan="1" colspan="1">Compiler driver passes the specified options to the linker.</td> </tr> </tbody> </table> </div> </div> </div> <div class="topic concept nested1" id="float-point-subnormal"><a name="float-point-subnormal" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#float-point-subnormal" name="float-point-subnormal" shape="rect">2.5. Floating-point Subnormal</a></h3> <div class="body conbody"> <p class="p"> Starting with the 22.7 release of the NV HPC SDK the default setting of how floating-point denormal (IEEE 754 terminology "subnormal") values are processed at runtime across both x86_64 and aarch64 processors has been changed to be more consistent. </p> <p class="p">Denormal values can be both operands to, and results of, floating-point operations. The x86_64 ISA differentiate between the two categories, operands and results, and use the terminology "daz" denormals are zeros for operands, and "flushz" flush to zero for results. The Arm V8 ISA as defined can differentiate between the two categories, but currently the processors that NV HPC SDK support only have a single setting for both operands and results and is defined as "fz" in the floating-point status and control register. </p> <p class="p">The NV HPC SDK C, C++, and Fortran compilers have command line switches <span class="keyword option">-⁠M[no]daz</span> and <span class="keyword option">-⁠M[no]flushz</span>, which when specified for the C/C++ main function or the Fortran main program affect how denormals are handled by the processor at runtime. The values of these two command line switches are passed to the runtime library to configure the floating-point status and control register at program startup. </p> <p class="p">NV HPC SDK supports x86_64 processors from both Intel and AMD, and ArmV8.1 and later processors. The following table summarizes the default settings of the <span class="keyword option">-⁠Mdaz</span> and <span class="keyword option">-⁠Mflushz</span> command line switches pre and post the 22.7 release. </p> <div class="tablenoborder"><a name="float-point-subnormal__default-daz-flushz-tbl" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="float-point-subnormal__default-daz-flushz-tbl" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 7. Default settings of <span class="keyword option">-⁠Mdaz</span> and <span class="keyword option">-⁠Mflushz</span></span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e2881" rowspan="1" colspan="1"> </th> <th class="entry" valign="top" width="NaN%" id="d54e2883" rowspan="1" colspan="1">Pre 22.7 defaults</th> <th class="entry" valign="top" width="NaN%" id="d54e2886" rowspan="1" colspan="1">22.7 defaults</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e2881" rowspan="1" colspan="1">Intel</td> <td class="entry" valign="top" width="NaN%" headers="d54e2883" rowspan="1" colspan="1"> <p class="p">-Mdaz</p> <p class="p">-Mnoflushz</p> </td> <td class="entry" valign="top" width="NaN%" headers="d54e2886" rowspan="1" colspan="1"> <p class="p">-Mdaz</p> <p class="p">-Mflushz</p> </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e2881" rowspan="1" colspan="1">AMD</td> <td class="entry" valign="top" width="NaN%" headers="d54e2883" rowspan="1" colspan="1"> <p class="p">-Mnodaz</p> <p class="p">-Mnoflushz</p> </td> <td class="entry" valign="top" width="NaN%" headers="d54e2886" rowspan="1" colspan="1"> <p class="p">-Mdaz</p> <p class="p">-Mflushz</p> </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e2881" rowspan="1" colspan="1">Arm processors</td> <td class="entry" valign="top" width="NaN%" headers="d54e2883" rowspan="1" colspan="1"> <p class="p">-Mnodaz</p> </td> <td class="entry" valign="top" width="NaN%" headers="d54e2886" rowspan="1" colspan="1"> <p class="p">-Mdaz</p> </td> </tr> </tbody> </table> </div> <p class="p">With the NV HPC SDK 22.7 release, the default handling of denormals operands and results is to treat them as zero, as if the main function/program were compiled with <span class="keyword option">-⁠Mdaz</span><span class="keyword option">-⁠Mflushz</span>. Consequently, these changes can potentially affect applications that are dependent on subnormal values being non-zero. </p> <p class="p">Along with the change to the default treatment of denormal values, users now have the ability to configure the floating-point status and control register through the <span class="ph filepath">NVCOMPILER_FPU_STATE</span> environment variable - effectively overriding how the program was originally compiled. For further information, see the description of the <a class="xref" href="index.html#env-vars-nv-fpu-state" shape="rect"><span class="ph filepath">NVCOMPILER_FPU_STATE</span></a> environment variable. </p> </div> </div> </div> <div class="topic concept nested0" id="opt-parallel"><a name="opt-parallel" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#opt-parallel" name="opt-parallel" shape="rect">3. Multicore CPU Optimization</a></h2> <div class="body conbody"> <p class="p"> Source code that is readable, maintainable, and produces correct results is not always organized for efficient execution. Normally, the first step in the program development process involves producing code that executes and produces the correct results. This first step usually involves compiling without much worry about optimization. After code is compiled and debugged, code optimization and parallelization become an issue. </p> <p class="p"> Invoking one of the NVIDIA HPC Compiler commands with certain options instructs the compiler to generate optimized code. Optimization is not always performed since it increases compilation time and may make debugging difficult. However, optimization produces more efficient code that usually runs significantly faster than code that is not optimized. </p> <p class="p"> The compilers optimize code according to the specified optimization level. You can use a number of options to specify the optimization levels, including <span class="keyword option">-⁠O</span>, <span class="keyword option">-⁠Mvect</span>, <span class="keyword option">-⁠Mipa</span> and <span class="keyword option">-⁠Mconcur</span>. In addition, you can use several of the <span class="keyword option">-⁠M&lt;nvflag&gt;</span> switches to control specific types of optimization. </p> <p class="p"> This chapter describes the overall effect of the optimization options supported by the NVIDIA HPC Compilers, and basic usage of several options. </p> </div> <div class="topic concept nested1" id="opt-overview"><a name="opt-overview" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-overview" name="opt-overview" shape="rect">3.1. Overview of Optimization</a></h3> <div class="body conbody"> <p class="p"> In general, optimization involves using transformations and replacements that generate more efficient code. This is done by the compiler and involves replacements that are independent of the particular target processor's architecture as well as replacements that take advantage of the <span class="ph">x86-64</span><span class="ph">OpenPOWER</span> architecture, instruction set and registers. </p> <p class="p">For discussion purposes, we categorize optimization: </p> <div class="p"> <ul class="sl simple"> <li class="sli"><a class="xref" href="index.html#opt-local" shape="rect">Local Optimization</a></li> <li class="sli"><a class="xref" href="index.html#opt-global" shape="rect">Global Optimization</a></li> <li class="sli"><a class="xref" href="index.html#opt-loop" shape="rect">Loop Optimization</a></li> <li class="sli"><a class="xref" href="index.html#opt-ipa" shape="rect">Interprocedural Analysis (IPA) and Optimization</a></li> <li class="sli"><a class="xref" href="index.html#opt-func-inline" shape="rect">Optimization Through Function Inlining</a></li> </ul> </div> </div> <div class="topic concept nested2" id="opt-local"><a name="opt-local" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-local" name="opt-local" shape="rect">3.1.1. Local Optimization</a></h3> <div class="body conbody"> <p class="p">A basic block is a sequence of statements in which the flow of control enters at the beginning and leaves at the end without the possibility of branching, except at the end. Local optimization is performed on a block-by-block basis within a program’s basic blocks. </p> <p class="p">The NVIDIA HPC Compilers perform many types of local optimization including: algebraic identity removal, constant folding, common sub-expression elimination, redundant load and store elimination, scheduling, strength reduction, and peephole optimizations. </p> </div> </div> <div class="topic concept nested2" id="opt-global"><a name="opt-global" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-global" name="opt-global" shape="rect">3.1.2. Global Optimization</a></h3> <div class="body conbody"> <p class="p">This optimization is performed on a subprogram/function over all its basic blocks. The optimizer performs control-flow and data-flow analysis for an entire program unit. All loops, including those formed by ad hoc branches such as IFs or GOTOs, are detected and optimized. </p> <p class="p">Global optimization includes: constant propagation, copy propagation, dead store elimination, global register allocation, invariant code motion, and induction variable elimination. </p> </div> </div> <div class="topic concept nested2" id="opt-loop"><a name="opt-loop" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-loop" name="opt-loop" shape="rect">3.1.3. Loop Optimization: Unrolling, Vectorization and Parallelization</a></h3> <div class="body conbody"> <p class="p"> The performance of certain classes of loops may be improved through vectorization or unrolling options. Vectorization transforms loops to improve memory access performance and make use of packed <span class="ph">SSE</span><span class="ph">vector</span> instructions which perform the same operation on multiple data items concurrently. Unrolling replicates the body of loops to reduce loop branching overhead and provide better opportunities for local optimization, vectorization and scheduling of instructions. Performance for loops on systems with multiple processors may also improve using the parallelization features of the NVIDIA HPC Compilers. </p> </div> </div> <div class="topic concept nested2" id="opt-ipa"><a name="opt-ipa" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-ipa" name="opt-ipa" shape="rect">3.1.4. Interprocedural Analysis (IPA) and Optimization</a></h3> <div class="body conbody"> <p class="p">Interprocedural analysis (IPA) allows use of information across function call boundaries to perform optimizations that would otherwise be unavailable. For example, if the actual argument to a function is in fact a constant in the caller, it may be possible to propagate that constant into the callee and perform optimizations that are not valid if the dummy argument is treated as a variable. A wide range of optimizations are enabled or improved by using IPA, including but not limited to data alignment optimizations, argument removal, constant propagation, pointer disambiguation, pure function detection, F90/F95 array shape propagation, data placement, empty function removal, automatic function inlining, inlining of functions from pre-compiled libraries, and interprocedural optimization of functions from pre-compiled libraries. </p> </div> </div> <div class="topic concept nested2" id="opt-func-inline"><a name="opt-func-inline" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-func-inline" name="opt-func-inline" shape="rect">3.1.5. Function Inlining</a></h3> <div class="body conbody"> <p class="p">This optimization allows a call to a function to be replaced by a copy of the body of that function. This optimization will sometimes speed up execution by eliminating the function call and return overhead. Function inlining may also create opportunities for other types of optimization. Function inlining is not always beneficial. When used improperly it may increase code size and generate less efficient code. </p> </div> </div> </div> <div class="topic concept nested1" id="opt-gs"><a name="opt-gs" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-gs" name="opt-gs" shape="rect">3.2. Getting Started with Optimization</a></h3> <div class="body conbody"> <p class="p"> The first concern should be getting the program to execute and produce correct results. To get the program running, start by compiling and linking without optimization. Add <span class="keyword option">-⁠O0</span> to the compile line to select no optimization; or add <span class="keyword option">-⁠g</span> to debug the program easily and isolate any coding errors exposed during porting. </p> <p class="p"> To get started quickly with optimization, a good set of options to use with any of the NVIDIA HPC compilers is <span class="keyword option">-⁠fast</span>. For example: </p><pre xml:space="preserve">$ nvfortran -fast -Mipa=fast,inline prog.f</pre><p class="p"> For all of the NVIDIA HPC Fortran<span class="ph">, C++ and C</span> compilers, the <span class="keyword option">-⁠fast -⁠Mipa=fast,inline</span> options generally produce code that is well-optimized without the possibility of significant slowdowns due to pathological cases. </p> <ul class="ul"> <li class="li"> The<span class="keyword option">-⁠fast</span> option is an aggregate option that includes a number of individual NVIDIA compiler options; which compiler options are included depends on the target for which compilation is performed. </li> <li class="li"> The <span class="keyword option">-⁠Mipa=fast,inline</span> option invokes interprocedural analysis (IPA), including several IPA suboptions. The inline suboption enables automatic inlining with IPA. If you do not wish to use automatic inlining, you can compile with <span class="keyword option">-⁠Mipa=fast</span> and use several IPA suboptions without inlining. </li> </ul> <p class="p"> These aggregate options incorporate a generally optimal set of flags for targets that support SIMD capability, including vectorization with SIMD instructions, cache alignment, and flushz. </p> <p class="p"> The following table shows the typical <span class="keyword option">-⁠fast</span> options. </p> <div class="tablenoborder"><a name="opt-gs__fast-fastsse-options" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="opt-gs__fast-fastsse-options" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 8. Typical <span class="keyword option">-⁠fast</span> Options </span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="25%" id="d54e3231" rowspan="1" colspan="1">Use this option...</th> <th class="entry" valign="top" width="75%" id="d54e3234" rowspan="1" colspan="1">To do this...</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e3231" rowspan="1" colspan="1"><span class="keyword option">-⁠O2</span></td> <td class="entry" valign="top" width="75%" headers="d54e3234" rowspan="1" colspan="1">Specifies a code optimization level of 2<span class="ph"> and -⁠Mvect=SIMD</span>. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e3231" rowspan="1" colspan="1"><span class="keyword option">-⁠Munroll=c:1</span></td> <td class="entry" valign="top" width="75%" headers="d54e3234" rowspan="1" colspan="1">Unrolls loops, executing multiple instances of the original loop during each iteration.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e3231" rowspan="1" colspan="1"><span class="keyword option">-⁠Mnoframe</span></td> <td class="entry" valign="top" width="75%" headers="d54e3234" rowspan="1" colspan="1">Indicates to not generate code to set up a stack frame. <strong class="ph b">Note</strong> With this option, a stack trace does not work. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e3231" rowspan="1" colspan="1"><span class="keyword option">-⁠Mlre</span></td> <td class="entry" valign="top" width="75%" headers="d54e3234" rowspan="1" colspan="1">Indicates loop-carried redundancy elimination.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e3231" rowspan="1" colspan="1"><span class="keyword option">-⁠Mautoinline</span></td> <td class="entry" valign="top" width="75%" headers="d54e3234" rowspan="1" colspan="1">Enables automatic function inlining in C &amp; C++.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e3231" rowspan="1" colspan="1"><span class="keyword option">-⁠Mpre </span></td> <td class="entry" valign="top" width="75%" headers="d54e3234" rowspan="1" colspan="1">Indicates partial redundancy elimination</td> </tr> </tbody> </table> </div> <p class="p"> On modern multicore CPUs the <span class="keyword option">-⁠fast</span> also typically includes the options shown in the following table: </p> <div class="tablenoborder"><a name="opt-gs__fast-fastsse-64-options" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="opt-gs__fast-fastsse-64-options" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 9. Additional <span class="keyword option">-⁠fast</span> Options </span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="25%" id="d54e3335" rowspan="1" colspan="1">Use this option...</th> <th class="entry" valign="top" width="75%" id="d54e3338" rowspan="1" colspan="1">To do this...</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e3335" rowspan="1" colspan="1"><span class="keyword option">-⁠Mvect=simd</span></td> <td class="entry" valign="top" width="75%" headers="d54e3338" rowspan="1" colspan="1">Generates packed SSE and AVX instructions.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e3335" rowspan="1" colspan="1"><span class="keyword option">-⁠Mcache_align</span></td> <td class="entry" valign="top" width="75%" headers="d54e3338" rowspan="1" colspan="1">Aligns long objects on cache-line boundaries.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="25%" headers="d54e3335" rowspan="1" colspan="1"><span class="keyword option">-⁠Mflushz </span></td> <td class="entry" valign="top" width="75%" headers="d54e3338" rowspan="1" colspan="1">Sets flush-to-zero mode.</td> </tr> </tbody> </table> </div> <p class="p"> By experimenting with individual compiler options on a file-by-file basis, further significant performance gains can sometimes be realized. However, depending on the coding style, individual optimizations can sometimes cause slowdowns, and must be used carefully to ensure performance improvements. </p> <p class="p"> There are other useful command line options related to optimization and parallelization, such as <a class="xref" href="index.html#opt-gs-help" shape="rect"><span class="keyword option">-⁠help</span></a>, <a class="xref" href="index.html#opt-gs-minfo" shape="rect"><span class="keyword option">-⁠Minfo</span></a>, <a class="xref" href="index.html#opt-gs-mneginfo" shape="rect"><span class="keyword option">-⁠Mneginfo</span></a>, <a class="xref" href="index.html#opt-gs-dryrun" shape="rect"><span class="keyword option">-⁠dryrun</span></a>, and <a class="xref" href="index.html#opt-gs-v" shape="rect"><span class="keyword option">-⁠v</span></a>. </p> </div> <div class="topic concept nested2" id="opt-gs-help"><a name="opt-gs-help" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-gs-help" name="opt-gs-help" shape="rect">3.2.1. -help</a></h3> <div class="body conbody"> <p class="p"> As described in <a class="xref" href="index.html#cmdln-options-help" shape="rect">Help with Command-Line Options</a>, you can see a specification of any command-line option by invoking any of the NVIDIA HPC Compilers with <span class="ph filepath">-⁠help</span> in combination with the option in question, without specifying any input files. </p> <p class="p"> For example, you might want information on <span class="ph filepath">-⁠O</span>: </p><pre xml:space="preserve">$ nvfortran -help -O</pre><p class="p"> The resulting output is similar to this: </p><pre xml:space="preserve">-O Set opt level. All -O1 optimizations plus traditional scheduling and global scalar optimizations performed</pre><p class="p"> Or you can see the full functionality of <span class="ph filepath">-⁠help</span> itself, which can return information on either an individual option or groups of options: </p><pre xml:space="preserve">$ nvfortran -help -help</pre><p class="p"> The resulting output is similar to this: </p><pre xml:space="preserve">-help[=groups|asm|debug|language|linker|opt|other|overall| phase|prepro|suffix|switch|target|variable] Show compiler switches</pre></div> </div> <div class="topic concept nested2" id="opt-gs-minfo"><a name="opt-gs-minfo" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-gs-minfo" name="opt-gs-minfo" shape="rect">3.2.2. -Minfo</a></h3> <div class="body conbody"> <p class="p"> You can use the <span class="ph filepath">-⁠Minfo</span> option to display compile-time optimization listings. When this option is used, the NVIDIA HPC Compilers issue informational messages to standard error (stderr) as compilation proceeds. From these messages, you can determine which loops are optimized using unrolling, SIMD vectorization, parallelization, GPU offloading, interprocedural optimizations and various miscellaneous optimizations. You can also see where and whether functions are inlined. </p> </div> </div> <div class="topic concept nested2" id="opt-gs-mneginfo"><a name="opt-gs-mneginfo" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-gs-mneginfo" name="opt-gs-mneginfo" shape="rect">3.2.3. -Mneginfo</a></h3> <div class="body conbody"> <p class="p"> You can use the <span class="ph filepath">-⁠Mneginfo</span> option to display informational messages to standard error (stderr) that explain why certain optimizations are inhibited. </p> </div> </div> <div class="topic concept nested2" id="opt-gs-dryrun"><a name="opt-gs-dryrun" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-gs-dryrun" name="opt-gs-dryrun" shape="rect">3.2.4. -dryrun</a></h3> <div class="body conbody"> <p class="p">The <span class="ph filepath">-⁠dryrun</span> option can be useful as a diagnostic tool if you need to see the steps used by the compiler driver to preprocess, compile, assemble and link in the presence of a given set of command line inputs. When you specify the <span class="ph filepath">-⁠dryrun</span> option, these steps are printed to standard error (stderr) but are not actually performed. For example, you can use this option to inspect the default and user-specified libraries that are searched during the link phase, and the order in which they are searched by the linker. </p> </div> </div> <div class="topic concept nested2" id="opt-gs-v"><a name="opt-gs-v" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-gs-v" name="opt-gs-v" shape="rect">3.2.5. -v</a></h3> <div class="body conbody"> <p class="p">The <span class="ph filepath">-⁠v</span> option is similar to <a class="xref" href="index.html#opt-gs-dryrun" shape="rect"><span class="ph filepath">-⁠dryrun</span></a>, except each compilation step is performed and not simply printed. </p> </div> </div> </div> <div class="topic concept nested1" id="opt-local-global"><a name="opt-local-global" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-local-global" name="opt-local-global" shape="rect">3.3. Local and Global Optimization</a></h3> <div class="body conbody"> <p class="p">This section describes local and global optimization. </p> </div> <div class="topic concept nested2" id="opt-loc-glob-msafeptr"><a name="opt-loc-glob-msafeptr" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-loc-glob-msafeptr" name="opt-loc-glob-msafeptr" shape="rect">3.3.1. -Msafeptr</a></h3> <div class="body conbody"> <p class="p"> The <span class="ph filepath">-⁠Msafeptr</span> option can significantly improve performance of C++ and C programs in which there is known to be no pointer aliasing. For obvious reasons, this command-line option must be used carefully. There are a number of suboptions for <span class="ph filepath">-⁠Msafeptr</span>: </p> <ul class="ul"> <li class="li"><span class="ph filepath">-Msafeptr=all</span> – All pointers are safe. Equivalent to the default setting: <span class="ph filepath">-⁠Msafeptr</span>. </li> <li class="li"><span class="ph filepath">-Msafeptr=arg</span> – Function formal argument pointers are safe. Equivalent to <span class="ph filepath">-⁠Msafeptr=dummy</span>. </li> <li class="li"><span class="ph filepath">-Msafeptr=global</span> – Global pointers are safe. </li> <li class="li"><span class="ph filepath">-Msafeptr=local</span> – Local pointers are safe. Equivalent to <span class="ph filepath">-⁠Msafeptr=auto</span>. </li> <li class="li"><span class="ph filepath">-Msafeptr=static</span> – Static local pointers are safe. </li> </ul> <p class="p"> If your C++ or C program has pointer aliasing and you also want automating inlining, then compiling with <span class="ph filepath">-⁠Mipa=fast</span> or <span class="ph filepath">-⁠Mipa=fast,inline</span> includes pointer aliasing optimizations. IPA may be able to optimize some of the alias references in your program and leave intact those that cannot be safely optimizied. </p> </div> </div> <div class="topic concept nested2" id="opt-loc-glob-o"><a name="opt-loc-glob-o" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-loc-glob-o" name="opt-loc-glob-o" shape="rect">3.3.2. -O</a></h3> <div class="body conbody"> <p class="p">Using the NVIDIA HPC Compiler commands with the <span class="ph filepath">-⁠O</span>&lt;level&gt; option (the capital O is for Optimize), you can specify any integer level from 0 to 4. </p> <div class="section"> <h4 class="title sectiontitle">-O0</h4> <p class="p">Level zero specifies no optimization. A basic block is generated for each language statement. At this level, the compiler generates a basic block for each statement. </p> <p class="p">Performance will almost always be slowest using this optimization level. This level is useful for the initial execution of a program. It is also useful for debugging, since there is a direct correlation between the program text and the code generated. To enable debugging, include <span class="ph filepath">-⁠g</span> on your compile line. </p> </div> <div class="section"> <h4 class="title sectiontitle">-O1</h4> <p class="p">Level one specifies local optimization. Scheduling of basic blocks is performed. Register allocation is performed.</p> <p class="p">Local optimization is a good choice when the code is very irregular, such as code that contains many short statements containing IF statements and does not contain loops (DO or DO WHILE statements ). Although this case rarely occurs, for certain types of code, this optimization level may perform better than level-two (<span class="ph filepath">-⁠O2</span>). </p> </div> <div class="section"> <h4 class="title sectiontitle">-O</h4> <p class="p">When no level is specified, level two global optimizations are performed, including traditional scalar optimizations, induction recognition, and loop invariant motion. No SIMD vectorization is enabled. </p> </div> <div class="section"> <h4 class="title sectiontitle">-O2</h4> <p class="p">Level two specifies global optimization. This level performs all level-one local optimization as well as level two global optimization described in <span class="ph filepath">-⁠O</span>. In addition, more advanced optimizations such as SIMD code generation, cache alignment, and partial redundancy elimination are enabled. </p> </div> <div class="section"> <h4 class="title sectiontitle">-O3</h4> <p class="p">Level three specifies aggressive global optimization. This level performs all level-one and level-two optimizations and enables more aggressive hoisting and scalar replacement optimizations that may or may not be profitable. </p> </div> <div class="section"> <h4 class="title sectiontitle">-O4</h4> <p class="p">Level four performs all level-one, level-two, and level-three optimizations and enables hoisting of guarded invariant floating point expressions. </p> </div> <div class="section"> <h4 class="title sectiontitle">Types of Optimizations</h4> <p class="p">The NVIDIA HPC Compilers perform many different types of local optimizations, including but not limited to:</p> <ul class="sl simple"> <li class="sli">Algebraic identity removal</li> <li class="sli">Constant folding</li> <li class="sli">Common subexpression elimination</li> <li class="sli">Local register optimization</li> <li class="sli">Peephole optimizations</li> <li class="sli">Redundant load and store elimination</li> <li class="sli">Strength reductions</li> </ul> <p class="p">Level-two optimization (<span class="ph filepath">-⁠O2</span> or <span class="ph filepath">-⁠O</span>) specifies global optimization. The <span class="ph filepath">-⁠fast</span> option generally specifies global optimization; however, the <span class="ph filepath">-⁠fast</span> switch varies from release to release, depending on a reasonable selection of switches for any one particular release. The <span class="ph filepath">-⁠O</span> or <span class="ph filepath">-⁠O2</span> level performs all level-one local optimizations as well as global optimizations. Control flow analysis is applied and global registers are allocated for all functions and subroutines. Loop regions are given special consideration. This optimization level is a good choice when the program contains loops, the loops are short, and the structure of the code is regular. </p> <p class="p">The NVIDIA HPC Compilers perform many different types of global optimizations, including but not limited to:</p> <ul class="sl simple"> <li class="sli">Branch to branch elimination</li> <li class="sli">Constant propagation</li> <li class="sli">Copy propagation</li> <li class="sli">Dead store elimination</li> <li class="sli">Global register allocation</li> <li class="sli">Induction variable elimination</li> <li class="sli">Invariant code motion</li> </ul> <p class="p">You can explicitly select the optimization level on the command line. For example, the following command line specifies level-two optimization which results in global optimization: </p><pre xml:space="preserve">$ nvfortran -O2 prog.f</pre><p class="p">The default optimization level changes depending on which options you select on the command line. For example, when you select the <span class="ph filepath">-⁠g</span> debugging option, the default optimization level is set to level-zero (<span class="ph filepath">-⁠O0</span>). However, if you need to debug optimized code, you can use the <span class="ph filepath">-⁠gopt</span> option to generate debug information without perturbing optimization. For a description of the default levels, refer to Default Optimization Levels. </p> <p class="p">The <span class="ph filepath">-⁠fast</span> option includes <span class="ph filepath">-⁠O2</span> on all targets. If you want to override the default for <span class="ph filepath">-⁠fast</span> with <span class="ph filepath">-⁠O3</span> while maintaining all other elements of <span class="ph filepath">-⁠fast</span>, simply compile as follows: </p><pre xml:space="preserve">$ nvfortran -fast -O3 prog.f</pre></div> </div> </div> </div> <div class="topic concept nested1" id="loop-unroll-munroll"><a name="loop-unroll-munroll" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#loop-unroll-munroll" name="loop-unroll-munroll" shape="rect">3.4. Loop Unrolling using -Munroll</a></h3> <div class="body conbody"> <p class="p"> This optimization unrolls loops, which reduces branch overhead, and can improve execution speed by creating better opportunities for instruction scheduling. A loop with a constant count may be completely unrolled or partially unrolled. A loop with a non-constant count may also be unrolled. A candidate loop must be an innermost loop containing one to four blocks of code. </p> <p class="p"> The following example shows the use of the <span class="ph filepath">-⁠Munroll</span> option: </p><pre xml:space="preserve">$ nvfortran -Munroll prog.f</pre><p class="p"> The <span class="ph filepath">-⁠Munroll</span> option is included as part of <span class="ph filepath">-⁠fast</span> on all targets. The loop unroller expands the contents of a loop and reduces the number of times a loop is executed. Branching overhead is reduced when a loop is unrolled two or more times, since each iteration of the unrolled loop corresponds to two or more iterations of the original loop; the number of branch instructions executed is proportionately reduced. When a loop is unrolled completely, the loop’s branch overhead is eliminated altogether. </p> <p class="p"> Loop unrolling may be beneficial for the instruction scheduler. When a loop is completely unrolled or unrolled two or more times, opportunities for improved scheduling may be presented. The code generator can take advantage of more possibilities for instruction grouping or filling instruction delays found within the loop. </p> <div class="section"> <h3 class="title sectiontitle">Examples Showing Effect of Unrolling</h3> <p class="p"> The following side-by-side examples show the effect of code unrolling on a segment that computes a dot product. </p> <div class="note note"><span class="notetitle">Note:</span> This example is only meant to represent how the compiler can transform the loop; it is not meant to imply that the programmer needs to manually change code. In fact, manually unrolling your code can sometimes inhibit the compiler’s analysis and optimization. </div> <div class="tablenoborder"><a name="loop-unroll-munroll__opt-desc" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="loop-unroll-munroll__opt-desc" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 10. Example of Effect of Code Unrolling</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="50%" id="d54e3908" rowspan="1" colspan="1">Dot Product Code</th> <th class="entry" valign="top" width="50%" id="d54e3911" rowspan="1" colspan="1">Unrolled Dot Product Code</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e3908" rowspan="1" colspan="1"><pre xml:space="preserve"> REAL*4 A(100), B(100), Z INTEGER I DO I=1, 100 Z = Z + A(i) * B(i) END DO END</pre></td> <td class="entry" valign="top" width="50%" headers="d54e3911" rowspan="1" colspan="1"><pre xml:space="preserve"> REAL*4 A(100), B(100), Z INTEGER I DO I=1, 100, 2 Z = Z + A(i) * B(i) Z = Z + A(i+1) * B(i+1) END DO END</pre></td> </tr> </tbody> </table> </div> <p class="p"> Using the <span class="ph filepath">-⁠Minfo</span> option, the compiler informs you when a loop is being unrolled. For example, a message similar to the following, indicating the line number, and the number of times the code is unrolled, displays when a loop is unrolled: </p><pre xml:space="preserve">dot: 5, Loop unrolled 5 times</pre><p class="p"> Using the c:&lt;m&gt; and n:&lt;m&gt; sub-options to <span class="ph filepath">-⁠Munroll</span>, or using <span class="ph filepath">-⁠Mnounroll</span>, you can control whether and how loops are unrolled on a file-by-file basis. For more information on <span class="ph filepath">-⁠Munroll</span>, refer to <a class="xref" href="index.html#cmdln-options-use" shape="rect">Use Command-line Options</a>. </p> </div> </div> </div> <div class="topic concept nested1" id="vect-mvect"><a name="vect-mvect" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#vect-mvect" name="vect-mvect" shape="rect">3.5. Vectorization using -Mvect</a></h3> <div class="body conbody"> <p class="p"> The <span class="keyword option">-⁠Mvect</span> option is included as part of <span class="keyword option">-⁠fast</span> on all multicore CPU targets. If your program contains computationally-intensive loops, the <span class="keyword option">-⁠Mvect</span> option may be helpful. If in addition you specify <span class="keyword option">-⁠Minfo</span>, and your code contains loops that can be vectorized, the compiler reports relevant information on the optimizations applied. </p> <p class="p"> When an NVIDIA HPC Compiler command is invoked with the <span class="keyword option">-⁠Mvect</span> option, the vectorizer scans code searching for loops that are candidates for high-⁠level transformations such as loop distribution, loop interchange, cache tiling, and idiom recognition (replacement of a recognizable code sequence, such as a reduction loop, with optimized code sequences or function calls). When the vectorizer finds vectorization opportunities, it internally rearranges or replaces sections of loops (the vectorizer changes the code generated; your source code’s loops are not altered). In addition to performing these loop transformations, the vectorizer produces extensive data dependence information for use by other phases of compilation and detects opportunities to use vector or packed SIMD instructions on processors where these are supported. </p> <p class="p"> The <span class="keyword option">-⁠Mvect</span> option can speed up code which contains well-behaved countable loops which operate on large floating point arrays in Fortran and their C++ and C counterparts. However, it is possible that some codes will show a decrease in performance when compiled with the <span class="keyword option">-⁠Mvect</span> option due to the generation of conditionally executed code segments, inability to determine data alignment, and other code generation factors. For this reason, it is recommended that you check carefully whether particular program units or loops show improved performance when compiled with this option enabled. </p> </div> <div class="topic concept nested2" id="vect-subopt"><a name="vect-subopt" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#vect-subopt" name="vect-subopt" shape="rect">3.5.1. Vectorization Sub-options</a></h3> <div class="body conbody"> <p class="p"> The vectorizer performs high-level loop transformations on countable loops. A loop is countable if the number of iterations is set only before loop execution and cannot be modified during loop execution. Some of the vectorizer transformations can be controlled by arguments to the <span class="keyword option">-⁠Mvect</span> command line option. The following sections describe the arguments that affect the operation of the vectorizer. In addition, some of these vectorizer operations can be controlled from within code using directives and pragmas. </p> <p class="p">The vectorizer performs the following operations:</p> <div class="p"> <ul class="ul"> <li class="li">Loop interchange</li> <li class="li">Loop splitting</li> <li class="li">Loop fusion</li> <li class="li">Generation of SIMD instructions on CPUs where these are supported</li> <li class="li">Generation of prefetch instructions on processors where these are supported</li> <li class="li">Loop iteration peeling to maximize vector alignment</li> <li class="li">Alternate code generation</li> </ul> </div> <p class="p"> The following table lists and briefly describes some of the <span class="keyword option">-⁠Mvect</span> suboptions. </p> <div class="tablenoborder"><a name="vect-subopt__mvect-subopt-desc" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="vect-subopt__mvect-subopt-desc" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 11. -Mvect Suboptions</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="50%" id="d54e4082" rowspan="1" colspan="1">Use this option ...</th> <th class="entry" valign="top" width="50%" id="d54e4085" rowspan="1" colspan="1">To instruct the vectorizer to do this ...</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e4082" rowspan="1" colspan="1"><span class="keyword option">-Mvect=altcode</span></td> <td class="entry" valign="top" width="50%" headers="d54e4085" rowspan="1" colspan="1">Generate appropriate code for vectorized loops.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e4082" rowspan="1" colspan="1"><span class="keyword option">-Mvect=[no]assoc</span></td> <td class="entry" valign="top" width="50%" headers="d54e4085" rowspan="1" colspan="1"> Perform[disable] associativity conversions that can change the results of a computation due to a round-off error. For example, a typical optimization is to change one arithmetic operation to another arithmetic operation that is mathematically correct, but can be computationally different and generate faster code. This option is provided to enable or disable this transformation, since a round-off error for such associativity conversions may produce unacceptable results. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e4082" rowspan="1" colspan="1"><span class="keyword option">-Mvect=fuse</span></td> <td class="entry" valign="top" width="50%" headers="d54e4085" rowspan="1" colspan="1">Enable loop fusion.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e4082" rowspan="1" colspan="1"><span class="keyword option">-Mvect=gather</span></td> <td class="entry" valign="top" width="50%" headers="d54e4085" rowspan="1" colspan="1">Enable vectorization of indirect array references.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e4082" rowspan="1" colspan="1"><span class="keyword option">-Mvect=idiom</span></td> <td class="entry" valign="top" width="50%" headers="d54e4085" rowspan="1" colspan="1">Enable idiom recognition.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e4082" rowspan="1" colspan="1"><span class="keyword option">-Mvect=levels:&lt;n&gt;</span></td> <td class="entry" valign="top" width="50%" headers="d54e4085" rowspan="1" colspan="1">Set the maximum next level of loops to optimize.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e4082" rowspan="1" colspan="1"><span class="keyword option">-Mvect=nocond</span></td> <td class="entry" valign="top" width="50%" headers="d54e4085" rowspan="1" colspan="1">Disable vectorization of loops with conditions.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e4082" rowspan="1" colspan="1"><span class="keyword option">-Mvect=partial</span></td> <td class="entry" valign="top" width="50%" headers="d54e4085" rowspan="1" colspan="1"> Enable partial loop vectorization via inner loop distribution. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e4082" rowspan="1" colspan="1"><span class="keyword option">-Mvect=prefetch</span></td> <td class="entry" valign="top" width="50%" headers="d54e4085" rowspan="1" colspan="1"> Automatically generate prefetch instructions when vectorizable loops are encountered, even in cases where <span class="ph">SSE</span><span class="ph">SIMD</span> instructions are not generated. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e4082" rowspan="1" colspan="1"><span class="keyword option">-Mvect=short</span></td> <td class="entry" valign="top" width="50%" headers="d54e4085" rowspan="1" colspan="1">Enable short vector operations.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e4082" rowspan="1" colspan="1"><span class="keyword option">-Mvect=simd</span></td> <td class="entry" valign="top" width="50%" headers="d54e4085" rowspan="1" colspan="1"> Automatically generate packed <span class="ph">SSE (Streaming SIMD Extensions)</span><span class="ph">SIMD</span>, and prefetch instructions when vectorizable loops are encountered. SIMD instructions, first introduced on Pentium III and AthlonXP processors, operate on single-precision floating-point data. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e4082" rowspan="1" colspan="1"><span class="keyword option">-Mvect=sizelimit:n</span></td> <td class="entry" valign="top" width="50%" headers="d54e4085" rowspan="1" colspan="1">Limit the size of vectorized loops.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e4082" rowspan="1" colspan="1"><span class="keyword option">-Mvect=sse</span></td> <td class="entry" valign="top" width="50%" headers="d54e4085" rowspan="1" colspan="1">Equivalent to <span class="keyword option">-⁠Mvect=simd</span>. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e4082" rowspan="1" colspan="1"><span class="keyword option">-Mvect=uniform</span></td> <td class="entry" valign="top" width="50%" headers="d54e4085" rowspan="1" colspan="1"> Perform consistent optimizations in both vectorized and residual loops. Be aware that this may affect the performance of the residual loop. </td> </tr> </tbody> </table> </div> <div class="note note"><span class="notetitle">Note:</span> Inserting <samp class="ph codeph">no</samp> in front of an option disables the option. For example, to disable the generation of SIMD instructions, compile with <span class="keyword option">-⁠Mvect=nosimd</span>. </div> </div> </div> <div class="topic concept nested2" id="vect-exam-simd"><a name="vect-exam-simd" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#vect-exam-simd" name="vect-exam-simd" shape="rect">3.5.2. Vectorization Example Using SIMD Instructions</a></h3> <div class="body conbody"> <p class="p"> One of the most important vectorization options is <span class="keyword option">-Mvect=simd</span>. When you use this option, the compiler automatically generates SIMD vector instructions, where possible, when targeting processors on which these instructions are supported. This process can improve performance by several factors compared with the equivalent scalar code. All of the NVIDIA HPC Fortran, C++ and C compilers support this capability. </p> <p class="p"> In the program in <a class="xref" href="index.html#vect-exam-simd__vect-use-simd-exam" shape="rect">Vector operation using SIMD instructions</a>, the vectorizer recognizes the vector operation in subroutine 'loop' when either compiler switch <span class="keyword option">-⁠Mvect=simd</span> or <span class="keyword option">-⁠fast</span> is used. This example shows the compilation, informational messages, and runtime results using SIMD instructions on an Intel Core i7 7800X Skylake system, along with issues that affect SIMD performance. </p> <p class="p"> Loops vectorized using SIMD instructions operate much more efficiently when processing vectors that are aligned to a cache-line boundary. You can cause unconstrained data objects of size 16 bytes or greater to be cache-aligned by compiling with the <span class="keyword option">-⁠Mcache_align</span> switch. An unconstrained data object is a data object that is not a common block member and not a member of an aggregate data structure. </p> <div class="note note"><span class="notetitle">Note:</span> For stack-based local variables to be properly aligned, the main program or function must be compiled with <span class="keyword option">-⁠Mcache_align</span>. </div> <p class="p"> The <span class="keyword option">-⁠Mcache_align</span> switch has no effect on the alignment of Fortran allocatable or automatic arrays. If you have arrays that are constrained, such as vectors that are members of Fortran common blocks, you must specifically pad your data structures to ensure proper cache alignment. You can use <span class="keyword option">-⁠Mcache_align</span> for only the beginning address of each common block to be cache-aligned. </p> <p class="p"> The following examples show the results of compiling the sample code in <a class="xref" href="index.html#vect-exam-simd__vect-use-simd-exam" shape="rect">Vector operation using SIMD instructions</a> both with and without the option <span class="keyword option">-⁠Mvect=simd</span> . </p> <div class="example" id="vect-exam-simd__vect-use-simd-exam"><a name="vect-exam-simd__vect-use-simd-exam" shape="rect"> <!-- --></a><h4 class="title sectiontitle">Vector operation using SIMD instructions</h4><pre xml:space="preserve">program vector_op parameter (N = 9999) real*4 x(N), y(N), z(N), W(N) do i = 1, n y(i) = i z(i) = 2*i w(i) = 4*i enddo do j = 1, 200000 call loop(x,y,z,w,1.0e0,N) enddo print *, x(1),x(771),x(3618),x(6498),x(9999) end</pre><pre xml:space="preserve">subroutine loop(a,b,c,d,s,n) integer i, n real*4 a(n), b(n), c(n), d(n),s do i = 1, n a(i) = b(i) + c(i) - s * d(i) enddo end</pre><p class="p"> Assume the preceding program is compiled as follows, where <span class="keyword option">-Mvect=nosimd</span> disables SIMD vectorization: </p><pre xml:space="preserve">% nvfortran -fast -Mvect=nosimd -Minfo vadd.f -Mfree -o vadd vector_op: 4, Loop unrolled 16 times Generated 1 prefetches in scalar loop 9, Loop not vectorized/parallelized: contains call loop: 18, Loop unrolled 8 times FMA (fused multiply-add) instruction(s) generated </pre><p class="p"> The following output shows a sample result if the generated executable is run and timed on an Intel Core i7 7800X Skylake system: </p><pre xml:space="preserve">$ /bin/time vadd -1.000000 -771.0000 -3618.000 -6498.000 -9999.000 0.99user 0.01system 0:01.18elapsed 84%CPU (0avgtext+0avgdata 3120maxresident)k 7736inputs+0outputs (4major+834minor)pagefaults 0swaps </pre><pre xml:space="preserve">$ /bin/time vadd -1.000000 -771.0000 -3618.000 -6498.000 -9999.000 2.31user 0.00system 0:02.57elapsed 89%CPU (0avgtext+0avgdata 6976maxresident)k 8192inputs+0outputs (4major+149minor)pagefaults 0swaps </pre><p class="p"> Now, recompile with vectorization enabled, and you see results similar to these: </p><pre xml:space="preserve">% nvfortran -fast -Minfo vadd.f -Mfree -o vadd vector_op: 4, Loop not vectorized: may not be beneficial Unrolled inner loop 8 times Residual loop unrolled 7 times (completely unrolled) Generated 1 prefetches in scalar loop 9, Loop not vectorized/parallelized: contains call loop: 18, Generated 2 alternate versions of the loop Generated vector simd code for the loop Generated 3 prefetch instructions for the loop Generated vector simd code for the loop Generated 3 prefetch instructions for the loop Generated vector simd code for the loop Generated 3 prefetch instructions for the loop FMA (fused multiply-add) instruction(s) generated </pre><p class="p"> Notice the informational messages for the loop at line 18. The first line of the message indicates that two alternate versions of the loop were generated. The loop count and alignments of the arrays determine which of these versions is executed. The next several lines indicate the loop was vectorized and that prefetch instructions have been generated for three loads to minimize latency of data transfers from main memory. </p> <p class="p">Executing again, you should see results similar to the following:</p><pre xml:space="preserve">$ /bin/time vadd-simd -1.000000 -771.0000 -3618.000 -6498.000 -9999.000 0.27user 0.00system 0:00.29elapsed 93%CPU (0avgtext+0avgdata 3124maxresident)k 0inputs+0outputs (0major+838minor)pagefaults 0swaps </pre><pre xml:space="preserve">$ /bin/time vadd-simd -1.000000 -771.0000 -3618.000 -6498.000 -9999.000 0.62user 0.00system 0:00.65elapsed 95%CPU (0avgtext+0avgdata 6976maxresident)k 0inputs+0outputs (0major+151minor)pagefaults 0swaps </pre><p class="p"> The SIMD result is 3.7 times faster than the equivalent non-SIMD version of the program. </p> <p class="p"> Speed-up realized by a given loop or program can vary widely based on a number of factors: </p> <ul class="ul"> <li class="li"> When the vectors of data are resident in the data cache, performance improvement using SIMD instructions is most effective. </li> <li class="li"> If data is aligned properly, performance will be better in general than when using SIMD operations on unaligned data. </li> <li class="li"> If the compiler can guarantee that data is aligned properly, even more efficient sequences of SIMD instructions can be generated. </li> <li class="li"> The efficiency of loops that operate on single-precision data can be higher. SIMD instructions can operate on four single-precision elements concurrently, but only two double-precision elements. </li> </ul> <div class="note note"><span class="notetitle">Note:</span> Compiling with <span class="keyword option">-⁠Mvect=simd</span> can result in numerical differences from the executables generated with less optimization. Certain vectorizable operations, for example dot products, are sensitive to order of operations and the associative transformations necessary to enable vectorization (or parallelization). </div> </div> </div> </div> </div> <div class="topic concept nested1" id="opt-ipa-mipa"><a name="opt-ipa-mipa" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-ipa-mipa" name="opt-ipa-mipa" shape="rect">3.6. Interprocedural Analysis and Optimization using -Mipa</a></h3> <div class="body conbody"> <p class="p"> The NVIDIA HPC Fortran, C++ and C compilers use interprocedural analysis (IPA) that results in minimal changes to makefiles and the standard edit-build-run application development cycle. Other than adding <span class="ph filepath">-⁠Mipa</span> to the command line, no other changes are required. For reference and background, the process of building a program without IPA is described later in this section, followed by the minor modifications required to use IPA with the NVIDIA compilers. While the NVC compiler is used here to show how IPA works, similar capabilities apply to each of the NVIDIA HPC Fortran, C++ and C compilers. </p> </div> <div class="topic concept nested2" id="opt-ipa-no-build-1step"><a name="opt-ipa-no-build-1step" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-ipa-no-build-1step" name="opt-ipa-no-build-1step" shape="rect">3.6.1. Building a Program Without IPA – Single Step</a></h3> <div class="body conbody"> <p class="p">Using the <span class="ph">nvc</span> command-level compiler driver, multiple source files can be compiled and linked into a single executable with one command. The following example compiles and links three source files: </p><pre xml:space="preserve">% nvc -o a.out file1.c file2.c file3.c</pre><p class="p">In actuality, the <span class="ph">nvc</span> driver executes several steps to produce the assembly code and object files corresponding to each source file, and subsequently to link the object files together into a single executable file. This command is roughly equivalent to the following commands performed individually: </p><pre xml:space="preserve">% nvc -S -o file1.s file1.c % as -o file1.o file1.s % nvc -S -o file2.s file2.c % as -o file2.o file2.s % nvc -S -o file3.s file3.c % as -o file3.o file3.s % nvc -o a.out file1.o file2.o file3.o</pre><p class="p">If any of the three source files is edited, the executable can be rebuilt with the same command line:</p><pre xml:space="preserve">% nvc -o a.out file1.c file2.c file3.c</pre><div class="note note"><span class="notetitle">Note:</span> This always works as intended, but has the side-effect of recompiling all of the source files, even if only one has changed. For applications with a large number of source files, this can be time-consuming and inefficient. </div> </div> </div> <div class="topic concept nested2" id="opt-ipa-no-build-multi-step"><a name="opt-ipa-no-build-multi-step" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-ipa-no-build-multi-step" name="opt-ipa-no-build-multi-step" shape="rect">3.6.2. Building a Program Without IPA – Several Steps</a></h3> <div class="body conbody"> <p class="p">It is also possible to use individual <span class="ph">nvc</span> commands to compile each source file into a corresponding object file, and one to link the resulting object files into an executable: </p><pre xml:space="preserve">% nvc -c file1.c % nvc -c file2.c % nvc -c file3.c % nvc -o a.out file1.o file2.o file3.o</pre><p class="p">The <span class="ph">nvc</span> driver invokes the compiler and assembler as required to process each source file, and invokes the linker for the final link command. If you modify one of the source files, the executable can be rebuilt by compiling just that file and then relinking: </p><pre xml:space="preserve">% nvc -c file1.c % nvc -o a.out file1.o file2.o file3.o</pre></div> </div> <div class="topic concept nested2" id="opt-ipa-no-build-use-make"><a name="opt-ipa-no-build-use-make" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-ipa-no-build-use-make" name="opt-ipa-no-build-use-make" shape="rect">3.6.3. Building a Program Without IPA Using Make</a></h3> <div class="body conbody"> <p class="p">The program compilation and linking process can be simplified greatly using the <span class="ph filepath">make</span> utility on systems where it is supported. Suppose you create a <span class="ph filepath">makefile</span> containing the following lines: </p><pre xml:space="preserve">a.out: file1.o file2.o file3.o nvc $(OPT) -o a.out file1.o file2.o file3.o file1.o: file1.c nvc $(OPT) -c file1.c file2.o: file2.c nvc $(OPT) -c file2.c file3.o: file3.c nvc $(OPT) -c file3.c</pre><p class="p">It is then possible to type a single make command:</p><pre xml:space="preserve">% make</pre><p class="p">The <span class="ph filepath">make</span> utility determines which object files are out of date with respect to their corresponding source files, and invokes the compiler to recompile only those source files and to relink the executable. If you subsequently edit one or more source files, the executable can be rebuilt with the minimum number of recompilations using the same single <span class="ph filepath">make</span> command. </p> </div> </div> <div class="topic concept nested2" id="opt-ipa-build-prog"><a name="opt-ipa-build-prog" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-ipa-build-prog" name="opt-ipa-build-prog" shape="rect">3.6.4. Building a Program with IPA</a></h3> <div class="body conbody"> <p class="p">Interprocedural analysis and optimization (IPA) by the NVIDIA HPC Compilers alters the standard and <span class="ph filepath">make</span> utility command-level interfaces as little as possible. IPA occurs in three phases: </p> <ul class="ul"> <li class="li"><strong class="ph b">Collection</strong>: Create a summary of each function or procedure, collecting the useful information for interprocedural optimizations. This is done during the compile step if the <span class="ph filepath">-⁠Mipa</span> switch is present on the command line; summary information is collected and stored in the object file. </li> <li class="li"><strong class="ph b">Propagation</strong>: Process all the object files to propagate the interprocedural summary information across function and file boundaries. This is done during the link step, when all the object files are combined, if the <span class="ph filepath">-⁠Mipa</span> switch is present on the link command line. </li> <li class="li"><strong class="ph b">Recompile/Optimization</strong>: Recompile each of the object files with the propagated interprocedural information, producing a specialized object file. This process is also performed during the link step when the <span class="ph filepath">-⁠Mipa</span> switch is present on the link command line. </li> </ul> <p class="p">When linking with <span class="ph filepath">-⁠Mipa</span>, the NVIDIA HPC Compilers automatically regenerate IPA-optimized versions of each object file, essentially recompiling each file. If there are IPA-optimized objects from a previous build, the compilers will minimize the recompile time by reusing those objects if they are still valid. They will still be valid if the IPA-optimized object is newer than the original object file, and the propagated IPA information for that file has not changed since it was optimized. </p> <p class="p">After each object file has been recompiled, the regular linker is invoked to build the application with the IPA-optimized object files. The IPA-optimized object files are saved in the same directory as the original object files, for use in subsequent program builds. </p> </div> </div> <div class="topic concept nested2" id="opt-ipa-build-1step"><a name="opt-ipa-build-1step" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-ipa-build-1step" name="opt-ipa-build-1step" shape="rect">3.6.5. Building a Program with IPA – Single Step</a></h3> <div class="body conbody"> <p class="p">By adding the <span class="ph filepath">-⁠Mipa</span> command line switch, several source files can be compiled and linked with interprocedural optimizations with one command: </p><pre xml:space="preserve">% nvc -Mipa=fast -o a.out file1.c file2.c file3.c</pre><p class="p">Just like compiling without <span class="ph filepath">-⁠Mipa</span>, the driver executes several steps to produce the assembly and object files to create the executable: </p><pre xml:space="preserve">% nvc -Mipa=fast -S -o file1.s file1.c % as -o file1.o file1.s % nvc -Mipa=fast -S -o file2.s file2.c % as -o file2.o file2.s % nvc -Mipa=fast -S -o file3.s file3.c % as -o file3.o file3.s % nvc -Mipa=fast -o a.out file1.o file2.o file3.o</pre><p class="p">In the last step, an IPA linker is invoked to read all the IPA summary information and perform the interprocedural propagation. The IPA linker reinvokes the compiler on each of the object files to recompile them with interprocedural information. This creates three new objects with mangled names: </p><pre xml:space="preserve">file1_ipa5_a.out.oo.o, file2_ipa5_a.out.oo.o, file3_ipa5_a.out.oo.o</pre><p class="p">The system linker is then invoked to link these IPA-optimized objects into the final executable. Later, if one of the three source files is edited, the executable can be rebuilt with the same command line: </p><pre xml:space="preserve">% nvc -Mipa=fast -o a.out file1.c file2.c file3.c</pre><p class="p">This works, but again has the side-effect of compiling each source file, and recompiling each object file at link time.</p> </div> </div> <div class="topic concept nested2" id="opt-ipa-build-multi-step"><a name="opt-ipa-build-multi-step" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-ipa-build-multi-step" name="opt-ipa-build-multi-step" shape="rect">3.6.6. Building a Program with IPA – Several Steps</a></h3> <div class="body conbody"> <p class="p">Just by adding the <span class="ph filepath">-⁠Mipa</span> command-line switch, it is possible to use individual <span class="ph">nvc</span> commands to compile each source file, followed by a command to link the resulting object files into an executable: </p><pre xml:space="preserve">% nvc -Mipa=fast -c file1.c % nvc -Mipa=fast -c file2.c % nvc -Mipa=fast -c file3.c % nvc -Mipa=fast -o a.out file1.o file2.o file3.o</pre><p class="p">The <span class="ph">nvc</span> driver invokes the compiler and assembler as required to process each source file, and invokes the IPA linker for the final link command. If you modify one of the source files, the executable can be rebuilt by compiling just that file and then relinking: </p><pre xml:space="preserve">% nvc -Mipa=fast -c file1.c % nvc -Mipa=fast -o a.out file1.o file2.o file3.o</pre><p class="p">When the IPA linker is invoked, it will determine that the IPA-optimized object for <span class="ph filepath">file1.o</span> (<span class="ph filepath">file1_ipa5_a.out.oo.o</span>) is stale, since it is older than the object <span class="ph filepath">file1.o</span>; and hence it needs to be rebuilt, and reinvokes the compiler to generate it. In addition, depending on the nature of the changes to the source file <span class="ph filepath">file1.c</span>, the interprocedural optimizations previously performed for <span class="ph filepath">file2</span> and <span class="ph filepath">file3</span> may now be inaccurate. For instance, IPA may have propagated a constant argument value in a call from a function in <span class="ph filepath">file1.c</span> to a function in <span class="ph filepath">file2.c</span>; if the value of the argument has changed, any optimizations based on that constant value are invalid. The IPA linker determines which, if any, of the previously created IPA-optimized objects need to be regenerated; and, as appropriate, reinvokes the compiler to regenerate them. Only those objects that are stale or which have new or different IPA information are regenerated. This approach saves compile time. </p> </div> </div> <div class="topic concept nested2" id="opt-ipa-build-use-make"><a name="opt-ipa-build-use-make" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-ipa-build-use-make" name="opt-ipa-build-use-make" shape="rect">3.6.7. Building a Program with IPA Using Make</a></h3> <div class="body conbody"> <p class="p">As shown earlier, programs can be built with IPA using the make utility. Just add the command-line switch <span class="ph filepath">-⁠Mipa</span>, as shown here: </p><pre xml:space="preserve">OPT=-Mipa=fast a.out: file1.o file2.o file3.o nvc $(OPT) -o a.out file1.o file2.o file3.o file1.o: file1.c nvc $(OPT) -c file1.c file2.o: file2.c nvc $(OPT) -c file2.c file3.o: file3.c nvc $(OPT) -c file3.c</pre><p class="p">Using the single <span class="ph filepath">make</span> command invokes the compiler to generate any of the object files that are out-of-date, then invokes <span class="ph">nvc</span> to link the objects into the executable. At link time, <span class="ph">nvc</span> calls the IPA linker to regenerate any stale or invalid IPA-optimized objects. </p><pre xml:space="preserve">% make</pre></div> </div> <div class="topic concept nested2" id="opt-ipa-qa"><a name="opt-ipa-qa" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#opt-ipa-qa" name="opt-ipa-qa" shape="rect">3.6.8. Questions about IPA</a></h3> <div class="body conbody"> <p class="p"><strong class="ph b">Question: </strong>Why is the object file so large? </p> <p class="p"><strong class="ph b">Answer: </strong>An object file created with <span class="ph filepath">-⁠Mipa</span> contains several additional sections. One is the summary information used to drive the interprocedural analysis. In addition, the object file contains the compiler internal representation of the source file, so the file can be recompiled at link time with interprocedural optimizations. There may be additional information when inlining is enabled. The total size of the object file may be 5-10 times its original size. The extra sections are not added to the final executable. </p> <p class="p"><strong class="ph b">Question: </strong>What if I compile with <span class="ph filepath">-⁠Mipa</span> and link without <span class="ph filepath">-⁠Mipa</span>? </p> <p class="p"><strong class="ph b">Answer: </strong>The NVIDIA HPC Compilers generate a legal object file, even when the source file is compiled with <span class="ph filepath">-⁠Mipa</span>. If you compile with <span class="ph filepath">-⁠Mipa</span> and link without <span class="ph filepath">-⁠Mipa</span>, the linker is invoked on the original object files. A legal executable is generated. While this executable does not have the benefit of interprocedural optimizations, any other optimizations do apply. </p> <p class="p"><strong class="ph b">Question: </strong>What if I compile without <span class="ph filepath">-⁠Mipa</span> and link with <span class="ph filepath">-⁠Mipa</span>? </p> <p class="p"><strong class="ph b">Answer: </strong>At link time, the IPA linker must have summary information about all the functions or routines used in the program. This information is created only when a file is compiled with <span class="ph filepath">-⁠Mipa</span>. If you compile a file without <span class="ph filepath">-⁠Mipa</span> and then try to get interprocedural optimizations by linking with <span class="ph filepath">-⁠Mipa</span>, the IPA linker will issue a message that some routines have no IPA summary information, and will proceed to run the system linker using the original object files. If some files were compiled with <span class="ph filepath">-⁠Mipa</span> and others were not, it will determine the safest approximation of the IPA summary information for those files not compiled with <span class="ph filepath">-⁠Mipa</span>, and use that to recompile the other files using interprocedural optimizations. </p> <p class="p"><strong class="ph b">Question: </strong>Can I build multiple applications in the same directory with <span class="ph filepath">-⁠Mipa</span>? </p> <p class="p"><strong class="ph b">Answer: </strong>Yes. Suppose you have three source files: <span class="ph filepath">main1.c</span>, <span class="ph filepath">main2.c</span>, and <span class="ph filepath">sub.c</span>, where <span class="ph filepath">sub.c</span> is shared between the two applications. Suppose you build the first application with <span class="ph filepath">-⁠Mipa</span>, using this command: </p><pre xml:space="preserve">% nvc -Mipa=fast -o app1 main1.c sub.c</pre><p class="p">The IPA linker creates two IPA-optimized object files and uses them to build the first application.</p><pre xml:space="preserve">main1_ipa4_app1.oo sub_ipa4_app1.oo</pre><p class="p">Now suppose you build the second application using this command:</p><pre xml:space="preserve">% nvc -Mipa=fast -o app2 main2.c sub.c</pre><p class="p">The IPA linker creates two more IPA-optimized object files:</p><pre xml:space="preserve">main2_ipa4_app2.oo sub_ipa4_app2.oo</pre><div class="note note"><span class="notetitle">Note:</span> There are now three object files for <span class="ph filepath">sub.c</span>: the original <span class="ph filepath">sub.o</span>, and two IPA-optimized objects, one for each application in which it appears. </div> <p class="p"><strong class="ph b">Question: </strong>How is the mangled name for the IPA-optimized object files generated? </p> <p class="p"><strong class="ph b">Answer: </strong>The mangled name has ‘_ipa’ appended, followed by the decimal number of the length of the executable file name, followed by an underscore and the executable file name itself. The suffix is changed to <span class="ph">.oo</span> so that linking <span class="ph">*.o</span> does not pull in the IPA-optimized objects. If the IPA linker determines that the file would not benefit from any interprocedural optimizations, it does not have to recompile the file at link time, and uses the original object. </p> <p class="p"><strong class="ph b">Question: </strong>Can I use parallel make environments (e.g., pmake) with IPA? </p> <p class="p"><strong class="ph b">Answer: </strong>No. IPA is not compatible with parallel make environments. </p> </div> </div> </div> </div> <div class="topic concept nested0" id="fn-inline-use"><a name="fn-inline-use" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#fn-inline-use" name="fn-inline-use" shape="rect">4. Using Function Inlining</a></h2> <div class="body conbody"> <p class="p"> Function inlining replaces a call to a function or a subroutine with the body of the function or subroutine. This can speed up execution by eliminating parameter passing and function/subroutine call and return overhead. It also allows the compiler to optimize the function with the rest of the code. Note that using function inlining indiscriminately can result in much larger code size and no increase in execution speed. </p> <p class="p">The NVIDIA HPC compilers provide two categories of inlining:</p> <ul class="ul"> <li class="li"><strong class="ph b">Automatic function inlining</strong> – In C++ and C, you can inline static functions with the <span class="ph filepath">inline</span> keyword by using the <span class="ph filepath">-⁠Mautoinline</span> option, which is included with <span class="ph filepath">-⁠fast</span>. </li> <li class="li"><strong class="ph b">Function inlining</strong> – You can inline functions which were extracted to the inline libraries in Fortran, C++ and C. There are two ways of enabling function inlining: with and without the <span class="ph filepath">lib</span> suboption. For the latter, you create inline libraries, for example using the <span class="ph filepath">nvfortran</span> compiler driver and the <span class="ph filepath">-⁠o</span> and <span class="ph filepath">-⁠Mextract</span> options. </li> </ul> <p class="p"> There are important restrictions on inlining. Inlining only applies to certain types of functions. Refer to <a class="xref" href="index.html#fn-inline-restrictions" shape="rect">Restrictions on Inlining</a> for more details on function inlining limitations. </p> <p class="p"> This section describes how to use the following options related to function inlining: </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="void" border="0" rules="none"> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"><span class="ph filepath">-Mautoinline</span></td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"><span class="ph filepath">-Mextract</span></td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"><span class="ph filepath">-Minline</span></td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"><span class="ph filepath">-Mnoinline</span></td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"><span class="ph filepath">-Mrecursive</span></td> </tr> </tbody> </table> </div> </div> <div class="topic concept nested1" id="fn-inline-auto"><a name="fn-inline-auto" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#fn-inline-auto" name="fn-inline-auto" shape="rect">4.1. Automatic function inlining in C++ and C</a></h3> <div class="body conbody"> <p class="p"> To enable automatic function inlining in C++ and C for static functions with the <span class="ph filepath">inline</span> keyword, use the <span class="ph filepath">-⁠Mautoinline</span> option (included in <span class="ph filepath">-⁠fast</span>). Use <span class="ph filepath">-⁠Mnoautoinline</span> to disable it. </p> <p class="p"> These <span class="ph filepath">-⁠Mautoinline</span> suboptions let you determine the selection criteria, where <span class="ph filepath">n</span> loosely corresponds to the number of lines in the procedure: </p> <dl class="dl"> <dt class="dt dlterm">maxsize:<span class="ph filepath">n</span></dt> <dd class="dd"> Automatically inline functions size <span class="ph filepath">n</span> and less </dd> <dt class="dt dlterm">totalsize:<span class="ph filepath">n</span></dt> <dd class="dd"> Limit automatic inlining to total size of <span class="ph filepath">n</span></dd> </dl> </div> </div> <div class="topic concept nested1" id="fn-inline-invoke"><a name="fn-inline-invoke" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#fn-inline-invoke" name="fn-inline-invoke" shape="rect">4.2. Invoking Procedure Inlining</a></h3> <div class="body conbody"> <p class="p"> To invoke the procedure inliner, use the <span class="ph filepath">-⁠Minline</span> option. If you do not specify an inline library, the compiler performs a special prepass on all source files named on the compiler command line before it compiles any of them. This pass extracts procedures that meet the requirements for inlining and puts them in a temporary inline library for use by the compilation pass. </p> <p class="p"> Several <span class="ph filepath">-⁠Minline</span> suboptions let you determine the selection criteria for procedures to be inlined. These suboptions include: </p> <dl class="dl"> <dt class="dt dlterm">except:<span class="ph filepath">func</span></dt> <dd class="dd"> Inlines all eligible procedures except <span class="ph filepath">func</span>, a procedure in the source text. You can use a comma-separated list to specify multiple procedure. </dd> <dt class="dt dlterm">[name:]<span class="ph filepath">func</span></dt> <dd class="dd"> Inlines all procedures in the source text whose name matches <span class="ph filepath">func</span>. You can use a comma-separated list to specify multiple procedures. </dd> <dt class="dt dlterm">[maxsize:]<span class="ph filepath">n</span></dt> <dd class="dd"> A numeric option is assumed to be a size. Procedures of size <span class="ph filepath">n</span> or less are inlined, where <span class="ph filepath">n</span> loosely corresponds to the number of lines in the procedure. If both <span class="ph filepath">n</span> and <span class="ph filepath">func</span> are specified, then procedures matching the given name(s) or meeting the size requirements are inlined. </dd> <dt class="dt dlterm">reshape</dt> <dd class="dd"> Fortran subprograms with array arguments are not inlined by default if the array shape does not match the shape in the caller. Use this option to override the default. </dd> <dt class="dt dlterm">smallsize:<span class="ph filepath">n</span></dt> <dd class="dd"> Always inline procedures of size smaller than <span class="ph filepath">n</span> regardless of other size limits. </dd> <dt class="dt dlterm">totalsize:<span class="ph filepath">n</span></dt> <dd class="dd"> Stop inlining in a procedure when the procedure's total size with inlining reaches the <span class="ph filepath">n</span> specified. </dd> <dt class="dt dlterm">[lib:]<span class="ph filepath">file.ext</span></dt> <dd class="dd"> Instructs the inliner to inline the procedures within the library file <span class="ph filepath">file.ext</span>. If no inline library is specified, procedures are extracted from a temporary library created during an extract prepass. <div class="note tip"><span class="tiptitle">Tip:</span> Create the library file using the <span class="ph filepath">-⁠Mextract</span> option. </div> </dd> </dl> <p class="p"> If you specify both a procedure name and a maxsize n, the compiler inlines procedures that match the procedure name <em class="ph i">or</em> have n or fewer statements. </p> <p class="p"> If a name is used without a keyword, then a name with a period is assumed to be an inline library and a name without a period is assumed to be a procedure name. If a number is used without a keyword, the number is assumed to be a size. </p> <p class="p"> Inlining can be disabled with <span class="ph filepath">-⁠Mnoinline</span>. </p> <p class="p"> In the following example, the compiler inlines procedures with fewer than approximately 100 statements in the source file <span class="ph filepath">myprog.f</span> and writes the executable code in the default output file <span class="ph filepath">a.out</span>. </p><pre xml:space="preserve"> $ nvfortran -Minline=maxsize:100 myprog.f </pre></div> </div> <div class="topic concept nested1" id="fn-inline-use-lib"><a name="fn-inline-use-lib" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#fn-inline-use-lib" name="fn-inline-use-lib" shape="rect">4.3. Using an Inline Library</a></h3> <div class="body conbody"> <p class="p">If you specify one or more inline libraries on the command line with the <span class="ph filepath">-Minline</span> option, the compiler does not perform an initial extract pass. The compiler selects functions to inline from the specified inline library. If you also specify a size or function name, all functions in the inline library meeting the selection criteria are selected for inline expansion at points in the source text where they are called. </p> <p class="p">If you do not specify a function name or a size limitation for the <span class="ph filepath">-Minline </span>option, the compiler tries to inline every function in the inline library that matches a function in the source text. </p> <p class="p">In the following example, the compiler inlines the function <span class="ph filepath">proc</span> from the inline library <span class="ph filepath">lib.il</span> and writes the executable code in the default output file <span class="ph filepath">a.out</span>. </p><pre xml:space="preserve">$ nvfortran -Minline=name:proc,lib:lib.il myprog.f</pre><p class="p">The following command line is equivalent to the preceding line, with the exception that in the following example does not use the keywords <span class="ph filepath">name:</span> and <span class="ph filepath">lib:</span>. You typically use keywords to avoid name conflicts when you use an inline library name that does not contain a period. Otherwise, without the keywords, a period informs the compiler that the file on the command line is an inline library. </p><pre xml:space="preserve">$ nvfortran -Minline=proc,lib.il myprog.f</pre></div> </div> <div class="topic concept nested1" id="fn-inline-create-lib"><a name="fn-inline-create-lib" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#fn-inline-create-lib" name="fn-inline-create-lib" shape="rect">4.4. Creating an Inline Library</a></h3> <div class="body conbody"> <p class="p">You can create or update an inline library using the <span class="ph filepath">-⁠Mextract</span> command-line option. If you do not specify selection criteria with the <span class="ph filepath">-⁠Mextract</span> option, the compiler attempts to extract all procedures. </p> <p class="p">Several <span class="ph filepath">-⁠Mextract</span> options let you determine the selection criteria for creating or updating an inline library. These selection criteria include: </p> <dl class="dl"> <dt class="dt dlterm"><span class="ph filepath">func</span></dt> <dd class="dd">Extracts the procedure <span class="ph filepath">func</span>. you can use a comma-separated list to specify multiple procedures. </dd> <dt class="dt dlterm">[name:]<span class="ph filepath">func</span></dt> <dd class="dd">Extracts the procedure whose name matches <span class="ph filepath">func</span>, a procedure in the source text. </dd> <dt class="dt dlterm">[size:]<span class="ph filepath">n</span></dt> <dd class="dd">Limits the size of the extracted procedures to those with a statement count less than or equal to <span class="ph filepath">n</span>, the specified size. <div class="note note"><span class="notetitle">Note:</span> The size n may not exactly equal the number of statements in a selected procedure; the size parameter is merely a rough gauge. </div> </dd> <dt class="dt dlterm">[lib:]<span class="ph filepath">ext.lib</span></dt> <dd class="dd">Stores the extracted information in the library directory <span class="ph filepath">ext.lib</span>. <p class="p">If no inline library is specified, procedures are extracted to a temporary library created during an extract prepass for use during the compilation stage. </p> </dd> </dl> <p class="p">When you use the <span class="ph filepath">-⁠Mextract</span> option, only the extract phase is performed; the compile and link phases are not performed. The output of an extract pass is a library of procedures available for inlining. This output is placed in the inline library file specified on the command line with the <span class="ph filepath">-⁠o</span> filename specification. If the library file exists, new information is appended to it. If the file does not exist, it is created. You can use a command similar to the following: </p><pre xml:space="preserve">$ nvfortran -Mextract=lib:lib.il myfunc.f</pre><p class="p">You can use the <span class="ph filepath">-⁠Minline</span> option with the <span class="ph filepath">-⁠Mextract</span> option. In this case, the extracted library of procedures can have other procedures inlined into the library. Using both options enables you to obtain more than one level of inlining. In this situation, if you do not specify a library with the <span class="ph filepath">-⁠Minline</span> option, the inline process consists of two extract passes. The first pass is a hidden pass implied by the <span class="ph filepath">-⁠Minline</span> option, during which the compiler extracts procedures and places them into a temporary library. The second pass uses the results of the first pass but puts its results into the library that you specify with the <span class="ph filepath">-⁠o</span> option. </p> </div> <div class="topic concept nested2" id="fn-inline-work-lib"><a name="fn-inline-work-lib" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#fn-inline-work-lib" name="fn-inline-work-lib" shape="rect">4.4.1. Working with Inline Libraries</a></h3> <div class="body conbody"> <p class="p">An inline library is implemented as a directory with each inline function in the library stored as a file using an encoded form of the inlinable function. </p> <p class="p">A special file named <span class="ph filepath">TOC</span> in the inline library directory serves as a table of contents for the inline library. This is a printable, ASCII file which you can examine to locate information about the library contents, such as names and sizes of functions, the source file from which they were extracted, the version number of the extractor which created the entry, and so on. </p> <p class="p">Libraries and their elements can be manipulated using ordinary system commands.</p> <ul class="ul"> <li class="li">Inline libraries can be copied or renamed.</li> <li class="li">Elements of libraries can be deleted or copied from one library to another.</li> <li class="li">The <span class="ph filepath">ls</span> or <span class="ph filepath">dir</span> command can be used to determine the last-change date of a library entry. </li> </ul> </div> </div> <div class="topic concept nested2" id="fn-inline-lib-depend"><a name="fn-inline-lib-depend" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#fn-inline-lib-depend" name="fn-inline-lib-depend" shape="rect">4.4.2. Dependencies</a></h3> <div class="body conbody"> <p class="p">When a library is created or updated using one of the NVIDIA HPC compilers, the last-change date of the library directory is updated. This allows a library to be listed as a dependence in a makefile and ensures that the necessary compilations are performed when a library is changed. </p> </div> </div> <div class="topic concept nested2" id="fn-inline-upd-lib-makefiles"><a name="fn-inline-upd-lib-makefiles" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#fn-inline-upd-lib-makefiles" name="fn-inline-upd-lib-makefiles" shape="rect">4.4.3. Updating Inline Libraries – Makefiles</a></h3> <div class="body conbody"> <p class="p">If you use inline libraries you must be certain that they remain up-to-date with the source files into which they are inlined. One way to assure inline libraries are updated is to include them in a makefile. </p> <p class="p">The makefile fragment in the following example assumes the file <span class="ph filepath">utils.f</span> contains a number of small functions used in the files <span class="ph filepath">parser.f</span> and <span class="ph filepath">alloc.f</span>. </p> <p class="p">This portion of the makefile:</p> <ul class="ul"> <li class="li">Maintains the inline library <span class="ph filepath">utils.il</span>. </li> <li class="li">Updates the library whenever you change <span class="ph filepath">utils.f</span> or one of the include files it uses. </li> <li class="li">Compiles <span class="ph filepath">parser.f</span> and <span class="ph filepath">alloc.f</span> whenever you update the library. </li> </ul> <div class="example"> <h4 class="title sectiontitle">Sample Makefile</h4><pre xml:space="preserve">SRC = mydir FC = nvfortran FFLAGS = -O2 main.o: $(SRC)/main.f $(SRC)/global.h $(FC) $(FFLAGS) -c $(SRC)/main.f utils.o: $(SRC)/utils.f $(SRC)/global.h $(SRC)/utils.h $(FC) $(FFLAGS) -c $(SRC)/utils.f utils.il: $(SRC)/utils.f $(SRC)/global.h $(SRC)/utils.h $(FC) $(FFLAGS) -Mextract=15 -o utils.il $(SRC)/utils.f parser.o: $(SRC)/parser.f $(SRC)/global.h utils.il $(FC) $(FFLAGS) -Minline=utils.il -c $(SRC)/parser.f alloc.o: $(SRC)/alloc.f $(SRC)/global.h utils.il $(FC) $(FFLAGS) -Minline=utils.il -c $(SRC)/alloc.f myprog: main.o utils.o parser.o alloc.o $(FC) -o myprog main.o utils.o parser.o alloc.o</pre></div> </div> </div> </div> <div class="topic concept nested1" id="fn-inline-errors"><a name="fn-inline-errors" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#fn-inline-errors" name="fn-inline-errors" shape="rect">4.5. Error Detection during Inlining</a></h3> <div class="body conbody"> <p class="p">You can specify the <span class="ph filepath">-⁠Minfo=inline</span> option to request inlining information from the compiler when you invoke the inliner. For example: </p><pre xml:space="preserve">$ nvfortran -Minline=mylib.il -Minfo=inline myext.f</pre></div> </div> <div class="topic concept nested1" id="fn-inline-examples"><a name="fn-inline-examples" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#fn-inline-examples" name="fn-inline-examples" shape="rect">4.6. Examples</a></h3> <div class="body conbody"> <p class="p">Assume the program <span class="ph filepath">dhry</span> consists of a single source file <span class="ph filepath">dhry.f</span>. The following command line builds an executable file for <span class="ph filepath">dhry</span> in which proc7 is inlined wherever it is called: </p><pre xml:space="preserve">$ nvfortran dhry.f -Minline=proc7</pre><p class="p">The following command lines build an executable file for <span class="ph filepath">dhry</span> in which proc7 plus any functions of approximately 10 or fewer statements are inlined (one level only). </p> <div class="note note"><span class="notetitle">Note:</span> The specified functions are inlined only if they are previously placed in the inline library, <span class="ph filepath">temp.il</span>, during the extract phase. </div><pre xml:space="preserve">$ nvfortran dhry.f -Mextract=lib:temp.il $ nvfortran dhry.f -Minline=10,proc7,temp.il</pre><p class="p">Using the same source file <span class="ph filepath">dhry.f</span>, the following example builds an executable for <span class="ph filepath">dhry</span> in which all functions of roughly ten or fewer statements are inlined. Two levels of inlining are performed. This means that if function A calls function B, and B calls C, and both B and C are inlinable, then the version of B which is inlined into A will have had C inlined into it. </p><pre xml:space="preserve">$ nvfortran dhry.f -Minline=maxsize:10</pre></div> </div> <div class="topic concept nested1" id="fn-inline-restrictions"><a name="fn-inline-restrictions" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#fn-inline-restrictions" name="fn-inline-restrictions" shape="rect">4.7. Restrictions on Inlining</a></h3> <div class="body conbody"> <p class="p">The following Fortran subprograms cannot be extracted: </p> <ul class="ul"> <li class="li">Main or BLOCK DATA programs.</li> <li class="li">Subprograms containing alternate return, assigned GO TO, DATA, SAVE, or EQUIVALENCE statements.</li> <li class="li">Subprograms containing FORMAT statements.</li> <li class="li">Subprograms containing multiple entries.</li> </ul> <p class="p">A Fortran subprogram is not inlined if any of the following applies: </p> <ul class="ul"> <li class="li">It is referenced in a statement function.</li> <li class="li">A common block mismatch exists; in other words, the caller must contain all common blocks specified in the callee, and elements of the common blocks must agree in name, order, and type (except that the caller's common block can have additional members appended to the end of the common block). </li> <li class="li">An argument mismatch exists; in other words, the number and type (size) of actual and formal parameters must be equal.</li> <li class="li">A name clash exists, such as a call to subroutine <span class="ph filepath">xyz</span> in the extracted subprogram and a variable named <span class="ph filepath">xyz</span> in the caller. </li> </ul> <p class="p">The following types of C and C++ functions cannot be inlined: </p> <ul class="ul"> <li class="li">Functions which accept a variable number of arguments</li> </ul> <p class="p">Certain C/C++ functions can only be inlined into the file that contains their definition:</p> <ul class="ul"> <li class="li">Static functions</li> <li class="li">Functions which call a static function</li> <li class="li">Functions which reference a static variable</li> </ul> </div> </div> </div> <div class="topic concept nested0" id="gpus-use"><a name="gpus-use" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#gpus-use" name="gpus-use" shape="rect">5. Using GPUs</a></h2> <div class="body conbody"> <p class="p"> An NVIDIA GPU can be used as an accelerator to which a CPU can offload data and executable kernels to perform compute-intensive calculations. This section gives an overview of options for programming NVIDIA GPUs with NVIDIA's HPC Compilers and covers topics that affect GPU programming when using one or more of the GPU programming models. </p> </div> <div class="topic concept nested1" id="acc-overview"><a name="acc-overview" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-overview" name="acc-overview" shape="rect">5.1. Overview</a></h3> <div class="body conbody"> <p class="p"> With the NVIDIA HPC Compilers you can program NVIDIA GPUs using certain standard language constructs, OpenACC directives, OpenMP directives, or CUDA Fortran language extensions. GPU programming with standard language constructs or directives allows you to create high-level GPU-accelerated programs without the need to explicitly initialize the GPU, manage data or program transfers between the host and GPU, or initiate GPU startup and shutdown. Rather, all of these details are implicit in the programming model and are managed by the NVIDIA HPC SDK Fortran, C⁠+⁠+ and C compilers. GPU programming with CUDA extensions gives you access to all NVIDIA GPU features and full control over data management and offloading of compute-intensive loops and kernels. </p> <p class="p"> The NVC⁠+⁠+ compiler supports automatic offload of C⁠+⁠+17 Parallel Algorithms invocations to NVIDIA GPUs under control of the <samp class="ph codeph">-stdpar</samp> compiler option. See the Blog post <em class="ph i">Accelerating Standard C⁠+⁠+ with GPUs</em> for details on using this feature. The NVFORTRAN compiler supports automatic offload to NVIDIA GPUs of certain Fortran array intrinsics and patterns of array syntax, including use of Volta and Ampere architecture Tensor Cores for appropriate intrinsics. See the Blog post <em class="ph i">Bringing Tensor Cores to Standard Fortran</em> for details on using this feature. </p> <p class="p"> The NVFORTRAN compiler supports CUDA programming in Fortran. See the <em class="ph i">NVIDIA CUDA Fortran Programming Guide</em> for complete details on how to use CUDA Fortran. The NVCC compiler supports CUDA programming in C and C⁠+⁠+ in combination with a host C⁠+⁠+ compiler on your system. See the <em class="ph i">CUDA C⁠+⁠+ Programming Guide</em> for an introduction and overview of how to use NVCC and CUDA C⁠+⁠+. </p> <p class="p"> The NVFORTRAN, NVC⁠+⁠+ and NVC compilers all support directive-based programming of NVIDIA GPUs using OpenACC. OpenACC is an accelerator programming model that is portable across operating systems and various host CPUs and types of accelerators, including both NVIDIA GPUs and multicore CPUs. OpenACC directives allow a programmer to migrate applications incrementally to accelerator targets using standards-compliant Fortran, C⁠+⁠+ or C that remains completely portable to other compilers and systems. It allows the programmer to augment information available to the compilers, including specification of data local to an accelerator region, guidance on mapping of loops onto an accelerator, and similar performance-related details. </p> <p class="p"> The NVFORTRAN, NVC⁠+⁠+, and NVC compilers support a subset of the OpenMP Application Program Interface for CPUs and GPUs. OpenMP applications properly structured for GPUs, meaning they expose massive parallelism and have relatively little or no synchronization in GPU-side code segments, should compile and execute with performance on par with or close to equivalent OpenACC. Codes that are not well-structured for GPUs may perform poorly but should execute correctly. </p> <p class="p"> In user-directed accelerator programming the user specifies the regions of a host program to be targeted for offloading to an accelerator. The bulk of a user's program, as well as regions containing constructs that are not supported on the targeted accelerator, are executed on the host. </p> </div> </div> <div class="topic concept nested1" id="acc-terms"><a name="acc-terms" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-terms" name="acc-terms" shape="rect">5.2. Terminology</a></h3> <div class="body conbody"> <p class="p">Clear and consistent terminology is important in describing any programming model. This section provides definitions of the terms required for you to effectively use this section and the associated programming model. </p> <dl class="dl"> <dt class="dt dlterm">Accelerator</dt> <dd class="dd">a parallel processor, such as a GPU or a CPU running in multicore mode, to which a CPU can offload data and executable kernels to perform compute-intensive calculations. </dd> <dt class="dt dlterm">Compute intensity</dt> <dd class="dd">for a given loop, region, or program unit, the ratio of the number of arithmetic operations performed on computed data divided by the number of memory transfers required to move that data between two levels of a memory hierarchy. </dd> <dt class="dt dlterm">Compute region</dt> <dd class="dd">a structured block defined by a compute construct. A <em class="ph i">compute construct</em> is a structured block containing loops which are compiled for the accelerator. A compute region may require device memory to be allocated and data to be copied from host to device upon region entry, and data to be copied from device to host memory and device memory deallocated upon exit. The dynamic range of a compute construct, including any code in procedures called from within the construct, is the compute region. In this release, compute regions may not contain other compute regions or data regions. </dd> <dt class="dt dlterm">Construct</dt> <dd class="dd">a structured block identified by the programmer or implicitly defined by the language. Certain actions may occur when program execution reaches the start and end of a construct, such as device memory allocation or data movement between the host and device memory. Loops in a compute construct are targeted for execution on the accelerator. The dynamic range of a construct including any code in procedures called from within the construct, is called a <em class="ph i">region</em>. </dd> <dt class="dt dlterm">CUDA</dt> <dd class="dd">stands for Compute Unified Device Architecture; CUDA C++ and Fortran language extensions and API calls can be used to explicitly control and program an NVIDIA GPU. </dd> <dt class="dt dlterm">Data region</dt> <dd class="dd">a region defined by a data construct, or an implicit data region for a function or subroutine containing directives. Data regions typically require device memory to be allocated and data to be copied from host to device memory upon entry, and data to be copied from device to host memory and device memory deallocated upon exit. Data regions may contain other data regions and compute regions. </dd> <dt class="dt dlterm">Device</dt> <dd class="dd">a general reference to any type of accelerator.</dd> <dt class="dt dlterm">Device memory </dt> <dd class="dd">memory attached to an accelerator which is physically separate from the host memory.</dd> <dt class="dt dlterm">Directive</dt> <dd class="dd"><span class="ph">in C, a #pragma, or in Fortran, </span>a specially formatted comment statement that is interpreted by a compiler to augment information about or specify the behavior of the program. </dd> <dt class="dt dlterm">DMA</dt> <dd class="dd">Direct Memory Access, a method to move data between physically separate memories; this is typically performed by a DMA engine, separate from the host CPU, that can access the host physical memory as well as an IO device or GPU physical memory. </dd> <dt class="dt dlterm">GPU</dt> <dd class="dd">a Graphics Processing Unit; one type of accelerator device.</dd> <dt class="dt dlterm">Host</dt> <dd class="dd">the main CPU that in this context has an attached accelerator device. The host CPU controls the program regions and data loaded into and executed on the device. </dd> <dt class="dt dlterm">Loop trip count</dt> <dd class="dd">the number of times a particular loop executes.</dd> <dt class="dt dlterm">Private data</dt> <dd class="dd">with respect to an iterative loop, data which is used only during a particular loop iteration. With respect to a more general region of code, data which is used within the region but is not initialized prior to the region and is re-initialized prior to any use after the region. </dd> <dt class="dt dlterm">Region </dt> <dd class="dd">the dynamic range of a construct, including any procedures invoked from within the construct.</dd> <dt class="dt dlterm">Structured block </dt> <dd class="dd">in C++ or C, an executable statement, possibly compound, with a single entry at the top and a single exit at the bottom. In Fortran, a block of executable statements with a single entry at the top and a single exit at the bottom. </dd> <dt class="dt dlterm">Vector operation</dt> <dd class="dd">a single operation or sequence of operations applied uniformly to each element of an array.</dd> <dt class="dt dlterm">Visible device copy</dt> <dd class="dd">a copy of a variable, array, or subarray allocated in device memory, that is visible to the program unit being compiled.</dd> </dl> </div> </div> <div class="topic concept nested1" id="acc-exe-model"><a name="acc-exe-model" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-exe-model" name="acc-exe-model" shape="rect">5.3. Execution Model</a></h3> <div class="body conbody"> <p class="p">The execution model targeted by the NVIDIA HPC Compilers is host-directed execution with an attached accelerator device, such as a GPU. The bulk of a user application executes on the host. Compute intensive regions are offloaded to the accelerator device under control of the host. The accelerator device executes kernels, which may be as simple as a tightly-nested loop, or as complex as a subroutine, depending on the accelerator hardware. </p> </div> <div class="topic concept nested2" id="acc-exe-host-funcs"><a name="acc-exe-host-funcs" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-exe-host-funcs" name="acc-exe-host-funcs" shape="rect">5.3.1. Host Functions</a></h3> <div class="body conbody"> <p class="p">Even in accelerator-targeted regions, the host must orchestrate the execution; it</p> <ul class="ul"> <li class="li">allocates memory on the accelerator device</li> <li class="li">initiates data transfer</li> <li class="li">sends the kernel code to the accelerator</li> <li class="li">passes kernel arguments</li> <li class="li">queues the kernel</li> <li class="li">waits for completion</li> <li class="li">transfers results back to the host</li> <li class="li">deallocates memory</li> </ul> <div class="note note"><span class="notetitle">Note:</span> In most cases, the host can queue a sequence of kernels to be executed on the device, one after the other. </div> </div> </div> </div> <div class="topic concept nested1" id="acc-mem-model"><a name="acc-mem-model" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-mem-model" name="acc-mem-model" shape="rect">5.4. Memory Model</a></h3> <div class="body conbody"> <p class="p"> The most significant difference between a <em class="ph i">host-only</em> program and a <em class="ph i">host+accelerator</em> program is that the memory on the accelerator can be completely separate from host memory, which is the case on many GPUs. For example: </p> <ul class="ul"> <li class="li"> The host cannot read or write accelerator memory directly because it is not mapped into the virtual memory space of the host. </li> <li class="li"> All data movement between host memory and accelerator memory must be performed by the host through runtime library calls that explicitly move data between the separate memories. </li> <li class="li"> In general it is not valid for the compiler to assume the accelerator can read or write host memory directly. This is well-defined starting with the OpenACC 2.7 and OpenMP 5.0 specifications. </li> </ul> <p class="p"> The systems with the latest GPUs provide a unified single address space between CPU and GPU for some or all memory regions, as detailed in the <a class="xref" href="index.html#acc-mem-unified" shape="rect">Managed and Unified Memory Modes</a> subsection below. In these systems data can be accessed from host and accelerator subprograms without the need for explicit data movement. </p> <p class="p"> The NVIDIA HPC Compilers support the following system memory modes: </p> <div class="tablenoborder"><a name="acc-mem-model__gpu-memory-modes" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="acc-mem-model__gpu-memory-modes" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 12. GPU Memory Modes</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="14.285714285714285%" id="d54e6151" rowspan="1" colspan="1"> Memory Mode </th> <th class="entry" valign="top" width="57.14285714285714%" id="d54e6154" rowspan="1" colspan="1"> Description </th> <th class="entry" valign="top" width="28.57142857142857%" id="d54e6157" rowspan="1" colspan="1"> Compiler flags </th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="14.285714285714285%" headers="d54e6151" rowspan="1" colspan="1"> Separate </td> <td class="entry" valign="top" width="57.14285714285714%" headers="d54e6154" rowspan="1" colspan="1"> All data accessed in host and accelerator programs are in separate (CPU and GPU) memories. Data in the application need to be physically moved between CPU and GPU memory either by adding explicit annotations or by relying on a compiler to detect and migrate the data. </td> <td class="entry" valign="top" width="28.57142857142857%" headers="d54e6157" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=mem:separate</samp></td> </tr> <tr class="row"> <td class="entry" valign="top" width="14.285714285714285%" headers="d54e6151" rowspan="1" colspan="1"> Managed </td> <td class="entry" valign="top" width="57.14285714285714%" headers="d54e6154" rowspan="1" colspan="1"> Dynamically allocated host data are placed in CUDA Managed Memory which is a unified single address space between host and accelerator programs and can therefore be accessed on device without explicit data movement. All other data (host, stack, or global data) remain in separate memory. </td> <td class="entry" valign="top" width="28.57142857142857%" headers="d54e6157" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=mem:managed</samp></td> </tr> <tr class="row"> <td class="entry" valign="top" width="14.285714285714285%" headers="d54e6151" rowspan="1" colspan="1"> Unified </td> <td class="entry" valign="top" width="57.14285714285714%" headers="d54e6154" rowspan="1" colspan="1"> All host data are placed in a unified single address space between the host and accelerator subprograms; no explicit data movements are required. This mode is intended for targets with full CUDA Unified Memory capability and it may utilize CUDA Managed Memory for dynamic allocations. </td> <td class="entry" valign="top" width="28.57142857142857%" headers="d54e6157" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=mem:unified</samp></td> </tr> </tbody> </table> </div> <p class="p"> If the memory mode is not selected explicitly by passing one of the above <samp class="ph codeph">-⁠gpu=mem:*</samp> options, the compiler selects a default memory mode. The default memory mode for Stdpar is explained in <a class="xref" href="index.html#stdpar-use" shape="rect">Using Stdpar</a>. When Stdpar is not enabled, the default memory mode is Separate Memory. Memory modes may have specific semantics in each programming language and the compilers can sometimes implicitly determine the data movement that's required. More details can be found in the subsections of each programming model. </p> <p class="p"> The following options <samp class="ph codeph">-⁠gpu=[no]managed</samp>, <samp class="ph codeph">-⁠gpu=[no]unified</samp> and <samp class="ph codeph">-⁠gpu=pinned</samp> are deprecated but still accepted. Refer to <a class="xref" href="index.html#gpu-mem-flags" shape="rect">Command-line Options Selecting Compiler Memory Modes</a> for compatibility between the current and deprecated memory specific flags. </p> <div class="p"> The compiler implicitly defines the following macros corresponding to the memory mode it compiles for: <ul class="ul"> <li class="li"> When the code is compiled for Separate Memory Mode, the compiler defines <samp class="ph codeph">__NVCOMPILER_GPU_SEPARATE_MEM</samp> macro. </li> <li class="li"> When the code is compiled for Managed Memory Mode, the compiler defines <samp class="ph codeph">__NVCOMPILER_GPU_MANAGED_MEM</samp> macro. </li> <li class="li"> When the code is compiled for Unified Memory Mode, the compiler defines <samp class="ph codeph">__NVCOMPILER_GPU_UNIFIED_MEM</samp> macro. If CUDA Managed Memory is utilised, the compiler defines additionally <samp class="ph codeph">__NVCOMPILER_GPU_MANAGED_MEM</samp>. </li> </ul> </div> <div class="p"> When a binary is compiled for one memory mode it may need to be run on a system with specific memory capabilities as follows: <ul class="ul"> <li class="li"> Applications compiled for Separate Memory Mode can run on any CUDA platforms. </li> <li class="li"> Applications compiled for Managed Memory Mode must be run on platforms with CUDA Managed Memory or full CUDA Unified Memory capabilities. </li> <li class="li"> Applications compiled for Unified Memory Mode must be run on platforms with full CUDA Unified Memory. </li> </ul> </div> <div class="note note"><span class="notetitle">Note:</span> Memory allocated in the accelerator subprogram can’t be accessed or deallocated from the host. </div> </div> <div class="topic concept nested2" id="acc-mem-sep-host"><a name="acc-mem-sep-host" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-mem-sep-host" name="acc-mem-sep-host" shape="rect">5.4.1. Separate Host and Accelerator Memory Considerations</a></h3> <div class="body conbody"> <p class="p">The programmer must be aware of the potentially separate memories for many reasons, including but not limited to:</p> <ul class="ul"> <li class="li">Memory bandwidth between host memory and accelerator memory determines the compute intensity required to effectively accelerate a given region of code. </li> <li class="li">Limited size of accelerator memory may prohibit offloading of regions of code that operate on very large amounts of data.</li> </ul> </div> <div class="topic concept nested3" id="acc-mem-model-mem"><a name="acc-mem-model-mem" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-mem-model-mem" name="acc-mem-model-mem" shape="rect">5.4.1.1. Accelerator Memory</a></h3> <div class="body conbody"> <p class="p"> On the accelerator side, current GPUs implement a weak memory model. In particular, they do not support memory coherence between threads unless those threads are parallel only at the synchronous level and the memory operations are separated by an explicit barrier. Otherwise, if one thread updates a memory location and another reads the same location, or two threads store a value to the same location, the hardware does not guarantee the results. While the results of running such a program might be inconsistent, it is not accurate to say that the results are incorrect. By definition, such programs are defined as being in error. While a compiler can detect some potential errors of this nature, it is nonetheless possible to write an accelerator region that produces inconsistent numerical results. </p> <p class="p"> Stack data in accelerator subprograms are allocated per thread. Stack data from one thread are not accessible by the other threads. </p> </div> </div> <div class="topic concept nested3" id="acc-mem-pinned-buffer"><a name="acc-mem-pinned-buffer" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-mem-pinned-buffer" name="acc-mem-pinned-buffer" shape="rect">5.4.1.2. Staging Memory Buffer</a></h3> <div class="body conbody"> <p class="p">Memory transfers between the accelerator and host may not always be asynchronous with respect to the host, even if the chosen programming model (for instance, OpenACC) declares that. This limitation may be due to the specific GPU and host memory architectures. </p> <p class="p">In order to help the host program proceed while a memory transfer to or from the accelerator is underway, the NVIDIA HPC Compilers Runtime maintains a designated staging memory area, also known as a pinned buffer. This memory area is registered with the CUDA API, which makes it suitable for asynchronous memory transfers between the GPU and the host. When an asynchronous memory transfer is started, the data being transferred is staged through the pinned buffer. Multiple asynchronous operations on the same data can be issued - in that case, the runtime system will operate on the data staged in the pinned buffer, not on the original host memory. When the host program issues an explicit or implicit synchronization request, the data is moved from the pinned buffer to its destination transparently to the application. </p> <p class="p">The runtime has the discretion to enable or disable the pinned buffer depending on the host and GPU memory architecture. Also, the size of the pinned buffer is determined by the runtime system as appropriate. The user can control some of these decisions using environment variables at the start of the application. Please refer to <a class="xref" href="index.html#env-vars-memory" shape="rect">Environment Variables Controlling Device Memory Management</a> to learn more. </p> </div> </div> <div class="topic concept nested3" id="acc-mem-cache-mgmt"><a name="acc-mem-cache-mgmt" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-mem-cache-mgmt" name="acc-mem-cache-mgmt" shape="rect">5.4.1.3. Cache Management</a></h3> <div class="body conbody"> <p class="p"> Some current GPUs have a software-managed cache, some have hardware-managed caches, and most have hardware caches that can be used only in certain situations and are limited to read-only data. In low-level programming models such as CUDA, it is up to the programmer to manage these caches. The OpenACC programming model provides directives the programmer can use as hints to the compiler for cache management. </p> </div> </div> <div class="topic concept nested3" id="env-vars-memory"><a name="env-vars-memory" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-memory" name="env-vars-memory" shape="rect">5.4.1.4. Environment Variables Controlling Device Memory Management</a></h3> <div class="body conbody"> <p class="p">This section summarizes the environment variables that NVIDIA HPC Compilers use to control device memory management.</p> <p class="p">The following table contains the environment variables that are currently supported and provides a brief description of each.</p> <div class="tablenoborder"><a name="env-vars-memory__environment-variables-memory" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="env-vars-memory__environment-variables-memory" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 13. Memory Management Environment Variables</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="50%" id="d54e6382" rowspan="1" colspan="1">Environment Variable</th> <th class="entry" valign="top" width="50%" id="d54e6385" rowspan="1" colspan="1">Use</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e6382" rowspan="1" colspan="1">NVCOMPILER_ACC_BUFFERSIZE</td> <td class="entry" valign="top" width="50%" headers="d54e6385" rowspan="1" colspan="1">For NVIDIA CUDA devices, this defines the size of the pinned buffer used to transfer data between host and device. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e6382" rowspan="1" colspan="1">NVCOMPILER_ACC_CUDA_CTX_SCHED</td> <td class="entry" valign="top" width="50%" headers="d54e6385" rowspan="1" colspan="1">For NVIDIA CUDA devices, sets flags to be used when creating a new CUDA context. By default, the <samp class="ph codeph">CU_CTX_SCHED_YIELD</samp> flag is used. Please refer to the <a class="xref" href="https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX" target="_blank" shape="rect">CUDA Toolkit Documentation</a> for the detailed description of the <samp class="ph codeph">cuCtxCreate</samp> function and the possible flag values. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e6382" rowspan="1" colspan="1">NVCOMPILER_ACC_CUDA_HEAPSIZE</td> <td class="entry" valign="top" width="50%" headers="d54e6385" rowspan="1" colspan="1">For NVIDIA CUDA devices, sets the heap size limit for <samp class="ph codeph">malloc()</samp> when called on device. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e6382" rowspan="1" colspan="1">NVCOMPILER_ACC_CUDA_MAX_L2_FETCH_GRANULARITY</td> <td class="entry" valign="top" width="50%" headers="d54e6385" rowspan="1" colspan="1">For NVIDIA CUDA devices, sets the maximum L2 cache fetch granularity size in bytes. A correct value is an integer between 0 and 128. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e6382" rowspan="1" colspan="1">NVCOMPILER_ACC_CUDA_MEMALLOCASYNC</td> <td class="entry" valign="top" width="50%" headers="d54e6385" rowspan="1" colspan="1">For NVIDIA CUDA devices, when set to a non-zero integer value, enables CUDA asynchronous memory allocations from the default CUDA memory pool as descibed in the <a class="xref" href="https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MALLOC__ASYNC.html" target="_blank" shape="rect">CUDA Toolkit Documentation</a>. By default, an internal NVIDIA HPC Runtime memory pool is used instead. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e6382" rowspan="1" colspan="1">NVCOMPILER_ACC_CUDA_MEMALLOCASYNC_POOLSIZE</td> <td class="entry" valign="top" width="50%" headers="d54e6385" rowspan="1" colspan="1">For NVIDIA CUDA devices, sets the size of the default CUDA memory pool for asynchronous allocations if the <samp class="ph codeph">NVCOMPILER_ACC_CUDA_MEMALLOCASYNC</samp> environment variable is also set to a non-zero integer value. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e6382" rowspan="1" colspan="1">NVCOMPILER_ACC_CUDA_NOCOPY</td> <td class="entry" valign="top" width="50%" headers="d54e6385" rowspan="1" colspan="1">Disables the use of the pinned buffer when transferring user data between host and NVIDIA CUDA devices. When this variable is set to a non-zero integer value, user data will be transferred directly bypassing the pinned buffer. Asynchronous execution of such data transfers can be limited when this setting is in effect. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e6382" rowspan="1" colspan="1">NVCOMPILER_ACC_CUDA_PIN</td> <td class="entry" valign="top" width="50%" headers="d54e6385" rowspan="1" colspan="1">For NVIDIA CUDA devices, enables host memory pinning at data directives. When host memory is pinned, data transfers to and from the device can be asynchronous, which can potentially improve program performance. A non-zero integer value enables this mechanism. A value of <samp class="ph codeph">2</samp> or greater additionally disallows unpinning the host data after it is pinned. A value of <samp class="ph codeph">3</samp> or greater also enables pinning the whole array referenced in a data directive (provided that the size of the array is known), rather than its subarray specified in the data directive. By default, host data referenced at data directives is not pinned unless directed by the compiler at compile-time; refer to <a class="xref" href="index.html#gpu-mem-flags" shape="rect">Command-line Options Selecting Compiler Memory Modes</a> for more information about the compile-time memory modes. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e6382" rowspan="1" colspan="1">NVCOMPILER_ACC_CUDA_PINSIZE</td> <td class="entry" valign="top" width="50%" headers="d54e6385" rowspan="1" colspan="1">For NVIDIA CUDA devices, sets the host memory pinning granularity. If host memory pinning is enabled with the <samp class="ph codeph">NVCOMPILER_ACC_CUDA_PIN</samp> environment variable, the runtime will attempt to use this setting to pin larger regions of memory at once, thus potentially lowering the cost of pinning memory when the program needs to pin multiple data regions separately. The maximum allowed value is 1 MB. By default, single byte pinning granularity is used. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e6382" rowspan="1" colspan="1">NVCOMPILER_ACC_CUDA_PRINTFIFOSIZE</td> <td class="entry" valign="top" width="50%" headers="d54e6385" rowspan="1" colspan="1">For NVIDIA CUDA devices, sets the buffer size for formatted output calls on device. In particular, it controls the buffer size for the <samp class="ph codeph">printf</samp> C function. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e6382" rowspan="1" colspan="1">NVCOMPILER_ACC_CUDA_STACKSIZE</td> <td class="entry" valign="top" width="50%" headers="d54e6385" rowspan="1" colspan="1">For NVIDIA CUDA devices, sets the stack size limit for device threads.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e6382" rowspan="1" colspan="1">NVCOMPILER_ACC_DEV_MEMORY</td> <td class="entry" valign="top" width="50%" headers="d54e6385" rowspan="1" colspan="1">For NVIDIA CUDA devices, when set to a valid non-zero size value, enables the use of a device memory pool and sets its size. By default, the device memory pool is not used. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e6382" rowspan="1" colspan="1">NVCOMPILER_ACC_MEM_MANAGE</td> <td class="entry" valign="top" width="50%" headers="d54e6385" rowspan="1" colspan="1">For NVIDIA CUDA devices, when set to the integer value 0, disables the use of an internal device memory manager. By default, the device memory manager is enabled. It maintains a list of deallocated chunks of device memory in an attempt to efficiently reuse them for future allocations. </td> </tr> </tbody> </table> </div> </div> </div> </div> <div class="topic concept nested2" id="acc-mem-unified"><a name="acc-mem-unified" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-mem-unified" name="acc-mem-unified" shape="rect">5.4.2. Managed and Unified Memory Modes</a></h3> <div class="body conbody"> <div class="section"> <p class="p"> The NVIDIA HPC Compilers support interoperability with <a class="xref" href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-memory-programming" target="_blank" shape="rect">CUDA Unified Memory</a>. This feature is available with the x86-64, OpenPOWER and Arm Server compilers. Unified memory provides a single address space for CPU and GPU; data movement between CPU and GPU memories is implicitly handled by the NVIDIA CUDA driver. </p> <p class="p"> Whenever data is accessed on the CPU or the GPU, it could trigger a data transfer if the last time it was accessed was not on the same device. In some cases, page thrashing may occur and impact performance. An introduction to CUDA Unified Memory is available on <a class="xref" href="https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners" target="_blank" shape="rect">Parallel Forall</a>. </p> </div> </div> <div class="topic concept nested3" id="gpu-mem-mode-managed"><a name="gpu-mem-mode-managed" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#gpu-mem-mode-managed" name="gpu-mem-mode-managed" shape="rect">5.4.2.1. Managed Memory Mode</a></h3> <div class="body conbody"> <div class="section"> <p class="p"> In Managed Memory Mode, all Fortran, C⁠+⁠+ and C explicit allocation statements (e.g. <samp class="ph codeph">allocate</samp>, <samp class="ph codeph">new</samp>, and <samp class="ph codeph">malloc</samp>, respectively) in a program unit are replaced by equivalent CUDA managed data allocation calls that place the data in CUDA Managed Memory. The result is that OpenACC and OpenMP data clauses and directives are not needed to manage data movement. They are essentially ignored and can be omitted. For Stdpar this is the minimal required memory mode since there are no specific annotations for data used in the parallel region. </p> <p class="p"> To enable Managed Memory Mode, add the option <samp class="ph codeph">-⁠gpu=mem:managed</samp> to the compiler and linker command lines. </p> <p class="p"> When a program allocates managed memory, it allocates host pinned memory as well as device memory thus making allocate and free operations somewhat more expensive and data transfers somewhat faster. A memory pool allocator is used to mitigate the overhead of the allocate and deallocate operations. More details can be found in <a class="xref" href="index.html#gpu-mem-poolallocator" shape="rect">Memory Pool Allocator</a>. </p> <p class="p"> Managed Memory Mode has the following limitations: </p> <ul class="ul"> <li class="li"> Use of managed memory applies only to dynamically-allocated data. </li> <li class="li"> Given an allocatable aggregate with a member that points to local, global, or static data, compiling with <samp class="ph codeph">-⁠gpu=mem:managed</samp> and attempting to access memory through that pointer from the compute kernel will cause a failure at runtime. </li> <li class="li"> C⁠+⁠+ virtual functions are not supported. </li> <li class="li"> The <samp class="ph codeph">-⁠gpu=mem:managed</samp> compiler option must be used to compile the files in which variables (accessed from GPU) are allocated, even if there is no code to accelerate on the GPU in the source file. </li> <li class="li"> When linking multiple translation units, the application must ensure that all data are deallocated using the scheme corresponding to their allocation. For example if the data are allocated in managed memory the deallocation must be performed using CUDA API calls for managed memory. More details and extra compiler support is detailed in <a class="xref" href="index.html#gpu-mem-intercept" shape="rect">Interception of Deallocations</a>. </li> </ul> <p class="p"> Managed Memory Mode has the following additional limitations when used with NVIDIA Kepler GPUs: </p> <ul class="ul"> <li class="li"> Data motion on Kepler GPUs is achieved through fast pinned asynchronous data transfers; from the program's perspective, however, the transfers are synchronous. </li> <li class="li"> The NVIDIA HPC Compiler Runtime enforces synchronous execution of kernels when <samp class="ph codeph">-⁠gpu=mem:managed</samp> is used on a system with a Kepler GPU. This situation may result in slower performance because of the extra synchronizations and decreased overlap between CPU and GPU. </li> <li class="li"> The total amount of managed memory is limited to the amount of available device memory on Kepler GPUs. </li> </ul> </div> <div class="section"> <h5 class="title sectiontitle">Memory Allocations/Deallocations Automatically Changed to Managed Memory</h5> <p class="p"> When the compiler utilizes CUDA Managed Memory capability either with <samp class="ph codeph">-⁠gpu=mem:managed</samp> or <samp class="ph codeph">-⁠gpu=mem:unified</samp>, the following explicit allocations/deallocations are automatically changed into <samp class="ph codeph">cudaMallocManaged</samp>/<samp class="ph codeph">cudaFree</samp>-type allocations/deallocations: </p> <ul class="ul"> <li class="li"> For C++: <ul class="ul"> <li class="li"> All calls to global <samp class="ph codeph">operator new</samp> and <samp class="ph codeph">operator delete</samp> that allocate or deallocate memory, such as: <pre xml:space="preserve"> operator new(std::size_t size) operator new(std::size_t size, const std::nothrow_t &amp;nothrow_value) operator new(std::size_t size, std::align_val_t align) operator new(std::size_t size, std::align_val_t align, const std::nothrow_t &amp;nothrow_value) operator delete(void *p) operator delete(void *p, std::size_t size) operator delete(void *p, std::align_val_t align) operator delete(void *p, std::size_t size, std::align_val_t align) operator delete(void *p, const std::nothrow_t &amp;nothrow_value) operator delete(void *p, std::align_val_t align, const std::nothrow_t &amp;nothrow_value) </pre></li> <li class="li"> All the array forms of the above overloads. </li> <li class="li"> All calls to <samp class="ph codeph">malloc</samp>/<samp class="ph codeph">free</samp> functions. </li> </ul> </li> <li class="li"> For C: all calls to <samp class="ph codeph">malloc</samp>/<samp class="ph codeph">free</samp> functions. </li> <li class="li"> For Fortran: <ul class="ul"> <li class="li"> All allocations of automatic arrays. </li> <li class="li"> all <samp class="ph codeph">allocate</samp>/<samp class="ph codeph">deallocate</samp> statements with allocatable arrays or pointer variables. </li> </ul> </li> </ul> </div> </div> </div> <div class="topic concept nested3" id="gpu-mem-mode-unified"><a name="gpu-mem-mode-unified" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#gpu-mem-mode-unified" name="gpu-mem-mode-unified" shape="rect">5.4.2.2. Unified Memory Mode</a></h3> <div class="body conbody"> <div class="section"> <p class="p"> In Unified Memory Mode, the requirements for the program are further relaxed compared to Managed Memory Mode. Specifically, not only is dynamically allocated system memory accessible on the GPU, but global and local memory are also accessible. </p> <p class="p"> To enable this feature, add the option <samp class="ph codeph">-⁠gpu=mem:unified</samp> to the compiler and linker command lines. </p> <p class="p"> Programs compiled with <samp class="ph codeph">-⁠gpu=mem:unified</samp> must be run on systems that support full CUDA Unified Memory capability. At this time, full CUDA Unified Memory is supported on NVIDIA Grace Hopper Superchip systems and Linux x86-64 systems running with the Heterogeneous Memory Management (HMM) feature enabled in the Linux kernel. Details about these platforms are available in the following blog posts on the NVIDIA website: <a class="xref" href="https://developer.nvidia.com/blog/simplifying-gpu-programming-for-hpc-with-the-nvidia-grace-hopper-superchip" target="_blank" shape="rect"> Simplifying GPU Programming for HPC with NVIDIA Grace Hopper Superchip</a> and <a class="xref" href="https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management" target="_blank" shape="rect"> Simplifying GPU Application Development with Heterogeneous Memory Management</a>. </p> <p class="p"> In Unified Memory Mode, the compiler assumes that any system memory is accessible on the GPU. Even so, the compiler may generate managed memory allocations for explicit data allocations when it considers them beneficial for program performance. If you would like to enforce or prohibit the use of managed memory for dynamic allocations pass <samp class="ph codeph">-⁠gpu=mem:unified:[no]managedalloc</samp> to compilation and linking. </p> <p class="p"> Unified Memory Mode has the following limitations: </p> <ul class="ul"> <li class="li"> Unified memory support for OpenACC, OpenMP and Stdpar Fortran is not mix-and-match; all object files containing OpenACC/OpenMP directives or Fortran <samp class="ph codeph">DO CONCURRENT</samp> constructs must be compiled and linked with <samp class="ph codeph">-⁠gpu=mem:unified</samp> to ensure correct execution. </li> <li class="li"> C⁠+⁠+ virtual functions are not supported. </li> </ul> </div> <div class="section"> <h5 class="title sectiontitle">Transitioning to Unified Memory Mode</h5> <p class="p"> Applications transitioning to architectures that support Unified Memory Mode can be recompiled with <samp class="ph codeph">-⁠gpu=mem:unified</samp> without any code modifications. </p> <p class="p"> The programmer should be aware that in Unified Memory Mode, the whole program state becomes essentially shared between the CPU and the GPU. By implication, modifications to program variables made on the GPU are visible on the CPU. That is, the GPU does not operate on a copy of the data even if the program contains respective directives, but instead the GPU operates directly on the data in system memory. To understand the importance of this idea, consider the following OpenACC C program: </p><pre xml:space="preserve"> int x[N]; void foo() { #pragma acc enter data create(x[0:N]) #pragma acc parallel loop for (int i = 0; i &lt; N; i++) { x[i] = i; } } </pre><p class="p"> When compiled in Separate Memory Mode, in the <samp class="ph codeph">foo()</samp> function a copy of the array <samp class="ph codeph">x</samp> is created in GPU memory and initialized as written in the <samp class="ph codeph">loop</samp> construct. When <samp class="ph codeph">-⁠gpu=mem:unified</samp> is added, however, the compiler ignores the <samp class="ph codeph">acc enter data</samp> construct, and the <samp class="ph codeph">loop</samp> construct initializes the array <samp class="ph codeph">x</samp> in system memory. </p> <p class="p"> Another implication of which to be aware, <em class="ph i">asynchronous</em> code execution on the GPU can introduce race conditions over access to program data. More details about code patterns to avoid when writing application sources for Unified Memory Mode can be found in the sections about specific programming models of this guide e.g. OpenACC, OpenMP, or CUDA Fortran. </p> </div> </div> </div> </div> <div class="topic concept nested2" id="gpu-mem-poolallocator"><a name="gpu-mem-poolallocator" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#gpu-mem-poolallocator" name="gpu-mem-poolallocator" shape="rect">5.4.3. Memory Pool Allocator</a></h3> <div class="body conbody"> <div class="section"> <p class="p"> Dynamic memory allocations may be made using <samp class="ph codeph">cudaMallocManaged()</samp>, a routine which has higher overhead than allocating non-managed memory using <samp class="ph codeph">cudaMalloc()</samp>. The more calls to <samp class="ph codeph">cudaMallocManaged()</samp>, the more significant the impact on performance. </p> <p class="p"> To mitigate the overhead of <samp class="ph codeph">cudaMallocManaged()</samp> or other CUDA allocation API calls, there is a pool allocator enabled by default in the presence of the <samp class="ph codeph">-⁠gpu=mem:managed</samp>, <samp class="ph codeph">-⁠gpu=mem:separate:pinnedalloc</samp>, or <samp class="ph codeph">-⁠gpu=mem:unified</samp> compiler options. It can be disabled, or its behavior modified, using these environment variables: </p> <div class="tablenoborder"><a name="gpu-mem-poolallocator__pool-allocator-env-vars" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="gpu-mem-poolallocator__pool-allocator-env-vars" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 14. Pool Allocator Environment Variables</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="42.857142857142854%" id="d54e6898" rowspan="1" colspan="1"> Environment Variable </th> <th class="entry" valign="top" width="57.14285714285714%" id="d54e6901" rowspan="1" colspan="1"> Use </th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="42.857142857142854%" headers="d54e6898" rowspan="1" colspan="1"> NVCOMPILER_ACC_POOL_ALLOC </td> <td class="entry" valign="top" width="57.14285714285714%" headers="d54e6901" rowspan="1" colspan="1"> Disable the pool allocator. The pool allocator is enabled by default; to disable it, set NVCOMPILER_ACC_POOL_ALLOC to 0. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="42.857142857142854%" headers="d54e6898" rowspan="1" colspan="1"> NVCOMPILER_ACC_POOL_SIZE </td> <td class="entry" valign="top" width="57.14285714285714%" headers="d54e6901" rowspan="1" colspan="1"> Set the of the pool. The default size is 1GB but other sizes (i.e., 2GB, 100MB, 500KB, etc.) can be used. The actual pool size is set such that the size is the nearest, smaller number in the Fibonacci series compared to the provided or default size. If necessary, the pool allocator will add more pools but only up to the NVCOMPILER_ACC_POOL_THRESHOLD value. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="42.857142857142854%" headers="d54e6898" rowspan="1" colspan="1"> NVCOMPILER_ACC_POOL_ALLOC_MAXSIZE </td> <td class="entry" valign="top" width="57.14285714285714%" headers="d54e6901" rowspan="1" colspan="1"> Set the maximum size for allocations. The default maximum size for allocations is 500MB but another size (i.e., 100KB, 10MB, 250MB, etc.) can be used as long as it is greater than or equal to 16B. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="42.857142857142854%" headers="d54e6898" rowspan="1" colspan="1"> NVCOMPILER_ACC_POOL_ALLOC_MINSIZE </td> <td class="entry" valign="top" width="57.14285714285714%" headers="d54e6901" rowspan="1" colspan="1"> Set the minimum size for allocation blocks. The default size is 128B but other sizes can be used. The size must be greater than or equal to 16B. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="42.857142857142854%" headers="d54e6898" rowspan="1" colspan="1"> NVCOMPILER_ACC_POOL_THRESHOLD </td> <td class="entry" valign="top" width="57.14285714285714%" headers="d54e6901" rowspan="1" colspan="1"> Set the percentage of total device memory that the pool allocator can occupy. Values from 0 to 100 are accepted. The default value is 50, corresponding to 50% of device memory. </td> </tr> </tbody> </table> </div> <div class="note note"><span class="notetitle">Note:</span> Note that where the size is specified if the unit suffix (B, KB, MB or GB) is ommited, the value is set by default in bytes. </div> </div> </div> </div> <div class="topic concept nested2" id="gpu-mem-intercept"><a name="gpu-mem-intercept" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#gpu-mem-intercept" name="gpu-mem-intercept" shape="rect">5.4.4. Interception of Deallocations</a></h3> <div class="body conbody"> <div class="section"> <p class="p"> While NVIDIA HPC Compilers facilitate the use of managed or pinned memory automatically, the application must ensure that memory is deallocated using the API which "matches" the API used to allocate said memory. For example, if <samp class="ph codeph">cudaMallocManaged</samp> is used to allocate, then <samp class="ph codeph">cudaFree</samp> must be used to deallocate; if <samp class="ph codeph">cudaMallocHost</samp> is used for allocations, <samp class="ph codeph">cudaFreeHost</samp> must be used for deallocations. Understanding this requirement is particularly important when third party or standard libraries are used; these libraries may have been compiled without any memory mode settings which sets up a situation where the deallocation routines in the libraries may not match the allocations made. When data is deallocated with an unmatching API call, the application may exhibit undefined behavior including crashing. To mitigate this issue, the compiler supports an interception mode in which calls to the standard deallocation function (e.g. free in C, delete in C++, or deallocate in Fortran) are inspected by the runtime and, if the memory is not detected as being system-allocated, the runtime replaces the standard deallocation function with the deallocation API corresponding to the allocation scheme in use. To activate this interception mode, use the <samp class="ph codeph">-⁠gpu=interceptdeallocations</samp> compiler flag. The interception is enabled by default for Stdpar in the presence of managed memory allocations. To deactivate the interception use the <samp class="ph codeph">-⁠gpu=nointerceptdeallocations</samp> compiler switch. This interception can incur extra runtime overhead. </p> </div> </div> </div> <div class="topic concept nested2" id="gpu-mem-flags"><a name="gpu-mem-flags" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#gpu-mem-flags" name="gpu-mem-flags" shape="rect">5.4.5. Command-line Options Selecting Compiler Memory Modes</a></h3> <div class="body conbody"> <div class="section"> <p class="p"> The following table maps the new memory model flags to their deprecated equivalents. </p> <div class="tablenoborder"><a name="gpu-mem-flags__mem-mode-cli" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="gpu-mem-flags__mem-mode-cli" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 15. Command-line Options Corresponding to Compiler Memory Modes</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="40%" id="d54e7026" rowspan="1" colspan="1"> Current Flags </th> <th class="entry" valign="top" width="30%" id="d54e7029" rowspan="1" colspan="1"> Deprecated Flags </th> <th class="entry" valign="top" width="30%" id="d54e7032" rowspan="1" colspan="1"> Brief Description </th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="40%" headers="d54e7026" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=mem:managed</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7029" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=managed</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7032" rowspan="1" colspan="1"> Managed Memory Mode </td> </tr> <tr class="row"> <td class="entry" valign="top" width="40%" headers="d54e7026" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=mem:managed -stdpar</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7029" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=nounified -stdpar</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7032" rowspan="1" colspan="1"> Managed Memory Mode </td> </tr> <tr class="row"> <td class="entry" valign="top" width="40%" headers="d54e7026" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=mem:unified</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7029" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=unified</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7032" rowspan="1" colspan="1"> Unified Memory Mode </td> </tr> <tr class="row"> <td class="entry" valign="top" width="40%" headers="d54e7026" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=mem:unified:managedalloc</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7029" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=unified,managed</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7032" rowspan="1" colspan="1"> Unified Memory Mode, all dynamically allocated data are implicitly in CUDA Managed Memory. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="40%" headers="d54e7026" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=mem:unified:nomanagedalloc</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7029" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=unified,nomanaged</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7032" rowspan="1" colspan="1"> Unified Memory Mode, CUDA Managed Memory is not used implicitly. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="40%" headers="d54e7026" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=mem:separate</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7029" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=nomanaged</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7032" rowspan="1" colspan="1"> Separate Memory Mode </td> </tr> <tr class="row"> <td class="entry" valign="top" width="40%" headers="d54e7026" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=mem:separate</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7029" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=nounified</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7032" rowspan="1" colspan="1"> Separate Memory Mode </td> </tr> <tr class="row"> <td class="entry" valign="top" width="40%" headers="d54e7026" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=mem:separate</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7029" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=nomanaged,nounified</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7032" rowspan="1" colspan="1"> Separate Memory Mode </td> </tr> <tr class="row"> <td class="entry" valign="top" width="40%" headers="d54e7026" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=mem:separate:pinnedalloc</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7029" rowspan="1" colspan="1"><samp class="ph codeph">-⁠gpu=pinned</samp></td> <td class="entry" valign="top" width="30%" headers="d54e7032" rowspan="1" colspan="1"> Separate Memory Mode, dynamically allocated data are in CPU pinned memory implicitly. </td> </tr> </tbody> </table> </div> </div> </div> </div> </div> <div class="topic concept nested1" id="fortran-ptrs-on-gpus"><a name="fortran-ptrs-on-gpus" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#fortran-ptrs-on-gpus" name="fortran-ptrs-on-gpus" shape="rect">5.5. Fortran pointers in device code</a></h3> <div class="body conbody"> <p class="p"> A Fortran pointer variable is implemented with a pointer and a descriptor, where the descriptor (often called a "dope vector") holds the array bounds and strides for each dimension, among other information, such as the size for each element and whether the pointer is associated. A Fortran scalar pointer has no bounds information, but does have a minimal descriptor. In Fortran, referring to the pointer variable always refers to the pointer target. There is no syntax to explicitly refer to the pointer and descriptor that implement the pointer variable. </p> <p class="p"> Fortran allocatable arrays and variables are implemented much the same way as pointer arrays and variables. Much of the discussion below applies both to allocatables and pointers. </p> <p class="p"> In OpenACC and OpenMP, when a pointer variable reference appears in a data clause, it's the pointer target that gets allocated or moved to device memory. The pointer and descriptor are neither allocated nor moved. </p> <p class="p"> When a pointer variable is declared in a module declaration section and appears in an <samp class="ph codeph">!$acc declare create()</samp> or <samp class="ph codeph">!$omp declare target to()</samp> directive, then the pointer and descriptor are statically allocated in device memory. When the pointer variable appears in a data clause, the pointer target is allocated or copied to the device, and the pointer and descriptor are 'attached' to the device copy of the data. If the pointer target is already present in device memory, no new memory is allocated or copied, but the pointer and descriptor are still 'attached', making the pointer valid in device memory. An important side effect of adding <samp class="ph codeph">declare create</samp> in the module declaration section is that when the program executes an 'allocate' statement for the pointer (or allocatable), memory is allocated in both CPU and device memory. This means the newly allocated data is already present in device memory. To get values from CPU to device memory or back, you'll have to use <samp class="ph codeph">update</samp> directives. </p> <p class="p"> When a pointer variable is used in an OpenACC or OpenMP compute construct, the compiler creates a private copy of the pointer and descriptor for each thread, unless the pointer variable was in a module as described above. The private pointer and descriptor will contain information about the device copy of the pointer target. In the compute construct, the pointer variables may be used pretty much as they can in host code outside a compute construct. However, there are some limitations. The program can do a pointer assignment to the pointer, changing the pointer, but that will only change the private pointer for that thread. The modified pointer in the compute construct will not change the corresponding pointer and descriptor in host memory. </p> </div> </div> <div class="topic concept nested1" id="calling-routines-on-gpus"><a name="calling-routines-on-gpus" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#calling-routines-on-gpus" name="calling-routines-on-gpus" shape="rect">5.6. Calling routines in a compute kernel</a></h3> <div class="body conbody"> <p class="p"> Using explicit interfaces is a common occurrence when writing Fortran applications. Here are some cases where doing so is required for GPU programming. </p> <ul class="ul"> <li class="li"> Explicit interfaces are required when using OpenACC <samp class="ph codeph">routine bind</samp> or OpenMP <samp class="ph codeph">declare variant</samp>. </li> <li class="li"> Fortran <samp class="ph codeph">do concurrent</samp> requires routines to be <samp class="ph codeph">pure</samp> which creates the need for an explicit interface. </li> </ul> </div> </div> <div class="topic concept nested1" id="acc-supp-procs-gpus"><a name="acc-supp-procs-gpus" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-supp-procs-gpus" name="acc-supp-procs-gpus" shape="rect">5.7. Supported Processors and GPUs</a></h3> <div class="body conbody"> <p class="p"> This NVIDIA HPC Compilers release supports x86-64, OpenPOWER and Arm Server CPUs. Cross-compilation across the different families of CPUs is not supported, but you can use the <span class="ph filepath">-⁠tp=&lt;target&gt;</span> flag as documented in the man pages to specify a target processor within a family. </p> <p class="p"> To direct the compilers to generate code for NVIDIA GPUs, use the <span class="ph filepath">-⁠acc</span> flag to enable OpenACC directives, the <span class="ph filepath">-⁠mp=gpu</span> flag to enable OpenMP directives, the <span class="ph filepath">-⁠stdpar</span> flag for standard language parallelism, and the <span class="ph filepath">-cuda</span> flag for CUDA Fortran. Use the <span class="ph filepath">-⁠gpu</span> flag to select specific options for GPU code generation. You can then use the generated code on any supported system with CUDA installed that has a CUDA-enabled GeForce, Quadro, or Tesla card. </p> <p class="p"> For more information on these flags as they relate to accelerator technology, refer to <a class="xref" href="index.html#acc-cmdln-opts" shape="rect">Compiling an OpenACC Program</a>. </p> <p class="p"> For a complete list of supported CUDA GPUs, refer to the NVIDIA website at: <a class="xref" href="http://www.nvidia.com/object/cuda_learn_products.html" target="_blank" shape="rect">http://www.nvidia.com/object/cuda_learn_products.html</a></p> </div> </div> <div class="topic concept nested1" id="cuda-toolkit-versions"><a name="cuda-toolkit-versions" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#cuda-toolkit-versions" name="cuda-toolkit-versions" shape="rect">5.8. CUDA Versions</a></h3> <div class="body conbody"> <p class="p"> The NVIDIA HPC compilers use components from NVIDIA's CUDA Toolkit to build programs for execution on an NVIDIA GPU. The NVIDIA HPC SDK puts the CUDA Toolkit components into an HPC SDK installation sub-directory; the HPC SDK currently bundles two versions of recently-released Toolkits. </p> <p class="p"> You can compile a program for an NVIDIA GPU on any system supported by the HPC compilers. You will be able to run that program only on a system with an NVIDIA GPU and an installed NVIDIA CUDA driver. NVIDIA HPC SDK products do not contain CUDA device drivers. You must download and install the appropriate <a class="xref" href="http://www.nvidia.com/cuda" target="_blank" shape="rect">CUDA Driver from NVIDIA</a>. </p> <p class="p"> The NVIDIA HPC SDK utility <samp class="ph codeph">nvaccelinfo</samp> prints the driver version as its first line of output. You can use it to find out which version of the CUDA Driver is installed on your system. </p> <div class="p"> The NVIDIA HPC SDK <span class="keyword">24.11</span> includes components from the following versions of the CUDA Toolkit: <ul class="ul"> <li class="li"> CUDA 11.8 </li> <li class="li"> CUDA 12.4 </li> </ul> </div> <p class="p"> If you are compiling a program for GPU execution on a system <em class="ph i">without</em> an installed CUDA driver, the compiler selects the version of the CUDA Toolkit to use based on the value of the <samp class="ph codeph">DEFCUDAVERSION</samp> variable contained in a file called <samp class="ph codeph">localrc</samp> which is created during installation of the HPC SDK. </p> <p class="p"> If you are compiling a program for GPU execution on a system <em class="ph i">with</em> an installed CUDA driver, the compiler detects the version of the CUDA driver and selects the appropriate CUDA Toolkit version to use from those bundled with the HPC SDK. </p> <p class="p"> The compilers look for a CUDA Toolkit version in the /opt/nvidia/hpc_sdk/<em class="ph i">target</em>/<span class="keyword">24.11</span>/cuda directory that matches the version of the CUDA Driver installed on the system. If an exact match is not found, the compiler searches for the closest match. For CUDA Driver versions 11.2 through 11.8, the compiler will use the CUDA 11.8 Toolkit. For CUDA Driver versions 12.0 and later, the compiler will use the newest CUDA 12.x Toolkit. </p> <div class="p"> You can change the compiler's default selection of CUDA Toolkit version using a compiler option. Add the <samp class="ph codeph">cudaX.Y</samp> sub-option to <samp class="ph codeph">-⁠gpu</samp> where <samp class="ph codeph">X.Y</samp> denotes the CUDA version. Using a compiler option changes the CUDA Toolkit version for one invocation of the compiler. For example, to compile an OpenACC C file with the CUDA 11.8 Toolkit you would use: <pre xml:space="preserve">nvc -acc -gpu=cuda11.8</pre></div> </div> </div> <div class="topic concept nested1" id="compute-cap"><a name="compute-cap" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#compute-cap" name="compute-cap" shape="rect">5.9. Compute Capability</a></h3> <div class="body conbody"> <p class="p"> The compilers can generate code for NVIDIA GPU compute capabilities 3.5 through 8.6. The compilers construct a default list of compute capabilities that matches the compute capabilities supported by the GPUs found on the system used in compilation. If there are no GPUs detected, the compilers generate code for every supported compute capability. </p> <p class="p"> You can override the default by specifying one or more compute capabilities using either command-line options or an <span class="ph filepath">rcfile</span>. </p> <p class="p"> To change the default with a command-line option, provide a comma-separated list of compute capabilities to the <span class="ph filepath">-⁠gpu</span> option. </p> <p class="p"> To change the default with an <span class="ph filepath">rcfile</span>, set the <samp class="ph codeph">DEFCOMPUTECAP</samp> value to a blank-separated list of compute capabilities in the siterc file located in your installation's bin directory: </p><pre xml:space="preserve">set DEFCOMPUTECAP=60 70;</pre><p class="p"> Alternatively, if you don't have permissions to change the <span class="ph filepath">siterc</span> file, you can add the <samp class="ph codeph">DEFCOMPUTECAP</samp> definition to a separate <span class="ph filepath">.mynvrc</span> file in your home directory. </p> <p class="p"> The generation of device code can be time consuming, so you may notice an increase in compile time as the number of compute capabilities increases. </p> </div> </div> <div class="topic concept nested1" id="ptx-jit"><a name="ptx-jit" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#ptx-jit" name="ptx-jit" shape="rect">5.10. PTX JIT Compilation</a></h3> <div class="body conbody"> <p class="p"> As of HPC SDK 22.9, support for PTX JIT compilation is enabled in all compilers for relocatable device code mode. This means that applications built with <span class="ph filepath">-gpu=rdc</span> (that is, with relocatable device code enabled, which is the default mode) are forward-compatible with newer GPUs thanks to the embedded PTX code. The embedded PTX code is dynamically compiled when the application runs on a GPU architecture newer than the architecture specified at compile time. </p> <p class="p"> The support for PTX JIT compilation is enabled automatically, which means that you do not need to change the compiler invocation command lines for your existing projects. </p> <div class="section" id="ptx-jit__use-scenarios"><a name="ptx-jit__use-scenarios" shape="rect"> <!-- --></a><h3 class="title sectiontitle">Use scenarios</h3> <div class="p"> <ul class="ul"> <li class="li"> As an example, you can compile your application targeting the Ampere GPU without having to worry about the Hopper GPU architecture. Once the application runs on a Hopper GPU, it will seamlessly use the embedded PTX code. </li> <li class="li"> In CUDA Fortran, or with the CUDA Interoperability mode enabled, you can mix in object files compiled with the CUDA NVCC compiler containing PTX code. This PTX code from NVCC will be handled by the JIT compiler alongside the PTX code contained in object files produced by the HPC SDK compilers. When using the CUDA NVCC compiler, the relocatable device code generation must be enabled explicitly using the NVCC <span class="ph filepath">--relocatable-device-code true</span> switch, as explained in the <a class="xref" href="https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#using-separate-compilation-in-cuda" target="_blank" shape="rect">CUDA Compiler Driver guide</a>. For information about CUDA Interoperability, please refer to <a class="xref" href="https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html#openmp-interop-cuda" target="_blank" shape="rect">https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html#openmp-interop-cuda</a>. The CUDA Fortran Programming Guide is available here: <a class="xref" href="https://docs.nvidia.com/hpc-sdk/compilers/cuda-fortran-prog-guide" target="_blank" shape="rect">https://docs.nvidia.com/hpc-sdk/compilers/cuda-fortran-prog-guide</a>. </li> </ul> </div> <p class="p"> By default, the compiler will choose the compute capability that matches the GPU on the system where the code is being compiled. For code that is going to run on the system where it is compiled, we recommend letting the compiler set the compute capability. </p> <p class="p"> When the default won’t work, we recommend compiling applications for a range of compute capabilities that the application is expected to run against, for example, using the <span class="ph filepath">-gpu=ccall</span> compiler option. When running the application on a system that supports one of those compute capabilities, the CUDA driver minor version is allowed to be less than the version of the CUDA toolkit used at compile time, as covered in section <a class="xref" href="index.html#cuda-toolkit-versions" shape="rect">CUDA Versions</a>. </p> </div> <div class="section" id="ptx-jit__performance-considerations"><a name="ptx-jit__performance-considerations" shape="rect"> <!-- --></a><h3 class="title sectiontitle">Performance considerations</h3> <p class="p"> PTX JIT compilation, when it occurs, can have a start-up overhead for the application. The JIT compiler keeps a cached copy of the produced device code, which reduces the overhead on subsequent runs. Please refer to the <a class="xref" href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#just-in-time-compilation" target="_blank" shape="rect">CUDA Programming Guide</a> for detailed information about how the JIT compiler works. </p> </div> <div class="section" id="ptx-jit__known-limitations"><a name="ptx-jit__known-limitations" shape="rect"> <!-- --></a><h3 class="title sectiontitle">Known limitations</h3> <p class="p"> In general, in order for PTX JIT compilation to work, the CUDA driver installed on the deployment system must be at least of the version that matches the CUDA toolkit used to compile the application. This requirement is stricter than those explained in section <a class="xref" href="index.html#cuda-toolkit-versions" shape="rect">CUDA Versions</a>. </p> <p class="p"> For example, as explained in that section, the compilers will use the CUDA 11.8 toolkit that is shipped as part of the HPC SDK toolkit when the CUDA driver installed in the system is at least 11.2. However, while the CUDA 11.2 driver is commonly sufficient to run the application, it will not be able to compile the PTX code produced by the CUDA 11.8 toolkit. This means that any deployment system where the PTX JIT compilation is expected to be used must have at least the CUDA 11.8 driver installed. Please refer to the <a class="xref" href="https://docs.nvidia.com/deploy/cuda-compatibility/index.html#application-considerations" target="_blank" shape="rect">CUDA Compatibility</a> guide for further information about the CUDA Driver compatibility with CUDA Toolkits. </p> <p class="p"> When the application is expected to run on a newer GPU architecture than specified at compile time, we recommend having a CUDA driver installed on the deployment system matching the CUDA toolkit used to build the application. One way to achieve that is to use the <samp class="ph codeph">NVHPC_CUDA_HOME</samp> environment variable at compile time to provide a specific CUDA toolkit. </p> <p class="p"> Below are a few examples of how the PTX version incompatibility can be diagnosed and fixed. As a general rule, if the CUDA driver is unable to run the application due to incompatible PTX, the application will terminate with an error message indicating the cause. OpenACC and OpenMP applications will in most cases suggest compiler flags to target the current CUDA installation. </p> </div> <div class="section" id="ptx-jit__openacc"><a name="ptx-jit__openacc" shape="rect"> <!-- --></a><h3 class="title sectiontitle">OpenACC</h3> <p class="p"> Consider this program that we will compile for Volta GPU and attempt to run on an Ampere GPU, on a system that has CUDA 11.5 installed: </p><pre xml:space="preserve"> #include &lt;stdio.h&gt; #define N 1000 int array[N]; int main() { #pragma acc parallel loop copy(array[0:N]) for(int i = 0; i &lt; N; i++) { array[i] = 3.0; } printf("Success!\n"); } </pre> When we build the program, HPC SDK will choose the CUDA 11.8 toolkit that is included as the default. When we attempt to run it, it fails because code generated with 11.8 does not work with the 11.5 driver: <pre xml:space="preserve"> $ nvc -acc -gpu=cc70 app.c $ ./a.out Accelerator Fatal Error: This file was compiled: -acc=gpu -gpu=cc70 Rebuild this file with -gpu=cc80 to use NVIDIA Tesla GPU 0 File: /tmp/app.c Function: main:3 Line: 3 </pre> From the error message it follows that the system is unable to execute the Volta GPU instructions on the current system. The embedded Volta PTX could not be compiled, which implies a CUDA driver incompatibility. A way to fix this is to use the installed CUDA 11.5 toolkit at compile time: <pre xml:space="preserve"> $ export NVHPC_CUDA_HOME=/usr/local/cuda-11.5 $ nvc -acc -gpu=cc70 app.c $ ./a.out Success! </pre></div> <div class="section" id="ptx-jit__openmp"><a name="ptx-jit__openmp" shape="rect"> <!-- --></a><h3 class="title sectiontitle">OpenMP</h3> Likewise, an OpenMP program will compile but not run: <pre xml:space="preserve"> #include &lt;stdio.h&gt; #define N 1000 int array[N]; int main() { #pragma omp target loop for(int i = 0; i &lt; N; i++) { array[i] = 0; } printf("Success!\n"); } </pre><pre xml:space="preserve"> $ nvc -mp=gpu -gpu=cc70 app.c $ ./a.out Accelerator Fatal Error: Failed to find device function 'nvkernel_main_F1L3_2'! File was compiled with: -gpu=cc70 Rebuild this file with -gpu=cc80 to use NVIDIA Tesla GPU 0 File: /tmp/app.c Function: main:3 Line: 3 </pre><p class="p"> We can also fix it by having <samp class="ph codeph">NVHPC_CUDA_HOME</samp> point at the matching CUDA toolkit location: </p><pre xml:space="preserve"> $ export NVHPC_CUDA_HOME=/usr/local/cuda-11.5 $ nvc -acc -gpu=cc70 app.c $ ./a.out Success! </pre></div> <div class="section" id="ptx-jit__cplusplus"><a name="ptx-jit__cplusplus" shape="rect"> <!-- --></a><h3 class="title sectiontitle">C++</h3> <p class="p"> In contrast to OpenACC and OpenMP applications that simply terminate when PTX JIT encounters an insufficient CUDA driver version, C++ applications throw a system exception when there is a PTX incompatibility: </p><pre xml:space="preserve"> #include &lt;vector&gt; #include &lt;algorithm&gt; #include &lt;execution&gt; #include &lt;iostream&gt; #include &lt;assert.h&gt; int main() { std::vector&lt;int&gt; x(1000, 0); x[1] = -20; auto result = std::count(std::execution::par, x.begin(), x.end(), -20); assert(result == 1); std::cout &lt;&lt; "Success!" &lt;&lt; std::endl; } </pre><pre xml:space="preserve"> $ nvc++ -stdpar -gpu=cc70 app.cpp $ ./a.out terminate called after throwing an instance of 'thrust::system::system_error' what(): after reduction step 1: cudaErrorUnsupportedPtxVersion: the provided PTX was compiled with an unsupported toolchain. Aborted (core dumped) </pre><p class="p"> The exception message contains a direct reference to an incompatible PTX, which in turn implies an mismatch between the CUDA toolkit and the CUDA driver version. </p> <p class="p"> We can fix it similarly by setting <samp class="ph codeph">NVHPC_CUDA_HOME</samp>: </p><pre xml:space="preserve"> $ export NVHPC_CUDA_HOME=/usr/local/cuda-11.5 $ nvc++ -stdpar -gpu=cc70 app.cpp $ ./a.out Success! </pre></div> </div> </div> </div> <div class="topic concept nested0" id="acc-use"><a name="acc-use" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#acc-use" name="acc-use" shape="rect">6. Using OpenACC</a></h2> <div class="body conbody"> <p class="p"> This chapter gives an overview of directive-based OpenACC programming in which compiler directives are used to specify regions of code in Fortran, C and C++ programs to be offloaded from a <em class="ph i">host</em> CPU to an NVIDIA GPU. For complete details on using OpenACC with NVIDIA GPUs, see the <em class="ph i">OpenACC Getting Started Guide</em>. </p> </div> <div class="topic concept nested1" id="acc-directives"><a name="acc-directives" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-directives" name="acc-directives" shape="rect">6.1. OpenACC Programming Model</a></h3> <div class="body conbody"> <p class="p"> With the emergence of GPU architectures in high performance computing, programmers want the ability to program using a familiar, high level programming model that provides both high performance and portability to a wide range of computing architectures. OpenACC emerged in 2011 as a programming model that uses high-level compiler directives to expose parallelism in the code and parallelizing compilers to build the code for a variety of parallel accelerators. </p> <p class="p"> This chapter will not attempt to describe OpenACC itself. For that, please refer to the OpenACC specification on the OpenACC <a class="xref" href="http://www.openacc.org" target="_blank" shape="rect"> www.openacc.org</a> website. Here, we will discuss differences between the OpenACC specification and its implementation by the NVIDIA HPC Compilers. </p> <p class="p"> Other resources to help you with your parallel programming including video tutorials, course materials, code samples, a best practices guide and more are available on the OpenACC website. </p> </div> <div class="topic concept nested2" id="acc-exe-levl-parallel"><a name="acc-exe-levl-parallel" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-exe-levl-parallel" name="acc-exe-levl-parallel" shape="rect">6.1.1. Levels of Parallelism</a></h3> <div class="body conbody"> <p class="p">OpenACC supports three levels of parallelism: </p> <ul class="ul"> <li class="li">an outer <em class="ph i">doall</em> (fully parallel) loop level </li> <li class="li">a <em class="ph i">workgroup</em> or <em class="ph i">threadblock</em> (worker parallel) loop level </li> <li class="li">an inner <em class="ph i">synchronous</em> (SIMD or vector) loop level </li> </ul> <p class="p">Each level can be multidimensional with 2 or 3 dimensions, but the domain must be strictly rectangular. The <em class="ph i">synchronous</em> level may not be fully implemented with SIMD or vector operations, so explicit synchronization is supported and required across this level. No synchronization is supported between parallel threads across the <em class="ph i">doall</em> level. </p> <p class="p">The OpenACC execution model on the device side exposes these levels of parallelism and the programmer is required to understand the difference between, for example, a fully parallel loop and a loop that is vectorizable but requires synchronization across iterations. All fully parallel loops can be scheduled for any of <em class="ph i">doall</em>, <em class="ph i">workgroup</em> or <em class="ph i">synchronous</em> parallel execution, but by definition SIMD vector loops that require synchronization can only be scheduled for synchronous parallel execution. </p> </div> </div> <div class="topic concept nested2" id="acc-dir-enable"><a name="acc-dir-enable" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-dir-enable" name="acc-dir-enable" shape="rect">6.1.2. Enable OpenACC Directives</a></h3> <div class="body conbody"> <p class="p">NVIDIA HPC compilers enable OpenACC directives with the <span class="ph filepath">-acc</span> and <span class="ph filepath">-gpu</span> command line options. For more information on these options refer to <a class="xref" href="index.html#acc-cmdln-opts" shape="rect">Compiling an OpenACC Program</a>. </p> <div class="section"> <h4 class="title sectiontitle">_OPENACC macro</h4> <p class="p">The <span class="ph filepath">_OPENACC</span> macro name is defined to have a value <span class="ph filepath">yyyymm</span> where <span class="ph filepath">yyyy</span> is the year and <span class="ph filepath">mm</span> is the month designation of the version of the OpenACC directives supported by the implementation. For example, the version for November, 2017 is 201711. All OpenACC compilers define this macro when OpenACC directives are enabled. </p> </div> </div> </div> <div class="topic concept nested2" id="acc-dir-openacc-sum"><a name="acc-dir-openacc-sum" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-dir-openacc-sum" name="acc-dir-openacc-sum" shape="rect">6.1.3. OpenACC Support</a></h3> <div class="body conbody"> <p class="p"> The NVIDIA HPC Compilers implement most features of OpenACC 2.7 as defined in <cite class="cite">The OpenACC Application Programming Interface</cite>, Version 2.7, November 2018, <a class="xref" href="http://www.openacc.org" target="_blank" shape="rect">http://www.openacc.org</a>, with the exception that the following OpenACC 2.7 features are not supported: </p> <ul class="ul"> <li class="li"> nested parallelism </li> <li class="li"> declare link </li> <li class="li"> enforcement of the <samp class="ph codeph">cache</samp> clause restriction that all references to listed variables must lie within the region being cached </li> <li class="li"> Subarrays and composite variables in <samp class="ph codeph">reduction</samp> clauses </li> <li class="li"> The <samp class="ph codeph">self</samp> clause </li> <li class="li"> The <samp class="ph codeph">default</samp> clause on data constructs </li> </ul> </div> </div> <div class="topic concept nested2" id="acc-dir-clauses"><a name="acc-dir-clauses" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-dir-clauses" name="acc-dir-clauses" shape="rect">6.1.4. OpenACC Extensions</a></h3> <div class="body conbody"> <p class="p"> The NVIDIA Fortran compiler supports an extension to the <span class="ph filepath">collapse</span> clause on the <span class="ph filepath">loop</span> construct. The OpenACC specification defines <span class="ph filepath">collapse</span>: </p><pre xml:space="preserve">collapse(n)</pre><p class="p"> NVIDIA Fortran supports the use of the identifier <span class="ph filepath">force</span> within <span class="ph filepath">collapse</span>: </p><pre xml:space="preserve">collapse(force:n)</pre><p class="p"> Using <span class="ph filepath">collapse(force:n)</span> instructs the compiler to enforce collapsing parallel loops that are not perfectly nested. </p> </div> </div> </div> <div class="topic concept nested1" id="acc-cmdln-opts"><a name="acc-cmdln-opts" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-cmdln-opts" name="acc-cmdln-opts" shape="rect">6.2. Compiling an OpenACC Program</a></h3> <div class="body conbody"> <p class="p"> Several compiler options are applicable specifically when working with OpenACC. These options include <span class="ph filepath">-acc</span>, <span class="ph filepath">-gpu</span>, and <span class="ph filepath">-Minfo</span>. </p> </div> <div class="topic reference nested2" id="acc"><a name="acc" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc" name="acc" shape="rect">6.2.1. -[no]acc</a></h3> <div class="body refbody"> <div class="section"> <p class="p"> Enable [disable] OpenACC directives. The following suboptions may be used following an equals sign ("="), with multiple sub-options separated by commas: </p> <dl class="dl"> <dt class="dt dlterm">gpu</dt> <dd class="dd"> OpenACC directives are compiled for GPU execution only. </dd> <dt class="dt dlterm">host</dt> <dd class="dd"> Compile for serial execution on the host CPU. </dd> <dt class="dt dlterm">multicore</dt> <dd class="dd"> Compile for parallel execution on the host CPU. </dd> <dt class="dt dlterm">legacy</dt> <dd class="dd"> Suppress warnings about deprecated NVIDIA accelerator directives. </dd> <dt class="dt dlterm">[no]autopar</dt> <dd class="dd"> Enable [disable] loop autoparallelization within acc parallel. The default is to autoparallelize, that is, to enable loop autoparallelization. </dd> <dt class="dt dlterm">[no]routineseq</dt> <dd class="dd"> Compile every routine for the devicee. The default behavior is to not treat every routine as a seq directive. </dd> <dt class="dt dlterm">strict</dt> <dd class="dd"> Instructs the compiler to issue warnings for non-OpenACC accelerator directives. </dd> <dt class="dt dlterm">sync</dt> <dd class="dd"> Ignore async clauses </dd> <dt class="dt dlterm">verystrict</dt> <dd class="dd"> Instructs the compiler to fail with an error for any non-OpenACC accelerator directive. </dd> <dt class="dt dlterm">[no]wait</dt> <dd class="dd"> Wait for each device kernel to finish. Kernel launching is blocked by default unless the async clause is used. </dd> </dl> </div> <div class="section" id="acc__default"><a name="acc__default" shape="rect"> <!-- --></a><h4 class="title sectiontitle">Default</h4> <p class="p">By default OpenACC directives are compiled for GPU and sequential CPU host execution (i.e. equivalent to explicitly setting <samp class="ph codeph">-acc=gpu,host</samp>). </p> </div> <div class="section" id="acc__usage"><a name="acc__usage" shape="rect"> <!-- --></a><h4 class="title sectiontitle">Usage</h4> <p class="p"> The following command-line requests that OpenACC directives be enabled and that an error be issued for any non-OpenACC accelerator directive. </p><pre xml:space="preserve">$ nvfortran -acc=verystrict prog.f</pre></div> <div class="section" id="acc__macros"><a name="acc__macros" shape="rect"> <!-- --></a><h4 class="title sectiontitle">Predefined Macros</h4> <div class="p">The following macros corresponding to the target compiled for are added implicitly: <ul class="ul"> <li class="li"><samp class="ph codeph">__NVCOMPILER_OPENACC_GPU</samp> when the OpenACC directives are compiled for GPU. </li> <li class="li"><samp class="ph codeph">__NVCOMPILER_OPENACC_MULTICORE</samp> when the OpenACC directives are compiled for multicore CPU. </li> <li class="li"><samp class="ph codeph">__NVCOMPILER_OPENACC_HOST</samp> when the OpenACC directives are compiled for serial execution on CPU. </li> </ul> </div> </div> </div> </div> <div class="topic reference nested2" id="gpu"><a name="gpu" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#gpu" name="gpu" shape="rect">6.2.2. -gpu</a></h3> <div class="body refbody"> <div class="section"> <p class="p"> Used in combination with the -⁠acc, -⁠cuda, -⁠mp, and -⁠stdpar flags to specify options for GPU code generation. The following sub-options may be used following an equals sign ("="), with multiple sub-options separated by commas: </p> <dl class="dl"> <dt class="dt dlterm">autocompare</dt> <dd class="dd">Automatically compare CPU vs GPU results at execution time: implies redundant</dd> <dt class="dt dlterm"> ccXY </dt> <dd class="dd"> Generate code for a device with compute capability X.Y. Multiple compute capabilities can be specified, and one version will be generated for each. By default, the compiler will detect the compute capability for each installed GPU. Use -⁠help -⁠gpu to see the valid compute capabilities for your installation. </dd> <dt class="dt dlterm"><span class="keyword option">ccall</span></dt> <dd class="dd"> Generate code for all compute capabilities supported by this platform and by the selected or default CUDA Toolkit. </dd> <dt class="dt dlterm"><span class="keyword option">ccall-major</span></dt> <dd class="dd"> Compile for all major supported compute capabilities. </dd> <dt class="dt dlterm"><span class="keyword option">ccnative</span></dt> <dd class="dd"> Detects the visible GPUs on the system and generates codes for them. If no device is available, the compute capability matching NVCC's default will be used. </dd> <dt class="dt dlterm">cudaX.Y</dt> <dd class="dd">Use CUDA X.Y Toolkit compatibility, where installed</dd> <dt class="dt dlterm">[no]debug</dt> <dd class="dd">Enable [disable] debug information generation in device code</dd> <dt class="dt dlterm">deepcopy</dt> <dd class="dd">Enable full deep copy of aggregate data structures in OpenACC; Fortran only </dd> <dt class="dt dlterm">fastmath</dt> <dd class="dd">Use routines from the fast math library</dd> <dt class="dt dlterm">[no]flushz</dt> <dd class="dd">Enable [disable] flush-to-zero mode for floating point computations on the GPU </dd> <dt class="dt dlterm">[no]fma</dt> <dd class="dd">Generate [do not generate] fused multiply-add instructions; default at <samp class="ph codeph">-⁠O3</samp></dd> <dt class="dt dlterm">[no]implicitsections</dt> <dd class="dd"> Change [Do not change] array element references in a data clause into an array section. In C⁠+⁠+, the <samp class="ph codeph">implicitsections</samp> option will change <samp class="ph codeph">update device(a[n])</samp> to <samp class="ph codeph">update device(a[0:n])</samp>. In Fortran, it will change <samp class="ph codeph">enter data copyin(a(n))</samp> to <samp class="ph codeph">enter data copyin(a(:n))</samp>. The default behavior, noimplicitsections, can also be changed using rcfiles; for example, one could add <samp class="ph codeph">set IMPLICITSECTIONS=0;</samp> to siterc or another rcfile. </dd> <dt class="dt dlterm">[no]interceptdeallocations</dt> <dd class="dd"> Intercept [Do not intercept] calls to standard library memory deallocations (e.g. <samp class="ph codeph">free</samp>) and call the corresponding CUDA memory deallocation version if address is in pinned or managed memory, regular version otherwise. </dd> <dt class="dt dlterm">keep</dt> <dd class="dd">Keep the kernel files (.cubin, .ptx, source)</dd> <dt class="dt dlterm">[no]lineinfo</dt> <dd class="dd">Enable [disable] GPU line information generation</dd> <dt class="dt dlterm">loadcache:{L1|L2}</dt> <dd class="dd">Choose what hardware level cache to use for global memory loads; options include the default, <samp class="ph codeph">L1</samp>, or <samp class="ph codeph">L2</samp></dd> <dt class="dt dlterm">[no]managed</dt> <dd class="dd"> Allocate [do not allocate] any dynamically allocated data in CUDA Managed memory. Use <samp class="ph codeph">-⁠gpu=nomanaged</samp> with <samp class="ph codeph">-⁠stdpar</samp> to prevent that flag's implicit use of <samp class="ph codeph">-⁠gpu=managed</samp> when CUDA Managed memory capability is detected. This option is deprecated. </dd> <dt class="dt dlterm">maxregcount:n</dt> <dd class="dd">Specify the maximum number of registers to use on the GPU; leaving this blank indicates no limit </dd> <dt class="dt dlterm">mem:{separate|managed|unified}</dt> <dd class="dd"> Select GPU memory mode for the generated binary. This controls CUDA memory capability to be utilised such as separate GPU memory only (<samp class="ph codeph">separate</samp>), GPU Managed Memory for the dynamically allocated data (<samp class="ph codeph">managed</samp>), or system memory aka full CUDA Unified Memory (<samp class="ph codeph">unified</samp>). Use of Managed or Unified Memory facilitates simpler programming by eliminating the need to detect all data to be copied into and outside of the code region executing on the GPU. </dd> <dt class="dt dlterm">pinned</dt> <dd class="dd">Use CUDA Pinned Memory. This option is deprecated.</dd> <dt class="dt dlterm">ptxinfo</dt> <dd class="dd">Print PTX info</dd> <dt class="dt dlterm">[no]rdc</dt> <dd class="dd">Generate [do not generate] relocatable device code.</dd> <dt class="dt dlterm">redundant</dt> <dd class="dd">Redundant CPU/GPU execution</dd> <dt class="dt dlterm">safecache</dt> <dd class="dd">Allow variable-sized array sections in cache directives; compiler assumes they fit into CUDA shared memory </dd> <dt class="dt dlterm"> sm_XY </dt> <dd class="dd"> Generate code for a device with compute capability X.Y. Multiple compute capabilities can be specified, and one version will be generated for each. By default, the compiler will detect the compute capability for each installed GPU. Use -⁠help -⁠gpu to see the valid compute capabilities for your installation. </dd> <dt class="dt dlterm">stacklimit:&lt;l&gt;nostacklimit</dt> <dd class="dd">Sets the limit (l) of stack variables in a procedure or kernel, in KB. This option is deprecated. </dd> <dt class="dt dlterm">[no]unified</dt> <dd class="dd"> Compile [do not compile] for CUDA Unified memory capability, where system memory is accessible from the GPU. This mode utilizes system and managed memory for dynamically allocated data unless explicit behavior is set through <samp class="ph codeph">-⁠gpu=[no]managed</samp>. Use <samp class="ph codeph">-⁠gpu=nounified</samp> with <samp class="ph codeph">-⁠stdpar</samp> to prevent that flag's implicit use of <samp class="ph codeph">-⁠gpu=unified</samp> when CUDA Unified memory capability is detected. This option must appear in both the compile and link lines. This option is deprecated. </dd> <dt class="dt dlterm">[no]unroll</dt> <dd class="dd">Enable [disable] automatic inner loop unrolling; default at <samp class="ph codeph">-⁠O3</samp></dd> <dt class="dt dlterm">zeroinit</dt> <dd class="dd">Initialize allocated device memory with zero</dd> </dl> </div> <div class="section" id="gpu__usage"><a name="gpu__usage" shape="rect"> <!-- --></a><h4 class="title sectiontitle">Usage</h4> <p class="p"> In the following example, the compiler generates code for NVIDIA GPUs with compute capabilities 6.0 and 7.0. </p><pre xml:space="preserve">$ nvfortran -acc -gpu=cc60,cc70 myprog.f</pre><p class="p"> The compiler automatically invokes the necessary software tools to create the kernel code and embeds the kernels in the object file. </p> <p class="p"> To link in the appropriate GPU libraries, you must link an OpenACC program with the <samp class="ph codeph">-⁠acc</samp> flag, and similarly for -⁠cuda, -⁠mp, or -⁠stdpar. </p> </div> <div class="section" id="gpu__dwarf-debugging-formats"><a name="gpu__dwarf-debugging-formats" shape="rect"> <!-- --></a><h4 class="title sectiontitle">DWARF Debugging Formats</h4> <p class="p"> Use the <samp class="ph codeph">-⁠g</samp> option to enable generation of full DWARF information on both the host and device; in the absence of other optimization flags, <samp class="ph codeph">-⁠g</samp> sets the optimization level to zero. If a <samp class="ph codeph">-⁠O</samp> option raises the optimization level to one or higher, only GPU line information is generated in device code even when <samp class="ph codeph">-⁠g</samp> is specified. To enforce full DWARF generation for device code at optimization levels above zero, use the <samp class="ph codeph">debug</samp> sub-option to <samp class="ph codeph">-⁠gpu</samp>. Conversely, to prevent the generation of dwarf information for device code, use the <samp class="ph codeph">nodebug</samp> sub-option to <samp class="ph codeph">-⁠gpu</samp>. Both <samp class="ph codeph">debug</samp> and <samp class="ph codeph">nodebug</samp> can be used independently of <samp class="ph codeph">-⁠g</samp>. </p> </div> </div> </div> </div> <div class="topic concept nested1" id="multicore"><a name="multicore" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#multicore" name="multicore" shape="rect">6.3. OpenACC for Multicore CPUs</a></h3> <div class="body conbody"> <div class="section"> <p class="p"> The NVIDIA OpenACC compilers support the option <span class="ph filepath">-acc=multicore</span>, to set the target accelerator for OpenACC programs to the host multicore CPU. This will compile OpenACC compute regions for parallel execution across the cores of the host processor or processors. The host multicore CPU will be treated as a shared-memory accelerator, so the data clauses (<samp class="ph codeph">copy</samp>, <samp class="ph codeph">copyin</samp>, <samp class="ph codeph">copyout</samp>, <samp class="ph codeph">create</samp>) will be ignored and no data copies will be executed. </p> <p class="p"> By default, <span class="ph filepath">-acc=multicore</span> will generate code that will use all the available cores of the processor. If the compute region specifies a value in the <samp class="ph codeph">num_gangs</samp> clause, the minimum of the <samp class="ph codeph">num_gangs</samp> value and the number of available cores will be used. At runtime, the number of cores can be limited by setting the environment variable <samp class="ph codeph">ACC_NUM_CORES</samp> to a constant integer value. The number of cores can also be set with the <samp class="ph codeph">void acc_set_num_cores(int numcores)</samp> runtime call. If an OpenACC compute construct appears lexically within an OpenMP parallel construct, the OpenACC compute region will generate sequential code. If an OpenACC compute region appears dynamically within an OpenMP region or another OpenACC compute region, the program may generate many more threads than there are cores, and may produce poor performance. </p> <p class="p"> The <span class="ph filepath">-acc=multicore</span> option differs from the <span class="ph filepath">-acc=host</span> option in that <span class="ph filepath">-acc=host</span> generates sequential host CPU code for the OpenACC compute regions. </p> </div> </div> </div> <div class="topic concept nested1" id="acc-openacc-unified-mem"><a name="acc-openacc-unified-mem" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-openacc-unified-mem" name="acc-openacc-unified-mem" shape="rect">6.4. OpenACC with CUDA Unified Memory</a></h3> <div class="body conbody"> <p class="p"> When developing OpenACC source for a target supporting CUDA Unified Memory, you can take advantage of a simplified approach to programming because there is no need for data clauses and directives, either in full or in part, depending on the exact memory capability the target supports and the compiler options used. </p> <p class="p"> The discussion in this section assumes you have become familiar with the Separate, Managed, and Unified Memory Modes covered in the <a class="xref" href="index.html#acc-mem-model" shape="rect">Memory Model</a> and <a class="xref" href="index.html#acc-mem-unified" shape="rect">Managed and Unified Memory Modes</a> sections. </p> <p class="p"> In Managed Memory Mode, only dynamically-allocated data are implicitly managed by the CUDA runtime; OpenACC data clauses and directives are therefore not needed for movement of this "managed" data. Data clauses and directives are still required to handle static data (C static and extern variables, Fortran module, common block and save variables) and function local data. </p> <p class="p"> In Unified Memory Mode, all data is managed by the CUDA runtime. Explicit data clauses and directives are no longer required to indicate which data should reside in GPU memory. All variables are accessible from the OpenACC compute regions executing on the GPU. The NVHPC compiler implementation closely adheres to the shared memory mode detailed in the OpenACC specification, meaning that <samp class="ph codeph">copy</samp>, <samp class="ph codeph">copyin</samp>, <samp class="ph codeph">copyout</samp>, and <samp class="ph codeph">create</samp> clauses will not result in any device allocation or data transfer. The <samp class="ph codeph">device_resident</samp> clause is still honored as in discrete memory mode and results in an allocation of data only accessible from device code. Device memory can also be allocated or deallocated in OpenACC programs in Unified Memory Mode by using the <samp class="ph codeph">acc_malloc</samp> or <samp class="ph codeph">acc_free</samp> API calls. </p> <div class="section"> <h3 class="title sectiontitle">Understanding Data Movement</h3> <p class="p"> In the absence of visible data clauses or directives, when the compiler encounters a compute construct it attempts to determine what data is required for correct execution of the region on the GPU. When the compiler is unable to determine the size and shape of data needing to be accessible on the device, it behaves as follows: </p> <ul class="ul"> <li class="li"> In Separate Memory Mode, the compiler emits an error requesting an explicit data clause be added to specify size/shape of the data to be copied. </li> <li class="li"> In Managed Memory Mode (<samp class="ph codeph">-⁠gpu=mem:managed</samp>), the compiler assumes the data is allocated in managed memory and thus is accessible from the device; if this assumption is wrong, if the data was defined globally or is located on the CPU stack, the program may fail at runtime. </li> <li class="li"> In Unified Memory Mode (<samp class="ph codeph">-⁠gpu=mem:unified</samp>), all data is accessible from the device making information about size and shape unnecessary. </li> </ul> <p class="p"> Take the following example in C: </p><pre xml:space="preserve"> void set(int* ptr, int i, int j, int dim){ int idx = i * dim + j; return ptr[idx] = someval(i, j); } void fill2d(int* ptr, int dim){ #pragma acc parallel loop for (int i = 0; i &lt; dim; i++) for (int j = 0; j &lt; dim; j++) set(ptr, i, j, dim); } </pre><p class="p"> In Separate Memory Mode, the only way to guarantee correctness for this example is to change the line with the <samp class="ph codeph">acc</samp> directive as follows: </p><pre xml:space="preserve"> #pragma acc parallel loop create(ptr[0:dim*dim]) copyout(ptr[0:dim*dim]) </pre><p class="p"> This change explicitly instructs the OpenACC implementation about the precise data segment used within the parallel loop. </p> <p class="p"> In Unified Memory Mode, that is, by compiling with <samp class="ph codeph">-⁠acc -⁠gpu=mem:unified</samp> and executing on a platform with unified memory capability, the <samp class="ph codeph">create</samp> and <samp class="ph codeph">copyout</samp> clauses are not required. </p> <p class="p"> The next example, in Fortran, illustrates how a global variable can be accessed in an OpenACC routine without requiring any explicit annotation. </p><pre xml:space="preserve"> module m integer :: globmin = 1234 contains subroutine findmin(a) !$acc routine seq integer, intent(in) :: a(:) integer :: i do i = 1, size(a) if (a(i) .lt. globmin) then globmin = a(i) endif end do end subroutine end module m </pre><p class="p"> Compile the example above for Unified Memory Mode: </p><pre xml:space="preserve"> nvfortran -acc -gpu=mem:unified example.f90 </pre><p class="p"> The source does not need any OpenACC directives to access module variable <samp class="ph codeph">globmin</samp>, to either read or update its value, in the routine invoked from CPU and GPU. Moreover, any access to <samp class="ph codeph">globmin</samp> will be made to the same exact instance of the variable from CPU and GPU; its value is synchronized automatically. In Separate or Managed Memory Modes, such behavior can only be achieved with a combination of OpenACC <samp class="ph codeph">declare</samp> and <samp class="ph codeph">update</samp> directives in the source code. </p> <p class="p"> In most cases, migrating existing OpenACC applications written for Separate Memory Mode should be a seamless process requiring no source changes. Some data access patterns, however, may lead to different results produced during application execution in Unified Memory Mode. </p> <p class="p"> Applications which rely on having separate data copies in GPU memory to conduct temporary computations on the GPU -⁠-⁠ without maintaining data synchronization with the CPU -⁠-⁠ pose a challenge for migration to Unified Memory. </p> <p class="p"> For the following Fortran example, the value of variable <samp class="ph codeph">c</samp> after the last loop will differ depending on whether the example is compiled with or without <samp class="ph codeph">-⁠gpu=mem:unified</samp>. </p><pre xml:space="preserve"> b(:) = ... c = 0 !$acc kernels copyin(b) copyout(a) !$acc loop do i = 1, N b(i) = b(i) * i end do !$acc loop do i = 1, N a(i) = b(i) + i end do !$acc end kernels do i = 1, N c = c + a(i) + b(i) end do </pre><p class="p"> Without Unified Memory, array <samp class="ph codeph">b</samp> is copied into the GPU memory at the beginning of the OpenACC <samp class="ph codeph">kernels</samp> region. It is then updated in the GPU memory and used to compute elements of array <samp class="ph codeph">a</samp>. As instructed by the data clause <samp class="ph codeph">copyin(b)</samp>, <samp class="ph codeph">b</samp> is not copied back to the CPU memory at the end of the <samp class="ph codeph">kernels</samp> region and therefore its initial value is used in the computation of <samp class="ph codeph">c</samp>. With <samp class="ph codeph">-⁠acc -⁠gpu=mem:unified</samp>, the updated value of <samp class="ph codeph">b</samp> in the first loop is automatically visible in the last loop leading to a different value of <samp class="ph codeph">c</samp> at its end. </p> </div> <div class="section"> <h3 class="title sectiontitle">Implications of Asynchronous Execution</h3> <p class="p"> Additional complexities can arise when dealing with asynchronous execution, particularly when CPU-GPU shared data is accessed within <samp class="ph codeph">async</samp> compute regions instead of using an independent data copy on GPU. The programmer should be especially careful about accessing local variables in asynchronous GPU code. Unless the GPU code execution is explicitly synchronized before the end of the scope in which local variables are defined, the GPU can access stale data thus resulting in undefined behavior. Consider the following OpenACC C example, where a local array is used to hold temporary data on the GPU: </p><pre xml:space="preserve"> void bar() { int x[N]; #pragma acc enter data create(x[0:N]) async #pragma acc parallel loop async for (int i = 0; i &lt; N; i++) x[i] = i; ... #pragma acc exit data delete(x[0:N]) async } </pre><p class="p"> When compiled for Separate Memory Mode, the <samp class="ph codeph">bar()</samp> function creates a copy of the array <samp class="ph codeph">x</samp> in GPU memory and initializes it as written in the <samp class="ph codeph">loop</samp> construct. That copy is eventually deleted. In Unified Memory Mode, however, the compiler ignores the <samp class="ph codeph">acc enter data</samp> and <samp class="ph codeph">acc exit data</samp> directives, so the <samp class="ph codeph">loop</samp> construct executed on the GPU accesses the array <samp class="ph codeph">x</samp> in local CPU memory. Moreover, since all constructs in this example are made asynchronous, the access to <samp class="ph codeph">x</samp> on the GPU leads to undefined behavior of the program because the variable <samp class="ph codeph">x</samp> goes out of scope once the <samp class="ph codeph">bar()</samp> function finishes. </p> </div> <div class="section"> <h3 class="title sectiontitle">Performance Considerations</h3> <p class="p"> In Unified Memory Mode, the OpenACC runtime may leverage data action information such as <samp class="ph codeph">create</samp>/<samp class="ph codeph">delete</samp> or <samp class="ph codeph">copyin</samp>/<samp class="ph codeph">copyout</samp> to communicate preferable data placement to the CUDA runtime by means of memory hint APIs as elaborated in the following blog post on the NVIDIA website: <a class="xref" href="https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management" target="_blank" shape="rect">Simplifying GPU Application Development with Heterogeneous Memory Management</a>. Such actions originate either from explicit data clauses in the source code or via implicit data movement generated by the compiler. This approach can minimize the amount of automatic data migration and may let a developer fine-tune application performance. For the C example above, while adding the data clauses <samp class="ph codeph">create(ptr[0:dim*dim])</samp> and <samp class="ph codeph">copyout(ptr[0:dim*dim])</samp> becomes optional with <samp class="ph codeph">-⁠gpu=mem:unified</samp>, their uses in the OpenACC <samp class="ph codeph">parallel loop</samp> directive may improve performance. </p> </div> </div> </div> <div class="topic concept nested1" id="openacc-error-handling"><a name="openacc-error-handling" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#openacc-error-handling" name="openacc-error-handling" shape="rect">6.5. OpenACC Error Handling</a></h3> <div class="body conbody"> <p class="p"> The OpenACC specification provides a mechanism to allow you to intercept errors triggered during execution on a GPU and execute a specific routine in response before the program exits. For example, if an MPI process fails while allocating memory on the GPU, the application may want to call <samp class="ph codeph">MPI_Abort</samp> to shut down all the other processes before the program exits. This section explains how to take advantage of this feature. </p> <p class="p"> To intercept errors the application must give a callback routine to the OpenACC runtime. To provide the callback, the application calls <samp class="ph codeph">acc_set_error_routine</samp> with a pointer to the callback routine. </p> <p class="p"> The interface is the following, where <samp class="ph codeph">err_msg</samp> contains a description of the error: </p><pre xml:space="preserve">typedef void (*exitroutinetype)(char *err_msg); extern void acc_set_error_routine(exitroutinetype callback_routine); </pre><p class="p"> When the OpenACC runtime detects a runtime error, it will invoke the <samp class="ph codeph">callback_routine</samp>. </p> <div class="note note"><span class="notetitle">Note:</span> This feature is not the same as error recovery. If the callback routine returns to the application, the behavior is decidedly undefined. </div> <p class="p"> Let's look at this feature in more depth using an example. </p> <p class="p"> Take the MPI program below and run it with two processes. Process 0 tries to allocate a large array on the GPU, then sends a message to the second process to acknowledge the success of the operation. Process 1 waits for the acknowledgment and terminates upon receiving it. </p><pre xml:space="preserve"> #include &lt;stdio.h&gt; #include &lt;stdlib.h&gt; #include "mpi.h" #define N 2147483648 int main(int argc, char **argv) { int rank, size; MPI_Init(&amp;argc, &amp;argv); MPI_Comm_rank(MPI_COMM_WORLD, &amp;rank); MPI_Comm_size(MPI_COMM_WORLD, &amp;size); int ack; if(rank == 0) { float *a = (float*) malloc(sizeof(float) * N); #pragma acc enter data create(a[0:N]) #pragma acc parallel loop independent for(int i = 0; i &lt; N; i++) { a[i] = i *0.5; } #pragma acc exit data copyout(a[0:N]) printf("I am process %d, I have initialized a vector of size %ld bytes on the GPU. Sending acknowledgment to process 1.", rank, N); ack = 1; MPI_Send(&amp;ack, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); } else if(rank == 1) { MPI_Recv(&amp;ack, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("I am process %d, I have received the acknowledgment from process 0 that data in the GPU has been initialized.\n", rank, N); fflush(stdout); } // do some more work MPI_Finalize(); return 0; } </pre><p class="p"> We compile the program with: </p><pre xml:space="preserve">$ mpicc -acc -o error_handling_mpi error_handling_mpi.c</pre><p class="p"> If we run this program with two MPI processes, the output will look like the following: </p><pre xml:space="preserve"> $ mpirun -n 2 ./error_handling_mpi Out of memory allocating -8589934592 bytes of device memory total/free CUDA memory: 11995578368/11919294464 Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 3.7, threadid=1 ...empty... call to cuMemAlloc returned error 2: Out of memory ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. </pre><p class="p"> Process 0 failed while allocating memory on the GPU and terminated unexpectedly with an error. In this case <samp class="ph codeph">mpirun</samp> was able to identify that one of the processes failed, so it shut down the remaining process and terminated the application. A simple two-process program like this is straightforward to debug. In a real world application though, with hundreds or thousands of processes, having a process exit prematurely may cause the application to hang indefinitely. Therefore it would be ideal to catch the failure of a process, control the termination of the other processes, and provide a useful error message. </p> <p class="p"> We can use the OpenACC error handling feature to improve the previous program and correctly terminate the application in case of failure of an MPI process. </p> <p class="p"> In the following sample code, we have added an error handling callback routine that will shut down the other processes if a process encounters an error while executing on the GPU. Process 0 tries to allocate a large array into the GPU and, if the operation is successful, process 0 will send an acknowledgment to process 1. Process 0 calls the OpenACC function <samp class="ph codeph">acc_set_error_routine</samp> to set the function <samp class="ph codeph">handle_gpu_errors</samp> as an error handling callback routine. This routine prints a message and calls <samp class="ph codeph">MPI_Abort</samp> to shut down all the MPI processes. If process 0 successfully allocates the array on the GPU, process 1 will receive the acknowledgment. Otherwise, if process 0 fails, it will terminate itself and trigger the call to <samp class="ph codeph">handle_gpu_errors</samp>. Process 1 is then terminated by the code executed in the callback routine. </p><pre xml:space="preserve"> #include &lt;stdio.h&gt; #include &lt;stdlib.h&gt; #include "mpi.h" #define N 2147483648 <strong class="ph b"> typedef void (*exitroutinetype)(char *err_msg); extern void acc_set_error_routine(exitroutinetype callback_routine); void handle_gpu_errors(char *err_msg) { printf("GPU Error: %s", err_msg); printf("Exiting...\n\n"); MPI_Abort(MPI_COMM_WORLD, 1); exit(-1); } </strong> int main(int argc, char **argv) { int rank, size; MPI_Init(&amp;argc, &amp;argv); MPI_Comm_rank(MPI_COMM_WORLD, &amp;rank); MPI_Comm_size(MPI_COMM_WORLD, &amp;size); int ack; if(rank == 0) { float *a = (float*) malloc(sizeof(float) * N); <strong class="ph b"> acc_set_error_routine(&amp;handle_gpu_errors); </strong> #pragma acc enter data create(a[0:N]) #pragma acc parallel loop independent for(int i = 0; i &lt; N; i++) { a[i] = i *0.5; } #pragma acc exit data copyout(a[0:N]) printf("I am process %d, I have initialized a vector of size %ld bytes on the GPU. Sending acknowledgment to process 1.", rank, N); fflush(stdout); ack = 1; MPI_Send(&amp;ack, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); } else if(rank == 1) { MPI_Recv(&amp;ack, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("I am process %d, I have received the acknowledgment from process 0 that data in the GPU has been initialized.\n", rank, N); fflush(stdout); } // more work MPI_Finalize(); return 0; } </pre><p class="p"> Again, we compile the program with: </p><pre xml:space="preserve">$ mpicc -acc -o error_handling_mpi error_handling_mpi.c</pre><p class="p"> We run the program with two MPI processes and obtain the output below: </p><pre xml:space="preserve"> $ mpirun -n 2 ./error_handling_mpi Out of memory allocating -8589934592 bytes of device memory total/free CUDA memory: 11995578368/11919294464 Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 3.7, threadid=1 ...empty... GPU Error: call to cuMemAlloc returned error 2: Out of memory Exiting... -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. </pre><p class="p"> This time the error on the GPU was intercepted by the application which managed it with the error handling callback routine. In this case the routine printed some information about the problem and called <samp class="ph codeph">MPI_Abort</samp> to terminate the remaining processes and avoid any unexpected behavior from the application. </p> </div> </div> <div class="topic concept nested1" id="openacc-cuda-graphs"><a name="openacc-cuda-graphs" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#openacc-cuda-graphs" name="openacc-cuda-graphs" shape="rect">6.6. OpenACC and CUDA Graphs</a></h3> <div class="body conbody"> <p class="p"> NVIDIA provides an optimized model for work submission onto GPUs called CUDA Graphs. A graph is a series of operations, such as kernel launches and other stream-oriented tasks, connected by their dependencies. A graph can be defined once, "captured", then launched repeatedly. This has potential benefits in reducing launch latencies and other overheads associated with kernel setup. </p> <p class="p"> A complete write-up explaining CUDA Graphs and the CUDA API for graph definition, instantiation, and execution can be found in Chapter 3 of the CUDA C Programming Guide. In OpenACC, we currently expose just the minimal set of operations to allow capture and replay of a graph containing OpenACC compute regions and data directives. The code executed between a "begin capture" call, <samp class="ph codeph">accx_begin_capture_async()</samp>, and the "end capture" call, <samp class="ph codeph">accx_end_capture_async()</samp>, is called the capture region. </p> <p class="p"> The CUDA graph API captures (or records) all the device work between accx_begin_capture_async and accx_end_capture_async. The host code in the capture region will be executed once normally, with the exception that no device work is actually executed on the device. Instead, a graph object is created that can be used to replay the captured work multiple times. </p> <div class="note note"><span class="notetitle">Note:</span> Graph capture is similar to a closure concept in many programming languages, like lambda-functions in C++. In lambda-function terms, CUDA graphs capture all the variables by value. That means that all the FIRSTPRIVATE scalars, array shapes, and those derived types, arrays and scalar addresses for data resident on the GPU, are baked into the graph object and cannot be altered. The device data behind the pointers, of course, can be updated by the graph execution normally, and updated by the host between replays. </div> <p class="p"> It is important to understand both what can and cannot be captured within a CUDA Graph capture region: </p> <ul class="ul"> <li class="li"> Asynchronous data clauses including data create can be captured. The OpenACC runtime will use the stream-ordered cudaMallocAsync() call in the capture region for variables which need allocation in data clauses, an API call allowed in CUDA Graphs. </li> <li class="li"> Asynchronous compute regions, preferably ACC parallel regions, can be captured. For ACC kernels regions, verify that no work is performed on the host. Host compute sections cannot be captured. </li> <li class="li"> Asynchronous ACC update host (self) and update device directives can be captured. The host and device addresses which are captured must be valid during the graph replay/execution. </li> <li class="li"> Since only the device work is captured and replayed, any data dependencies between the host and device inside the capture region are erroneous. For example, downloading data from the device, processing it on host and uploading it back to the device within the capture region is invalid. </li> <li class="li"> Host code, even host code containing conditionals, can occur within a capture region. Note though that the path taken through the host code will be the path captured by the graph, i.e. the conditionals must likely be consistent during the replay for correct results. Host code which updates host variables, such as <samp class="ph codeph">i=i+1</samp> will not be captured in the graph, which might affect proper indexing into device-side arrays or other kernel arguments. </li> <li class="li"> Similarly, device work initiated in host code loops can be captured in the CUDA Graph. The graph will not contain a notion of looping, just the sequence of device operations submitted to the device during the loop. </li> <li class="li"> Subroutine and function calls within a capture region, which contain further compute regions or other work which runs on the device, are captured. Care must be taken that the device data addresses passed to the kernels are valid throughout graph execution, and don't come and go based on stack addresses or something similar. </li> <li class="li"> Codes which double-buffer, or ping-pong between source and destination arrays that are input on odd iterations, and output on even iterations, can be accomodated by capturing two graphs: one per even iteration, one per odd iteration. </li> <li class="li"> Many CUDA library calls, like cublas, etc. can occur in a captured region. Setup for the library calls, such as creating handles, and computing and allocating workspace requirements, should be done before the capture region. </li> <li class="li"> Graph capturing is thread-safe with respect to each async queue. Host threads can independently capture graphs using different async queues. </li> </ul> <p class="p"> The OpenACC API follows the basic portion of the CUDA Graph API fairly closely. The major difference is OpenACC includes the <samp class="ph codeph">cudaGraphInstantiate()</samp> call as part of the end capture function. </p> <p class="p">From Fortran, the graph type is defined in the OpenACC module:</p><pre xml:space="preserve"> type, bind(c) :: acc_graph_t type(c_ptr) :: graph type(c_ptr) :: graph_exec end type acc_graph_t </pre><p class="p"> These subroutines are available in the OpenACC runtime. Here, pGraph is type(acc_graph_t) and async is just the asynchronous queue value: </p><pre xml:space="preserve"> subroutine accx_async_begin_capture( async ) subroutine accx_async_end_capture( async, pGraph ) subroutine accx_graph_launch( pGraph, async ) subroutine accx_graph_delete( pGraph ) type(c_ptr) function accx_get_graph( pGraph ) type(c_ptr) function accx_get_graph_exec( pGraph ) </pre><p class="p">From C, the graph type is defined in OpenACC.h:</p><pre xml:space="preserve"> typedef struct { void *graph; void *graph_exec; } acc_graph_t; </pre><p class="p"> These void functions are available in the OpenACC runtime: </p><pre xml:space="preserve"> extern void accx_async_begin_capture(long async); extern void accx_async_end_capture(long async, acc_graph_t *pgraph); extern void accx_graph_launch(acc_graph_t *pgraph, long async); extern void accx_graph_delete(acc_graph_t *pgraph); extern void *accx_get_graph(acc_graph_t *pgraph); extern void *accx_get_graph_exec(acc_graph_t *pgraph); </pre><p class="p"> We will use a simple Fortran example code which demonstrates some of the modifications needed to use CUDA Graphs from OpenACC. The original serial code for a conjugate gradient iterative solver: </p><pre xml:space="preserve"> subroutine RunCG(N, A, b, x, tol, max_iter) implicit none integer, intent(in) :: N, max_iter real(WP), intent(in) :: A(N, N), b(N), tol real(WP), intent(inout) :: x(N) real(WP) :: alpha, rr0, rr real(WP), allocatable :: Ax(:), r(:), p(:) integer :: it, i allocate(Ax(N), r(N), p(N)) call symmatvec(N, N, A, x, Ax) do i = 1, N r(i) = b(i) - Ax(i) p(i) = r(i) enddo rr0 = dot(N, r, r) do it = 1, max_iter call symmatvec(N, N, A, p, Ax) alpha = rr0 / dot(N, p, Ax) do i = 1, N x(i) = x(i) + alpha * p(i) r(i) = r(i) - alpha * Ax(i) enddo rr = dot(N, r, r) print*, "Iteration ", it, " residual: ", sqrt(rr) if (sqrt(rr) &lt;= tol) then deallocate(Ax, r, p) return endif do i = 1, N p(i) = r(i) + (rr / rr0) * p(i) enddo rr0 = rr enddo deallocate(Ax, r, p) end subroutine RunCG </pre><p class="p"> For this exercise we wish to put the <samp class="ph codeph">do it = 1,max_iter</samp> work for each iteration into a CUDA graph. Step one is to port the code to OpenACC, keeping in mind that we want to use asynchronous queues. We annotate the dot function with OpenACC directives like this: </p><pre xml:space="preserve"> function dot(N, x, y) result(r) integer, intent(in) :: N real(WP), intent(in) :: x(N), y(N) integer :: i real(WP) :: r r = 0.d0 !$acc parallel loop present(x, y) reduction(+:r) async(1) do i = 1, N r = r + x(i) * y(i) enddo !$acc wait(1) end function dot </pre><p class="p"> We write the symmetric matrix multiply like this: </p><pre xml:space="preserve"> subroutine symmatvec(M, N, AT, x, Ax) implicit none integer, intent(in) :: M, N real(WP), intent(in) :: AT(N, M), x(N) real(WP), intent(out) :: Ax(M) integer :: i, j real(WP) :: s ! Note: Since A is symmetric, we can use the "transpose" ! for better memory access here !$acc parallel loop gang present(AT, x, Ax) async(1) do i = 1, M s = 0.d0 !$acc loop vector reduction(+:s) do j = 1, N s = s + AT(j,i) * x(j) end do Ax(i) = s end do end subroutine </pre><p class="p"> And now our main loop of the conjugate gradient solver looks like this: </p><pre xml:space="preserve"> do it = 1, max_iter call symmatvec(N, N, A, p, Ax) alpha = rr0 / dot(N, p, Ax) !$acc parallel loop gang vector async(1) do i = 1, N x(i) = x(i) + alpha * p(i) r(i) = r(i) - alpha * Ax(i) enddo rr = dot(N, r, r) print*, "Iteration ", it, " residual: ", sqrt(rr) if (sqrt(rr) &lt;= tol) exit !$acc parallel loop gang vector async(1) do i = 1, N p(i) = r(i) + (rr / rr0) * p(i) enddo rr0 = rr enddo </pre><p class="p"> Step 2 is to prepare the code for running under CUDA Graphs. There is a lot of host code executing in the main loop. While the <samp class="ph codeph">dot()</samp> function runs on the GPU, the rest of the statement <samp class="ph codeph">alpha = rr0 / dot(...)</samp> runs on the host. Similarly, the 2nd <samp class="ph codeph">dot()</samp> call returns its value to the host. The print statement occurs on the host, as does the residual check. Finally, this iteration's value for rr is moved to rr0 in the last statement of the loop, on the host. </p> <p class="p"> The dot product is tricky. We wish to compute the dot product on the GPU, and leave the result on the GPU, so the reduction variable must be present on the GPU. Here, we change the function call to a subroutine, and remove the initialization which is outside of the parallel region: </p><pre xml:space="preserve"> subroutine dot(N, x, y, r) implicit none integer, intent(in) :: N real(WP), intent(in) :: x(N), y(N) integer :: i real(WP) :: r !$acc parallel loop present(x, y, r) reduction(+:r) async(1) do i = 1, N r = r + x(i) * y(i) enddo end subroutine dot </pre><p class="p"> We add one serial kernel to do some of the swapping between rr0 and rr, as well as zeroing out the scalar that will hold the dot product reduction, and move the print and check outside of the GPU capture region, replaced by a update host operation. The finished loop, complete with graph control, looks like this: </p><pre xml:space="preserve"> do it = 1, max_iter if (it .eq. 1) then ! First time capture call accx_async_begin_capture(1) call symmatvec(N, N, A, p, Ax) call dot(N, p, Ax, rden) !$acc serial async(1) rr0 = rr alpha = rr0 / rden rden = 0.0d0 rr = 0.0d0 !$acc end serial !$acc parallel loop gang vector async(1) do i = 1, N x(i) = x(i) + alpha * p(i) r(i) = r(i) - alpha * Ax(i) enddo call dot(N, r, r, rr) !$acc update host(rr) async(1) !$acc parallel loop gang vector async(1) do i = 1, N p(i) = r(i) + (rr / rr0) * p(i) enddo call accx_async_end_capture(1, graph) endif ! Always launch, then wait call accx_graph_launch(graph, 1) !$acc wait(1) rra(it) = rr if (sqrt(rr) &lt;= tol) exit enddo </pre><p class="p"> Step 3 is to compile, run, and profile the result. No special compiler options are needed besides -acc=gpu. When running, you may be advised to set the <samp class="ph codeph"> NVCOMPILER_ACC_USE_GRAPH </samp> environment variable. This is currently necessary to properly set the OpenACC runtime for graph capture. Failure to abide by the guidelines above may result in wrong answers, which can be hard to debug. See the following sections on how to use environment variables to help. A common issue is that the pointers passed to the device kernels during graph playback will be the same every time. Make sure that is the case between iterations in the code without graph capture. </p> <p class="p"> The Nsight Systems tool has very good support for profiling CUDA graphs. The timeline view will provide information on whether you have reduced the launch overhead gaps between the GPU kernels. <a class="xref" href="index.html#openacc-cuda-graphs__nsys-cg-report1-timeline-fig" shape="rect">Figure 1</a> shows a timeline of the iterations of the original OpenACC loop: </p> <div class="fig fignone" id="openacc-cuda-graphs__nsys-cg-report1-timeline-fig"><a name="openacc-cuda-graphs__nsys-cg-report1-timeline-fig" shape="rect"> <!-- --></a><span class="figcap">Figure 1. Nsight Systems Report1 Timeline</span><br clear="none" /><div class="imagecenter"><img class="image imagecenter" src="../hpc-compilers-user-guide/graphics/nsys-cg-report1-timeline.png" alt="png for PDF." /></div><br clear="none" /></div> <p class="p"><a class="xref" href="index.html#openacc-cuda-graphs__nsys-cg-report2-timeline-fig" shape="rect">Figure 2</a> shows a timeline of the iterations when using CUDA Graphs. When the size N is less than a few thousand, launch latency becomes a major contributor to the overall time and here we can see about a 2x speedup: </p> <div class="fig fignone" id="openacc-cuda-graphs__nsys-cg-report2-timeline-fig"><a name="openacc-cuda-graphs__nsys-cg-report2-timeline-fig" shape="rect"> <!-- --></a><span class="figcap">Figure 2. Nsight Systems Report2 Timeline</span><br clear="none" /><div class="imagecenter"><img class="image imagecenter" src="../hpc-compilers-user-guide/graphics/nsys-cg-report2-timeline.png" alt="png for PDF." /></div><br clear="none" /></div> <p class="p"> You can see a more-detailed trace of the CUDA Graph components by adding the <samp class="ph codeph">--cuda-graph-trace=node</samp> option to the nsys profile command. </p> <p class="p"> The above loop demonstrates several of the guidelines outlined at the top of this section, namely, capturing compute regions, whether at the top level or in subprogram units, capturing data movement, and restructuring code regions to minimize or eliminate the host code within a capture region. And the minimal API to begin capture, end capture, then launch the captured graph. </p> </div> </div> <div class="topic concept nested1" id="env-vars"><a name="env-vars" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars" name="env-vars" shape="rect">6.7. Environment Variables</a></h3> <div class="body conbody"> <p class="p">This section summarizes the environment variables that NVIDIA OpenACC supports. These environment variables are user-setable environment variables that control behavior of accelerator-enabled programs at execution. These environment variables must comply with these rules: </p> <ul class="ul"> <li class="li">The names of the environment variables must be upper case. </li> <li class="li">The values of environment variables are case insensitive and may have leading and trailing white space.</li> <li class="li">The behavior is implementation-defined if the values of the environment variables change after the program has started, even if the program itself modifies the values. </li> </ul> <p class="p">The following table contains the environment variables that are currently supported and provides a brief description of each.</p> <div class="tablenoborder"><a name="env-vars__environment-variables-openacc" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="env-vars__environment-variables-openacc" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 16. Supported Environment Variables</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="33.33333333333333%" id="d54e9149" rowspan="1" colspan="1">Use this environment variable...</th> <th class="entry" valign="top" width="66.66666666666666%" id="d54e9152" rowspan="1" colspan="1">To do this...</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e9149" rowspan="1" colspan="1">NVCOMPILER_ACC_CUDA_PROFSTOP</td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e9152" rowspan="1" colspan="1">Set to 1 (or any positive value) to tell the runtime environment to insert an 'atexit(cuProfilerStop)' call upon exit. This behavior may be desired in the case where a profile is incomplete or where a message is issued to call cudaProfilerStop(). </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e9149" rowspan="1" colspan="1">NVCOMPILER_ACC_DEVICE_NUM</td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e9152" rowspan="1" colspan="1">Sets the default device number to use. <span class="keyword parmname">NVCOMPILER_ACC_DEVICE_NUM</span>. Specifies the default device number to use when executing accelerator regions. The value of this environment variable must be a nonnegative integer between zero and the number of devices attached to the host. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e9149" rowspan="1" colspan="1">ACC_DEVICE_NUM</td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e9152" rowspan="1" colspan="1">Legacy name. Superseded by <span class="keyword parmname">NVCOMPILER_ACC_DEVICE_NUM</span>. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e9149" rowspan="1" colspan="1">NVCOMPILER_ACC_DEVICE_TYPE</td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e9152" rowspan="1" colspan="1">Sets the default device type to use for OpenACC regions. <span class="keyword parmname">NVCOMPILER_ACC_DEVICE_TYPE</span>. Specifies which accelerator device to use when executing accelerator regions when the program has been compiled to use more than one different type of device. The value of this environment variable is implementation-defined, and in the NVIDIA OpenACC implementation may be the strings NVIDIA, MULTICORE or HOST </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e9149" rowspan="1" colspan="1">ACC_DEVICE_TYPE</td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e9152" rowspan="1" colspan="1">Legacy name. Superseded by <span class="keyword parmname">NVCOMPILER_ACC_DEVICE_TYPE</span>. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e9149" rowspan="1" colspan="1">NVCOMPILER_ACC_GANGLIMIT</td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e9152" rowspan="1" colspan="1">For NVIDIA CUDA devices, this defines the maximum number of gangs (CUDA thread blocks) that will be launched by a kernel.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e9149" rowspan="1" colspan="1">NVCOMPILER_ACC_NOTIFY</td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e9152" rowspan="1" colspan="1"> With no argument, a debug message will be written to stderr for each kernel launch and/or data transfer. When set to an integer value, the value is used as a bit mask to print information about: <p class="p"> 1: kernel launches</p> <p class="p"> 2: data transfers</p> <p class="p"> 4: region entry/exit</p> <p class="p"> 8: wait operations or synchronizations with the device</p> <p class="p">16: device memory allocates and deallocates</p> </td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e9149" rowspan="1" colspan="1">NVCOMPILER_ACC_PROFLIB</td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e9152" rowspan="1" colspan="1">Enables 3rd party tools interface using the new profiler dynamic library interface.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e9149" rowspan="1" colspan="1">NVCOMPILER_ACC_SYNCHRONOUS</td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e9152" rowspan="1" colspan="1">Disables asynchronous launches and data movement.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d54e9149" rowspan="1" colspan="1">NVCOMPILER_ACC_TIME</td> <td class="entry" valign="top" width="66.66666666666666%" headers="d54e9152" rowspan="1" colspan="1">Enables a lightweight profiler to measure data movement and accelerator kernel execution time and print a summary at the end of program execution. </td> </tr> </tbody> </table> </div> </div> </div> <div class="topic concept nested1" id="acc-prof-kernels"><a name="acc-prof-kernels" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-prof-kernels" name="acc-prof-kernels" shape="rect">6.8. Profiling Accelerator Kernels</a></h3> <div class="body conbody"> <div class="section"> <h3 class="title sectiontitle">Support for Profiler/Trace Tool Interface</h3> <p class="p"> The NVIDIA HPC Compilers support the OpenACC Profiler/Trace Tools Interface. This is the interface used by the NVIDIA profilers to collect performance measurements of OpenACC programs. </p> </div> <div class="section"> <h3 class="title sectiontitle">Using NVCOMPILER_ACC_TIME</h3> <p class="p">Setting the environment variable NVCOMPILER_ACC_TIME to a nonzero value enables collection and printing of simple timing information about the accelerator regions and generated kernels. </p> <div class="note note"><span class="notetitle">Note:</span> Turn off all CUDA Profilers (NVIDIA's Visual Profiler, NVPROF, CUDA_PROFILE, etc) when enabling NVCOMPILER_ACC_TIME, they use the same library to gather performance data and cannot be used concurently. </div> </div> <div class="section"> <h3 class="title sectiontitle">Accelerator Kernel Timing Data</h3><pre xml:space="preserve">bb04.f90 s1 15: region entered 1 times time(us): total=1490738 init=1489138 region=1600 kernels=155 data=1445 w/o init: total=1600 max=1600 min=1600 avg=1600 18: kernel launched 1 times time(us): total=155 max=155 min=155 avg=155</pre><p class="p">In this example, a number of things are occurring:</p> <ul class="ul"> <li class="li">For each accelerator region, the file name <span class="ph filepath">bb04.f90</span> and subroutine or function name <span class="ph filepath">s1</span> is printed, with the line number of the accelerator region, which in the example is <span class="ph filepath">15</span>. </li> <li class="li">The library counts how many times the region is entered (<span class="ph filepath">1</span> in the example) and the microseconds spent in the region (in this example <span class="ph filepath">1490738</span>), which is split into initialization time (in this example <span class="ph filepath">1489138</span>) and execution time (in this example <span class="ph filepath">1600</span>). </li> <li class="li">The execution time is then divided into kernel execution time and data transfer time between the host and GPU.</li> <li class="li">For each kernel, the line number is given, (<span class="ph filepath">18</span> in the example), along with a count of kernel launches, and the total, maximum, minimum, and average time spent in the kernel, all of which are 155 in this example. </li> </ul> </div> </div> </div> <div class="topic concept nested1" id="acc-openacc-runtm-lib"><a name="acc-openacc-runtm-lib" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-openacc-runtm-lib" name="acc-openacc-runtm-lib" shape="rect">6.9. OpenACC Runtime Libraries</a></h3> <div class="body conbody"> <p class="p">This section provides an overview of the user-callable functions and library routines that are available for use by programmers to query the accelerator features and to control behavior of accelerator-enabled programs at runtime. </p> <div class="note note"><span class="notetitle">Note:</span> In Fortran, none of the OpenACC runtime library routines may be called from a PURE or ELEMENTAL procedure. </div> </div> <div class="topic concept nested2" id="acc-openacc-lib-defn"><a name="acc-openacc-lib-defn" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-openacc-lib-defn" name="acc-openacc-lib-defn" shape="rect">6.9.1. Runtime Library Definitions</a></h3> <div class="body conbody"> <p class="p">There are separate runtime library files for Fortran, and for C++ and C.</p> <div class="section"> <h4 class="title sectiontitle">C++ and C Runtime Library Files</h4> <p class="p">In C++ and C, prototypes for the runtime library routines are available in a header file named <span class="ph filepath">accel.h</span>. All the library routines are <span class="ph filepath">extern</span> functions with ‘C’ linkage. This file defines: </p> <ul class="ul"> <li class="li">The prototypes of all routines in this section.</li> <li class="li">Any data types used in those prototypes, including an enumeration type to describe types of accelerators.</li> </ul> </div> <div class="section"> <h4 class="title sectiontitle">Fortran Runtime Library Files</h4> <p class="p">In Fortran, interface declarations are provided in a Fortran include file named <span class="ph filepath">accel_lib.h</span> and in a Fortran module named <span class="ph filepath">accel_lib</span>. These files define: </p> <ul class="ul"> <li class="li">Interfaces for all routines in this section.</li> <li class="li">Integer parameters to define integer kinds for arguments to those routines.</li> <li class="li">Integer parameters to describe types of accelerators.</li> </ul> </div> </div> </div> <div class="topic concept nested2" id="acc-openacc-lib-routines"><a name="acc-openacc-lib-routines" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-openacc-lib-routines" name="acc-openacc-lib-routines" shape="rect">6.9.2. Runtime Library Routines</a></h3> <div class="body conbody"> <p class="p"><a class="xref" href="index.html#acc-openacc-lib-routines__acc-openacc-lib-routines-tbl" shape="rect">Table 17</a> lists and briefly describes the runtime library routines supported by the NVIDIA HPC Compilers in addition to the standard OpenACC runtine API routines. </p> <div class="tablenoborder"><a name="acc-openacc-lib-routines__acc-openacc-lib-routines-tbl" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="acc-openacc-lib-routines__acc-openacc-lib-routines-tbl" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 17. Accelerator Runtime Library Routines</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e9466" rowspan="1" colspan="1">This Runtime Library Routine...</th> <th class="entry" valign="top" width="NaN%" id="d54e9469" rowspan="1" colspan="1">Does this...</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_allocs</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Returns the number of arrays allocated in data or compute regions.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_bytesalloc</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Returns the total bytes allocated by data or compute regions.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_bytesin</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Returns the total bytes copied in to the accelerator by data or compute regions.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_bytesout</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Returns the total bytes copied out from the accelerator by data or compute regions.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_clear_freelists</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Clears lists of deallocated device memory chunks retained by the device memory manager for reuse. Please refer to the section <a class="xref" href="index.html#env-vars-memory" shape="rect">Environment Variables Controlling Device Memory Management</a> to learn more about device memory control. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_copyins</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Returns the number of arrays copied in to the accelerator by data or compute regions.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_copyouts</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Returns the number of arrays copied out from the accelerator by data or compute regions.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_disable_time</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Tells the runtime to stop profiling accelerator regions and kernels. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_enable_time</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Tells the runtime to start profiling accelerator regions and kernels, if it is not already doing so.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_exec_time</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Returns the number of microseconds spent on the accelerator executing kernels.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_frees</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Returns the number of arrays freed or deallocated in data or compute regions.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_get_device</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Returns the type of accelerator device used to run the next accelerator region, if one is selected.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_get_device_num</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Returns the number of the device being used to execute an accelerator region.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_get_free_memory</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Returns the total available free memory on the attached accelerator device.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_get_memory</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Returns the total memory on the attached accelerator device.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_get_num_devices</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Returns the number of accelerator devices of the given type attached to the host.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_kernels</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Returns the number of accelerator kernels launched since the start of the program.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_present_dump</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Summarizes all data present on the current device.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_present_dump_all</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Summarizes all data present on all devices.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_regions</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Returns the number of accelerator regions entered since the start of the program.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9466" rowspan="1" colspan="1">acc_total_time</td> <td class="entry" valign="top" width="NaN%" headers="d54e9469" rowspan="1" colspan="1">Returns the number of microseconds spent in accelerator compute regions and in moving data for accelerator data regions.</td> </tr> </tbody> </table> </div> </div> </div> </div> <div class="topic concept nested1" id="acc-supt-intrin"><a name="acc-supt-intrin" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-supt-intrin" name="acc-supt-intrin" shape="rect">6.10. Supported Intrinsics</a></h3> <div class="body conbody"> <p class="p">An intrinsic is a function available in a given language whose implementation is handled specifically by the compiler. Typically, an intrinsic substitutes a sequence of automatically-generated instructions for the original function call. Since the compiler has an intimate knowledge of the intrinsic function, it can better integrate it and optimize it for the situation. </p> <p class="p">Intrinsics make the use of processor-specific enhancements easier because they provide a language interface to assembly instructions. In doing so, the compiler manages things that the user would normally have to be concerned with, such as register names, register allocations, and memory locations of data. </p> <p class="p">This section contains an overview of the Fortran <span class="ph">and C </span>intrinsics that the accelerator supports. </p> </div> <div class="topic concept nested2" id="acc-fort-intrin-sum"><a name="acc-fort-intrin-sum" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-fort-intrin-sum" name="acc-fort-intrin-sum" shape="rect">6.10.1. Supported Fortran Intrinsics Summary Table</a></h3> <div class="body conbody"> <p class="p"><a class="xref" href="index.html#acc-fort-intrin-sum__acc-fort-intrin-sum-tbl" shape="rect">Table 18</a> is an alphabetical summary of the supported Fortran intrinsics that the accelerator supports. These functions are specific to Fortran 90/95 unless otherwise specified. </p> <p class="p"> In most cases support is provided for all the data types for which the intrinsic is valid. When support is available for only certain data types, the middle column of the table specifies which ones, using the following codes: </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="void" border="0" rules="none"> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">I for integer</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">S for single precision real</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">C for single precision complex</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">D for double precision real</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Z for double precision complex</td> </tr> </tbody> </table> </div> <div class="tablenoborder"><a name="acc-fort-intrin-sum__acc-fort-intrin-sum-tbl" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="acc-fort-intrin-sum__acc-fort-intrin-sum-tbl" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 18. Supported Fortran Intrinsics</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e9769" rowspan="1" colspan="1">This intrinsic</th> <th class="entry" valign="top" width="NaN%" id="d54e9772" rowspan="1" colspan="1"></th> <th class="entry" valign="top" width="NaN%" id="d54e9775" rowspan="1" colspan="1">Return value</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">ABS</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> I,S,D</td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">absolute value of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">ACOS</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">arccosine of the specified argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">AINT</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">truncation of the argument to a whole number.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">ANINT</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">nearest whole number of the real argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">ASIN</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">arcsine of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">ATAN</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">arctangent of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">ATAN2</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">angle in radians of the complex value first-argument + i*second-argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">COS</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> S,D,C,Z</td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">cosine of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">COSH</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">hyperbolic cosine of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">DBLE</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> S,D</td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">conversion of the argument to double precision real.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">DPROD</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">double precision product of two single precision arguments.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">EXP</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> S,D,C,Z</td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">natural exponential value of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">IAND</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">result of logical AND of the two integer arguments.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">IEOR</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">result of the boolean exclusive OR of the two integer arguments.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">INT</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> I,S,D</td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">conversion of the argument to integer type.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">IOR</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">result of the boolean inclusive OR of the two integer arguments.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">LOG</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> S,D,C,Z</td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">base-e (natural logarithm) of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">LOG10</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">base-10 logarithm of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">MAX</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">maximum value of the arguments.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">MIN</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">minimum value of the arguments.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">MOD</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> I</td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">remainder of the first argument divided by the second argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">NINT</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">nearest integer of the real argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">NOT</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">logical complement of the integer argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">REAL</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> I,S,D</td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">conversion of the argument to real.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">SIGN</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">absolute value of first argument times the sign of second argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">SIN</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> S,D,C,Z</td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">sine of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">SINH</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">hyperbolic sine of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">SQRT</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> S,D,C,Z</td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">square root of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">TAN</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">tangent of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e9769" rowspan="1" colspan="1">TANH</td> <td class="entry" valign="top" width="NaN%" headers="d54e9772" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e9775" rowspan="1" colspan="1">hyperbolic tangent of the argument.</td> </tr> </tbody> </table> </div> </div> </div> <div class="topic concept nested2" id="acc-c-intrin-sum"><a name="acc-c-intrin-sum" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#acc-c-intrin-sum" name="acc-c-intrin-sum" shape="rect">6.10.2. Supported C Intrinsics Summary Table</a></h3> <div class="body conbody"> <p class="p">This section contains two alphabetical summaries – one for double functions and a second for float functions. These lists contain only those C intrinsics that the accelerator supports. </p> <div class="tablenoborder"><a name="acc-c-intrin-sum__supt-c-intrin-dbl-func-tbl" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="acc-c-intrin-sum__supt-c-intrin-dbl-func-tbl" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 19. Supported C Intrinsic Double Functions</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e10157" rowspan="1" colspan="1">This intrinsic</th> <th class="entry" valign="top" width="NaN%" id="d54e10160" rowspan="1" colspan="1">Return value</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">acos</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">arccosine of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">asin</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">arcsine of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">atan</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">arctangent of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">atan2</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">arctangent of y/x, where y is the first argument, x the second.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">cos</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">cosine of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">cosh</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">hyperbolic cosine of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">exp</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">exponential value of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">fabs</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">absolute value of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">fmax</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">maximum value of the two arguments</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">fmin</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">minimum value of the two arguments</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">log</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">natural logarithm of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">log10</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">base-10 logarithm of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">pow</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">value of the first argument raised to the power of the second argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">sin</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">value of the sine of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">sinh</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">hyperbolic sine of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">sqrt</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">square root of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">tan</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">tangent of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10157" rowspan="1" colspan="1">tanh</td> <td class="entry" valign="top" width="NaN%" headers="d54e10160" rowspan="1" colspan="1">hyperbolic tangent of the argument.</td> </tr> </tbody> </table> </div> <div class="tablenoborder"><a name="acc-c-intrin-sum__supt-c-intrin-float-func-tbl" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="acc-c-intrin-sum__supt-c-intrin-float-func-tbl" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 20. Supported C Intrinsic Float Functions</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e10350" rowspan="1" colspan="1">This intrinsic</th> <th class="entry" valign="top" width="NaN%" id="d54e10353" rowspan="1" colspan="1">Return value</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10350" rowspan="1" colspan="1">acosf</td> <td class="entry" valign="top" width="NaN%" headers="d54e10353" rowspan="1" colspan="1">arccosine of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10350" rowspan="1" colspan="1">asinf</td> <td class="entry" valign="top" width="NaN%" headers="d54e10353" rowspan="1" colspan="1">arcsine of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10350" rowspan="1" colspan="1">atanf</td> <td class="entry" valign="top" width="NaN%" headers="d54e10353" rowspan="1" colspan="1">arctangent of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10350" rowspan="1" colspan="1">atan2f</td> <td class="entry" valign="top" width="NaN%" headers="d54e10353" rowspan="1" colspan="1">arctangent of y/x, where y is the first argument, x the second.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10350" rowspan="1" colspan="1">cosf</td> <td class="entry" valign="top" width="NaN%" headers="d54e10353" rowspan="1" colspan="1">cosine of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10350" rowspan="1" colspan="1">coshf</td> <td class="entry" valign="top" width="NaN%" headers="d54e10353" rowspan="1" colspan="1">hyperbolic cosine of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10350" rowspan="1" colspan="1">expf</td> <td class="entry" valign="top" width="NaN%" headers="d54e10353" rowspan="1" colspan="1">exponential value of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10350" rowspan="1" colspan="1">fabsf</td> <td class="entry" valign="top" width="NaN%" headers="d54e10353" rowspan="1" colspan="1">absolute value of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10350" rowspan="1" colspan="1">logf</td> <td class="entry" valign="top" width="NaN%" headers="d54e10353" rowspan="1" colspan="1">natural logarithm of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10350" rowspan="1" colspan="1">log10f</td> <td class="entry" valign="top" width="NaN%" headers="d54e10353" rowspan="1" colspan="1">base-10 logarithm of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10350" rowspan="1" colspan="1">powf</td> <td class="entry" valign="top" width="NaN%" headers="d54e10353" rowspan="1" colspan="1">value of the first argument raised to the power of the second argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10350" rowspan="1" colspan="1">sinf</td> <td class="entry" valign="top" width="NaN%" headers="d54e10353" rowspan="1" colspan="1">value of the sine of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10350" rowspan="1" colspan="1">sinhf</td> <td class="entry" valign="top" width="NaN%" headers="d54e10353" rowspan="1" colspan="1">hyperbolic sine of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10350" rowspan="1" colspan="1">sqrtf</td> <td class="entry" valign="top" width="NaN%" headers="d54e10353" rowspan="1" colspan="1">square root of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10350" rowspan="1" colspan="1">tanf</td> <td class="entry" valign="top" width="NaN%" headers="d54e10353" rowspan="1" colspan="1">tangent of the argument.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e10350" rowspan="1" colspan="1">tanhf</td> <td class="entry" valign="top" width="NaN%" headers="d54e10353" rowspan="1" colspan="1">hyperbolic tangent of the argument.</td> </tr> </tbody> </table> </div> </div> </div> </div> </div> <div class="topic concept nested0" id="openmp-use"><a name="openmp-use" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#openmp-use" name="openmp-use" shape="rect">7. Using OpenMP</a></h2> <div class="body conbody"> <p class="p"> OpenMP is a specification for a set of compiler directives, an applications programming interface (API), and a set of environment variables that can be used to specify parallel execution in Fortran, C⁠+⁠+, and C programs. For general information about using OpenMP and to obtain a copy of the OpenMP specification, refer to the <a class="xref" href="https://www.openmp.org" target="_blank" shape="rect">OpenMP organization's website</a>. </p> <p class="p"> The NVFORTRAN, NVC⁠+⁠+, and NVC compilers support a subset of the OpenMP Application Program Interface for CPUs and GPUs. In defining this subset, we have focused on OpenMP 5.0 features that will enable CPU and GPU targeting for OpenMP applications with a goal of encouraging programming practices that are portable and scalable. For features that are to be avoided, wherever possible, the directives and API calls related to those features are parsed and ignored to maximize portability. Where ignoring such features is not possible, or could result in ambiguous or incorrect execution, the compilers emit appropriate error messages at compile- or run-time. </p> <p class="p"> OpenMP applications properly structured for GPUs, meaning they expose massive parallelism and have relatively little or no synchronization in GPU-side code segments, should compile and execute with performance on par with or close to equivalent OpenACC. Codes that are not well-structured for GPUs may perform poorly but should execute correctly. </p> <div class="p"> Use the <samp class="ph codeph">-⁠mp</samp> compiler switch to enable processing of OpenMP directives and pragmas. The most important sub-options to <samp class="ph codeph">-⁠mp</samp> are the following: <ul class="ul"> <li class="li"><samp class="ph codeph">gpu</samp>: OpenMP directives are compiled for GPU execution plus multicore CPU fallback; this feature is supported on NVIDIA V100 or later GPUs. </li> <li class="li"><samp class="ph codeph">multicore</samp>: OpenMP directives are compiled for multicore CPU execution only; this sub-option is the default. </li> </ul> </div> <div class="section" id="openmp-use__macros"><a name="openmp-use__macros" shape="rect"> <!-- --></a><h2 class="title sectiontitle">Predefined Macros</h2> <div class="p">The following macros corresponding to the offload target compiled for are added implicitly: <ul class="ul"> <li class="li"><samp class="ph codeph">__NVCOMPILER_OPENMP_GPU</samp> when OpenMP target directives are compiled for GPU. </li> <li class="li"><samp class="ph codeph">__NVCOMPILER_OPENMP_MULTICORE</samp> when OpenMP target directives are compiled for multicore CPU. </li> </ul> </div> </div> </div> <div class="topic concept nested1" id="openmp-env-vars"><a name="openmp-env-vars" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#openmp-env-vars" name="openmp-env-vars" shape="rect">7.1. Environment Variables</a></h3> <div class="body conbody"> <p class="p"> The OpenMP specification includes many environment variables related to program execution. </p> <div class="section" id="openmp-env-vars__thread-affinity"><a name="openmp-env-vars__thread-affinity" shape="rect"> <!-- --></a><h3 class="title sectiontitle">Thread affinity</h3> <p class="p"> One important environment variable is <samp class="ph codeph">OMP_PROC_BIND</samp>. It controls the OpenMP CPU thread affinity policy. When thread affinity is disabled, the operating system is free to move threads between the available CPU cores. When thread affinity is enabled, each thread is bound to a subset of the available CPU cores. The environment variable <samp class="ph codeph">OMP_PLACES</samp> can be used to specify how a subset of the available CPU cores is determined for each thread. When set to a valid value, this environment variable will enable thread affinity and override the default thread affinity policy. </p> <p class="p"> Binding threads to certain CPU cores is often beneficial for application performance, because that can improve the CPU cache hit rate and limit memory transactions between different NUMA nodes. Therefore, it is important to consider enabling thread affinity for your application. </p> <p class="p"> The default value of <samp class="ph codeph">OMP_PROC_BIND</samp> is <samp class="ph codeph">false</samp>. Thus, thread affinity is disabled by default. This is a conservative setting that allows certain classes of applications (such as OpenMP + MPI) to create multiple processes without taking special care of the thread affinity policy to avoid binding threads in different processes to the same CPU cores. </p> <p class="p"> The following table explains the simplest possible values of <samp class="ph codeph">OMP_PROC_BIND</samp>. For the comprehensive explanation of <samp class="ph codeph">OMP_PROC_BIND</samp> and <samp class="ph codeph">OMP_PLACES</samp>, please refer to the OpenMP specification. </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="border" border="1" rules="all"> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="45.45454545454545%" id="d54e10653" rowspan="1" colspan="1">Value</th> <th class="entry" valign="top" width="54.54545454545454%" id="d54e10656" rowspan="1" colspan="1">Behavior</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="45.45454545454545%" headers="d54e10653" rowspan="1" colspan="1"><samp class="ph codeph">OMP_PROC_BIND=false</samp></td> <td class="entry" valign="top" width="54.54545454545454%" headers="d54e10656" rowspan="1" colspan="1"> Thread affinity is disabled unless <samp class="ph codeph">OMP_PLACES</samp> is set to a valid value. When thread affinity is disabled, the operating system is free to assign threads to any available CPU core at any time of the application execution. This is the default value. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="45.45454545454545%" headers="d54e10653" rowspan="1" colspan="1"><samp class="ph codeph">OMP_PROC_BIND=true</samp></td> <td class="entry" valign="top" width="54.54545454545454%" headers="d54e10656" rowspan="1" colspan="1"> Thread affinity is enabled. Unless <samp class="ph codeph">OMP_PLACES</samp> is set, the implementation attempts to assign threads optimally to CPU cores to maximize the cache hit rate and minimize the number of memory transactions between NUMA nodes. </td> </tr> </tbody> </table> </div> </div> <div class="section" id="openmp-env-vars__device-offload"><a name="openmp-env-vars__device-offload" shape="rect"> <!-- --></a><h3 class="title sectiontitle">Device offload</h3> <p class="p"> Another important environment variable to understand is <samp class="ph codeph">OMP_TARGET_OFFLOAD</samp>. Use this environment variable to affect the behavior of execution on host and device including host fallback. The following table explains the behavior determined by each of the values to which you can set this environment variable. </p> <div class="tablenoborder"><a name="openmp-env-vars__omp-target-offload" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="openmp-env-vars__omp-target-offload" class="table" frame="border" border="1" rules="all"> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="45.45454545454545%" id="d54e10720" rowspan="1" colspan="1">Value</th> <th class="entry" valign="top" width="54.54545454545454%" id="d54e10723" rowspan="1" colspan="1">Behavior</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="45.45454545454545%" headers="d54e10720" rowspan="1" colspan="1"><samp class="ph codeph">OMP_TARGET_OFFLOAD=DEFAULT</samp></td> <td class="entry" valign="top" width="54.54545454545454%" headers="d54e10723" rowspan="1" colspan="1"> Try to execute on a GPU; if a supported GPU is not available, fallback to the host </td> </tr> <tr class="row"> <td class="entry" valign="top" width="45.45454545454545%" headers="d54e10720" rowspan="1" colspan="1"><samp class="ph codeph">OMP_TARGET_OFFLOAD=DISABLED</samp></td> <td class="entry" valign="top" width="54.54545454545454%" headers="d54e10723" rowspan="1" colspan="1"> Do not execute on the GPU even if one is available; execute on the host </td> </tr> <tr class="row"> <td class="entry" valign="top" width="45.45454545454545%" headers="d54e10720" rowspan="1" colspan="1"><samp class="ph codeph">OMP_TARGET_OFFLOAD=MANDATORY</samp></td> <td class="entry" valign="top" width="54.54545454545454%" headers="d54e10723" rowspan="1" colspan="1"> Execute on a GPU or terminate the program </td> </tr> </tbody> </table> </div> </div> <div class="section" id="openmp-env-vars__number-of-teams"><a name="openmp-env-vars__number-of-teams" shape="rect"> <!-- --></a><h3 class="title sectiontitle">Number of teams on device</h3> <p class="p"> When an application offloads an <samp class="ph codeph">omp target teams</samp> construct to the GPU, the number of teams is calculated automatically unless the construct has a <samp class="ph codeph">num_teams</samp> clause. The automatic setting of the number of teams can be limited to a maximum value provided by the <samp class="ph codeph">OMP_NUM_TEAMS</samp> environment variable. The same maximum value can also be set by the application at run time with the function <samp class="ph codeph">omp_set_num_teams</samp>. </p> <div class="tablenoborder"><a name="openmp-env-vars__omp-num-teams" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="openmp-env-vars__omp-num-teams" class="table" frame="border" border="1" rules="all"> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="54.54545454545454%" id="d54e10802" rowspan="1" colspan="1">Value</th> <th class="entry" valign="top" width="45.45454545454545%" id="d54e10805" rowspan="1" colspan="1">Behavior</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="54.54545454545454%" headers="d54e10802" rowspan="1" colspan="1"><samp class="ph codeph">OMP_NUM_TEAMS=&lt;positive_integer&gt;</samp></td> <td class="entry" valign="top" width="45.45454545454545%" headers="d54e10805" rowspan="1" colspan="1"> Maximum number of teams on device </td> </tr> </tbody> </table> </div> <p class="p"> For the comprehensive explanation of <samp class="ph codeph">OMP_NUM_TEAMS</samp>, please refer to the OpenMP specification. </p> </div> <div class="section" id="openmp-env-vars__number-of-threads-in-teams"><a name="openmp-env-vars__number-of-threads-in-teams" shape="rect"> <!-- --></a><h3 class="title sectiontitle">Number of threads in teams</h3> <p class="p"> An <samp class="ph codeph">omp target teams</samp> construct offloaded to the GPU creates a league of teams each consisting of a certain number of threads. The number of threads is the same for all teams in the league, and is calculated automatically unless the construct has a <samp class="ph codeph">thread_limit</samp> clause. </p> <p class="p"> The environment variable <samp class="ph codeph">OMP_TEAMS_THREAD_LIMIT</samp> can be used to limit the maximum number of threads in teams. The same maximum value can be set by the application with the runtime function <samp class="ph codeph">omp_set_teams_thread_limit</samp>. </p> <p class="p"> For NVIDIA GPUs, we recommend using values that are multiples of 32 (which is the size of the GPU thread warp). That equally applies to the <samp class="ph codeph">OMP_TEAMS_THREAD_LIMIT</samp> environment variable, the <samp class="ph codeph">omp_set_teams_thread_limit</samp> function and the <samp class="ph codeph">thread_limit</samp> clause. For any other value, the actual limit on the number of threads per team will likely be rounded down to the nearest multiple of 32. The same guidance applies to the <samp class="ph codeph">num_threads</samp> clause as well. </p> <div class="tablenoborder"><a name="openmp-env-vars__omp-teams-thread-limit" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="openmp-env-vars__omp-teams-thread-limit" class="table" frame="border" border="1" rules="all"> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="54.54545454545454%" id="d54e10884" rowspan="1" colspan="1">Value</th> <th class="entry" valign="top" width="45.45454545454545%" id="d54e10887" rowspan="1" colspan="1">Behavior</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="54.54545454545454%" headers="d54e10884" rowspan="1" colspan="1"><samp class="ph codeph">OMP_TEAMS_THREAD_LIMIT=&lt;positive_integer&gt;</samp></td> <td class="entry" valign="top" width="45.45454545454545%" headers="d54e10887" rowspan="1" colspan="1"> Maximum number of threads in teams </td> </tr> </tbody> </table> </div> <p class="p"> For the comprehensive explanation of <samp class="ph codeph">OMP_TEAMS_THREAD_LIMIT</samp>, please refer to the OpenMP specification. </p> </div> <div class="section" id="openmp-env-vars__openmp-cuda-grid"><a name="openmp-env-vars__openmp-cuda-grid" shape="rect"> <!-- --></a><h3 class="title sectiontitle">Forcing the number of device teams and threads</h3> <p class="p"> In certain situations, for instance for debugging or performance tuning, it may be desirable to specify an exact number of teams and threads on the GPU. While OpenMP offers a number of convenient ways to control that, e.g. the <samp class="ph codeph">num_teams</samp> and <samp class="ph codeph">thread_limit</samp> clauses, as well as the environment variables described above, they do not guarantee an exact teams and threads configuration. </p> <p class="p"> The NVIDIA HPC OpenMP Runtime supports the <samp class="ph codeph">NVCOMPILER_OMP_CUDA_GRID</samp> environment variable. When set, it requests the runtime to use the exact number of teams and threads per team when running OpenMP compute constructs on the GPU. Essentially, its effect is to use a specific CUDA grid configuration for any kernel, bypassing runtime and compiler guidance. </p> <div class="tablenoborder"><a name="openmp-env-vars__nv-omp-cuda-grid" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="openmp-env-vars__nv-omp-cuda-grid" class="table" frame="border" border="1" rules="all"> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="54.54545454545454%" id="d54e10950" rowspan="1" colspan="1">Value</th> <th class="entry" valign="top" width="45.45454545454545%" id="d54e10953" rowspan="1" colspan="1">Behavior</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="54.54545454545454%" headers="d54e10950" rowspan="1" colspan="1"><samp class="ph codeph">NVCOMPILER_OMP_CUDA_GRID=&lt;num_blocks&gt;,&lt;num_threads&gt;</samp></td> <td class="entry" valign="top" width="45.45454545454545%" headers="d54e10953" rowspan="1" colspan="1"> The &lt;num_blocks&gt; and &lt;num_threads&gt; must be positive integers. They are used to form a CUDA grid when running GPU kernels associated with <samp class="ph codeph">omp target</samp> compute constructs. </td> </tr> </tbody> </table> </div> <p class="p"> However, even with an exact CUDA grid specified, the runtime may still use a corrected configuration if that is necessary for a successful kernel launch. </p> <p class="p"> Please refer to the <a class="xref" href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html" target="_blank" shape="rect">CUDA C++ Programming Guide</a> for the detailed explanation of how the CUDA kernel execution configurations work. </p> </div> </div> </div> <div class="topic concept nested1" id="openmp-fallback"><a name="openmp-fallback" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#openmp-fallback" name="openmp-fallback" shape="rect">7.2. Fallback Mode</a></h3> <div class="body conbody"> <div class="section"> <p class="p"> The HPC compilers support host fallback of OpenMP <samp class="ph codeph">target</samp> regions when no GPU is present or <samp class="ph codeph">OMP_TARGET_OFFLOAD</samp> is set to <samp class="ph codeph">DISABLED</samp>. Execution should always be correct but the performance of the target region may not always be optimal when run on the host. OpenMP target regions prescriptively structured for optimal execution on GPUs may not perform well when run on the dissimilar architecture of the CPU. To provide performance portability between host and device, we recommend use of the <samp class="ph codeph">loop</samp> construct. </p> </div> <div class="section"> <h3 class="title sectiontitle">firstprivates with nowait not supported for host execution</h3> <p class="p"> There is currently a limitation on the use of the <samp class="ph codeph">nowait</samp> clause on target regions intended for execution on the host (-⁠mp or -⁠mp=gpu with <samp class="ph codeph">OMP_TARGET_OFFLOAD=DISABLED</samp>). If the target region references variables having the <samp class="ph codeph">firstprivate</samp> data-sharing attribute, their concurrent updates are not guaranteed to be safe. To work around this limitation, when running on the host, we recommend avoiding the <samp class="ph codeph">nowait</samp> clause on such target regions or equivalently using the <samp class="ph codeph">taskwait</samp> construct immediately following the region. </p> </div> </div> </div> <div class="topic concept nested1" id="openmp-loop"><a name="openmp-loop" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#openmp-loop" name="openmp-loop" shape="rect">7.3. Loop</a></h3> <div class="body conbody"> <p class="p"> The HPC compilers support the <samp class="ph codeph">loop</samp> construct with an extension to the default binding thread set mechanism specified by OpenMP in order to allow the compilers the freedom to analyze loops and dependencies to generate highly parallel code for CPU and GPU targets. In other words, the compilers map <samp class="ph codeph">loop</samp> to either teams or to threads, as the compiler chooses, unless the user explicitly specifies otherwise. The mapping selected is specific to each target architecture even within the same executable (i.e., GPU offload and host fallback) thereby facilitating performance portability. </p> <p class="p"> The shape of the parallelism offered by NVIDIA's GPUs, consisting of thread blocks and three dimensions of threads therein, differs from the multi-threaded vector parallelism of modern CPUs. The following table summarizes the OpenMP mapping to NVIDIA GPUs and multicore CPUs: </p> <div class="tablenoborder"><a name="openmp-loop__omp-loop-table-1" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="openmp-loop__omp-loop-table-1" class="table" frame="border" border="1" rules="all"> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="20%" id="d54e11075" rowspan="1" colspan="1">Construct</th> <th class="entry" valign="top" width="40%" id="d54e11078" rowspan="1" colspan="1">CPU</th> <th class="entry" valign="top" width="40%" id="d54e11081" rowspan="1" colspan="1">GPU</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="20%" headers="d54e11075" rowspan="1" colspan="1"><samp class="ph codeph">!$omp target</samp></td> <td class="entry" valign="top" width="40%" headers="d54e11078" rowspan="1" colspan="1"></td> <td class="entry" valign="top" width="40%" headers="d54e11081" rowspan="1" colspan="1"> starts offload </td> </tr> <tr class="row"> <td class="entry" valign="top" width="20%" headers="d54e11075" rowspan="1" colspan="1"><samp class="ph codeph">!$omp teams</samp></td> <td class="entry" valign="top" width="40%" headers="d54e11078" rowspan="1" colspan="1"> single team </td> <td class="entry" valign="top" width="40%" headers="d54e11081" rowspan="1" colspan="1"> CUDA thread blocks in grid </td> </tr> <tr class="row"> <td class="entry" valign="top" width="20%" headers="d54e11075" rowspan="1" colspan="1"><samp class="ph codeph">!$omp parallel</samp></td> <td class="entry" valign="top" width="40%" headers="d54e11078" rowspan="1" colspan="1"> CPU threads </td> <td class="entry" valign="top" width="40%" headers="d54e11081" rowspan="1" colspan="1"> CUDA threads within thread block </td> </tr> <tr class="row"> <td class="entry" valign="top" width="20%" headers="d54e11075" rowspan="1" colspan="1"><samp class="ph codeph">!$omp simd</samp></td> <td class="entry" valign="top" width="40%" headers="d54e11078" rowspan="1" colspan="1"> hint for vector instructions </td> <td class="entry" valign="top" width="40%" headers="d54e11081" rowspan="1" colspan="1"> simdlen(1) </td> </tr> </tbody> </table> </div> <p class="p"> HPC programs need to leverage all available parallelism to achieve performance. The programmer can attempt to become an expert in the intricacies of each target architecture and use that knowledge to structure programs accordingly. This prescriptive model can be successful but tends to increase source code complexity and often requires restructuring for each new target architecture. Here's an example where a programmer explicitly requests the steps the compiler should take to map parallelism to two targets: </p><pre xml:space="preserve"> #ifdef TARGET_GPU #pragma omp target teams distribute reduction(max:error) #else #pragma omp parallel for reduction(max:error) #endif for( int j = 1; j &lt; n-1; j++) { #ifdef TARGET_GPU #pragma omp parallel for reduction(max:error) #endif for( int i = 1; i &lt; m-1; i++ ) { Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmaxf( error, fabsf(Anew[j][i]-A[j][i])); } } </pre><p class="p"> An alternative is for the programmer to focus on exposing parallelism in a program and allowing a compiler to do the mapping onto the target architectures. The HPC compilers' implementation of <samp class="ph codeph">loop</samp> supports this descriptive model. In this example, the programmer specifies the loop regions to be parallelized by the compiler and the compilers parallelize <samp class="ph codeph">loop</samp> across teams and threads: </p><pre xml:space="preserve"> #pragma omp target teams loop reduction(max:error) for( int j = 1; j &lt; n-1; j++) { #pragma omp loop reduction(max:error) for( int i = 1; i &lt; m-1; i++ ) { Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmaxf( error, fabsf(Anew[j][i]-A[j][i])); } } </pre><p class="p"> The programmer's tuning tool with <samp class="ph codeph">loop</samp> is the <samp class="ph codeph">bind</samp> clause. The following table extends the previous mapping example: </p> <div class="tablenoborder"><a name="openmp-loop__omp-loop-table-2" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="openmp-loop__omp-loop-table-2" class="table" frame="border" border="1" rules="all"> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="42.857142857142854%" id="d54e11193" rowspan="1" colspan="1">Construct</th> <th class="entry" valign="top" width="28.57142857142857%" id="d54e11196" rowspan="1" colspan="1">CPU</th> <th class="entry" valign="top" width="28.57142857142857%" id="d54e11199" rowspan="1" colspan="1">GPU</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="42.857142857142854%" headers="d54e11193" rowspan="1" colspan="1"><samp class="ph codeph">!$omp loop bind(teams)</samp></td> <td class="entry" valign="top" width="28.57142857142857%" headers="d54e11196" rowspan="1" colspan="1"> threads </td> <td class="entry" valign="top" width="28.57142857142857%" headers="d54e11199" rowspan="1" colspan="1"> CUDA thread blocks and threads </td> </tr> <tr class="row"> <td class="entry" valign="top" width="42.857142857142854%" headers="d54e11193" rowspan="1" colspan="1"><samp class="ph codeph">!$omp loop bind(parallel)</samp></td> <td class="entry" valign="top" width="28.57142857142857%" headers="d54e11196" rowspan="1" colspan="1"> threads </td> <td class="entry" valign="top" width="28.57142857142857%" headers="d54e11199" rowspan="1" colspan="1"> CUDA threads </td> </tr> <tr class="row"> <td class="entry" valign="top" width="42.857142857142854%" headers="d54e11193" rowspan="1" colspan="1"><samp class="ph codeph">!$omp loop bind(thread)</samp></td> <td class="entry" valign="top" width="28.57142857142857%" headers="d54e11196" rowspan="1" colspan="1"> single thread (useful for vector instructions) </td> <td class="entry" valign="top" width="28.57142857142857%" headers="d54e11199" rowspan="1" colspan="1"> single thread </td> </tr> </tbody> </table> </div> <p class="p"> Orphaned <samp class="ph codeph">loop</samp> constructs within a single file are supported; a binding region of either <samp class="ph codeph">parallel</samp> or <samp class="ph codeph">thread</samp> must be specified with such loops via the <samp class="ph codeph">bind</samp> clause. The compilers support <samp class="ph codeph">loop</samp> regions containing procedure calls as long as the callee does not contain OpenMP directives. </p> <p class="p"> Here are a few additional examples using <samp class="ph codeph">loop</samp>. We also show examples of the type of information the compiler would provide when using the <samp class="ph codeph">-⁠Minfo</samp> compiler option. </p> <p class="p"> Use of <samp class="ph codeph">loop</samp> in Fortran: </p><pre xml:space="preserve"> !$omp target teams loop do n1loc_blk = 1, n1loc_blksize do igp = 1, ngpown do ig_blk = 1, ig_blksize do ig = ig_blk, ncouls, ig_blksize do n1_loc = n1loc_blk, ntband_dist, n1loc_blksize !expensive computation codes enddo enddo enddo enddo enddo $ nvfortran test.f90 -mp=gpu -Minfo=mp 42, !$omp target teams loop 42, Generating "nvkernel_MAIN__F1L42_1" GPU kernel Generating Tesla code 43, Loop parallelized across teams ! blockidx%x 44, Loop run sequentially 45, Loop run sequentially 46, Loop run sequentially 47, Loop parallelized across threads(128) ! threadidx%x 42, Generating Multicore code 43, Loop parallelized across threads </pre><p class="p"> Use of <samp class="ph codeph">loop</samp>, <samp class="ph codeph">collapse</samp>, and <samp class="ph codeph">bind</samp>: </p><pre xml:space="preserve"> !$omp target teams loop collapse(3) do n1loc_blk = 1, n1loc_blksize do igp = 1, ngpown do ig_blk = 1, ig_blksize !$omp loop bind(parallel) collapse(2) do ig = ig_blk, ncouls, ig_blksize do n1_loc = n1loc_blk, ntband_dist, n1loc_blksize !expensive computation codes enddo enddo enddo enddo enddo $ nvfortran test.f90 -mp=gpu -Minfo=mp 42, !$omp target teams loop 42, Generating "nvkernel_MAIN__F1L42_1" GPU kernel Generating Tesla code 43, Loop parallelized across teams collapse(3) ! blockidx%x 44, ! blockidx%x collapsed 45, ! blockidx%x collapsed 47, Loop parallelized across threads(128) collapse(2) ! threadidx%x 48, ! threadidx%x collapsed 42, Generating Multicore code 43, Loop parallelized across threads </pre><p class="p"> Use of <samp class="ph codeph">loop</samp>, <samp class="ph codeph">collapse</samp>, and <samp class="ph codeph">bind(thread)</samp>: </p><pre xml:space="preserve"> !$omp target teams loop collapse(3) do n1loc_blk = 1, n1loc_blksize do igp = 1, ngpown do ig_blk = 1, ig_blksize !$omp loop bind(thread) collapse(2) do ig = ig_blk, ncouls, ig_blksize do n1_loc = n1loc_blk, ntband_dist, n1loc_blksize ! expensive computation codes enddo enddo enddo enddo enddo $ nvfortran test.f90 -mp=gpu -Minfo=mp 42, !$omp target teams loop 42, Generating "nvkernel_MAIN__F1L42_1" GPU kernel Generating Tesla code 43, Loop parallelized across teams, threads(128) collapse(3) ! blockidx%x threadidx%x 44, ! blockidx%x threadidx%x collapsed 45, ! blockidx%x threadidx%x collapsed 47, Loop run sequentially 48, collapsed 42, Generating Multicore code 43, Loop parallelized across threads </pre></div> </div> <div class="topic concept nested1" id="openmp-subset"><a name="openmp-subset" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#openmp-subset" name="openmp-subset" shape="rect">7.4. OpenMP Subset</a></h3> <div class="body conbody"> <div class="section"> <p class="p"> This section contains the subset of OpenMP 5.0 features that the HPC compilers support. We have attempted to define this subset of features to be those that enable, where possible, OpenMP-for-GPU application performance that closely mirrors the success NVIDIA has seen with OpenACC. Almost every feature supported on NVIDIA GPUs is also supported on multicore CPUs, although the reverse is not true. Most constructs from OpenMP 3.1 and OpenMP 4.5 that apply to multicore CPUs are supported for CPU targets, and some features from OpenMP 5.0 are supported as well. </p> <p class="p"> OpenMP target offload to NVIDIA GPUs is supported on NVIDIA V100 or later GPUs. </p> </div> <div class="section"> <p class="p"> The section numbers below correspond to the section numbers in the OpenMP Application Programming Interface Version 5.0 November 2018 document. </p> <p class="p"><strong class="ph b">2. Directives</strong></p> <p class="p"><strong class="ph b">2.3 Variant Directives</strong></p> <p class="p"><strong class="ph b">2.3.4 Metadirectives</strong></p> <p class="p"> The <samp class="ph codeph">target_device</samp>/<samp class="ph codeph">device</samp> context selector is supported with the <samp class="ph codeph">kind</samp>(<samp class="ph codeph">host</samp>|<samp class="ph codeph">nohost</samp>|<samp class="ph codeph">cpu</samp>|<samp class="ph codeph">gpu</samp>) and <samp class="ph codeph">arch</samp>(<samp class="ph codeph">nvtpx</samp>|<samp class="ph codeph">nvptx64</samp>) trait selectors. The <samp class="ph codeph">arch</samp> trait property <samp class="ph codeph">nvptx</samp> is an alias for <samp class="ph codeph">nvptx64</samp>; any other <samp class="ph codeph">arch</samp> trait properties are treated as not matching or are ignored. The <samp class="ph codeph">isa</samp> selector is treated as not matching or is ignored; no support is provided to select a context based on NVIDIA GPU compute capability. </p> <p class="p"> The <samp class="ph codeph">implementation</samp> context selector is supported with the <samp class="ph codeph">vendor(nvidia)</samp> trait selector. </p> <p class="p"> The <samp class="ph codeph">user</samp> context selector is supported with the <samp class="ph codeph">condition(expression)</samp> trait selector including dynamic <samp class="ph codeph">user</samp> traits. </p> <p class="p"> The syntax <samp class="ph codeph">begin</samp>/<samp class="ph codeph">end metadirective</samp> is not supported. </p> <p class="p"><strong class="ph b">2.3.5 Declare Variant Directive</strong></p> <p class="p"> The <samp class="ph codeph">device</samp> context selector is supported with the <samp class="ph codeph">kind</samp>(<samp class="ph codeph">host</samp>|<samp class="ph codeph">nohost</samp>|<samp class="ph codeph">cpu</samp>|<samp class="ph codeph">gpu</samp>) and <samp class="ph codeph">arch</samp>(<samp class="ph codeph">nvtpx</samp>|<samp class="ph codeph">nvptx64</samp>) trait selectors. The <samp class="ph codeph">arch</samp> trait property <samp class="ph codeph">nvptx</samp> is an alias for <samp class="ph codeph">nvptx64</samp>; any other <samp class="ph codeph">arch</samp> trait properties are treated as not matching or are ignored. The <samp class="ph codeph">isa</samp> selector is also treated as not matching or is ignored; no support is provided to select a context based on NVIDIA GPU compute capability. </p> <p class="p"> The <samp class="ph codeph">implementation</samp> context selector is supported with the <samp class="ph codeph">vendor(nvidia)</samp> trait selector; all other implementation trait selectors are treated as not matching. </p> <p class="p"> The syntax <samp class="ph codeph">begin</samp>/<samp class="ph codeph">end declare variant</samp> is supported for C/C⁠+⁠+. </p> <p class="p"><strong class="ph b">2.4 Requires Directive</strong></p> <p class="p"> The <samp class="ph codeph">requires</samp> directive has limited support. The requirement clauses <samp class="ph codeph">unified_address</samp> and <samp class="ph codeph">unified_shared_memory</samp> are accepted but have no effect. To activate OpenMP unified shared memory programming a command-line option needs to be passed in (refer to <a class="xref" href="index.html#openmp-unified-mem" shape="rect">OpenMP with CUDA Unified Memory</a> for more details). </p> <p class="p"><strong class="ph b">2.5 Internal Control Variables</strong></p> <div class="p"> ICV support is as follows. <ul class="ul"> <li class="li"><samp class="ph codeph">dyn-var</samp>, <samp class="ph codeph">nthread-var</samp>, <samp class="ph codeph">thread-limit-var</samp>, <samp class="ph codeph">max-active-levels-var</samp>, <samp class="ph codeph">active-levels-var</samp>, <samp class="ph codeph">levels-var</samp>, <samp class="ph codeph">run-sched-var</samp>, <samp class="ph codeph">dyn-sched-var</samp>, and <samp class="ph codeph">stacksize-var</samp> are supported </li> <li class="li"><samp class="ph codeph">place-partition-var</samp>, <samp class="ph codeph">bind-var</samp>, <samp class="ph codeph">wait-policy-var</samp>, <samp class="ph codeph">display-affinity-var</samp>, <samp class="ph codeph">default-device-var</samp>, and <samp class="ph codeph">target-offload-var</samp> are supported only on the CPU </li> <li class="li"><samp class="ph codeph">affinity-format-var</samp> is supported only on the CPU; its value is immutable </li> <li class="li"><samp class="ph codeph">max-task-priority-var</samp>, <samp class="ph codeph">def-allocator-var</samp> are not supported </li> <li class="li"><samp class="ph codeph">cancel-var</samp> is not supported; it always returns false </li> </ul> </div> <p class="p"><strong class="ph b">2.6 Parallel Construct</strong></p> <div class="p"> Support for <samp class="ph codeph">parallel</samp> construct clauses is as follows. <ul class="ul"> <li class="li">The <samp class="ph codeph">num_threads</samp>, <samp class="ph codeph">default</samp>, <samp class="ph codeph">private</samp>, <samp class="ph codeph">firstprivate</samp>, and <samp class="ph codeph">shared</samp> clauses are supported </li> <li class="li">The <samp class="ph codeph">reduction</samp> clause is supported as described in 2.19.5 </li> <li class="li">The <samp class="ph codeph">if</samp> and <samp class="ph codeph">copyin</samp> clauses are supported only for CPU targets; the compiler emits an error for GPU targets </li> <li class="li">The <samp class="ph codeph">proc_bind</samp> clause is supported only for CPU targets; it is ignored for GPU targets </li> <li class="li">The <samp class="ph codeph">allocate</samp> clause is ignored </li> </ul> </div> <p class="p"><strong class="ph b">2.7 Teams Construct</strong></p> <div class="p"> The <samp class="ph codeph">teams</samp> construct is supported only when nested within a <samp class="ph codeph">target</samp> construct that contains no statements, declarations, or directives outside the <samp class="ph codeph">teams</samp> construct, or as a combined <samp class="ph codeph">target</samp><samp class="ph codeph">teams</samp> construct. The <samp class="ph codeph">teams</samp> construct is supported for GPU targets. If the <samp class="ph codeph">target</samp> construct falls back to CPU mode, the number of teams is one. Support for <samp class="ph codeph">teams</samp> construct clauses is as follows. <ul class="ul"> <li class="li">The <samp class="ph codeph">num_teams</samp>, <samp class="ph codeph">thread_limit</samp>, <samp class="ph codeph">default</samp>, <samp class="ph codeph">private</samp>, and <samp class="ph codeph">firstprivate</samp> clauses are supported </li> <li class="li">The <samp class="ph codeph">reduction</samp> clause is supported as described in 2.19.5 </li> <li class="li">The <samp class="ph codeph">shared</samp> clause is supported for CPU targets and is supported for GPU targets in unified-memory mode </li> <li class="li">The <samp class="ph codeph">allocate</samp> clause is ignored </li> </ul> </div> <p class="p"><strong class="ph b">2.8 Worksharing Constructs</strong></p> <p class="p"><strong class="ph b">2.8.1 Sections Construct</strong></p> <div class="p"> The <samp class="ph codeph">sections</samp> construct is supported only for CPU targets; the compiler emits an error for GPU targets. Support for <samp class="ph codeph">sections</samp> construct clauses is as follows. <ul class="ul"> <li class="li">The <samp class="ph codeph">private</samp> and <samp class="ph codeph">firstprivate</samp> clauses are supported </li> <li class="li">The <samp class="ph codeph">reduction</samp> clause is supported as described in 2.19.5 </li> <li class="li">The <samp class="ph codeph">lastprivate</samp> clause is supported; the optional <samp class="ph codeph">lastprivate</samp> modifier is not supported </li> <li class="li">The <samp class="ph codeph">allocate</samp> clause is ignored </li> </ul> </div> <p class="p"><strong class="ph b">2.8.2 Single Construct</strong></p> <div class="p"> Support for <samp class="ph codeph">single</samp> construct clauses is as follows. <ul class="ul"> <li class="li">The <samp class="ph codeph">private</samp>, <samp class="ph codeph">firstprivate</samp>, and <samp class="ph codeph">nowait</samp> clauses are supported </li> <li class="li">The <samp class="ph codeph">copyprivate</samp> clause is supported only for CPU targets; the compiler emits an error for GPU targets </li> <li class="li">The <samp class="ph codeph">allocate</samp> clause is ignored </li> </ul> </div> <p class="p"><strong class="ph b">2.8.3 Workshare Construct</strong></p> <p class="p"> The <samp class="ph codeph">workshare</samp> construct is supported in Fortran only for CPU targets; the compiler emits an error for GPU targets. </p> <p class="p"><strong class="ph b">2.9 Loop-Related Constructs</strong></p> <p class="p"><strong class="ph b">2.9.2 Worksharing-Loop Construct (for/do)</strong></p> <div class="p"> Support for worksharing <samp class="ph codeph">for</samp> and <samp class="ph codeph">do</samp> construct clauses is as follows. <ul class="ul"> <li class="li">The <samp class="ph codeph">private</samp>, <samp class="ph codeph">firstprivate</samp>, and <samp class="ph codeph">collapse</samp> clauses are supported </li> <li class="li">The <samp class="ph codeph">reduction</samp> clause is supported as described in 2.19.5 </li> <li class="li">The <samp class="ph codeph">schedule</samp> clause is supported; the optional modifiers are not supported </li> <li class="li">The <samp class="ph codeph">lastprivate</samp> clause is supported; the optional <samp class="ph codeph">lastprivate</samp> modifier is not supported </li> <li class="li">The <samp class="ph codeph">ordered</samp> clause is supported only for CPU targets; <samp class="ph codeph">ordered(n)</samp> clause is not supported </li> <li class="li">The <samp class="ph codeph">linear</samp> clause is not supported </li> <li class="li">The <samp class="ph codeph">order(concurrent)</samp> clause is ignored </li> <li class="li">The <samp class="ph codeph">allocate</samp> clause is ignored </li> </ul> </div> <p class="p"><strong class="ph b">2.9.3 SIMD Directives</strong></p> <div class="p"> The <samp class="ph codeph">simd</samp> construct can be used to provide tuning hints for CPU targets; the <samp class="ph codeph">simd</samp> construct is ignored for GPU targets. Support for <samp class="ph codeph">simd</samp> construct clauses is as follows. <ul class="ul"> <li class="li">The <samp class="ph codeph">reduction</samp> clause is supported as described in 2.19.5 </li> <li class="li">The <samp class="ph codeph">lastprivate</samp> clause is supported; the optional <samp class="ph codeph">lastprivate</samp> modifier is not supported </li> <li class="li">The <samp class="ph codeph">if</samp>, <samp class="ph codeph">simdlen</samp>, and <samp class="ph codeph">linear</samp> clauses are not supported </li> <li class="li">The <samp class="ph codeph">safelen</samp>, <samp class="ph codeph">aligned</samp>, <samp class="ph codeph">nontemporal</samp>, and <samp class="ph codeph">order(concurrent)</samp> clauses are ignored </li> </ul> </div> <p class="p"> The composite <samp class="ph codeph">for</samp><samp class="ph codeph">simd</samp> and <samp class="ph codeph">do</samp><samp class="ph codeph">simd</samp> constructs are supported for CPU targets; they are treated as <samp class="ph codeph">for</samp> and <samp class="ph codeph">do</samp> directives for GPU targets. Supported <samp class="ph codeph">simd</samp> clauses are supported on the composite constructs for the CPU. Any <samp class="ph codeph">simd</samp> clauses are ignored for GPU targets. </p> <p class="p"> The <samp class="ph codeph">declare</samp><samp class="ph codeph">simd</samp> directive is ignored. </p> <p class="p"><strong class="ph b">2.9.4 Distribute Directives</strong></p> <div class="p"> The <samp class="ph codeph">distribute</samp> construct is supported within a <samp class="ph codeph">teams</samp> construct. Support for <samp class="ph codeph">distribute</samp> construct clauses is as follows: <ul class="ul"> <li class="li"> The <samp class="ph codeph">private</samp>, <samp class="ph codeph">firstprivate</samp>, <samp class="ph codeph">collapse</samp>, and <samp class="ph codeph">dist_schedule(static [ ,chunksize])</samp> clauses are supported </li> <li class="li">The <samp class="ph codeph">lastprivate</samp> clause is not supported </li> <li class="li">The <samp class="ph codeph">allocate</samp> clause is ignored </li> </ul> </div> <p class="p"> The <samp class="ph codeph">distribute</samp><samp class="ph codeph">simd</samp> construct is treated as a <samp class="ph codeph">distribute</samp> construct and is supported for GPU targets; valid supported <samp class="ph codeph">distribute</samp> clauses are accepted; <samp class="ph codeph">simd</samp> clauses are ignored. The <samp class="ph codeph">distribute</samp><samp class="ph codeph">simd</samp> construct is not supported for CPU targets. </p> <p class="p"> The <samp class="ph codeph">distribute</samp><samp class="ph codeph">parallel</samp><samp class="ph codeph">for</samp> or <samp class="ph codeph">distribute</samp><samp class="ph codeph">parallel</samp><samp class="ph codeph">do</samp> constructs are supported for GPU targets. Valid supported <samp class="ph codeph">distribute</samp> and <samp class="ph codeph">parallel</samp> and <samp class="ph codeph">for</samp> or <samp class="ph codeph">do</samp> clauses are accepted. The <samp class="ph codeph">distribute</samp><samp class="ph codeph">parallel</samp><samp class="ph codeph">for</samp> or <samp class="ph codeph">distribute</samp><samp class="ph codeph">parallel</samp><samp class="ph codeph">do</samp> constructs are not supported for CPU targets. </p> <p class="p"> The <samp class="ph codeph">distribute</samp><samp class="ph codeph">parallel</samp><samp class="ph codeph">for</samp><samp class="ph codeph">simd</samp> or <samp class="ph codeph">distribute</samp><samp class="ph codeph">parallel</samp><samp class="ph codeph">do</samp><samp class="ph codeph">simd</samp> constructs are treated as <samp class="ph codeph">distribute</samp><samp class="ph codeph">parallel</samp><samp class="ph codeph">for</samp> or <samp class="ph codeph">distribute</samp><samp class="ph codeph">parallel</samp><samp class="ph codeph">do</samp> constructs and are supported for GPU targets. These are not supported for CPU targets. </p> <p class="p"><strong class="ph b">2.9.5 Loop Construct</strong></p> <div class="p"> Support for <samp class="ph codeph">loop</samp> construct clauses is as follows. <ul class="ul"> <li class="li">The <samp class="ph codeph">private</samp>, <samp class="ph codeph">bind</samp>, and <samp class="ph codeph">collapse</samp> clauses are supported </li> <li class="li">The <samp class="ph codeph">reduction</samp> clause is supported as described in 2.19.5 </li> <li class="li">The <samp class="ph codeph">order(concurrent)</samp> clause is assumed </li> <li class="li">The <samp class="ph codeph">lastprivate</samp> clause is not supported </li> </ul> </div> <p class="p"><strong class="ph b">2.10 Tasking Constructs</strong></p> <p class="p"><strong class="ph b">2.10.1 Task Construct</strong></p> <div class="p"> The <samp class="ph codeph">task</samp> construct is supported for CPU targets. The compiler emits an error when it encounters <samp class="ph codeph">task</samp> within a <samp class="ph codeph">target</samp> construct. Support for <samp class="ph codeph">task</samp> construct clauses is as follows: <ul class="ul"> <li class="li">The <samp class="ph codeph">if</samp>, <samp class="ph codeph">final</samp>, <samp class="ph codeph">default</samp> , <samp class="ph codeph">private</samp>, <samp class="ph codeph">firstprivate</samp>, and <samp class="ph codeph">shared</samp> clauses are supported </li> <li class="li">The <samp class="ph codeph">depend([dependmodifier,] dependtype : list)</samp> clause is supported as described in 2.17.11 </li> </ul> </div> <p class="p"><strong class="ph b">2.10.4 Taskyield Construct</strong></p> <p class="p"> The <samp class="ph codeph">taskyield</samp> construct is supported for CPU targets; it is ignored for GPU targets. </p> <p class="p"><strong class="ph b">2.11 Memory Management Directives</strong></p> <p class="p"> The memory management allocators, memory management API routines, and memory management directives are not supported </p> <p class="p"><strong class="ph b">2.12 Device Directives</strong></p> <p class="p"><strong class="ph b">2.12.1 Device Initialization</strong></p> <p class="p"> Depending on how the program is compiled and linked, device initialization may occur at the first <samp class="ph codeph">target</samp> construct or API routine call, or may occur implicitly at program startup. </p> <p class="p"><strong class="ph b">2.12.2 Target Data Construct</strong></p> <div class="p"> The <samp class="ph codeph">target</samp><samp class="ph codeph">data</samp> construct is supported for GPU targets. Support for <samp class="ph codeph">target</samp><samp class="ph codeph">data</samp> construct clauses is as follows. <ul class="ul"> <li class="li">The <samp class="ph codeph">if</samp>, <samp class="ph codeph">device</samp>, <samp class="ph codeph">use_device_ptr</samp>, and <samp class="ph codeph">use_device_addr</samp> clauses are supported </li> <li class="li">The <samp class="ph codeph">map</samp> clause is supported as described in 2.19.7 </li> </ul> </div> <p class="p"><strong class="ph b">2.12.3 Target Enter Data Construct</strong></p> <div class="p"> The <samp class="ph codeph">target</samp><samp class="ph codeph">enter</samp><samp class="ph codeph">data</samp> construct is supported for GPU targets. Support for <samp class="ph codeph">enter</samp><samp class="ph codeph">data</samp> construct clauses is as follows. <ul class="ul"> <li class="li">The <samp class="ph codeph">if</samp>, <samp class="ph codeph">device</samp>, and <samp class="ph codeph">nowait</samp> clauses are supported </li> <li class="li">The <samp class="ph codeph">map</samp> clause is supported as described in 2.19.7. </li> <li class="li">The <samp class="ph codeph">depend([dependmodifier,] dependtype : list)</samp> clause is supported as described in 2.17.11 </li> </ul> </div> <p class="p"><strong class="ph b">2.12.4 Target Exit Data Construct</strong></p> <div class="p"> The <samp class="ph codeph">target</samp><samp class="ph codeph">exit</samp><samp class="ph codeph">data</samp> construct is supported for GPU targets. Support for <samp class="ph codeph">exit</samp><samp class="ph codeph">data</samp> construct clauses is as follows. <ul class="ul"> <li class="li">The <samp class="ph codeph">if</samp>, <samp class="ph codeph">device</samp>, and <samp class="ph codeph">nowait</samp> clauses are supported </li> <li class="li">The <samp class="ph codeph">map</samp> clause is supported as described in 2.19.7. </li> <li class="li">The <samp class="ph codeph">depend([dependmodifier,] dependtype : list)</samp> clause is supported as described in 2.17.11 </li> </ul> </div> <p class="p"><strong class="ph b">2.12.5 Target Construct</strong></p> <div class="p"> The <samp class="ph codeph">target</samp> construct is supported for GPU targets. If there is no GPU or GPU offload is otherwise disabled, execution falls back to CPU mode. Support for <samp class="ph codeph">target</samp> construct clauses is as follows: <ul class="ul"> <li class="li">The <samp class="ph codeph">if</samp>, <samp class="ph codeph">private</samp>, <samp class="ph codeph">firstprivate</samp>, <samp class="ph codeph">is_device_ptr</samp>, and <samp class="ph codeph">nowait</samp> clauses are supported </li> <li class="li">The <samp class="ph codeph">device</samp> clause is supported without the device-modifier <samp class="ph codeph">ancestor</samp> keyword </li> <li class="li">The <samp class="ph codeph">map</samp> clause is supported as described in 2.19.7 </li> <li class="li">The <samp class="ph codeph">defaultmap</samp> clause is supported using OpenMP 5.0 semantics </li> <li class="li">The <samp class="ph codeph">depend([dependmodifier,] dependtype : list)</samp> clause is supported as described in 2.17.11 </li> <li class="li">The <samp class="ph codeph">allocate</samp> and <samp class="ph codeph">uses_allocate</samp> clauses are ignored </li> </ul> </div> <p class="p"><strong class="ph b">2.12.6 Target Update Construct</strong></p> <div class="p"> The <samp class="ph codeph">target</samp><samp class="ph codeph">update</samp> construct is supported for GPU targets. Support for <samp class="ph codeph">target</samp><samp class="ph codeph">update</samp> construct clauses is as follows. <ul class="ul"> <li class="li">The <samp class="ph codeph">if</samp>, <samp class="ph codeph">device</samp>, and <samp class="ph codeph">nowait</samp> clauses are supported. </li> <li class="li">The <samp class="ph codeph">to</samp> and <samp class="ph codeph">from</samp> clauses are supported without <samp class="ph codeph">mapper</samp> or <samp class="ph codeph">mapid</samp></li> <li class="li">The <samp class="ph codeph">depend([dependmodifier,] dependtype : list)</samp> clause is supported as described in 2.17.11 </li> </ul> </div> <p class="p"> Array sections are supported in <samp class="ph codeph">to</samp> and <samp class="ph codeph">from</samp> clauses, including noncontiguous array sections. Array section strides are not supported. If the array section is noncontiguous, the OpenMP runtime may have to use multiple host-to-device or device-to-host data transfer operations, which increases the overhead. If the host data is in host-pinned memory, then <samp class="ph codeph">update</samp> data transfers with the <samp class="ph codeph">nowait</samp> clause are asynchronous. This means the data transfer for a <samp class="ph codeph">target</samp><samp class="ph codeph">update</samp><samp class="ph codeph">to</samp><samp class="ph codeph">nowait</samp> may not occur immediately or synchronously with the program thread, and any changes to the data may affect the transfer, until a synchronizing operation is reached. Similarly, a <samp class="ph codeph">target</samp><samp class="ph codeph">update</samp><samp class="ph codeph">from</samp><samp class="ph codeph">nowait</samp> may not occur immediately or synchronously with the program thread, and the downloaded data may not be available until a synchronizing operation is reached. If the host data is not in host-pinned memory, then <samp class="ph codeph">update</samp> data transfers with the <samp class="ph codeph">nowait</samp> clause may require that the data transfer operation use an intermediate pinned buffer managed by the OpenMP runtime library, and that a memory copy operation on the host between the program memory and the pinned buffer may be needed before starting or before finishing the transfer operation, which affects overhead and performance. To learn more about the pinned buffer, please refer to <a class="xref" href="index.html#acc-mem-pinned-buffer" shape="rect">Staging Memory Buffer</a>. </p> <p class="p"><strong class="ph b">2.12.7 Declare Target Construct</strong></p> <div class="p"> The <samp class="ph codeph">declare</samp><samp class="ph codeph">target</samp> construct is supported for GPU targets. <ul class="ul"> <li class="li"><samp class="ph codeph">declare target ... end declare target</samp> is supported </li> <li class="li"><samp class="ph codeph">declare target(list)</samp> is supported </li> <li class="li">The <samp class="ph codeph">to(list)</samp> clause is supported </li> <li class="li">The <samp class="ph codeph">device_type</samp> clause is supported for C/C⁠+⁠+ </li> </ul> A function or procedure that is referenced in a function or procedure that appears in a <samp class="ph codeph">declare</samp><samp class="ph codeph">target</samp><samp class="ph codeph">to</samp> clause (explicitly or implicitly) is treated as if its name had implicitly appeared in a <samp class="ph codeph">declare</samp><samp class="ph codeph">target</samp><samp class="ph codeph">to</samp> clause. </div> <p class="p"><strong class="ph b">2.13 Combined Constructs</strong></p> <p class="p"> Combined constructs are supported to the extent that the component constructs are themselves supported. </p> <p class="p"><strong class="ph b">2.14 Clauses on Combined and Composite Constructs</strong></p> <p class="p"> Clauses on combined constructs are supported to the extent that the clauses are supported on the component constructs. </p> <p class="p"><strong class="ph b">2.16 Master Construct</strong></p> <p class="p"> The <samp class="ph codeph">master</samp> construct is supported for CPU and GPU targets. </p> <p class="p"><strong class="ph b">2.17 Synchronization Constructs and Clauses</strong></p> <p class="p"><strong class="ph b">2.17.1 Critical Construct</strong></p> <p class="p"> The <samp class="ph codeph">critical</samp> construct is supported only for CPU targets; the compiler emits an error for GPU targets. </p> <p class="p"><strong class="ph b">2.17.2 Barrier Construct</strong></p> <p class="p"> The <samp class="ph codeph">barrier</samp> construct is supported. </p> <p class="p"><strong class="ph b">2.17.3 Implicit Barriers</strong></p> <p class="p"> Implicit barriers are implemented. </p> <p class="p"><strong class="ph b">2.17.4 Implementation-Specific Barriers</strong></p> <p class="p"> There may be implementation-specific barriers, and they may be different for CPU targets than for GPU targets. </p> <p class="p"><strong class="ph b">2.17.5 Taskwait Construct</strong></p> <div class="p"> The <samp class="ph codeph">taskwait</samp> construct is supported only for CPU targets; it is ignored for GPU targets. <ul class="ul"> <li class="li">The <samp class="ph codeph">depend([dependmodifier,] dependtype : list)</samp> clause is supported as described in 2.17.11 </li> </ul> </div> <p class="p"><strong class="ph b">2.17.6 Taskgroup Construct</strong></p> <p class="p"> The <samp class="ph codeph">taskgroup</samp> construct is supported only for CPU targets. It is ignored for GPU targets. </p> <p class="p"><strong class="ph b">2.17.7 Atomic Construct</strong></p> <div class="p"> Support for <samp class="ph codeph">atomic</samp> construct clauses is as follows. <ul class="ul"> <li class="li">The <samp class="ph codeph">read</samp>, <samp class="ph codeph">write</samp>, <samp class="ph codeph">update</samp>, and <samp class="ph codeph">capture</samp> clauses are supported. </li> <li class="li">The memory order clauses <samp class="ph codeph">seq_cst</samp>, <samp class="ph codeph">acq_rel</samp>, <samp class="ph codeph">release</samp>, <samp class="ph codeph">acquire</samp>, <samp class="ph codeph">relaxed</samp> are not supported </li> <li class="li">The <samp class="ph codeph">hint</samp> clause is ignored </li> </ul> </div> <p class="p"><strong class="ph b">2.17.8 Flush Construct</strong></p> <p class="p"> The <samp class="ph codeph">flush</samp> construct is supported only for CPU targets. </p> <p class="p"><strong class="ph b">2.17.9 Ordered Construct and Ordered Directive</strong></p> <p class="p"> The <samp class="ph codeph">ordered</samp> block construct is supported only for CPU targets. </p> <p class="p"><strong class="ph b">2.17.11 Depend Clause</strong></p> <p class="p"> The <samp class="ph codeph">depend</samp> clause is supported on CPU targets. It is not supported on GPU targets. The dependence types <samp class="ph codeph">in</samp>, <samp class="ph codeph">out</samp>, and <samp class="ph codeph">inout</samp> are supported. The dependence types <samp class="ph codeph">mutexinoutset</samp> and <samp class="ph codeph">depobj</samp>, dependence modifier <samp class="ph codeph">iterator(iters)</samp>, <samp class="ph codeph">depend(source)</samp>, and <samp class="ph codeph">depend(sink:vector)</samp> are not supported. </p> <p class="p"><strong class="ph b">2.19 Data Environment</strong></p> <p class="p"><strong class="ph b">2.19.2 Threadprivate Directive</strong></p> <p class="p"> The <samp class="ph codeph">threadprivate</samp> directive is supported only for CPU targets. It is not supported for GPU targets; references to <samp class="ph codeph">threadprivate</samp> variables in device code are not supported. </p> <p class="p"><strong class="ph b">2.19.5 Reduction Clauses and Directives</strong></p> <p class="p"> The <samp class="ph codeph">reduction</samp> clause is supported. The optional modifier is not supported. </p> <p class="p"><strong class="ph b">2.19.6 Data Copying Clauses</strong></p> <p class="p"> The data copying <samp class="ph codeph">copyin</samp> and <samp class="ph codeph">copyprivate</samp> clauses are supported only for CPU targets; the compiler emits a compile-time error for GPU targets. </p> <p class="p"><strong class="ph b">2.19.7 Data Mapping Attribute Rules, Clauses, and Directives</strong></p> <div class="p"> <ul class="ul"> <li class="li">The <samp class="ph codeph">map([[mapmod[,]...] maptype:] datalist)</samp> clause is supported. Of the map-type-modifiers, <samp class="ph codeph">always</samp> is supported, <samp class="ph codeph">close</samp> is ignored, and <samp class="ph codeph">mapper(mapid)</samp> is not supported. </li> <li class="li">The <samp class="ph codeph">defaultmap</samp> clause is supported using OpenMP 5.0 semantics. </li> </ul> </div> <p class="p"><strong class="ph b">2.20 Nesting of Regions</strong></p> <p class="p"> For constructs supported in this subset, restrictions on nesting of regions is observed. Additionally, nested parallel regions on CPU are not supported and nested teams or parallel regions in a target region are not supported. </p> <p class="p"><strong class="ph b">Runtime Library Routines</strong></p> <p class="p"><strong class="ph b">3.2 Execution Environment Routines</strong></p> <p class="p"> The following execution environment runtime API routines are supported. </p> <div class="p"> <ul class="ul"> <li class="li"><samp class="ph codeph">omp_set_num_threads</samp>, <samp class="ph codeph">omp_get_num_threads</samp>, <samp class="ph codeph">omp_get_max_threads</samp>, <samp class="ph codeph">omp_get_thread_num</samp>, <samp class="ph codeph">omp_get_thread_limit</samp>, <samp class="ph codeph">omp_get_supported_active_levels</samp>, <samp class="ph codeph">omp_set_max_active_levels</samp>, <samp class="ph codeph">omp_get_max_active_levels</samp>, <samp class="ph codeph">omp_get_level</samp>, <samp class="ph codeph">omp_get_ancestor_thread_num</samp>, <samp class="ph codeph">omp_get_team_size</samp>, <samp class="ph codeph">omp_get_num_teams</samp>, <samp class="ph codeph">omp_get_team_num</samp>, <samp class="ph codeph">omp_is_initial_device</samp></li> </ul> </div> <p class="p"> The following execution environment runtime API routines are supported only on the CPU. </p> <div class="p"> <ul class="ul"> <li class="li"><samp class="ph codeph">omp_get_num_procs</samp>, <samp class="ph codeph">omp_set_dynamic</samp>, <samp class="ph codeph">omp_get_dynamic</samp>, <samp class="ph codeph">omp_set_schedule</samp>, <samp class="ph codeph">omp_get_schedule</samp>, <samp class="ph codeph">omp_in_final</samp>, <samp class="ph codeph">omp_get_proc_bind</samp>, <samp class="ph codeph">omp_get_num_places</samp>, <samp class="ph codeph">omp_get_affinity_format</samp>, <samp class="ph codeph">omp_set_default_device</samp>, <samp class="ph codeph">omp_get_default_device</samp>, <samp class="ph codeph">omp_get_num_devices</samp>, <samp class="ph codeph">omp_get_device_num</samp>, <samp class="ph codeph">omp_get_initial_device</samp></li> </ul> </div> <p class="p"> The following execution environment runtime API routines have limited support. </p> <div class="p"> <ul class="ul"> <li class="li"><samp class="ph codeph">omp_get_cancellation</samp>, <samp class="ph codeph">omp_get_nested</samp>; supported only on the CPU; the value returned is always false </li> <li class="li"><samp class="ph codeph">omp_display_affinity</samp>, <samp class="ph codeph">omp_capture_affinity</samp>; supported only on the CPU; the format specifier is ignored </li> <li class="li"><samp class="ph codeph">omp_set_nested</samp>; supported only on the CPU, the value is ignored </li> </ul> </div> <p class="p"> The following execution environment runtime API routines are not supported. </p> <div class="p"> <ul class="ul"> <li class="li"><samp class="ph codeph">omp_get_place_num_procs</samp>, <samp class="ph codeph">omp_get_place_proc_ids</samp>, <samp class="ph codeph">omp_get_place_num</samp>, <samp class="ph codeph">omp_get_partition_num_places</samp>, <samp class="ph codeph">omp_get_partition_place_nums</samp>, <samp class="ph codeph">omp_set_affinity_format</samp>, <samp class="ph codeph">omp_get_max_task_priority</samp>, <samp class="ph codeph">omp_pause_resource</samp>, <samp class="ph codeph">omp_pause_resource_all</samp></li> </ul> </div> <p class="p"><strong class="ph b">3.3 Lock Routines</strong></p> <p class="p"> Lock runtime API routines are not supported on the GPU. The following lock runtime API routines are supported on the CPU. </p> <div class="p"> <ul class="ul"> <li class="li"><samp class="ph codeph">omp_init_lock</samp>, <samp class="ph codeph">omp_init_nest_lock</samp>, <samp class="ph codeph">omp_destroy_lock</samp>, <samp class="ph codeph">omp_destroy_nest_lock</samp>, <samp class="ph codeph">omp_set_lock</samp>, <samp class="ph codeph">omp_set_nest_lock</samp>, <samp class="ph codeph">omp_unset_lock</samp>, <samp class="ph codeph">omp_unset_nest_lock</samp>, <samp class="ph codeph">omp_test_lock</samp>, <samp class="ph codeph">omp_test_nest_lock</samp></li> </ul> </div> <p class="p"> The following lock runtime API routines are not supported. </p> <div class="p"> <ul class="ul"> <li class="li"><samp class="ph codeph">omp_init_lock_with_hint</samp>, <samp class="ph codeph">omp_init_nest_lock_with_hint</samp></li> </ul> </div> <p class="p"><strong class="ph b">3.4 Timing Routines</strong></p> <p class="p"> The following timing runtime API routines are supported. </p> <div class="p"> <ul class="ul"> <li class="li"><samp class="ph codeph">omp_get_wtime</samp>, <samp class="ph codeph">omp_get_wtick</samp></li> </ul> </div> <p class="p"><strong class="ph b">3.6 Device Memory Routines</strong></p> <p class="p"> The following device memory routines are supported only on the CPU. </p> <div class="p"> <ul class="ul"> <li class="li"><samp class="ph codeph">omp_target_is_present</samp>, <samp class="ph codeph">omp_target_associate_ptr</samp>, <samp class="ph codeph">omp_target_disassociate_ptr</samp></li> <li class="li"><samp class="ph codeph">omp_target_memcpy</samp> and <samp class="ph codeph">omp_target_memcpy_rect</samp> are only supported when copying to and from the same device. </li> </ul> </div> <p class="p"> The following device memory routines are supported on the CPU; we extend OpenMP to support these in target regions on a GPU, but only allocation and deallocation on the same device is supported. </p> <div class="p"> <ul class="ul"> <li class="li"><samp class="ph codeph">omp_target_alloc</samp>, <samp class="ph codeph">omp_target_free</samp></li> </ul> </div> <p class="p"><strong class="ph b">3.7 Memory Management Routines</strong></p> <p class="p"> The following memory management routines are supported. </p> <div class="p"> <ul class="ul"> <li class="li"><samp class="ph codeph">omp_alloc</samp>, <samp class="ph codeph">omp_free</samp></li> </ul> </div> <p class="p"> The following memory management routines are not supported. </p> <div class="p"> <ul class="ul"> <li class="li"><samp class="ph codeph">omp_init_allocator</samp>, <samp class="ph codeph">omp_destroy_allocator</samp>, <samp class="ph codeph">omp_set_default_allocator</samp>, <samp class="ph codeph">omp_get_default_allocator</samp></li> </ul> </div> <p class="p"><strong class="ph b">6 Environment Variables</strong></p> <p class="p"> The following environment variables have limited support. </p> <div class="p"> <ul class="ul"> <li class="li"><samp class="ph codeph">OMP_SCHEDULE</samp>, <samp class="ph codeph">OMP_NUM_THREADS</samp>, <samp class="ph codeph">OMP_NUM_TEAMS</samp>, <samp class="ph codeph">OMP_DYNAMIC</samp>, <samp class="ph codeph">OMP_PROC_BIND</samp>, <samp class="ph codeph">OMP_PLACES</samp>, <samp class="ph codeph">OMP_STACKSIZE</samp>, <samp class="ph codeph">OMP_WAIT_POLICY</samp>, <samp class="ph codeph">OMP_MAX_ACTIVE_LEVELS</samp>, <samp class="ph codeph">OMP_NESTED</samp>, <samp class="ph codeph">OMP_THREAD_LIMIT</samp>, <samp class="ph codeph">OMP_TEAMS_THREAD_LIMIT</samp>, <samp class="ph codeph">OMP_DISPLAY_ENV</samp>, <samp class="ph codeph">OMP_DISPLAY_AFFINITY</samp>, <samp class="ph codeph">OMP_DEFAULT_DEVICE</samp>, and <samp class="ph codeph">OMP_TARGET_OFFLOAD</samp> are supported on CPU. </li> <li class="li"><samp class="ph codeph">OMP_CANCELLATION</samp> and <samp class="ph codeph">OMP_MAX_TASK_PRIORITY</samp> are ignored. </li> <li class="li"><samp class="ph codeph">OMP_AFFINITY_FORMAT</samp>, <samp class="ph codeph">OMP_TOOL</samp>, <samp class="ph codeph">OMP_TOOL_LIBRARIES</samp>, <samp class="ph codeph">OMP_DEBUG</samp>, and <samp class="ph codeph">OMP_ALLOCATOR</samp> are not supported </li> </ul> </div> </div> </div> </div> <div class="topic concept nested1" id="openmp-metadirective"><a name="openmp-metadirective" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#openmp-metadirective" name="openmp-metadirective" shape="rect">7.5. Using metadirective</a></h3> <div class="body conbody"> <div class="section"> <p class="p"> This section contains limitations affecting <samp class="ph codeph">metadirective</samp> along with a few guidelines for its use. </p> </div> <div class="section"> <p class="p"> The Fortran compiler does not support variants leading to an OpenMP directive for which a corresponding <samp class="ph codeph">end</samp> directive is required. </p> </div> <div class="section"> <p class="p"> Nesting <samp class="ph codeph">user</samp> conditions, while legal, may create situations that the HPC Compilers do not handle gracefully. To avoid potential problems, use <samp class="ph codeph">device</samp> traits inside <samp class="ph codeph">user</samp> conditions instead. The following example illustrates this best practice. </p> <p class="p"> Avoid nesting dynamic <samp class="ph codeph">user</samp> conditions like this: </p><pre xml:space="preserve"> #pragma omp metadirective \ when( user={condition(use_offload)} : target teams distribute) \ default( parallel for schedule(static) ) for (i = 0; i &lt; N; i++) { ... #pragma omp metadirective \ when( user={condition(use_offload)} : parallel for) for (j = 0; j &lt; N; j++) { ... } ... } </pre><p class="p"> Instead, use <samp class="ph codeph">target_device</samp> and <samp class="ph codeph">device</samp> traits within dynamic <samp class="ph codeph">user</samp> conditions like this: </p><pre xml:space="preserve"> #pragma omp metadirective \ when( target_device={kind(gpu)}, user={condition(use_offload)} : target teams distribute) \ default( parallel for schedule(static) ) for (i = 0; i &lt; N; i++) { ... #pragma omp metadirective \ when( device={kind(gpu)} : parallel for) for (j = 0; j &lt; N; j++) { ... } ... } </pre></div> <div class="section"> <p class="p"> The HPC compilers do not support nesting <samp class="ph codeph">metadirective</samp> inside a <samp class="ph codeph">target</samp> construct applying to a syntactic block leading to a <samp class="ph codeph">teams</samp> variant. Some examples: </p> <p class="p"> The compilers will emit an error given the following code: </p><pre xml:space="preserve"> #pragma omp target map(to:v1,v2) map(from:v3) { #pragma omp metadirective \ when( device={arch("nvptx")} : teams distribute parallel for) \ default( parallel for) for (int i = 0; i &lt; N; i++) { v3[i] = v1[i] * v2[i]; } } </pre><p class="p"> The compilers will always match <samp class="ph codeph">device={arch("nvptx")}</samp> given the following code: </p><pre xml:space="preserve"> #pragma omp target map(to:v1,v2) map(from:v3) #pragma omp metadirective \ when( device={arch("nvptx")} : teams distribute parallel for) \ default( parallel for) for (int i = 0; i &lt; N; i++) { v3[i] = v1[i] * v2[i]; } </pre><p class="p"> The compilers match <samp class="ph codeph">device={"arch")</samp> for GPU code, and <samp class="ph codeph">default</samp> for host fallback, given the following code: </p><pre xml:space="preserve"> #pragma omp target teams distribute map(to:v1,v2) map(from:v3) for (...) { #pragma omp metadirective \ when( device={arch("nvptx")} : parallel for) \ default( simd ) for (int i = 0; i &lt; N; i++) { v3[i] = v1[i] * v2[i]; } } </pre></div> </div> </div> <div class="topic concept nested1" id="openmp-cuda-streams"><a name="openmp-cuda-streams" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#openmp-cuda-streams" name="openmp-cuda-streams" shape="rect">7.6. Mapping target constructs to CUDA streams</a></h3> <div class="body conbody"> <div class="section"> <p class="p"> An OpenMP target task generating construct is executed on the GPU in a CUDA stream. The following are target task generating constructs: </p> <ul class="ul"> <li class="li"><samp class="ph codeph">target enter data</samp></li> <li class="li"><samp class="ph codeph">target exit data</samp></li> <li class="li"><samp class="ph codeph">target update</samp></li> <li class="li"><samp class="ph codeph">target</samp></li> </ul> <p class="p"> This section explains how these target constructs are mapped to CUDA streams. The relationship with the OpenACC queues is also explained below. </p> <p class="p"> Keep in mind that the <samp class="ph codeph">target data</samp> construct does not generate a task and is not necessarily executed in a CUDA stream. It also cannot have the <samp class="ph codeph">depend</samp> and <samp class="ph codeph">nowait</samp> clauses, thus its behavior cannot be directly controlled by the user application. The rest of this section does not cover the behavior of the <samp class="ph codeph">target data</samp> construct. </p> <p class="p"> Any task-generating target construct can have <samp class="ph codeph">depend</samp> and <samp class="ph codeph">nowait</samp> clauses. The NVIDIA OpenMP Runtime takes these clauses as a guidance for how to map the construct to a specific CUDA stream. Below is a breakdown of how the clauses affect the mapping decisions. </p> </div> <div class="section"> <h3 class="title sectiontitle">'target' without 'depend', without 'nowait'</h3> <p class="p"> For these constructs, the per-thread default CUDA stream is normally used. The stream is unique for each host thread, so target regions created by different host threads will execute independently in different streams according to the CUDA rules described in <a class="xref" href="https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html" target="_blank" shape="rect">CUDA Runtime API</a>; see the rules in the "Per-thread default stream" section. </p> <p class="p"> The OpenACC queue <samp class="ph codeph">acc_async_sync</samp> is initially associated with the same per-thread default CUDA stream. The user is allowed to change the association by calling <samp class="ph codeph">acc_set_cuda_stream(acc_async_sync, stream)</samp>. This will change accordingly the stream used for <samp class="ph codeph">target</samp> without <samp class="ph codeph">nowait</samp>. </p> <p class="p"> The CUDA stream handle can be directly obtained via the <samp class="ph codeph">ompx_get_cuda_stream(int device, int nowait)</samp> function, with the <samp class="ph codeph">nowait</samp> parameter set to 0. The per-thread default stream can be obtained with the CUDA handle <samp class="ph codeph">CU_STREAM_PER_THREAD</samp> or <samp class="ph codeph">cudaStreamPerThread</samp>. </p> <p class="p"> Here is an example of how a custom CUDA stream can be used to substitute the default stream: </p><pre xml:space="preserve"> extern __global__ void kernel(int *data); CUstream stream; cuStreamCreate(&amp;stream, CU_STREAM_DEFAULT); acc_set_cuda_stream(acc_async_sync, stream); #pragma omp target enter data map(to:data[:N]) #pragma omp target data use_device_ptr(data) kernel&lt;&lt;&lt;N/32, 32, 0, stream&gt;&gt;&gt;(data); #pragma omp target teams distribute parallel for for (int i = 0; i &lt; N; i++) { data[i]++; } #pragma omp target exit data map(from:data[:N]) </pre><p class="p"> Note there is no explicit stream synchronization after the CUDA <samp class="ph codeph">kernel</samp> is launched. The stream is synchronized automatically at the <samp class="ph codeph">target</samp> constructs that follow. </p> </div> <div class="section"> <h3 class="title sectiontitle">'target' with 'depend', without 'nowait'</h3> <p class="p"> For this construct, the runtime will block the current thread until all dependencies listed in the <samp class="ph codeph">depend</samp> clause are resolved. Then, the <samp class="ph codeph">target</samp> construct will be executed in the default per-thread CUDA stream as described in the previous section (that is, as if there is no <samp class="ph codeph">depend</samp> clause). </p> </div> <div class="section"> <h3 class="title sectiontitle">'target' with 'nowait', without 'depend'</h3> <p class="p"> By default, the runtime will select a CUDA stream for each new <samp class="ph codeph">target nowait</samp> construct. The selected stream may be the same that was used for a prior <samp class="ph codeph">target nowait</samp> construct. That is, there is no guarantee of uniqueness of the selected stream. </p> <p class="p"> This is different from the OpenACC model that uses the same CUDA stream associated with the <samp class="ph codeph">acc_async_noval</samp> queue for any asynchronous construct with the <samp class="ph codeph">async</samp> clause without an argument. To change this behavior, the user can call the <samp class="ph codeph">ompx_set_cuda_stream_auto(int enable)</samp> function with the <samp class="ph codeph">enable</samp> parameter set to 0. In this case, the CUDA stream associated with the <samp class="ph codeph">acc_async_noval</samp> OpenACC queue will be used for all OpenMP <samp class="ph codeph">target nowait</samp> constructs. Another way to enable this behavior is to set the environment variable <samp class="ph codeph">NVCOMPILER_OMP_AUTO_STREAMS</samp> to <samp class="ph codeph">FALSE</samp>. </p> <p class="p"> To access the stream used for the next <samp class="ph codeph">target nowait</samp> construct, the user can call the <samp class="ph codeph">ompx_get_cuda_stream(int device, int nowait)</samp> function, with the <samp class="ph codeph">nowait</samp> parameter set to 1. </p> </div> <div class="section"> <h3 class="title sectiontitle">'target' with both 'depend' and 'nowait'</h3> <p class="p"> The decision on which CUDA stream to use in this case relies on previously scheduled target and host tasks sharing a subset of the dependencies listed in the <samp class="ph codeph">depend</samp> clause: </p> <ul class="ul"> <li class="li"> If the target construct has only one dependency, which is of the type <samp class="ph codeph">inout</samp> or <samp class="ph codeph">out</samp>, and that dependency maps to a previously scheduled <samp class="ph codeph">target depend(...) nowait</samp> construct, and the same device is used for both target constructs, then the CUDA stream which the previous target task was scheduled to will be used. </li> <li class="li"> Otherwise, a CUDA stream will be selected for this target construct according to the stream selection policy. </li> </ul> <p class="p"> Note that target constructs with a single <samp class="ph codeph">in</samp> dependency can be scheduled on a newly selected CUDA stream. This is to allow parallel execution of multiple <samp class="ph codeph">target nowait</samp> constructs that depend on data produced by another previously scheduled <samp class="ph codeph">target nowait</samp> construct. </p> <p class="p"> Here is a simplified example of how a <samp class="ph codeph">target</samp> construct, a CUDA library function and a CUDA kernel can be executed on the GPU in the same stream asynchronously with respect to the host thread: </p><pre xml:space="preserve"> extern __global__ void kernel(int *data); cudaStream_t stream = (cudaStream_t)ompx_get_cuda_stream(omp_get_default_device(), 1); cufftSetStream(cufft_plan, stream); #pragma omp target enter data map(to:data[:N]) depend(inout:stream) nowait #pragma omp target data use_device_ptr(data) { kernel&lt;&lt;&lt;N/32, 32, 0, stream&gt;&gt;&gt;(data); cufftExecC2C(cufft_plan, data, data, CUFFT_FORWARD); } #pragma omp target teams distribute parallel for depend(inout:stream) nowait for (int i = 0; i &lt; N; i++) { data[i]++; } #pragma omp target exit data map(from:data[:N]) depend(inout:stream) nowait </pre><p class="p"> Note that the <samp class="ph codeph">stream</samp> variable holds the CUDA stream handle and also serves as the dependency for the <samp class="ph codeph">target</samp> constructs. This dependency enforces the order of execution and also guarantees the target constructs are on the same stream that was returned from the <samp class="ph codeph">ompx_get_cuda_stream</samp> function call. </p> </div> <div class="section"> <h3 class="title sectiontitle">NVIDIA OpenMP API to access and control CUDA streams</h3> <p class="p"> NVIDIA OpenMP Runtime provides the following API to access CUDA streams and to control their use. </p><pre xml:space="preserve"> void *ompx_get_cuda_stream(int device, int nowait); </pre><p class="p"> This function returns the handle of the CUDA stream that will be used for the next <samp class="ph codeph">target</samp> construct: </p> <ul class="ul"> <li class="li"> If the <samp class="ph codeph">nowait</samp> parameter is set to 0, it returns the CUDA stream associated with the OpenACC queue <samp class="ph codeph">acc_async_sync</samp>, which is initially mapped to the default per-thread CUDA stream; </li> <li class="li"> Otherwise, it returns a CUDA stream which will be used for the next <samp class="ph codeph">target nowait</samp> construct that cannot be mapped to an existing stream according to the rules for the <samp class="ph codeph">depend</samp> clause. </li> </ul><pre xml:space="preserve"> void ompx_set_cuda_stream_auto(int enable); </pre><p class="p"> This function sets the policy for how CUDA streams are selected for <samp class="ph codeph">target nowait</samp> constructs: </p> <ul class="ul"> <li class="li"> If the <samp class="ph codeph">enable</samp> parameter is set to a non-zero value, an internally selected CUDA stream will be used for each <samp class="ph codeph">target nowait</samp> construct that follows. This is the default behavior; </li> <li class="li"> Otherwise, the CUDA stream associated with the OpenACC queue <samp class="ph codeph">acc_async_noval</samp> will be used for all <samp class="ph codeph">target nowait</samp> constructs that follow. This becomes the default behavior if the environment variable <samp class="ph codeph">NVCOMPILER_OMP_AUTO_STREAMS</samp> is set to <samp class="ph codeph">FALSE</samp>. </li> </ul> <p class="p"> The setting is done only for the host thread which calls this function. </p> </div> </div> </div> <div class="topic concept nested1" id="openmp-noncontiguous-arrays"><a name="openmp-noncontiguous-arrays" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#openmp-noncontiguous-arrays" name="openmp-noncontiguous-arrays" shape="rect">7.7. Noncontiguous Array Sections</a></h3> <div class="body conbody"> <p class="p"> Array sections can be used in <samp class="ph codeph">to</samp> and <samp class="ph codeph">from</samp> clauses, including noncontiguous array sections. The noncontiguous array section must be specified in a single <samp class="ph codeph">map</samp> clause; it cannot be split between multiple directives. Although this feature may become a part of a future OpenMP specification, at this time it is an NVIDIA HPC compilers extension. </p> </div> </div> <div class="topic concept nested1" id="openmp-unified-mem"><a name="openmp-unified-mem" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#openmp-unified-mem" name="openmp-unified-mem" shape="rect">7.8. OpenMP with CUDA Unified Memory</a></h3> <div class="body conbody"> <p class="p"> This section will focus on OpenMP unified shared memory programming, and assume users are familiar with Separate, Managed, and Unified Memory Modes explained in the <a class="xref" href="index.html#acc-mem-model" shape="rect">Memory Model</a> and <a class="xref" href="index.html#acc-mem-unified" shape="rect">Managed and Unified Memory Modes</a> sections. OpenMP unified shared memory corresponds to Unified Memory Mode in NVHPC Compilers and it can be enabled with <samp class="ph codeph">-⁠gpu=mem:unified</samp> flag. Source code with <samp class="ph codeph">requires unified_shared_memory</samp> directive is accepted but requires <samp class="ph codeph">-⁠gpu=mem:unified</samp> flag to activate Unified Memory Mode. </p> <p class="p"> In Unified Memory Mode, <samp class="ph codeph">map</samp> clauses on <samp class="ph codeph">target</samp> constructs are optional. Additionally, <samp class="ph codeph">declare target</samp> directives are optional for variables with static storage duration accessed inside functions to which such directive is applied. The OpenMP unified shared memory eases accelerator programming on the GPUs removing the need for data management and only requiring to express the parallelism in the compute regions. </p> <p class="p"> In Unified Memory Mode, all data is managed by the CUDA runtime. Explicit data <samp class="ph codeph">map</samp> clauses which manage the data movement across the host and devices become optional. All variables are accessible from the OpenMP offload compute regions executing on the GPU. The <samp class="ph codeph">map</samp> clause with <samp class="ph codeph">alloc</samp>, <samp class="ph codeph">to</samp>,<samp class="ph codeph">from</samp>, and <samp class="ph codeph">tofrom</samp> type will not result in any device allocation or data transfer. The OpenMP runtime, however, may leverage such clauses to communicate preferable data placement to the CUDA runtime by means of memory hint APIs as elaborated in the following blog post on the NVIDIA website: <a class="xref" href="https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management" target="_blank" shape="rect">Simplifying GPU Application Development with Heterogeneous Memory Management</a>. Device memory can be allocated or deallocated in OpenMP programs in Unified Memory Mode by using the <samp class="ph codeph">omp_target_alloc</samp> and <samp class="ph codeph">omp_target_free</samp> API calls. Please, note that the memory allocated through <samp class="ph codeph">omp_target_alloc</samp> cannot be accessed by the host. </p> <div class="section"> <h3 class="title sectiontitle">Understanding Data Movement</h3> <p class="p"> When the compiler encounters a compute construct without visible <samp class="ph codeph">target data</samp> directives or <samp class="ph codeph">map</samp> clauses, it attempts to determine what data is required for correct execution of the region on the GPU. When the compiler is unable to determine the size and shape of data needing to be accessible on the device, it behaves as follows: </p> <ul class="ul"> <li class="li"> In Separate Memory Mode, the compiler may not be able to alert you to the need for an explicit data clause specifying size and/or shape of data being copied to/from the GPU. In this case, the default length of one may be used. This may cause illegal memory access errors at runtime on the GPU devices. </li> <li class="li"> In Managed Memory Mode (<samp class="ph codeph">-⁠gpu=mem:managed</samp>), the compiler assumes the data is allocated in managed memory and thus is accessible from the device; if this assumption is wrong, for example, if the data was defined globally or is located on the CPU stack, the program may fail at runtime. </li> <li class="li"> In Unified Memory Mode (<samp class="ph codeph">-⁠gpu=mem:unified</samp>), all data is accessible from the device making information about size and shape unnecessary. </li> </ul> <p class="p"> Take the following example in C: </p><pre xml:space="preserve"> #pragma omp declare target void set(int* ptr, int i, int j, int dim){ int idx = i * dim + j; return ptr[idx] = someval(i, j); } #pragma omp end declare target void fill2d(int* ptr, int dim){ #pragma omp target teams distribute parallel for for (int i = 0; i &lt; dim; i++) for (int j = 0; j &lt; dim; j++) set(ptr, i, j, dim); } </pre><p class="p"> In Separate Memory Mode, the only way to guarantee correctness for this example is to specify an array section in the <samp class="ph codeph">target</samp> construct as follows: </p><pre xml:space="preserve"> #pragma omp target teams distribute parallel for map(from: ptr[0:dim*dim]) </pre><p class="p"> This change explicitly instructs the OpenMP implementation about the precise data segment used within the target for loop. </p> <p class="p"> In Unified Memory Mode, the <samp class="ph codeph">map</samp> clause is not required. </p> <p class="p"> The next example, in Fortran, illustrates how a global variable can be accessed in an OpenMP routine without requiring any explicit annotation. </p><pre xml:space="preserve"> module m integer :: globmin = 1234 contains subroutine findmin(a) !$omp declare target integer, intent(in) :: a(:) integer :: i do i = 1, size(a) if (a(i) .lt. globmin) then globmin = a(i) endif end do end subroutine end module m </pre><p class="p"> Compile the example above for Unified Memory Mode: </p><pre xml:space="preserve"> nvfortran -mp=gpu -gpu=mem:unified example.f90 </pre><p class="p"> The source does not need any OpenMP directives to access module variable <samp class="ph codeph">globmin</samp>, to either read or update its value, in the routine invoked from CPU and GPU. Moreover, any access to <samp class="ph codeph">globmin</samp> will be made to the same exact instance of the variable from CPU and GPU; its value is synchronized automatically. In Separate or Managed Memory Modes, such behavior can only be achieved with a combination of OpenMP <samp class="ph codeph">declare target</samp> and <samp class="ph codeph">target update</samp> directives in the source code. </p> <p class="p"> Migrating existing OpenMP applications written for Separate Memory Mode should, in most cases, be a seamless process requiring no source changes. Some data access patterns, however, may lead to different results produced during application execution in Unified Memory Mode. Applications which rely on having separate data copies in GPU memory to conduct temporary computations on the GPU -⁠-⁠ without maintaining data synchronization with the CPU -⁠-⁠ pose a challenge for migration to unified memory. For the following Fortran example, the value of variable <samp class="ph codeph">c</samp> after the last loop will differ depending on whether the example is compiled with or without <samp class="ph codeph">-⁠gpu=mem:unified</samp>. </p><pre xml:space="preserve"> b(:) = ... c = 0 !$omp target data map(to: b) map(from: a) !$omp target distribute teams parallel for do i = 1, N b(i) = b(i) * i end do !$omp target distribute teams parallel for do i = 1, N a(i) = b(i) + i end do !$omp end target data do i = 1, N c = c + a(i) + b(i) end do </pre><p class="p"> Without Unified Memory, array <samp class="ph codeph">b</samp> is copied into the GPU memory at the beginning of the OpenMP <samp class="ph codeph">target data</samp> region. It is then updated in the GPU memory and used to compute elements of array <samp class="ph codeph">a</samp>. As instructed by the data clause <samp class="ph codeph">map(to:b)</samp>, <samp class="ph codeph">b</samp> is not copied back to the CPU memory at the end of the <samp class="ph codeph">target data</samp> region and therefore its initial value is used in the computation of <samp class="ph codeph">c</samp>. With <samp class="ph codeph">-⁠mp=gpu -⁠gpu=mem:unified</samp>, the updated value of <samp class="ph codeph">b</samp> in the first loop is automatically visible in the last loop leading to a different value of <samp class="ph codeph">c</samp> at its end. </p> <p class="p"> Additional complications may arise from the asynchronous execution as the use of unified shared memory may require extra synchronizations to avoid data races. </p> </div> </div> </div> <div class="topic concept nested1" id="openmp-multi-gpus"><a name="openmp-multi-gpus" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#openmp-multi-gpus" name="openmp-multi-gpus" shape="rect">7.9. Multiple Device Support</a></h3> <div class="body conbody"> <p class="p"> A program can use multiple devices on a single node. </p> <p class="p"> This functionality is supported using the <samp class="ph codeph">omp_set_default_device</samp> API call and the <samp class="ph codeph">device()</samp> clause on the <samp class="ph codeph">target</samp> constructs. Our experience is that most programs use MPI parallelism with each MPI rank selecting a single GPU to which to offload. Some programs assign multiple MPI ranks to each GPU, in order to keep the GPU fully occupied, though the fixed memory size of the GPU limits how effective this strategy can be. Similarly, other programs use OpenMP thread parallelism on the CPU, with each thread selecting a single GPU to which to offload. </p> </div> </div> <div class="topic concept nested1" id="openmp-interop-cuda"><a name="openmp-interop-cuda" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#openmp-interop-cuda" name="openmp-interop-cuda" shape="rect">7.10. Interoperability with CUDA</a></h3> <div class="body conbody"> <p class="p"> The HPC Compilers support interoperability of OpenMP and CUDA to the same extent they support CUDA interoperability with OpenACC. </p> <p class="p"> If OpenMP and CUDA code coexist in the same program, the OpenMP runtime and the CUDA runtime use the same CUDA context on each GPU. To enable this coexistence, use the compilation and linking option <samp class="ph codeph">-⁠cuda</samp>. CUDA-allocated data is available for use inside OpenMP target regions with the OpenMP analog <samp class="ph codeph">is_device_ptr</samp> to OpenACC's <samp class="ph codeph">deviceptr()</samp> clause. </p> <p class="p"> OpenMP-allocated data is available for use inside CUDA kernels directly if the data was allocated with the <samp class="ph codeph">omp_target_alloc()</samp> API call; if the OpenMP data was created with a <samp class="ph codeph">target data map</samp> clause, it can be made available for use inside CUDA kernels using the <samp class="ph codeph">target data use_device_addr()</samp> clause. Calling a CUDA device function inside an OpenMP target region is supported, as long as the CUDA function is a scalar function, that is, does not use CUDA shared memory or any inter-thread synchronization. Calling an OpenMP <samp class="ph codeph">declare target</samp> function inside a CUDA kernel is supported as long as the <samp class="ph codeph">declare target</samp> function has no OpenMP constructs or API calls. </p> </div> </div> <div class="topic concept nested1" id="openmp-interop-openmp"><a name="openmp-interop-openmp" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#openmp-interop-openmp" name="openmp-interop-openmp" shape="rect">7.11. Interoperability with Other OpenMP Compilers</a></h3> <div class="body conbody"> <p class="p"> OpenMP CPU-parallel object files compiled with NVIDIA's HPC compilers are interoperable with OpenMP CPU-parallel object files compiled by other compilers using the KMPC OpenMP runtime interface. Compilers supporting KMPC OpenMP include Intel and CLANG. The HPC compilers support a GNU OpenMP interface layer as well which provides OpenMP CPU-parallel interoperability with the GNU compilers. </p> <p class="p"> For OpenMP GPU computation, there is no similar formal or informal standard library interface for launching GPU compute constructs or managing GPU memory. There is also no standard way to manage the device context in such a way as to interoperate between multiple offload libraries. The HPC compilers therefore do not support interoperability of device compute offload operations and similar operations generated with another compiler. </p> </div> </div> <div class="topic concept nested1" id="openmp-gnu"><a name="openmp-gnu" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#openmp-gnu" name="openmp-gnu" shape="rect">7.12. GNU STL</a></h3> <div class="body conbody"> <p class="p"> When using nvc⁠+⁠+ on Linux, the GNU STL is thread-safe to the extent listed in the GNU documentation as required by the C⁠+⁠+11 standard. If an STL thread-safe issue is suspected, the suspect code can be run sequentially inside of an OpenMP region using <samp class="ph codeph">#pragma omp critical</samp> sections. </p> </div> </div> </div> <div class="topic concept nested0" id="stdpar-use"><a name="stdpar-use" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#stdpar-use" name="stdpar-use" shape="rect">8. Using Stdpar</a></h2> <div class="body conbody"> <p class="p"> This chapter describes the NVIDIA HPC Compiler support for standard language parallelism, also known as Stdpar: </p> <ul class="ul"> <li class="li"> ISO C⁠+⁠+ standard library parallel algorithms with <samp class="ph codeph">nvc⁠+⁠+</samp></li> <li class="li"> ISO Fortran <samp class="ph codeph">do concurrent</samp> loop construct with <samp class="ph codeph">nvfortran</samp></li> </ul> <p class="p"> Use the <samp class="ph codeph">-⁠stdpar</samp> compiler option to enable parallel execution with standard parallelism. The sub-options to <samp class="ph codeph">-⁠stdpar</samp> are the following: </p> <ul class="ul"> <li class="li"><samp class="ph codeph">gpu</samp>: compile for parallel execution on GPU; this sub-option is the default. This feature is supported on the NVIDIA Pascal architecture and newer. </li> <li class="li"><samp class="ph codeph">multicore</samp>: compile for multicore CPU execution. </li> <li class="li"><samp class="ph codeph">gpu,multicore</samp>: compile for parallel execution on GPU and CPU; if execution platform has any GPU, the code will be offloaded to run on GPU. Otherwise, fall back to multicore CPU execution. </li> </ul> <p class="p"> NVC++ supports the additional <samp class="ph codeph">-⁠stdpar</samp> sub-options detailed in <a class="xref" href="index.html#stdpar-cpp-stdpar-option" shape="rect">Enabling Parallel Algorithms with the -⁠stdpar Option</a>. </p> <p class="p"> By default, the compiler auto-detects and generates GPU code for the type of GPU that is installed on the system on which the compiler is running. To generate code for a specific GPU architecture, which may be necessary when the application is compiled and run on different systems, add the <samp class="ph codeph">-⁠gpu=ccXX</samp> command-line option. More details can be found in <a class="xref" href="index.html#compute-cap" shape="rect">Compute Capability</a>. </p> <div class="section" id="stdpar-use__macros"><a name="stdpar-use__macros" shape="rect"> <!-- --></a><h2 class="title sectiontitle">Predefined Macros</h2> <div class="p">The following macros corresponding to the parallel execution target compiled for are added implicitly: <ul class="ul"> <li class="li"><samp class="ph codeph">__NVCOMPILER_STDPAR_GPU</samp> for parallel execution on GPU. </li> <li class="li"><samp class="ph codeph">__NVCOMPILER_STDPAR_MULTICORE</samp> for parallel execution on multicore CPU. </li> </ul> </div> </div> </div> <div class="topic concept nested1" id="stdpar-mem-modes"><a name="stdpar-mem-modes" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-mem-modes" name="stdpar-mem-modes" shape="rect">8.1. GPU Memory Modes</a></h3> <div class="body conbody"> <p class="p"> When compiling for GPU execution, Stdpar utilizes <a class="xref" href="index.html#acc-mem-unified" shape="rect">Managed and Unified Memory Modes</a> for managing data accessed from the sequential code running on CPU and from the parallel code on GPU. </p> <p class="p"> The compiler detects the memory capability of the system on which the compiler is running and uses that information to enable the correct memory mode as follows: </p> <ul class="ul"> <li class="li"> When compiled on the platform with full CUDA Unified Memory capability, <samp class="ph codeph">-⁠stdpar</samp> implies <samp class="ph codeph">-⁠gpu=mem:unified</samp>. </li> <li class="li"> When compiled on the platform with CUDA Managed Memory capability only, <samp class="ph codeph">-⁠stdpar</samp> implies <samp class="ph codeph">-⁠gpu=mem:managed</samp>. </li> </ul> <p class="p"> To compile code for a specific Memory Mode regardless of the memory capability of the system on which you are compiling, add the desired <samp class="ph codeph">-⁠gpu=mem:unified</samp> or <samp class="ph codeph">-⁠gpu=mem:managed</samp> option. </p> <p class="p"> Stdpar with Separate Memory Mode can only be supported when the data are fully managed through features of other programming models e.g. OpenACC. </p> <p class="p"> All restrictions on variables used on the GPU in standard language parallel code in Managed Memory Mode have been removed when using Unified Memory Mode. </p> <p class="p"> If the compiler utilises CUDA Managed Memory automatically, the interception of deallocations is enabled implicitly at runtime. This is to prevent deallocating the data with unmatching API which may lead to undefined behavior. The interception incurs some runtime overhead and may be unnecessary if allocatations and deallocations for all data in the application are performed using the matching APIs. The interception can be disabled using dedicated command-line options detailed in <a class="xref" href="index.html#gpu-mem-intercept" shape="rect">Interception of Deallocations</a>. More details about the memory modes supported by the NVIDIA HPC Compilers and dedicated command-line options can be found in <a class="xref" href="index.html#acc-mem-model" shape="rect">Memory Model</a>. </p> </div> </div> <div class="topic concept nested1" id="stdpar-cpp"><a name="stdpar-cpp" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp" name="stdpar-cpp" shape="rect">8.2. Stdpar C++</a></h3> <div class="body conbody"> <div class="section"> <p class="p"> The NVIDIA HPC C⁠+⁠+ compiler, NVC⁠+⁠+, supports C⁠+⁠+ Standard Language Parallelism (Stdpar) for execution on NVIDIA GPUs and multicore CPUs. As mentioned previously, use the NVC⁠+⁠+ command-line option <samp class="ph codeph">-⁠stdpar</samp> to enable GPU accelerated C⁠+⁠+ Parallel Algorithms. The following sections go into more detail about the NVC⁠+⁠+ support for the ISO C⁠+⁠+ Standard Library Parallel Algorithms. </p> </div> </div> <div class="topic concept nested2" id="stdpar-cpp-intro"><a name="stdpar-cpp-intro" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-intro" name="stdpar-cpp-intro" shape="rect">8.2.1.  Introduction to Stdpar C++</a></h3> <div class="body conbody"> <p class="p"> The C⁠+⁠+17 Standard introduced higher-level parallelism features that allow users to request parallelization of Standard Library algorithms. </p> <p class="p"> This higher-level parallelism is expressed by adding an execution policy as the first parameter to any algorithm that supports execution policies. Most of the existing Standard C++ algorithms were enhanced to support execution policies. C⁠+⁠+17 defined several new parallel algorithms, including the useful <samp class="ph codeph"><a class="xref" href="https://en.cppreference.com/w/cpp/algorithm/reduce" target="_blank" shape="rect">std::reduce</a></samp> and <samp class="ph codeph"><a class="xref" href="https://en.cppreference.com/w/cpp/algorithm/transform_reduce" target="_blank" shape="rect">std::transform_reduce</a></samp>. </p> <p class="p"> C⁠+⁠+17 defines three <a class="xref" href="https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag" target="_blank" shape="rect">execution policies</a>: </p> <ul class="ul"> <li class="li"><samp class="ph codeph">std::execution::seq:</samp> Sequential execution. No parallelism is allowed. </li> <li class="li"><samp class="ph codeph">std::execution::par:</samp> Parallel execution on one or more threads. </li> <li class="li"><samp class="ph codeph">std::execution::par_unseq:</samp> Parallel execution on one or more threads, with each thread possibly vectorized. </li> </ul> <p class="p"> When you use an execution policy other than <samp class="ph codeph">std::execution::seq</samp>, you are communicating two important things to the compiler: </p> <ul class="ul"> <li class="li">You prefer but do not require that the algorithm be run in parallel. A conforming C⁠+⁠+17 implementation may ignore the hint and run the algorithm sequentially, but a performance-oriented implementation takes the hint and executes in parallel when possible and prudent. </li> <li class="li">The algorithm is safe to run in parallel. For the <samp class="ph codeph">std::execution::par</samp> and <samp class="ph codeph">std::execution::par_unseq</samp> policies, any user-provided code—such as iterators, lambdas, or function objects passed into the algorithm—must not introduce data races if run concurrently on separate threads. For the <samp class="ph codeph">std::execution::par_unseq</samp> policy, any user-provided code must not introduce data races or deadlocks if multiple calls are interleaved on the same thread, which is what happens when a loop is vectorized. For more information about potential deadlocks, see the <a class="xref" href="https://en.cppreference.com/w/cpp/language/memory_model#Progress_guarantee" target="_blank" shape="rect">forward progress guarantees</a> provided by the parallel policies or watch <a class="xref" href="https://www.youtube.com/watch?v=FJIn1YhPJJc" target="_blank" shape="rect">CppCon 2018: Bryce Adelstein Lelbach “The C++ Execution Model”</a>. </li> </ul> <p class="p"> The C⁠+⁠+ Standard grants compilers great freedom to choose if, when, and how to execute algorithms in parallel as long as the forward progress guarantees the user requests are honored. For example, <samp class="ph codeph">std::execution::par_unseq</samp> may be implemented with vectorization and <samp class="ph codeph">std::execution::par</samp> may be implemented with a CPU thread pool. It is also possible to execute parallel algorithms on a GPU, which is a good choice for invocations with sufficient parallelism to take advantage of the processing power and memory bandwidth of NVIDIA GPU processors. </p> </div> </div> <div class="topic concept nested2" id="stdpar-cpp-compiler-support"><a name="stdpar-cpp-compiler-support" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-compiler-support" name="stdpar-cpp-compiler-support" shape="rect">8.2.2.  NVC++ Compiler Parallel Algorithms Support </a></h3> <div class="body conbody"> <p class="p"> NVC⁠+⁠+ supports C⁠+⁠+ Standard Language Parallelism with the parallel execution policies <samp class="ph codeph">std::execution::par</samp> or <samp class="ph codeph">std::execution::par_unseq</samp> for execution on GPUs or multicore CPUs. </p> <p class="p"> Lambdas, including generic lambdas, are fully supported in parallel algorithm invocations. No language extensions or non-standard libraries are required to enable GPU acceleration. All data movement between host memory and GPU device memory is performed implicitly and automatically under the control of <a class="xref" href="index.html#acc-mem-unified" shape="rect">Managed and Unified Memory Modes</a>. </p> <p class="p"> It's straightforward to automatically GPU accelerate C⁠+⁠+ Parallel Algorithms with NVC⁠+⁠+. However, there are some restrictions and limitations you need to be aware of as explained below. </p> </div> <div class="topic concept nested3" id="stdpar-cpp-stdpar-option"><a name="stdpar-cpp-stdpar-option" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-stdpar-option" name="stdpar-cpp-stdpar-option" shape="rect">8.2.2.1.  Enabling Parallel Algorithms with the <span class="keyword option">-⁠stdpar</span> Option </a></h3> <div class="body conbody"> <p class="p"> GPU acceleration of C⁠+⁠+ Parallel Algorithms is enabled with the <samp class="ph codeph">-⁠stdpar=gpu</samp> command-line option to NVC++. If <samp class="ph codeph">-⁠stdpar=gpu</samp> is specified (or <samp class="ph codeph">-⁠stdpar</samp> without an argument), almost all algorithms that use a parallel execution policy are compiled for offloading to run in parallel on an NVIDIA GPU: </p><pre xml:space="preserve"> nvc++ -stdpar=gpu program.cpp -o program </pre><pre xml:space="preserve"> nvc++ -stdpar program.cpp -o program </pre><div class="p"> In addition, the GPU acceleration sub-option can be further specialized using <samp class="ph codeph">-⁠stdpar=gpu:acc</samp>. This option directs the compiler to use its OpenACC implementation to GPU-accelerate a subset of algorithm with a parallel execution policy: <pre xml:space="preserve"> nvc++ -stdpar=gpu:acc program.cpp -o program </pre> More details about the OpenACC support of Stdpar C++ is provided in <a class="xref" href="index.html#stdpar-cpp-impl-openacc" shape="rect">OpenACC Implementation of Parallel Algorithms</a>. </div> <p class="p"> Acceleration of C⁠+⁠+ Parallel Algorithms with multicore CPUs is enabled with the <span class="keyword option">-⁠stdpar=multicore</span> command-line option to NVC⁠+⁠+. If <span class="keyword option">-⁠stdpar=multicore</span> specified, almost all algorithms that use a parallel execution policy are compiled to run on a multicore CPU: </p><pre xml:space="preserve"> nvc++ -stdpar=multicore program.cpp -o program </pre><p class="p"> When either <samp class="ph codeph">-⁠stdpar=gpu,multicore</samp> or <samp class="ph codeph">-⁠stdpar=gpu:acc,multicore</samp> command-line options are specified to NVC++, the parallel algorithms code is compiled for both GPU and multicore CPU. When the execution platform has any GPU the binary executes on the GPU and otherwise on the multicore CPU. </p><pre xml:space="preserve"> nvc++ -stdpar=gpu,multicore program.cpp -o program </pre><pre xml:space="preserve"> nvc++ -stdpar=gpu:acc,multicore program.cpp -o program </pre></div> </div> </div> <div class="topic concept nested2" id="stdpar-cpp-example-1"><a name="stdpar-cpp-example-1" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-example-1" name="stdpar-cpp-example-1" shape="rect">8.2.3.  Stdpar C++ Simple Example </a></h3> <div class="body conbody"> <p class="p"> Here are a few simple examples to get a feel for how the C++ Parallel Algorithms work. </p> <p class="p"> From the early days of C++, sorting items stored in an appropriate container has been relatively easy using a single call like the following: </p><pre xml:space="preserve"> std::sort(employees.begin(), employees.end(), CompareByLastName()); </pre><p class="p"> Assuming the comparison class <samp class="ph codeph">CompareByLastName</samp> is thread-safe, which is true for most comparison functions, parallelizing this sort is simple with C++ Parallel Algorithms. Include <samp class="ph codeph">&lt;execution&gt;</samp> and add an execution policy to the function call: </p><pre xml:space="preserve"> std:sort(std::execution::par, employees.begin(), employees.end(), CompareByLastName()); </pre><p class="p"> Calculating the sum of all the elements in a container is also simple with the <samp class="ph codeph">std::accumulate</samp> algorithm. Prior to C++17, transforming the data in some way while taking the sum was somewhat awkward. For example, to compute the average age of your employees, you might write the following code: </p><pre xml:space="preserve"> int ave_age = std::accumulate(employees.begin(), employees.end(), 0, [](int sum, const Employee&amp; emp){ return sum + emp.age(); }) / employees.size(); </pre><p class="p"> The <samp class="ph codeph">std::transform_reduce</samp> algorithm introduced in C++17 makes it simple to parallelize this code. It also results in cleaner code by separating the reduction operation, in this case <samp class="ph codeph">std::plus</samp>, from the transformation operation, in this case <samp class="ph codeph">emp.age():</samp></p><pre xml:space="preserve"> int ave_age = std::transform_reduce(std::execution::par_unseq, employees.begin(), employees.end(), 0, std::plus&lt;int&gt;(), [](const Employee&amp; emp){ return emp.age(); }) / employees.size(); </pre></div> </div> <div class="topic concept nested2" id="stdpar-cpp-impl-openacc"><a name="stdpar-cpp-impl-openacc" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-impl-openacc" name="stdpar-cpp-impl-openacc" shape="rect">8.2.4.  OpenACC Implementation of Parallel Algorithms </a></h3> <div class="body conbody"> <p class="p"> NVC++ has an experimental GPU support for a subset of algorithms with parallel execution policies <samp class="ph codeph">std::par</samp> and <samp class="ph codeph">std::par_unseq</samp> accelerated through the OpenACC implementation. This feature, enabled with the <samp class="ph codeph">-stdpar=gpu:acc</samp> option, may result in better application performance on the GPU and faster compilation speed. </p> <p class="p"> The following subset of algorithms have OpenACC implementation support: </p> <ul class="ul"> <li class="li"><samp class="ph codeph">std::for_each</samp></li> <li class="li"><samp class="ph codeph">std::for_each_n</samp></li> <li class="li"><samp class="ph codeph">std::transform</samp></li> </ul> <p class="p"> The following algorithms have OpenACC implementation support for scalar data types and the standard <samp class="ph codeph">std::plus</samp> reduction operation: </p> <ul class="ul"> <li class="li"><samp class="ph codeph">std::reduce</samp></li> <li class="li"><samp class="ph codeph">std::transform_reduce</samp></li> </ul> <p class="p"> The remainder of the parallel algorithms are parallelized using the default GPU implementation as if <samp class="ph codeph">-stdpar=gpu</samp> was specified. </p> <p class="p">When the code is compiled for GPU with the OpenACC acceleration <samp class="ph codeph"> __NVCOMPILER_STDPAR_OPENACC_GPU</samp> macro is defined implicitly. </p> </div> </div> <div class="topic concept nested2" id="stdpar-cpp-guidelines"><a name="stdpar-cpp-guidelines" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-guidelines" name="stdpar-cpp-guidelines" shape="rect">8.2.5.  Coding Guidelines for GPU-accelerating Parallel Algorithms </a></h3> <div class="body conbody"> <p class="p"> GPUs are not simply CPUs with more threads. To effectively take advantage of the massive parallelism and memory bandwidth available on GPUs, it is typical for GPU programming models to put some limitations on code executed on the GPU. The NVC⁠+⁠+ implementation of C⁠+⁠+ Parallel Algorithms is no exception in this regard. The sections which follow detail the limitations that apply in the current release. </p> </div> <div class="topic concept nested3" id="stdpar-cpp-device-functions"><a name="stdpar-cpp-device-functions" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-device-functions" name="stdpar-cpp-device-functions" shape="rect">8.2.5.1.  Parallel Algorithms and Device Function Annotations </a></h3> <div class="body conbody"> <p class="p"> Functions to be executed on the GPU within parallel algorithms do not need any <samp class="ph codeph">__device__</samp> annotations or other special markings to be compiled for GPU execution. The NVC++ compiler walks the call graph for each source file and automatically infers which functions must be compiled for GPU execution. </p> <p class="p"> However, this only works when the compiler can see the function definition in the same source file where the function is called. This is true for most inline functions and template functions but may fail when functions are defined in a different source file or linked in from an external library. You need to be aware of this when formulating parallel algorithms invocations that you expect to be offloaded and accelerated on NVIDIA GPUs. </p> <p class="p"> When calling an externally defined function from within a parallel algorithm region, such functions require some form of device annotations from other GPU programming models e.g. OpenACC routine directive (refer to <a class="xref" href="index.html#stdpar-cpp-interop-openacc-routine" shape="rect">External Device Function Annotations</a> for more information). </p> </div> </div> <div class="topic concept nested3" id="stdpar-cpp-unified-memory"><a name="stdpar-cpp-unified-memory" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-unified-memory" name="stdpar-cpp-unified-memory" shape="rect">8.2.5.2. Data Management in Parallel Algorithms</a></h3> <div class="body conbody"> <div class="p"> When offloading parallel algorithms to a GPU, it's essential to consider how data is accessed from the parallel region. Some GPUs may not access certain segments of the CPU's address space. Developers targeting platforms without unified shared memory or those seeking to optimize performance must be aware of these memory distinctions, as they may affect the folowing types of data accessed in parallel algorithm regions: <ul class="ul"> <li class="li"> Pointer data passed into lambda functions within the parallel algorithm. </li> <li class="li"> Data captured by reference in lambda functions or pointer data captured by value. </li> <li class="li"> Variables with static storage duration referenced inside the parallel algorithm. </li> </ul> To avoid memory access violations, developers must ensure that all of the above data is accessible to the GPU before the parallel algorithm is executed. </div> <p class="p"> Stdpar C⁠+⁠+ only supports <a class="xref" href="index.html#acc-mem-unified" shape="rect">Managed and Unified Memory Modes</a> which allow data being accessed from CPU and GPU. Through support in both the CUDA device driver and the NVIDIA GPU hardware, the CUDA Unified Memory manager automatically moves some types of data based on usage. </p> <p class="p"> Stdpar with Separate Memory Mode can only be supported when the data are fully managed through the OpenACC data directives, refer to <a class="xref" href="index.html#stdpar-cpp-interop-openacc" shape="rect">Interoperability with OpenACC</a>. </p> <p class="p"> Since object-oriented design is fundamental to C++, special consideration must be given to composite data types with pointer or reference members. The data referenced or pointed to may not be stored contiguously within the composite data type. Moreover, such data might not even be allocated in the same memory segment as the composite type itself. As a result, when accessing both the composite data type and its referenced or pointed-to data from parallel algorithms, the developer must ensure that the member data is also made accessible to the GPU. These considerations should also be taken into account when standard library containers are used in the parallel algorithms as the containers frequently contain member pointers to their elements. </p> <p class="p"> The discussion in this section assumes familiarity with the Managed and Unified Memory Modes covered in <a class="xref" href="index.html#acc-mem-model" shape="rect">Memory Model</a> and <a class="xref" href="index.html#acc-mem-unified" shape="rect">Managed and Unified Memory Modes</a>. The code executing within the parallel algorithm is referred to as the accelerator subprogram. In contrast to the code executing outside of the parallel algorithm which is referred to as the host subprogram. </p> <div class="section"> <h5 class="title sectiontitle">Managed Memory Mode</h5> <p class="p"> When Stdpar code is compiled with Managed Memory Mode (as default mode or by passing <samp class="ph codeph">-⁠gpu=mem:managed</samp>) only data dynamically allocated on the heap in CPU code can be managed automatically. CPU and GPU automatic storage (stack memory) and static storage (global or static data) cannot be automatically managed. Likewise, data that is dynamically allocated in program units not compiled by <samp class="ph codeph">nvc⁠+⁠+</samp> with the <samp class="ph codeph">-⁠stdpar</samp> option is not automatically managed by CUDA Unified Memory even though it is on the CPU heap. The compiler utilizes CUDA Managed Memory for dynamic allocations to make data accessible from CPU and GPU. As managed memory allocation calls can incur higher runtime overhead than standard allocator calls, the implementation uses memory pools for performance reasons by default as detailed in <a class="xref" href="index.html#gpu-mem-poolallocator" shape="rect">Memory Pool Allocator</a>. </p> <p class="p"> The Managed Memory Mode is intended for binaries run on targets with CUDA Managed Memory capability only. Any pointer that is dereferenced and any C⁠+⁠+ object that is referenced within a parallel algorithm invocation must refer to data on the CPU heap that is allocated in a program unit compiled by <samp class="ph codeph">nvc⁠+⁠+</samp> with <samp class="ph codeph">-⁠stdpar</samp>. Dereferencing a pointer to a CPU stack or a global object will result in a memory violation in GPU code. </p> </div> <div class="section"> <h5 class="title sectiontitle">Unified Memory Mode</h5> <p class="p"> When Unified Memory is the default memory mode or is selected explicitly on the command line by passing <samp class="ph codeph">-⁠gpu=mem:unified</samp>, there are no restrictions on variables accessed in the parallel algorithms. Therefore, all CPU data (either residing on stack, heap, or globally) are simply accessible in the parallel algorithm functions. Note that memory dynamically allocated in GPU code is only visible from GPU code and can never be accessed by the CPU regardless of the CUDA Unified Memory capability. </p> <p class="p"> When compiling a binary for platforms with full CUDA Unified Memory capability, only those source files using features from the standard parallel algorithms library must be compiled by <samp class="ph codeph">nvc⁠+⁠+</samp> with the <samp class="ph codeph">-⁠stdpar</samp> option. There is no requirement that the code dynamically allocating memory accessed on GPU is also compiled in such a way. </p> <p class="p"> Unified Memory Mode may utilize CUDA Managed Memory for dynamic allocation, more details can be found in <a class="xref" href="index.html#acc-mem-unified" shape="rect">Managed and Unified Memory Modes</a>. </p> </div> <div class="section"> <h5 class="title sectiontitle">Summary</h5> <p class="p"> The following table provides a key summary of important command-line options selecting memory modes and the impact of memory modes on the Stdpar features. </p> <div class="tablenoborder"><a name="stdpar-cpp-unified-memory__mem-mode-stdpar" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="stdpar-cpp-unified-memory__mem-mode-stdpar" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 21. Stdpar C++ Feature Differences for Memory Modes</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="40%" id="d54e15004" rowspan="1" colspan="1"> Command-line options </th> <th class="entry" valign="top" width="20%" id="d54e15007" rowspan="1" colspan="1"> Dynamically allocated variables outside of parallel algorithm region </th> <th class="entry" valign="top" width="20%" id="d54e15010" rowspan="1" colspan="1"> Automatic or static storage variables outside of parallel algorithm region </th> <th class="entry" valign="top" width="20%" id="d54e15013" rowspan="1" colspan="1"> Dynamic allocator </th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="40%" headers="d54e15004" rowspan="1" colspan="1"> No memory-specific flags passed, compiling on target with CUDA Managed Memory only </td> <td class="entry" valign="top" width="20%" headers="d54e15007" rowspan="1" colspan="1"> Can be accessed within parallel region code </td> <td class="entry" valign="top" width="20%" headers="d54e15010" rowspan="1" colspan="1"> Cannot be accessed within parallel algorithm code </td> <td class="entry" valign="top" width="20%" headers="d54e15013" rowspan="1" colspan="1"> cudaMallocManaged </td> </tr> <tr class="row"> <td class="entry" valign="top" width="40%" headers="d54e15004" rowspan="1" colspan="1"> No memory-specific flags passed, compiling on target with full CUDA Unified Memory </td> <td class="entry" valign="top" width="20%" headers="d54e15007" rowspan="1" colspan="1"> Can be accessed within parallel region code </td> <td class="entry" valign="top" width="20%" headers="d54e15010" rowspan="1" colspan="1"> Can be accessed within parallel algorithm code </td> <td class="entry" valign="top" width="20%" headers="d54e15013" rowspan="1" colspan="1"> cudaMallocManaged or system allocators: new/malloc (compiler picks the most suitable allocator) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="40%" headers="d54e15004" rowspan="1" colspan="1"> <p class="p"><samp class="ph codeph">-⁠gpu=mem:managed</samp></p> </td> <td class="entry" valign="top" width="20%" headers="d54e15007" rowspan="1" colspan="1"> Can be accessed within parallel region code </td> <td class="entry" valign="top" width="20%" headers="d54e15010" rowspan="1" colspan="1"> Cannot be accessed within parallel algorithm code </td> <td class="entry" valign="top" width="20%" headers="d54e15013" rowspan="1" colspan="1"> cudaMallocManaged </td> </tr> <tr class="row"> <td class="entry" valign="top" width="40%" headers="d54e15004" rowspan="1" colspan="1"> <p class="p"><samp class="ph codeph">-⁠gpu=mem:unified</samp></p> </td> <td class="entry" valign="top" width="20%" headers="d54e15007" rowspan="1" colspan="1"> Can be accessed within parallel region code </td> <td class="entry" valign="top" width="20%" headers="d54e15010" rowspan="1" colspan="1"> Can be accessed within parallel algorithm code </td> <td class="entry" valign="top" width="20%" headers="d54e15013" rowspan="1" colspan="1"> cudaMallocManaged or system allocators: new/malloc (compiler picks the most suitable allocator) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="40%" headers="d54e15004" rowspan="1" colspan="1"> <p class="p"><samp class="ph codeph">-⁠gpu=mem:unified:managedalloc</samp></p> </td> <td class="entry" valign="top" width="20%" headers="d54e15007" rowspan="1" colspan="1"> Can be accessed within parallel region code </td> <td class="entry" valign="top" width="20%" headers="d54e15010" rowspan="1" colspan="1"> Can be accessed within parallel algorithm code </td> <td class="entry" valign="top" width="20%" headers="d54e15013" rowspan="1" colspan="1"> cudaMallocManaged </td> </tr> <tr class="row"> <td class="entry" valign="top" width="40%" headers="d54e15004" rowspan="1" colspan="1"> <p class="p"><samp class="ph codeph">-⁠gpu=mem:unified:nomanagedalloc</samp></p> </td> <td class="entry" valign="top" width="20%" headers="d54e15007" rowspan="1" colspan="1"> Can be accessed within parallel region code </td> <td class="entry" valign="top" width="20%" headers="d54e15010" rowspan="1" colspan="1"> Can be accessed within parallel algorithm code </td> <td class="entry" valign="top" width="20%" headers="d54e15013" rowspan="1" colspan="1"> System allocators: new/malloc </td> </tr> </tbody> </table> </div> </div> <div class="section"> <h5 class="title sectiontitle">Examples</h5> <p class="p"> For example, <samp class="ph codeph">std::vector</samp> uses dynamically allocated memory, which is accessible from the GPU when using Stdpar. Iterating over the contents of a <samp class="ph codeph">std::vector</samp> in a parallel algorithm works as expected when compiling with either <samp class="ph codeph">-⁠gpu=mem:managed</samp> or <samp class="ph codeph">-⁠gpu=mem:unified</samp>: </p><pre xml:space="preserve"> std::vector&lt;int&gt; v = ...; std::sort(std::execution::par, v.begin(), v.end()); // Okay, accesses heap memory. </pre><p class="p"> On the other hand, <samp class="ph codeph">std::array</samp> performs no dynamic allocations. Its contents are stored within the <samp class="ph codeph">std::array</samp> object itself, which is often on a CPU stack. Iterating over the contents of a <samp class="ph codeph">std::array</samp> will not work on systems with only CUDA Managed Memory support unless the <samp class="ph codeph">std::array</samp> itself is allocated on the heap and the code is compiled with <samp class="ph codeph">-⁠gpu=mem:managed</samp>: </p><pre xml:space="preserve"> std::array&lt;int, 1024&gt; a = ...; std::sort(std::execution::par, a.begin(), a.end()); // Fails on targets with CUDA Managed // Memory capability only, array is on // a CPU stack inaccessible from GPU. // Works correctly on targets whith full // CUDA Unified Memory support. </pre><p class="p"> The above example works as expected when run on a target supporting full CUDA Unified Memory capability. </p> <p class="p"> When executing on targets with CUDA Managed Memory capability only, pay particular attention to lambda captures, especially capturing data objects by reference, which may contain non-obvious pointer dereferences: </p><pre xml:space="preserve"> void saxpy(float* x, float* y, int N, float a) { std::transform(std::execution::par_unseq, x, x + N, y, y, [<strong class="ph b">&amp;</strong>](float xi, float yi){ return a * xi + yi; }); } </pre><p class="p"> In the earlier example, the containing function parameter <samp class="ph codeph">a</samp> is captured by reference. The code within the body of the lambda, which is running on the GPU, tries to access <samp class="ph codeph">a</samp>, which is in the CPU stack memory. This attempt results in a memory violation and undefined behavior. In this case, the problem can easily be fixed by changing the lambda to capture by value: </p><pre xml:space="preserve"> void saxpy(float* x, float* y, int N, float a) { std::transform(std::execution::par_unseq, x, x + N, y, y, [<strong class="ph b">=</strong>](float xi, float yi){ return a * xi + yi; }); } </pre><p class="p"> With this one-character change, the lambda makes a copy of <samp class="ph codeph">a</samp>, which is then copied to the GPU, and there are no attempts to reference CPU stack memory from GPU code. Such code will run correctly without requiring modifications on targets with full CUDA Unified Memory capability. </p> <div class="p"> If <samp class="ph codeph">std::vector</samp> is accessed through a subscript operator from the device this would require such a vector object to be accessible from the parallel code executing on the GPU. This means that the <samp class="ph codeph">std::vector</samp> needs to be allocated dynamically in order to make it accessible from the GPU when compiled for the systems with only CUDA Managed Memory support. <pre xml:space="preserve"> std::vector&lt;int&gt; v = ...; std::for_each(std::execution::par, idx.begin(), idx.end(), [&amp;](auto i) {v[i] = 1;}); // Fails on targets with CUDA Managed // Memory capability only, vector object is on // a CPU stack inaccessible from GPU. // Works correctly on targets with full // CUDA Unified Memory support. </pre> An alternative approach to managing the content of the <samp class="ph codeph">std::vector</samp> on systems with CUDA Managed Memory support only would be to obtain a pointer to its elements data region using <samp class="ph codeph">data()</samp> member. <pre xml:space="preserve"> std::vector&lt;int&gt; v = ...; int* vdataptr = v.data(); std::for_each(std::execution::par, idx.begin(), idx.end(), [&amp;](auto i) {vdataptr[i] = 1;}); // Works, vector elements are in heap // memory </pre></div> <p class="p"> Whether <samp class="ph codeph">-⁠gpu=mem:unified</samp> is enabled by default or passed explicitly on the command line, parallel algorithms can access global variables and accesses to global variables from CPU and GPU are kept in sync. Extra care should be taken when accessing global variables within parallel algorithms, as simultaneous updates in different iterations running on the GPU can lead to data races. The following example illustrates the safe update of a global variable in the parallel algorithm since the update only occurs in one iteration. </p><pre xml:space="preserve"> int globvar = 123; void foo() { auto r = std::views::iota(0, N); std::for_each(std::execution::par_unseq, r.begin(), r.end(), [](auto i) { if (i == N - 1) globvar += 345; }); // globvar is equal to 468. } </pre></div> </div> </div> <div class="topic concept nested3" id="stdpar-cpp-function-pointers"><a name="stdpar-cpp-function-pointers" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-function-pointers" name="stdpar-cpp-function-pointers" shape="rect">8.2.5.3.  Parallel Algorithms and Function Pointers </a></h3> <div class="body conbody"> <p class="p"> Functions compiled to run on either the CPU or the GPU must be compiled into two different versions, one with the CPU machine instructions and one with the GPU machine instructions. </p> <p class="p"> In the current implementation, a function pointer either points to the CPU or the GPU version of the functions. This causes problems if you attempt to pass function pointers between CPU and GPU code. You might inadvertently pass a pointer to the CPU version of the function to GPU code. In the future, it may be possible to automatically and seamlessly support the use of function pointers across CPU and GPU code boundaries, but it is not supported in the current implementation. </p> <p class="p"> Function pointers can’t be passed to Parallel Algorithms to be run on the GPU, and functions may not be called through a function pointer within GPU code. For example, the following code example won’t work correctly: </p><pre xml:space="preserve"> void square(int&amp; x) { x = x * x; } void square_all(std::vector&lt;int&gt;&amp; v) { std::for_each(std::execution::par_unseq, v.begin(), v.end(), &amp;square); }</pre><p class="p"> It passes a pointer to the CPU version of the function square to a parallel <samp class="ph codeph">for_each</samp> algorithm invocation. When the algorithm is parallelized and offloaded to the GPU, the program fails to resolve the function pointer to the GPU version of <samp class="ph codeph">square</samp>. </p> <p class="p"> You can often solve this issue by using a function object, which is an object with a function call operator. The function object's call operator is resolved at compile time to the GPU version of the function, instead of being resolved at run time to the incorrect CPU version of the function as in the previous example. For example, the following code example works: </p><pre xml:space="preserve"> struct squared { void operator()(int&amp; x) const { x = x * x; } }; void square_all(std::vector&lt;int&gt;&amp; v) { std::for_each(std::execution::par_unseq, v.begin(), v.end(), squared{}); }</pre><p class="p"> Another possible workaround is to change the function to a lambda, because a lambda is implemented as a nameless function object: </p><pre xml:space="preserve"> void square_all(std::vector&lt;int&gt;&amp; v) { std::for_each(std::execution::par_unseq, v.begin(), v.end(), [](int&amp; x) { x = x * x; }); }</pre><p class="p"> If the function in question is too big to be converted to a function object or a lambda, then it should be possible to wrap the call to the function in a lambda: </p><pre xml:space="preserve"> void compute(int&amp; x) { // Assume lots and lots of code here. } void compute_all(std::vector&lt;int&gt;&amp; v) { std::for_each(std::execution::par_unseq, v.begin(), v.end(), [](int&amp; x) { compute(x); }); }</pre><p class="p"> No function pointers are used in this example. </p> <p class="p"> The restriction on calling a function through a function pointer unfortunately means passing polymorphic objects from CPU code to GPU-accelerated Parallel Algorithms is not currently supported, as virtual tables are implemented using function pointers. </p> </div> </div> <div class="topic concept nested3" id="stdpar-cpp-iterators"><a name="stdpar-cpp-iterators" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-iterators" name="stdpar-cpp-iterators" shape="rect">8.2.5.4.  Random Access Iterators </a></h3> <div class="body conbody"> <p class="p"> The C++ Standard requires that the iterators passed to most C++ Parallel Algorithms be forward iterators. However, C++ Parallel Algorithms on GPUs only works with random access iterators. Passing a forward iterator or a bidirectional iterator to a GPU/CPU-accelerated Parallel Algorithm results in a compilation error. Passing raw pointers or Standard Library random access iterators to the algorithms has the best performance, but most other random-access iterators work correctly. </p> </div> </div> <div class="topic concept nested3" id="stdpar-cpp-interop"><a name="stdpar-cpp-interop" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-interop" name="stdpar-cpp-interop" shape="rect">8.2.5.5.  Interoperability with the C++ Standard Library </a></h3> <div class="body conbody"> <p class="p"> Large parts of the C++ Standard Library can be used with stdpar on GPUs. </p> <ul class="ul"> <li class="li"><samp class="ph codeph">std::atomic&lt;T&gt;</samp> objects within GPU code work provided that <samp class="ph codeph">T</samp> is a four-byte or eight-byte integer type. </li> <li class="li">Math functions that operate on floating-point types—such as <samp class="ph codeph">sin</samp>, <samp class="ph codeph">cos</samp>, <samp class="ph codeph">log</samp>, and most of the other functions declared in <samp class="ph codeph">&lt;cmath&gt;</samp> —can be used in GPU code and resolve to the same implementations that are used in CUDA C++ programs. </li> <li class="li"><samp class="ph codeph">std::complex</samp>, <samp class="ph codeph">std::tuple</samp>, <samp class="ph codeph">std::pair</samp>, <samp class="ph codeph">std::optional</samp>, <samp class="ph codeph">std::variant</samp>, and <samp class="ph codeph">&lt;type_traits&gt;</samp>, are supported and work as expected in GPU code. </li> </ul> <p class="p"> The parts of the C++ Standard Library that aren’t supported in GPU code include I/O functions and in general any function that accesses the CPU operating system. As a special case, basic <samp class="ph codeph">printf</samp> calls can be used within GPU code and leverage the same implementation that is used in NVIDIA CUDA C++. </p> </div> </div> <div class="topic concept nested3" id="stdpar-cpp-exceptions"><a name="stdpar-cpp-exceptions" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-exceptions" name="stdpar-cpp-exceptions" shape="rect">8.2.5.6.  No Exceptions in GPU Code </a></h3> <div class="body conbody"> <p class="p"> As with most other GPU programming models, throwing and catching C++ exceptions is not supported within Parallel Algorithm invocations that are offloaded to the GPU. </p> <p class="p"> Unlike some other GPU programming models where try/catch blocks and throw expressions are compilation errors, exception code does compile but with non-standard behavior. Catch clauses are ignored, and throw expressions abort the GPU kernel if actually executed. Exceptions in CPU code work without restrictions. </p> </div> </div> </div> <div class="topic concept nested2" id="stdpar-cpp-experimental"><a name="stdpar-cpp-experimental" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-experimental" name="stdpar-cpp-experimental" shape="rect">8.2.6.  NVC++ Experimental Features </a></h3> <div class="body conbody"> <p class="p"> nvc++ experimental features are enabled with the <span class="keyword option">--experimental–stdpar</span> compiler flag. Experimental feature headers are exposed via the <samp class="ph codeph">&lt;experimental/...&gt;</samp> namespaces and limited support for these features is available in older C++ versions. Table 1 lists all experimental features available and the minimum language version required to use them. </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 22. Experimental features information</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e15431" rowspan="1" colspan="1">Feature</th> <th class="entry" valign="top" width="NaN%" id="d54e15434" rowspan="1" colspan="1">Recommended</th> <th class="entry" valign="top" width="NaN%" id="d54e15437" rowspan="1" colspan="1">Limited support</th> <th class="entry" valign="top" width="NaN%" id="d54e15440" rowspan="1" colspan="1">Standard proposal</th> <th class="entry" valign="top" width="NaN%" id="d54e15443" rowspan="1" colspan="1">Other notes</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e15431" rowspan="1" colspan="1">Multi-dimensional spans (mdspan)</td> <td class="entry" valign="top" width="NaN%" headers="d54e15434" rowspan="1" colspan="1">C++23</td> <td class="entry" valign="top" width="NaN%" headers="d54e15437" rowspan="1" colspan="1">C++17</td> <td class="entry" valign="top" width="NaN%" headers="d54e15440" rowspan="1" colspan="1"><a class="xref" href="http://wg21.link/p0009" target="_blank" shape="rect">P0009</a></td> <td class="entry" valign="top" width="NaN%" headers="d54e15443" rowspan="1" colspan="1"><a class="xref" href="https://github.com/NVIDIA/libcudacxx" target="_blank" shape="rect">https://github.com/NVIDIA/libcudacxx</a></td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e15431" rowspan="1" colspan="1">Slices of multi-dimensional spans (submdspan)</td> <td class="entry" valign="top" width="NaN%" headers="d54e15434" rowspan="1" colspan="1">C++23</td> <td class="entry" valign="top" width="NaN%" headers="d54e15437" rowspan="1" colspan="1">C++17</td> <td class="entry" valign="top" width="NaN%" headers="d54e15440" rowspan="1" colspan="1"><a class="xref" href="http://wg21.link/p2630" target="_blank" shape="rect">P2630</a></td> <td class="entry" valign="top" width="NaN%" headers="d54e15443" rowspan="1" colspan="1"><a class="xref" href="https://github.com/NVIDIA/libcudacxx" target="_blank" shape="rect">https://github.com/NVIDIA/libcudacxx</a></td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e15431" rowspan="1" colspan="1">Multi-dimensional arrays (mdarray)</td> <td class="entry" valign="top" width="NaN%" headers="d54e15434" rowspan="1" colspan="1">C++23</td> <td class="entry" valign="top" width="NaN%" headers="d54e15437" rowspan="1" colspan="1">C++17</td> <td class="entry" valign="top" width="NaN%" headers="d54e15440" rowspan="1" colspan="1"><a class="xref" href="http://wg21.link/p1684" target="_blank" shape="rect">P1684</a></td> <td class="entry" valign="top" width="NaN%" headers="d54e15443" rowspan="1" colspan="1"><a class="xref" href="https://github.com/kokkos/mdspan" target="_blank" shape="rect">https://github.com/kokkos/mdspan</a></td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e15431" rowspan="1" colspan="1">Senders and receivers</td> <td class="entry" valign="top" width="NaN%" headers="d54e15434" rowspan="1" colspan="1">C++23</td> <td class="entry" valign="top" width="NaN%" headers="d54e15437" rowspan="1" colspan="1">C++20</td> <td class="entry" valign="top" width="NaN%" headers="d54e15440" rowspan="1" colspan="1"><a class="xref" href="http://wg21.link/p2300" target="_blank" shape="rect">P2300</a></td> <td class="entry" valign="top" width="NaN%" headers="d54e15443" rowspan="1" colspan="1"><a class="xref" href="https://github.com/NVIDIA/stdexec" target="_blank" shape="rect">https://github.com/NVIDIA/stdexec</a></td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e15431" rowspan="1" colspan="1">Linear algebra</td> <td class="entry" valign="top" width="NaN%" headers="d54e15434" rowspan="1" colspan="1">C++23</td> <td class="entry" valign="top" width="NaN%" headers="d54e15437" rowspan="1" colspan="1">C++17</td> <td class="entry" valign="top" width="NaN%" headers="d54e15440" rowspan="1" colspan="1"><a class="xref" href="http://wg21.link/p1673" target="_blank" shape="rect">P1673</a></td> <td class="entry" valign="top" width="NaN%" headers="d54e15443" rowspan="1" colspan="1"><a class="xref" href="https://github.com/kokkos/stdblas" target="_blank" shape="rect">https://github.com/kokkos/stdblas</a></td> </tr> </tbody> </table> </div> </div> <div class="topic concept nested3" id="stdpar-cpp-multi-dimensional-spans"><a name="stdpar-cpp-multi-dimensional-spans" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-multi-dimensional-spans" name="stdpar-cpp-multi-dimensional-spans" shape="rect">8.2.6.1.  Multi-dimensional Spans </a></h3> <div class="body conbody"> <p class="p"> Multi-dimensional spans (<samp class="ph codeph">std::mdspan</samp>) enable customizable multi-dimensional access to data. This feature was added to C++23 (see <a class="xref" href="http://wg21.link/p0009" target="_blank" shape="rect">P0009</a> and follow-on papers). <a class="xref" href="https://github.com/kokkos/mdspan/wiki/A-Gentle-Introduction-to-mdspan" target="_blank" shape="rect">A Gentle Introduction to mdspan</a> gives a tutorial. The reference mdspan implementation <a class="xref" href="https://github.com/kokkos/mdspan" target="_blank" shape="rect">https://github.com/kokkos/mdspan</a> also has many useful examples. </p> <p class="p"> nvc++ provides an implementation available in the <samp class="ph codeph">&lt;experimental/mdspan&gt;</samp> namespace that works with C++17 or newer. It enables applications that are not targeting the C++23 version of the standard to use mdspan. </p> <p class="p"> nvc++ also provides the <a class="xref" href="http://wg21.link/p0009r17" target="_blank" shape="rect">P0009R17</a> version of submdspan, which only works for the mdspan layouts in C++23; that is, it does not implement C++26 submdspan (<a class="xref" href="http://wg21.link/p2630" target="_blank" shape="rect">P2630</a>) yet. </p> <p class="p"> C++23’s mdspan uses <samp class="ph codeph">operator[]</samp> for array access. For example, if <samp class="ph codeph">A</samp> is a rank-2 mdspan, and <samp class="ph codeph">i</samp> and <samp class="ph codeph">j</samp> are integers, then <samp class="ph codeph">A[i, j]</samp> accesses the element of <samp class="ph codeph">A</samp> at row <samp class="ph codeph">i</samp> and column <samp class="ph codeph">j</samp>. Before C++23, <samp class="ph codeph">operator[]</samp> was only allowed to take one argument. C++23 changed the language to permit any number of arguments (zero or more). nvc++ does not support this new language feature. As a result, the implementation of mdspan provided by nvc++ permits use of <samp class="ph codeph">operator()</samp> as a fall-back (e.g., <samp class="ph codeph">A(i, j)</samp> instead of <samp class="ph codeph">A[i, j]</samp>). Users may enable this fall-back manually, by defining the macro <samp class="ph codeph">MDSPAN_USE_PAREN_OPERATOR</samp> to <samp class="ph codeph">1</samp> before including any mdspan headers. </p> <p class="p"> The following example (<a class="xref" href="https://godbolt.org/z/6eG93o7YW" target="_blank" shape="rect">godbolt</a>): </p><pre xml:space="preserve"> #include &lt;experimental/mdspan&gt; #include &lt;iostream&gt; namespace stdex = std::experimental; int main() { std::array d{ 0, 5, 1, 3, 8, 4, 2, 7, 6, }; stdex::mdspan m{d.data(), stdex::extents{3, 3}}; static_assert(m.rank()==2, "Rank is two"); for (std::size_t i = 0; i &lt; m.extent(0); ++i) for (std::size_t j = 0; j &lt; m.extent(1); ++j) std::cout &lt;&lt; "m(" &lt;&lt; i &lt;&lt; ", " &lt;&lt; j &lt;&lt; ") == " &lt;&lt; m(i, j) &lt;&lt; "\n"; return 0; } </pre><p class="p"> is compiled as follows </p><pre xml:space="preserve"> nvc++ -std=c++17 -o example example.cpp </pre><p class="p"> and outputs </p><pre xml:space="preserve"> m(0, 0) == 0 m(0, 1) == 5 m(0, 2) == 1 m(1, 0) == 3 m(1, 1) == 8 m(1, 2) == 4 m(2, 0) == 2 m(2, 1) == 7 m(2, 2) == 6 </pre></div> </div> <div class="topic concept nested3" id="stdpar-cpp-senders-and-receivers"><a name="stdpar-cpp-senders-and-receivers" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-senders-and-receivers" name="stdpar-cpp-senders-and-receivers" shape="rect">8.2.6.2.  Senders and Receivers </a></h3> <div class="body conbody"> <p class="p"><a class="xref" href="http://wg21.link/p2300" target="_blank" shape="rect">P2300 - std::execution</a> proposes a model of asynchronous programming for adoption into the C++26 Standard. For an introduction to this feature, see <a class="xref" href="https://wg21.link/P2300#design-user" target="_blank" shape="rect">Design - user side</a> section of the proposal. The NVIDIA implementation of Senders and receivers is <a class="xref" href="https://github.com/NVIDIA/stdexec" target="_blank" shape="rect">open source</a> and its repository contains many <a class="xref" href="https://github.com/NVIDIA/stdexec/tree/main/examples" target="_blank" shape="rect">useful examples.</a> nvc++ provides access to the NVIDIA implementation which works in C++20 or newer. Since the proposal is still evolving, our implementation is not stable. It is experimental in nature and will change to follow the proposal closely without any warning. The NVIDIA implementation is structured as follows: </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="border" border="1" rules="all"> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e15699" rowspan="1" colspan="1">Includes</th> <th class="entry" valign="top" width="NaN%" id="d54e15702" rowspan="1" colspan="1">Namespace</th> <th class="entry" valign="top" width="NaN%" id="d54e15705" rowspan="1" colspan="1">Description</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e15699" rowspan="1" colspan="1">&lt;stdexec/...&gt;</td> <td class="entry" valign="top" width="NaN%" headers="d54e15702" rowspan="1" colspan="1">::stdexec</td> <td class="entry" valign="top" width="NaN%" headers="d54e15705" rowspan="1" colspan="1">Approved for C++ standard</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e15699" rowspan="1" colspan="1">&lt;sexec/...&gt;</td> <td class="entry" valign="top" width="NaN%" headers="d54e15702" rowspan="1" colspan="1">::exec</td> <td class="entry" valign="top" width="NaN%" headers="d54e15705" rowspan="1" colspan="1">Generic additions and extensions</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e15699" rowspan="1" colspan="1">&lt;nvexec/...&gt;</td> <td class="entry" valign="top" width="NaN%" headers="d54e15702" rowspan="1" colspan="1">::nvexec</td> <td class="entry" valign="top" width="NaN%" headers="d54e15705" rowspan="1" colspan="1">NVIDIA-specific extensions and customizations</td> </tr> </tbody> </table> </div> <p class="p"> The following example (<a class="xref" href="https://godbolt.org/z/axbhYs7vj" target="_blank" shape="rect">godbolt</a>) builds a task graph in which two different vectors, v0 and v1, are concurrently modified in bulk, using a CPU thread pool and a GPU stream context, respectively. This graph then transfers execution to the CPU thread pool, and adds both vectors into v2 on the CPU, returning the sum of all elements: </p><pre xml:space="preserve"> int main() { // Declare a pool of 8 worker CPU threads: exec::static_thread_pool pool(8); // Declare a GPU stream context: nvexec::stream_context stream_ctx{}; // Get a handle to the thread pool: auto cpu_sched = pool.get_scheduler(); auto gpu_sched = stream_ctx.get_scheduler(); // Declare three dynamic array with N elements std::size_t N = 5; std::vector&lt;int&gt; v0 {1, 1, 1, 1, 1}; std::vector&lt;int&gt; v1 {2, 2, 2, 2, 2}; std::vector&lt;int&gt; v2 {0, 0, 0, 0, 0}; // Describe some work: auto work = stdexec::when_all( // Double v0 on the CPU stdexec::just() | exec::on(cpu_sched, stdexec::bulk(N, [v0 = v0.data()](std::size_t i) { v0[i] *= 2; })), // Triple v1 on the GPU stdexec::just() | exec::on(gpu_sched, stdexec::bulk(N, [v1 = v1.data()](std::size_t i) { v1[i] *= 3; })) ) | stdexec::transfer(cpu_sched) // Add the two vectors into the output vector v2 = v0 + v1: | stdexec::bulk(N, [&amp;](std::size_t i) { v2[i] = v0[i] + v1[i]; }) | stdexec::then([&amp;] { int r = 0; for (std::size_t i = 0; i &lt; N; ++i) r += v2[i]; return r; }); auto [sum] = stdexec::sync_wait(work).value(); // Print the results: std::printf("sum = %d\n", sum); for (int i = 0; i &lt; N; ++i) { std::printf("v0[%d] = %d, v1[%d] = %d, v2[%d] = %d\n", i, v0[i], i, v1[i], i, v2[i]); } return 0; } </pre><p class="p"> is compiled as follows: </p><pre xml:space="preserve"> nvc++ --stdpar=gpu --experimental-stdpar -std=c++20 -o example example.cpp </pre><p class="p"> and outputs: </p><pre xml:space="preserve"> sum = 40 v0[0] = 2, v1[0] = 6, v2[0] = 8 v0[1] = 2, v1[1] = 6, v2[1] = 8 v0[2] = 2, v1[2] = 6, v2[2] = 8 v0[3] = 2, v1[3] = 6, v2[3] = 8 v0[4] = 2, v1[4] = 6, v2[4] = 8 </pre></div> </div> <div class="topic concept nested3" id="stdpar-cpp-linear-algebra"><a name="stdpar-cpp-linear-algebra" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-linear-algebra" name="stdpar-cpp-linear-algebra" shape="rect">8.2.6.3.  Linear Algebra </a></h3> <div class="body conbody"> <p class="p"><a class="xref" href="https://wg21.link/p1673" target="_blank" shape="rect">P1673 - A free function linear algebra interface based on the BLAS</a> proposes standardizing an idiomatic C++ interface based on std::mdspan for a subset of the Basic Linear Algebra Subroutines (BLAS) standard. For an introduction to this feature, see <a class="xref" href="https://youtu.be/n7mBGDqSzlQ" target="_blank" shape="rect">P1673 (C++ linear algebra library) background &amp; motivation</a>. There are many useful examples available in $HPCSDK_HOME/examples/stdpar/stdblas and in the repository of the <a class="xref" href="https://github.com/kokkos/stdBLAS/tree/main/examples" target="_blank" shape="rect">reference implementation</a>. A detailed documentation is available at $HPCSDK_HOME/compilers/include/experimental/__p1673_bits/README.md. nvc++ provides access to the NVIDIA implementation which works in C++17 or newer. Since the proposal is still evolving, our implementation is not stable. It is experimental in nature and will change to follow the proposal closely without any warning. To use the linear algebra library facilities, a suitable linear algebra library must be linked: cuBLAS for GPU execution via the -cudalib=cublas flag, and a CPU BLAS library for CPU execution. The HPC SDK bundles OpenBLAS which may be linked using the -lblas linker flag. </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="border" border="1" rules="all"> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e15810" rowspan="1" colspan="1">Execution</th> <th class="entry" valign="top" width="NaN%" id="d54e15813" rowspan="1" colspan="1">BLAS library</th> <th class="entry" valign="top" width="NaN%" id="d54e15816" rowspan="1" colspan="1">Architectures</th> <th class="entry" valign="top" width="NaN%" id="d54e15819" rowspan="1" colspan="1">Compiler flags</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e15810" rowspan="1" colspan="1">Multicore</td> <td class="entry" valign="top" width="NaN%" headers="d54e15813" rowspan="1" colspan="1">OpenBLAS</td> <td class="entry" valign="top" width="NaN%" headers="d54e15816" rowspan="1" colspan="1">x86_64, aarch64, ppc64l</td> <td class="entry" valign="top" width="NaN%" headers="d54e15819" rowspan="1" colspan="1"><samp class="ph codeph">-stdpar=multicore -lblas</samp></td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e15810" rowspan="1" colspan="1">GPU</td> <td class="entry" valign="top" width="NaN%" headers="d54e15813" rowspan="1" colspan="1">cuBLAS</td> <td class="entry" valign="top" width="NaN%" headers="d54e15816" rowspan="1" colspan="1">All</td> <td class="entry" valign="top" width="NaN%" headers="d54e15819" rowspan="1" colspan="1"><samp class="ph codeph">-stdpar=gpu -cudalib=cublas</samp></td> </tr> </tbody> </table> </div> <p class="p"> The following example (<a class="xref" href="https://godbolt.org/z/8cT537dfo" target="_blank" shape="rect">godbolt</a>): </p><pre xml:space="preserve"> #include &lt;experimental/mdspan&gt; #include &lt;experimental/linalg&gt; #include &lt;vector&gt; #include &lt;array&gt; namespace stdex = std::experimental; int main() { constexpr size_t N = 4; constexpr size_t M = 2; std::vector&lt;double&gt; A_vec(N*M); std::vector&lt;double&gt; x_vec(M); std::array&lt;double, N&gt; y_vec(N); stdex::mdspan A(A_vec.data(), N, M); stdex::mdspan x(x_vec.data(), M); stdex::mdspan y(y_vec.data(), N); for(int i = 0; i &lt; A.extent(0); ++i) for(int j = 0; j &lt; A.extent(1); ++j) A(i,j) = 100.0 * i + j; for(int j = 0; j &lt; x.extent(0); ++j) x(j) = 1.0 * j; for(int i = 0; i &lt; y.extent(0); ++i) y(i) = -1.0 * i; stdex::linalg::matrix_vector_product(A, x, y); // y = A * x // y = 0.5 * y + 2 * A * x stdex::linalg::matrix_vector_product(std::execution::par, stdex::linalg::scaled(2.0, A), x, stdex::linalg::scaled(0.5, y), y); // Print the results: for (int i = 0; i &lt; N; ++i) std::printf("y[%d] = %f\n", i, y(i)); return 0; } </pre><p class="p"> is compiled as follows for GPU execution: </p><pre xml:space="preserve"> nvc++ -std=c++17 -stdpar=gpu -cudalib=cublas -o example example.cpp </pre><p class="p"> And as follows for CPU execution: </p><pre xml:space="preserve"> nvc++ -std=c++17 -stdpar=multicore -o example example.cpp -lblas </pre><p class="p"> and produces the same outputs in both cases: </p><pre xml:space="preserve"> y[0] = 2.500000 y[1] = 252.500000 y[2] = 502.500000 y[3] = 752.500000 </pre></div> </div> </div> <div class="topic concept nested2" id="stdpar-cpp-example-2"><a name="stdpar-cpp-example-2" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-example-2" name="stdpar-cpp-example-2" shape="rect">8.2.7.  Stdpar C++ Larger Example: LULESH </a></h3> <div class="body conbody"> <p class="p"> The <a class="xref" href="https://github.com/LLNL/LULESH" target="_blank" shape="rect">LULESH hydrodynamics mini-app</a> was developed at Lawrence Livermore National Laboratory to stress test compilers and model performance of hydrodynamics applications. It is about 9,000 lines of C++ code, of which 2,800 lines are the core computation that should be parallelized. </p> <p class="p"> We ported LULESH to C++ Parallel Algorithms and made the port available on <a class="xref" href="https://github.com/LLNL/LULESH/tree/2.0.2-dev/stdpar" target="_blank" shape="rect">LULESH's GitHub repository</a>. To compile it, install the <a class="xref" href="https://developer.nvidia.com/hpc-sdk" target="_blank" shape="rect">NVIDIA HPC SDK</a>, check out the 2.0.2-dev branch of the LULESH repository, go to the correct directory, and <samp class="ph codeph">run make</samp>. </p><pre class="pre screen" xml:space="preserve"> git clone --branch 2.0.2-dev https://github.com/LLNL/LULESH.git cd LULESH/stdpar/build make run </pre><p class="p"> While LULESH is too large to show the entire source code here, there are some key code sequences that demonstrate the use of stdpar. </p> <p class="p"> The LULESH code has many loops with large bodies and no loop-carried dependencies, making them good candidates for parallelization. Most of these were easily converted into calls to <samp class="ph codeph">std::for_each_n</samp> with the <samp class="ph codeph">std::execution::par</samp> policy, where the body of the lambda passed to <samp class="ph codeph">std::for_each_n</samp> is identical to the original loop body. </p> <p class="p"> The function <samp class="ph codeph">CalcMonotonicQRegionForElems</samp> is an example of this. The loop header written for OpenMP looks as follows: </p><pre xml:space="preserve"> #pragma omp parallel for firstprivate(qlc_monoq, qqc_monoq, \ monoq_limiter_mult, monoq_max_slope, ptiny) for ( Index_t i = 0 ; i &lt; domain.regElemSize(r); ++i ) {</pre><p class="p"> This loop header in the C++ Parallel Algorithms version becomes <a class="xref" href="https://github.com/LLNL/LULESH/blob/2.0.2-dev/stdpar/src/lulesh.cc#L1555-L1756" target="_blank" shape="rect">the following</a>: </p><pre xml:space="preserve"> std::for_each_n( std::execution::par, counting_iterator(0), domain.regElemSize(r), [=, &amp;domain](Index_t i) {</pre><p class="p"> The loop body, which in this case is almost 200 lines long, becomes the body of the lambda but is otherwise unchanged from the OpenMP version. </p> <p class="p"> In a number of places, an explicit <samp class="ph codeph">for</samp> loop was changed to use C++ Parallel Algorithms that better express the intent of the code, such as the function <samp class="ph codeph">CalcPressureForElems</samp>: </p><pre xml:space="preserve"> #pragma omp parallel for firstprivate(length) for (Index_t i = 0; i &lt; length ; ++i) { Real_t c1s = Real_t(2.0)/Real_t(3.0) ; bvc[i] = c1s * (compression[i] + Real_t(1.)); pbvc[i] = c1s; }</pre><p class="p"> This function was rewritten as <a class="xref" href="https://github.com/LLNL/LULESH/blob/2.0.2-dev/stdpar/src/lulesh.cc#L1825-L1830" target="_blank" shape="rect">as follows</a>: </p><pre xml:space="preserve"> constexpr Real_t cls = Real_t(2.0) / Real_t(3.0); std::transform(std::execution::par, compression, compression + length, bvc, [=](Real_t compression_i) { return cls * (compression_i + Real_t(1.0)); }); std::fill(std::execution::par, pbvc, pbvc + length, cls);</pre></div> </div> <div class="topic concept nested2" id="stdpar-cpp-interop-openacc"><a name="stdpar-cpp-interop-openacc" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-interop-openacc" name="stdpar-cpp-interop-openacc" shape="rect">8.2.8.  Interoperability with OpenACC </a></h3> <div class="body conbody"> <div class="p"> A subset of OpenACC features can be used when compiling Stdpar code for GPUs. Such a subset is documented in this section. To activate OpenACC directives recognition with Stdpar code add <span class="keyword option">-acc</span> command line flag to nvc++. <pre xml:space="preserve"> nvc++ -stdpar -acc example.cpp </pre> OpenACC functionality is detailed in the OpenACC specification and the NVHPC compiler specific differences are detailed in <a class="xref" href="index.html#acc-use" shape="rect">Using OpenACC</a> of this guide. </div> <p class="p"> Combining OpenACC features with Stdpar offers greater flexibility in how code is written. For instance, it allows external functions to be called from within parallel algorithms. Additionally, it provides opportunities for performance tuning, such as through explicit data management. </p> </div> <div class="topic concept nested3" id="stdpar-cpp-interop-openacc-dataman"><a name="stdpar-cpp-interop-openacc-dataman" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-interop-openacc-dataman" name="stdpar-cpp-interop-openacc-dataman" shape="rect">8.2.8.1. Data Management Directives</a></h3> <div class="body conbody"> <p class="p"> C⁠+⁠+ parallel algorithms can be offloaded to the GPU when the data accessed in such algorithms is managed through the OpenACC directives. With data fully managed through the OpenACC directives, Stdpar code can run with all GPU Memory Modes including Separate Memory Mode (compiled with <span class="keyword option">-gpu=mem:separate</span>). </p> <div class="p"> The following data directives are supported: <ul class="ul"> <li class="li"> OpenACC structured data construct directive </li> <li class="li"> OpenACC unstructured enter/exit data directives </li> <li class="li"> OpenACC host_data directive </li> <li class="li"> OpenACC update directive </li> </ul> </div> <p class="p"> Only the data that are captured by reference or pointer-like data captured by values as well as pointer-like data passed as arguments in the parallel algorithm lambdas can be managed through OpenACC. Any non-pointer variables that are captured by value in the parallel algorithm lambda or non-pointer data passed in as lambda arguments are managed by the C++ implementation. A copy of such data is automatically created in the memory accessible from the GPU. For additional details refer to <a class="xref" href="index.html#stdpar-cpp-unified-memory" shape="rect">Data Management in Parallel Algorithms</a>. </p> <div class="p"> OpenACC data management can serve two main purposes: <ul class="ul"> <li class="li"><strong class="ph b">Explicit Data Management:</strong> This is necessary for data that cannot be managed implicitly, such as on platforms without full CUDA Unified Memory support and when data is not allocated in the CUDA Managed Memory segment. </li> <li class="li"><strong class="ph b">Performance Tuning:</strong> Even when data is located in the GPU-accessible memory, performance can be optimized via OpenACC features. Many OpenACC data directives and clauses provide hints to the CUDA device driver, which can improve implicit data management. </li> </ul> Data management strategies may differ depending on the specific goals being pursued. These differences are outlined where applicable. </div> <div class="section"> <h5 class="title sectiontitle">General Rules</h5> <div class="p"> All directives, except <samp class="ph codeph">host_data</samp>, can be used for data management tasks such as allocating memory in the GPU and copying data between the CPU and the GPU. These directives can be used to ensure that the data is present on the device during the execution of parallel algorithms. The <samp class="ph codeph">host_data</samp> construct, on the other hand, is used for address translation between CPU and GPU address spaces when data is accessed in parallel algorithms. <pre xml:space="preserve"> int n = get_n(); T* in = new T[nelem]; T* out = new T[nelem]; // Data captured by the lambda are managed explicitly with OpenACC #pragma acc enter data copyin(n, in[0:nelem]) create(out[0:nelem]) #pragma acc host_data use_device(n, in, out) { std::for_each(std::execution::par_unseq, r.begin(), r.end(), [&amp;,in,out](auto i) { out[i] = in[i] * n; }); } #pragma acc exit data copyout(out[0:nelem]) </pre> In the above example all data accessed from <samp class="ph codeph">std::for_each</samp> through the lambda capture are managed explicitly through the OpenACC data directives. Since the data inside the parallel algorithms are either captured by reference or capturing a pointer, the application code must ensure that such data is accessible from the GPU. To make non-GPU resident data accessible in the parallel region, such a region must be enclosed into the <samp class="ph codeph">host_data</samp> construct region with all variables that are managed explicitly via OpenACC runtime listed in the <samp class="ph codeph">use_device</samp> clause. The data need to be present (copied or created) at the time the <samp class="ph codeph">host_data</samp> directive is encountered/executed at runtime and the data must also be present for the duration of parallel algorithm execution. The implications of the above are such that lambdas accessing variables enclosed in <samp class="ph codeph">use_device</samp> regions can not be additionally invoked from the host code (from outside the parallel region executing on the GPU) because the variable addresses from the GPU obtained through <samp class="ph codeph">host_data</samp> may not be accessible on the CPU. </div> <div class="note note"><span class="notetitle">Note:</span> If the iterator in the above example would be a pointer type it would require explicit data management in addition to the data captured by the lambda. </div> <div class="p"> If the example below is compiled for Separate Memory Mode (<span class="keyword option">-gpu=mem:separate</span>) calling <samp class="ph codeph">fn</samp> from within a parallel <samp class="ph codeph">std::for_each</samp> works fine but not from outside of any parallel algorithm function since the data resident on GPU would need to be accessed from the CPU. <pre xml:space="preserve"> int n = get_n(); T* in = new T[nelem]; T* out = new T[nelem]; #pragma acc enter data copyin(n, in[0:nelem]) create(out[0:nelem]) #pragma acc host_data use_device(n, in, out) { auto fn = [&amp;,in,out](auto i) { out[i] = in[i] * n;}; std::for_each(std::execution::par_unseq, r.begin(), r.end(), fn); // The following line would not be legal, fn accesses variables in GPU memory //std::for_each(r.begin(), r.end(), fn); } #pragma acc exit data copyout(out[0:nelem]) </pre></div> <div class="note note"><span class="notetitle">Note:</span> The behavior of using <samp class="ph codeph">use_device</samp> with non-pointer data type is such that all occurrences of non-pointer variables inside the <samp class="ph codeph">host_data</samp> region are converted to using the addresses of the variable in the GPU address space before accessing that variable. This is essentially equivalent to translating original occurrences of such variable <samp class="ph codeph">var</samp> into <samp class="ph codeph">dvar = *acc_device(&amp;var)</samp>. </div> </div> <div class="section"> <h5 class="title sectiontitle">Composite Data Types</h5> <p class="p"> Composite data types with pointer members can also be managed explicitly but require explicit deep copy to work correctly including pointer attach/detach. </p> <div class="p"><pre xml:space="preserve"> struct S { float *ptr; } int idx[N] = {/*...*/}; float arr[N]; S s{arr}; // Deep copying ptr member with OpenACC #pragma acc enter data copyin(s.ptr[0:N]) #pragma acc enter data copyin(s, idx) #pragma acc data attach(s.ptr) #pragma acc host_data use_device(s, idx) { std::for_each_n(std::execution::par, idx, N, [&amp;](int i) { s.ptr[i] += 5.0; }); } #pragma acc exit data copyout(s.ptr[0:N]) #pragma acc exit data copyout(s) </pre> When variable of struct <samp class="ph codeph">S</samp> type in the above example is copied to the device, a deep copy is performed with the content pointed by <samp class="ph codeph">S.ptr</samp> copied separately. The pointer attachment is used to ensure the address of the pointer is changed to the device memory equivalent before it is accessed from the GPU. Depending on the order of the copies, the pointer <samp class="ph codeph">attach</samp> clause may not be required. </div> <div class="note note"><span class="notetitle">Note:</span> In the above example the pointer-like iterator <samp class="ph codeph">idx</samp> is managed through the OpenACC directives in addition to the data captured by the lambda. </div> </div> <div class="section"> <h5 class="title sectiontitle">Standard Containers</h5> <div class="p"> If the standard containers with non-contiguous storage must be used in host code with explicit data management to GPU memory, the only viable option is to access the raw data directly using the raw pointer to data (e.g. obtained via <samp class="ph codeph">data()</samp> member of <samp class="ph codeph">std::vector</samp>) unless the iterator over the data can be used. <pre xml:space="preserve"> std::vector&lt;T&gt; in(nelem); std::vector&lt;T&gt; out(nelem); T *inptr=in.data(),*outptr=out.data(); #pragma acc data copyin(inptr[0:nelem]) copyout(outptr[0:nelem]) #pragma acc host_data use_device(inptr,outptr) { std::for_each(std::execution::par_unseq, r.begin(), r.end(), [=](auto i) { outptr[i] = inptr[i]; }); } </pre> In the above example vector elements are accessed through raw pointers to their elements obtained through <samp class="ph codeph">vector::data()</samp> member, they are explicitly management through the OpenACC data clauses. </div> </div> <div class="section"> <h5 class="title sectiontitle">Static Storage Data</h5> <div class="p"> Global or static variables can be made accessible in the parallel algorithms using OpenACC data directives similarly to other variables. <pre xml:space="preserve"> int glob_arr[N] = {/*...*/}; void foo(){ #pragma acc data copy(glob_arr) #pragma acc host_data use_device(glob_arr) { std::for_each_n(std::execution::par, glob_arr, N, [](int &amp;e) { e += 1; }); } } </pre> In the above example the global array <samp class="ph codeph">glob_arr</samp> is updated on the GPU with help of OpenACC data directives. </div> </div> <div class="section"> <h5 class="title sectiontitle">Member Functions</h5> <div class="p"> When the data members are managed inside the member functions the implicit object pointer <samp class="ph codeph">this</samp> needs to be explicitly managed for correctness as accessing members is always done through the dereference of the object itself. <pre xml:space="preserve"> struct S { float *ptr; void update_member() { #pragma acc data copy(ptr[0:N], this) #pragma acc host_data use_device(ptr, this) { std::for_each(std::execution::par, ptr, ptr + N, [=](float &amp;e) { ptr[&amp;e - ptr] += 5.0; }); } } }; </pre></div> </div> <div class="section"> <h5 class="title sectiontitle">GPU Memory Mode Related Differences</h5> <p class="p"> In Separate Memory Mode all data must be managed explicitly via extra device allocations and <samp class="ph codeph">memcpy</samp> between the host and device and the address translations. This also applies to variables with automatic or static storage duration in Managed Memory Mode. </p> <p class="p"> In Unified Memory Mode all data is automatically managed by the CUDA device driver. Additionally in Managed Memory Mode all dynamic allocations are managed by the CUDA device driver. Use of data clauses and directives can only propagate memory usage hints to the CUDA device driver which are used to improve the data management performance. More details can be found in Memory Model and <a class="xref" href="index.html#acc-openacc-unified-mem" shape="rect">OpenACC with CUDA Unified Memory</a> . </p> <div class="p"> All the data managed by the CUDA device driver can benefit from the simplified uses of the OpenACC features, particularly: <ul class="ul"> <li class="li"> Use of <samp class="ph codeph">host_data</samp> directive is not required since the host and device address of data in unified shared memory is identical. </li> <li class="li"> Use of pointer attach or detach is not required since the host and device pointers in unified shared memory are identical. </li> </ul> The following example illustrates simplified data managment with only OpenACC data construct enclosing the <samp class="ph codeph">std::for_each</samp> with Unified Memory Mode. <pre xml:space="preserve"> int n = get_n(); T* in = new T[nelem]; T* out = new T[nelem]; #pragma acc data copyin(in[0:nelem]) copyout(out[0:nelem]) { std::for_each(std::execution::par_unseq, r.begin(), r.end(), [&amp;](auto i) { out[i] = in[i] * n; }); } </pre> In the above example we leverage OpenACC explicit data management construct to indicate how data is used on GPU for the computation executed in <samp class="ph codeph">std::for_each</samp>: <ul class="ul"> <li class="li"><samp class="ph codeph">in</samp> is moved into the GPU memory; </li> <li class="li"><samp class="ph codeph">out</samp> is moved from the GPU memory. </li> </ul> Both <samp class="ph codeph">in</samp> and <samp class="ph codeph">out</samp> are captured by reference and therefore their host address is used in the lambda of <samp class="ph codeph">std::for_each</samp>. The scalar variable <samp class="ph codeph">n</samp> is not managed. The use of <samp class="ph codeph">host_data</samp> construct is not required. </div> <div class="p"> When standard containers are used in data directives and clauses, the underlying data collection can be managed too. For example, in order to indicate that elements of the <samp class="ph codeph">std::vector</samp> are accessed from the GPU the application code must first retrieve the pointer to the array elements using its <samp class="ph codeph">data()</samp> member. Then such pointers can be used in the regular data directives. <pre xml:space="preserve"> std::vector&lt;T&gt; in(nelem); std::vector&lt;T&gt; out(nelem); T *inptr=in.data(), *outptr=out.data(); #pragma acc data copyin(inptr[0:nelem]) copyout(outptr[0:nelem]) { std::for_each(std::execution::par_unseq, r.begin(), r.end(), [&amp;](auto i) { out[i] = in[i]; }); } </pre> The above example demonstrates the use of OpenACC data directives with a raw pointer to elements of <samp class="ph codeph">std::vector</samp> which can improve memory performance for data in unified memory and the full deep copy of vector content using attach/detach is not required. <pre xml:space="preserve"> int n = get_n(); T* in = new T[nelem]; T* out = new T[nelem]; #pragma acc enter data copyin(n) #pragma acc host_data use_device(n) { std::for_each(std::execution::par_unseq, r.begin(), r.end(), [&amp;, in, out](auto i) { out[i] = in[i] * n; }); } #pragma acc enter data delete(n) </pre> In the above example, <samp class="ph codeph">in</samp> and <samp class="ph codeph">out</samp> are dynamically allocated and managed by CUDA device driver with Managed Memory Mode, <samp class="ph codeph">n</samp> is on the stack and therefore managed explicitly via OpenACC directives. </div> </div> </div> </div> <div class="topic concept nested3" id="stdpar-cpp-interop-openacc-routine"><a name="stdpar-cpp-interop-openacc-routine" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-interop-openacc-routine" name="stdpar-cpp-interop-openacc-routine" shape="rect">8.2.8.2. External Device Function Annotations</a></h3> <div class="body conbody"> <div class="p"> Using OpenACC routine directive annotations allows calling external device functions. <pre xml:space="preserve"> // In file1.cpp extern int foo(); void bar() { std::for_each(std::execution::par_unseq, r.begin(), r.end(), [=](auto i) { ou[i] = foo(); }); } // In file2.cpp #pragma acc routine int foo(){ return 4; } </pre> The above code can be compiled/linked as follows: <pre xml:space="preserve"> nvc++ -stdpar file1.cpp nvc++ -acc file2.cpp nvc++ -stdpar -acc file1.o file2.o </pre></div> </div> </div> </div> <div class="topic concept nested2" id="stdpar-cpp-gpu-getting-started"><a name="stdpar-cpp-gpu-getting-started" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-gpu-getting-started" name="stdpar-cpp-gpu-getting-started" shape="rect">8.2.9.  Getting Started with Parallel Algorithms for GPUs </a></h3> <div class="body conbody"> <p class="p"> To get started, download and install the <a class="xref" href="https://developer.nvidia.com/hpc-sdk" target="_blank" shape="rect">NVIDIA HPC SDK</a> on your x86-64, OpenPOWER, or Arm CPU-based system running a supported version of Linux. </p> <p class="p"> The NVIDIA HPC SDK is freely downloadable and includes a perpetual use license for all NVIDIA Registered Developers, including access to future release updates as they are issued. After you have the NVIDIA HPC SDK installed on your system, the nvc++ compiler is available under the <samp class="ph codeph">/opt/nvidia/hpc_sdk</samp> directory structure. </p> <ul class="ul"> <li class="li"> To use the compilers including nvc++ on a Linux/x86-64 system, add the directory <samp class="ph codeph">/opt/nvidia/hpc_sdk/Linux_x86_64/<span class="keyword">24.11</span>/compilers/bin</samp> to your path. </li> <li class="li"> On an OpenPOWER or Arm CPU-based system, replace <samp class="ph codeph">Linux_x86_64</samp> with <samp class="ph codeph">Linux_ppc64le</samp> or <samp class="ph codeph">Linux_aarch64</samp>, respectively. </li> </ul> </div> <div class="topic concept nested3" id="stdpar-cpp-supported-gpus"><a name="stdpar-cpp-supported-gpus" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-supported-gpus" name="stdpar-cpp-supported-gpus" shape="rect">8.2.9.1.  Supported NVIDIA GPUs </a></h3> <div class="body conbody"> <p class="p"> The NVC⁠+⁠+ compiler can automatically offload C⁠+⁠+ Parallel Algorithms to NVIDIA GPUs based on the Volta architecture or newer. These architectures include features -- such as independent thread scheduling and hardware optimizations for CUDA Unified Memory -- that were specifically designed to support high-performance, general-purpose parallel programming models like the C⁠+⁠+ Parallel Algorithms. </p> <p class="p"> The NVC⁠+⁠+ compiler provides limited support for C⁠+⁠+ Parallel Algorithms on the Pascal architecture, which does not have the <a class="xref" href="https://devblogs.nvidia.com/inside-volta/" target="_blank" shape="rect">independent thread scheduling</a> necessary to properly support the <samp class="ph codeph">std::execution::par</samp> policy. When compiling for the Pascal architecture (<span class="keyword option">-⁠gpu=cc60</span>), NVC⁠+⁠+ compiles algorithms with the <samp class="ph codeph">std::execution::par</samp> policy for serial execution on the CPU. Only algorithms with the <samp class="ph codeph">std::execution::par_unseq</samp> policy will be scheduled to run on Pascal GPUs. </p> </div> </div> <div class="topic concept nested3" id="stdpar-cpp-supported-cuda"><a name="stdpar-cpp-supported-cuda" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-cpp-supported-cuda" name="stdpar-cpp-supported-cuda" shape="rect">8.2.9.2.  Supported CUDA Versions </a></h3> <div class="body conbody"> <p class="p"> The NVC++ compiler is built on CUDA libraries and technologies and uses CUDA to accelerate C++ Parallel Algorithms on NVIDIA GPUs. A GPU-accelerated system on which NVC++-compiled applications are to be run must have a CUDA 11.2 or newer device driver installed. </p> <p class="p"> The NVIDIA HPC SDK compilers ship with an integrated CUDA toolchain, header files, and libraries to use during compilation, so it is not necessary to have a CUDA Toolkit installed on the system. </p> <p class="p"> When <span class="keyword option">-⁠stdpar</span> is specified, NVC++ compiles using the CUDA toolchain version that best matches the CUDA driver installed on the system on which compilation is performed. To compile using a different version of the CUDA toolchain, use the <span class="keyword option">-⁠gpu=cudaX.Y</span> option. For example, use the <span class="keyword option">-⁠gpu=cuda11.8</span> option to specify that your program should be compiled for a CUDA 11.8 system using the CUDA 11.8 toolchain. </p> </div> </div> </div> </div> <div class="topic concept nested1" id="stdpar-fortran"><a name="stdpar-fortran" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-fortran" name="stdpar-fortran" shape="rect">8.3. Stdpar Fortran</a></h3> <div class="body conbody"> <p class="p"> Fortran 2008 introduced the <samp class="ph codeph">do concurrent</samp> (DC) loop construct signaling that loop iterations have no interdependencies. With <samp class="ph codeph">-⁠stdpar</samp> such loop iterations will be executed in parallel on the GPU when <samp class="ph codeph">-⁠stdpar</samp> (or <samp class="ph codeph">-⁠stdpar=gpu</samp>) is passed to <samp class="ph codeph">nvfortran</samp> or using CPU threads when <samp class="ph codeph">-⁠stdpar=multicore</samp> is passed to <samp class="ph codeph">nvfortran</samp>. More details can be found in the following blog post on the NVIDIA website: <a class="xref" href="https://developer.nvidia.com/blog/accelerating-fortran-do-concurrent-with-gpus-and-the-nvidia-hpc-sdk" target="_blank" shape="rect">Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK</a>. </p> </div> <div class="topic concept nested2" id="stdpar-fortran-routine"><a name="stdpar-fortran-routine" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-fortran-routine" name="stdpar-fortran-routine" shape="rect">8.3.1. Calling Routines in DO CONCURRENT on the GPU</a></h3> <div class="body conbody"> <p class="p"> When compiling for the GPU, calling routines in the body of <samp class="ph codeph">do concurrent</samp> loop can be constrained. PURE routines can generally be called inside the <samp class="ph codeph">do concurrent</samp> loop body. The compiler detects that such routines are to be compiled for the GPU target. External routines, however, can't be called from within the DC loop unless they are explicitly annotated with the OpenACC routine directive (refert to <a class="xref" href="index.html#stdpar-fortran-interop-openacc" shape="rect">Interoperability with OpenACC</a>) or CUDA device attribute (refer to <a class="xref" href="index.html#stdpar-fortran-interop-cuf" shape="rect">Interoperability with CUDA Fortran</a>). </p> <div class="p"> The following example will compile successfully. <pre xml:space="preserve"> module m contains pure subroutine foo() return end subroutine end module m program dc use m implicit none integer :: i do concurrent (i=1:10) call foo() enddo end program </pre></div> <div class="p"> The following example, however, doesn't compile unless <samp class="ph codeph">foo</samp> is either <ul class="ul"> <li class="li"> annotated with <samp class="ph codeph">!$acc routine</samp>, </li> <li class="li"> or attributed with <samp class="ph codeph">attributes(device)</samp> and compiled as Stdpar and CUDA Fortran. </li> </ul><pre xml:space="preserve"> program dc implicit none interface pure subroutine foo() end subroutine foo end interface integer :: i do concurrent (i=1:10) call foo() enddo end program </pre></div> </div> </div> <div class="topic concept nested2" id="stdpar-fortran-um"><a name="stdpar-fortran-um" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-fortran-um" name="stdpar-fortran-um" shape="rect">8.3.2. GPU Data Management</a></h3> <div class="body conbody"> <p class="p"> If <samp class="ph codeph">-⁠gpu=mem:managed</samp> is enabled by default or is explicitly passed on the command line, some data accesses in <samp class="ph codeph">do concurrent</samp> loops are invalid. For example, accessing global variables in the routines called from the <samp class="ph codeph">do concurrent</samp> loop does not perform expected value updates in the CPU code. </p> <p class="p"> Additionally, there are rare instances where the compiler cannot accurately determine variable sizes for implicit data movements between CPU and GPU. As demonstrated in the following example, <samp class="ph codeph">a</samp> is an assumed-size array, and its access region inside the DC construct cannot be determined at compile time because the element index positions are taken from another array <samp class="ph codeph">b</samp> initialized outside of the routine. Such code does not update <samp class="ph codeph">a</samp> as expected and may result in a memory violation and undefined behavior. </p><pre xml:space="preserve"> subroutine r(a, b) integer :: a(*) integer :: b(:) do concurrent (i = 1 : size(b)) a(b(i)) = i enddo end subroutine </pre><p class="p"> There are no limitations on the variable accessed in <samp class="ph codeph">do concurrent</samp> loops described above when the code is compiled with <samp class="ph codeph">-⁠gpu=mem:unified</samp>, whether this option is enabled by default or explicitly via an option on the command line. </p> </div> </div> <div class="topic concept nested2" id="stdpar-fortran-interop-openacc"><a name="stdpar-fortran-interop-openacc" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-fortran-interop-openacc" name="stdpar-fortran-interop-openacc" shape="rect">8.3.3. Interoperability with OpenACC</a></h3> <div class="body conbody"> <div class="p"> OpenACC features can be used when compiling Stdpar code for GPUs. To activate OpenACC directives recognition with Stdpar code add <samp class="ph codeph">-acc</samp> command line flag to <samp class="ph codeph">nvfortran</samp>. <pre xml:space="preserve"> nvfortran -stdpar -acc example.f90 </pre> OpenACC functionality and interoperability with DO-CONCURRENT loop is detailed in the OpenACC specification and the NVIDIA HPC compiler specific differences are detailed in <a class="xref" href="index.html#acc-use" shape="rect">Using OpenACC</a> of this guide. </div> <div class="p"> Using OpenACC features can enhance functionality of DC-loop for example with the following: <ul class="ul"> <li class="li"> Explicit data management to improve performance of CPU-GPU implicit data movements or even leverage separate memory compiling on the GPU when compiling with <span class="keyword option">-gpu=mem:separate</span> passed in. </li> <li class="li"> Tuning DC-loop execution on the GPU e.g. GPU kernels launch configuration. </li> <li class="li"> Executing DC-loops asynchronously. </li> <li class="li"> Calling external routines from within DC-loops. </li> <li class="li"> Atomic operations in DC-loops. </li> </ul> </div> <div class="section"> <h4 class="title sectiontitle">Examples</h4> <p class="p"> Some examples of using OpenACC directives with DC-loops are provided below. </p> <div class="p"> The following example demonstrates how the data accessed inside the DC-loop are fully managed in the OpenACC data construct. <pre xml:space="preserve"> !$acc data copyin(b) copyout(a) do concurrent (j=1:N) do i=1,K a(j,i) = b(j,i) end do end do !$acc end data </pre> While in the above example the data construct is used for GPU data management, the same effect can be achieved with the use of data clauses on the compute construct enclosing DC-loop. </div> <div class="p"> The following example shows how the scheduling of DC loop on the GPU is controlled through the clauses on the compute construct. <pre xml:space="preserve"> !$acc parallel loop num_gangs(50000) vector_length(32) do concurrent (i=1:K,j=1:N) a(j,i) = real(j) end do </pre></div> <div class="p"> Use of OpenACC async clause on the compute constructs can be utilised to perform computations in DC-loop asynchronously. <pre xml:space="preserve"> !$acc parallel loop async do concurrent (j=1:N) a(j) = j end do b = foo() #pragma acc wait c = sum(a) + b </pre> In the previous example, array <samp class="ph codeph">a</samp> is filled in with values asynchronously in DC-loop. </div> </div> </div> </div> <div class="topic concept nested2" id="stdpar-fortran-interop-cuf"><a name="stdpar-fortran-interop-cuf" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#stdpar-fortran-interop-cuf" name="stdpar-fortran-interop-cuf" shape="rect">8.3.4. Interoperability with CUDA Fortran</a></h3> <div class="body conbody"> <p class="p"> CUDA Fortran features can also be used when compiling Stdpar code for GPUs. To recognize CUDA Fortran features in your source code, compile with the <samp class="ph codeph">-cuda</samp> command line flag using <samp class="ph codeph">nvfortran</samp>. </p><pre xml:space="preserve"> nvfortran -stdpar -cuda example.f90 </pre><div class="p"> Using CUDA Fortran extensions can enhance the functionality of a do concurrent (DC) loop and Stdpar program, for several cases: <ul class="ul"> <li class="li"> Explicit data locality, accessing CUDA Fortran attributed arrays or other data with the device, managed, unified, or constant attributes from within DC-loops. </li> <li class="li"> Tuning DC-loop execution on the GPU e.g. controlling the GPU kernels launch configuration. </li> <li class="li"> Executing DC-loops asynchronously using a specific CUDA stream. </li> <li class="li"> Calling external, user-defined CUDA device routines from within DC-loops. </li> <li class="li"> Using CUDA Atomic operations in DC-loops, or other CUDA-specific device-side runtime library calls. </li> <li class="li"> Inserting CUDA Runtime API calls for memory tuning hints outside of DC-loops. </li> </ul> </div> <div class="section"> <h4 class="title sectiontitle">Examples</h4> <p class="p"> Some examples of using CUDA Fortran features with DC-loops are provided below. The following example demonstrates how a DC-loop can access CUDA Fortran device data, run on a specific CUDA stream, call the CUDA Runtime API for creating a stream, and hide non-standard features behind the CUF sentinel for code portability. </p><pre xml:space="preserve"> !@cuf use cudafor !@cuf integer(kind=cuda_stream_kind) :: istrm real, allocatable :: a(:,:), b(:,:) !@cuf attributes(device) :: a ! A is device array only, not unified/managed . . . !@cuf istat = cudaStreamCreate(istrm) . . . a(:,:) = 0.0 . . . !$cuf kernel do(1) &lt;&lt;&lt; *, *, stream=istrm&gt;&gt;&gt; do concurrent (j=1:N) do i=1,K a(j,i) = a(j,i) + 2.0 * b(j,i) end do end do </pre><p class="p"> This program demonstrates how to call low-level CUDA device functions from within a DC-loop. The function can be written in either CUDA Fortran or CUDA C++, depending on the interface. The CUDA C function must be compiled for relocatable device code. This can be used for accessing features in CUDA and NVIDIA GPUs not readily available in directive-based models or standard languages. </p><pre xml:space="preserve"> module mcuda contains attributes(host,device) pure integer function std_dbg(itype) integer, value :: itype if (itype.eq.1) then std_dbg = threadIdx%x else if (itype.eq.2) then std_dbg = blockIdx%x else std_dbg = (blockIdx%x-1)*blockDim%x + threadIdx%x end if end function end module program test use mcuda integer, parameter :: N = 2000 integer, allocatable :: a(:), b(:), c(:) allocate(a(N),b(N),c(N)) do concurrent (j=1:N) a(j) = std_dbg(1) b(j) = std_dbg(2) c(j) = std_dbg(3) end do print *,a(1),a(N/2),a(N) print *,b(1),b(N/2),b(N) print *,c(1),c(N/2),c(N) end </pre><p class="p"> Many functions from the CUDA Fortran <samp class="ph codeph">cudadevice</samp> module are available within do concurrent loops, not just atomics. This code snippet shows two uses: </p><pre xml:space="preserve"> real :: tmp(4), x, y ... block; use cudadevice do concurrent (i=1:K,j=1:N) x = real(j) + a(i,j) y = atomicAdd(b(1,j), x) end do do concurrent (j=1:N) x = real(j) tmp(1:4) = __ldca(a(1:4,j)) tmp(1:4) = tmp(1:4) + x call __stwt(b(1:4,j), tmp) end do end block </pre></div> </div> </div> </div> </div> <div class="topic concept nested0" id="pcast"><a name="pcast" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#pcast" name="pcast" shape="rect">9. PCAST</a></h2> <div class="body conbody"> <p class="p"> Parallel Compiler Assisted Software Testing (PCAST) is a set of API calls and compiler directives useful in testing program correctness. Numerical results produced by a program can diverge when parts of the program are mapped onto a GPU, when new or additional compiler options are used, or when changes are made to the program itself. PCAST can help you determine where these divergences begin, and pinpoint the changes that cause them. It is useful in other situations as well, including when using new libraries, determining whether parallel execution is safe, or porting programs from one ISA or type of processor to another. </p> </div> <div class="topic concept nested1" id="pcast-overview"><a name="pcast-overview" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#pcast-overview" name="pcast-overview" shape="rect">9.1. Overview</a></h3> <div class="body conbody"> <p class="p"> PCAST Comparisons can be performed in two ways. The first saves the initial run's data into a file through the <samp class="ph codeph">pcast_compare</samp> call or directive. Add the calls or directives to your application where you want intermediate results to be compared. Then, execute the program to save the "golden" results where the values are known to be correct. During subsequent runs of the program, the same pcast_compare calls or directives will compare the computed intermediate results to the saved "golden" results and report the differences. </p> <p class="p"> The second approach works in conjunction with the NVIDIA OpenACC implementation to compare GPU computation against the same program running on a CPU. In this case, all compute constructs are performed redundantly, both on the CPU and GPU. GPU results are compared against the CPU results, and differences reported. This is essentially like the first case where the CPU-calculated values are treated as the "golden" results. GPU to CPU comparisons can be done implicitly at the end of data regions with the <samp class="ph codeph">autocompare</samp> flag or explicitly after kernels with the <samp class="ph codeph">acc_compare</samp> call or directive. </p> <p class="p"> With the autocompare flag, OpenACC regions will run redundantly on the CPU and GPU. On an OpenACC region exit where data is to be downloaded from device to host, PCAST will compare the values calculated on the CPU with those calculated in the GPU. Comparisons done with <samp class="ph codeph">autocompare</samp> or <samp class="ph codeph">acc_compare</samp> are handled in memory and do not write results to an intermediate file. </p> <p class="p"> The following table outlines the supported data types that can be used with PCAST. Short, integer, long, and half precision data types are not supported with <samp class="ph codeph">ABS</samp>, <samp class="ph codeph">REL</samp>, <samp class="ph codeph">ULP</samp>, or <samp class="ph codeph">IEEE</samp> options; only a bit-for-bit comparison is supported. </p> <p class="p"> For floating-point types, PCAST can calculate absolute, relative, and unit-last-place differences. Absolute differences measures only the absolute value of the difference (subtraction) between two values, i.e. <em class="ph i">abs(A-B)</em>. Relative differences are calculated as a ratio between the difference of values, <em class="ph i">A-B</em>, and the previous value <em class="ph i">A</em>; <em class="ph i">abs((A-B)/A)</em>. Unit-least precision (Unit-last place) is a measure of the smallest distance between two values <em class="ph i">A</em> and <em class="ph i">B</em>. With the <samp class="ph codeph">ULP</samp> option set, PCAST will report if the calculated ULP between two numbers is greater than some threshold. </p> <div class="tablenoborder"><a name="pcast-overview__pcast-supported-types-table" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="pcast-overview__pcast-supported-types-table" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 23. Supported Types for Tolerance Measurements </span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e16832" rowspan="1" colspan="1">C/C++ Type</th> <th class="entry" valign="top" width="NaN%" id="d54e16835" rowspan="1" colspan="1">Fortran Type</th> <th class="entry" valign="top" width="NaN%" id="d54e16838" rowspan="1" colspan="1">ABS</th> <th class="entry" valign="top" width="NaN%" id="d54e16841" rowspan="1" colspan="1">REL</th> <th class="entry" valign="top" width="NaN%" id="d54e16844" rowspan="1" colspan="1">ULP</th> <th class="entry" valign="top" width="NaN%" id="d54e16848" rowspan="1" colspan="1">IEEE</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e16832" rowspan="1" colspan="1">float</td> <td class="entry" valign="top" width="NaN%" headers="d54e16835" rowspan="1" colspan="1">real, real(4)</td> <td class="entry" valign="top" width="NaN%" headers="d54e16838" rowspan="1" colspan="1">Yes</td> <td class="entry" valign="top" width="NaN%" headers="d54e16841" rowspan="1" colspan="1">Yes</td> <td class="entry" valign="top" width="NaN%" headers="d54e16844" rowspan="1" colspan="1">Yes</td> <td class="entry" valign="top" width="NaN%" headers="d54e16848" rowspan="1" colspan="1">Yes</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e16832" rowspan="1" colspan="1">double</td> <td class="entry" valign="top" width="NaN%" headers="d54e16835" rowspan="1" colspan="1">double precision, real(8)</td> <td class="entry" valign="top" width="NaN%" headers="d54e16838" rowspan="1" colspan="1">Yes</td> <td class="entry" valign="top" width="NaN%" headers="d54e16841" rowspan="1" colspan="1">Yes</td> <td class="entry" valign="top" width="NaN%" headers="d54e16844" rowspan="1" colspan="1">Yes</td> <td class="entry" valign="top" width="NaN%" headers="d54e16848" rowspan="1" colspan="1">Yes</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e16832" rowspan="1" colspan="1">float _Complex</td> <td class="entry" valign="top" width="NaN%" headers="d54e16835" rowspan="1" colspan="1">complex, complex(4)</td> <td class="entry" valign="top" width="NaN%" headers="d54e16838" rowspan="1" colspan="1">Yes</td> <td class="entry" valign="top" width="NaN%" headers="d54e16841" rowspan="1" colspan="1">Yes</td> <td class="entry" valign="top" width="NaN%" headers="d54e16844" rowspan="1" colspan="1">Yes</td> <td class="entry" valign="top" width="NaN%" headers="d54e16848" rowspan="1" colspan="1">Yes</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e16832" rowspan="1" colspan="1">double _Complex</td> <td class="entry" valign="top" width="NaN%" headers="d54e16835" rowspan="1" colspan="1">complex(8)</td> <td class="entry" valign="top" width="NaN%" headers="d54e16838" rowspan="1" colspan="1">Yes</td> <td class="entry" valign="top" width="NaN%" headers="d54e16841" rowspan="1" colspan="1">Yes</td> <td class="entry" valign="top" width="NaN%" headers="d54e16844" rowspan="1" colspan="1">Yes</td> <td class="entry" valign="top" width="NaN%" headers="d54e16848" rowspan="1" colspan="1">Yes</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e16832" rowspan="1" colspan="1"> - </td> <td class="entry" valign="top" width="NaN%" headers="d54e16835" rowspan="1" colspan="1">real(2)</td> <td class="entry" valign="top" width="NaN%" headers="d54e16838" rowspan="1" colspan="1">No</td> <td class="entry" valign="top" width="NaN%" headers="d54e16841" rowspan="1" colspan="1">No</td> <td class="entry" valign="top" width="NaN%" headers="d54e16844" rowspan="1" colspan="1">No</td> <td class="entry" valign="top" width="NaN%" headers="d54e16848" rowspan="1" colspan="1">No</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e16832" rowspan="1" colspan="1">(un)signed short</td> <td class="entry" valign="top" width="NaN%" headers="d54e16835" rowspan="1" colspan="1">integer(2)</td> <td class="entry" valign="top" width="NaN%" headers="d54e16838" rowspan="1" colspan="1">N/A</td> <td class="entry" valign="top" width="NaN%" headers="d54e16841" rowspan="1" colspan="1">N/A</td> <td class="entry" valign="top" width="NaN%" headers="d54e16844" rowspan="1" colspan="1">N/A</td> <td class="entry" valign="top" width="NaN%" headers="d54e16848" rowspan="1" colspan="1">N/A</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e16832" rowspan="1" colspan="1">(un)signed int</td> <td class="entry" valign="top" width="NaN%" headers="d54e16835" rowspan="1" colspan="1">integer, integer(4)</td> <td class="entry" valign="top" width="NaN%" headers="d54e16838" rowspan="1" colspan="1">N/A</td> <td class="entry" valign="top" width="NaN%" headers="d54e16841" rowspan="1" colspan="1">N/A</td> <td class="entry" valign="top" width="NaN%" headers="d54e16844" rowspan="1" colspan="1">N/A</td> <td class="entry" valign="top" width="NaN%" headers="d54e16848" rowspan="1" colspan="1">N/A</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e16832" rowspan="1" colspan="1">(un)signed long</td> <td class="entry" valign="top" width="NaN%" headers="d54e16835" rowspan="1" colspan="1">integer(8)</td> <td class="entry" valign="top" width="NaN%" headers="d54e16838" rowspan="1" colspan="1">N/A</td> <td class="entry" valign="top" width="NaN%" headers="d54e16841" rowspan="1" colspan="1">N/A</td> <td class="entry" valign="top" width="NaN%" headers="d54e16844" rowspan="1" colspan="1">N/A</td> <td class="entry" valign="top" width="NaN%" headers="d54e16848" rowspan="1" colspan="1">N/A</td> </tr> </tbody> </table> </div> </div> </div> <div class="topic concept nested1" id="pcast-nv-compare"><a name="pcast-nv-compare" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#pcast-nv-compare" name="pcast-nv-compare" shape="rect">9.2. PCAST with a "Golden" File</a></h3> <div class="body conbody"> <p class="p"> The run-time call <samp class="ph codeph">pcast_compare</samp> highlights differences between successive program runs. It has two modes of operation, depending on the presence of a data file named <em class="ph i">pcast_compare.dat</em> by default. If the file does not exist, <samp class="ph codeph">pcast_compare</samp> assumes this is the first "golden" run. It will create the file and fill it with the computed data at each call to <samp class="ph codeph">pcast_compare</samp>. If the file exists, <samp class="ph codeph">pcast_compare</samp> assumes it is a test run. It will read the file and compare the computed data with the saved data from the file. The default behavior is to consider the first 50 differences to be a reportable error, no matter how small. </p> <p class="p"> By default, the <samp class="ph codeph">pcast_compare.dat</samp> file is in the same directory as the executable. The behavior of <samp class="ph codeph">pcast_compare</samp>, and other comparison parameters, can be changed at runtime with the PCAST_COMPARE environment variable discussed in the <a class="xref" href="index.html#pcast-env-vars" shape="rect">Environment Variables</a> section. </p> <p class="p"> The signature of <samp class="ph codeph">pcast_compare</samp> for C++ and C is: </p><pre xml:space="preserve"> void pcast_compare(void*, char*, size_t, char*, char*, char*, int); </pre><p class="p"> The signature of <samp class="ph codeph">pcast_compare</samp> for Fortran is: </p><pre xml:space="preserve"> subroutine pcast_compare(a, datatype, len, varname, filename, funcname, lineno) type(*), dimension(..) :: a character(*) :: datatype, varname, filename, funcname integer(8),value :: len integer(4),value :: lineno </pre><p class="p"> The call takes seven arguments: </p> <ol class="ol"> <li class="li"> The address of the data to be saved or compared. </li> <li class="li"> A string containing the data type. </li> <li class="li"> The number of elements to compare. </li> <li class="li"> A string treated as the variable name. </li> <li class="li"> A string treated as the source file name. </li> <li class="li"> A string treated as the function name. </li> <li class="li"> An integer treated as a line number. </li> </ol> <p class="p"> For example, the <samp class="ph codeph">pcast_compare</samp> runtime call can be invoked like the following: </p><pre xml:space="preserve"> pcast_compare(a, "float", N, "a", "pcast_compare03.c", "main", 1); </pre><pre xml:space="preserve"> call pcast_compare(a, 'real', n, 'a', 'pcast_compare1.f90', 'program', 9) </pre><p class="p"> The caller should give meaningful names to the last four arguments. They can be anything, since they only serve to annotate the report. It is imperative that the identifiers are not modified between comparisons; comparisons must be called in the same order for each program run. If, for example, you are calling <samp class="ph codeph">pcast_compare</samp> inside a loop, it is reasonable to set the last argument to be the loop index. </p> <p class="p"> There also exists a directive form of the <samp class="ph codeph">pcast_compare</samp>, which is functionally the same as the runtime call. It can be used at any point in the program to compare the current value of data to that recorded in the golden file, same as the runtime call. There are two benefits to using the directive over the API call: </p> <ol class="ol"> <li class="li"> The directive syntax is much simpler than the API syntax. Most of what the compare call needs to output data to the user can be gleaned by the compiler at compile-time (The type, variable name, file name, function name, and line number). <pre xml:space="preserve"> #pragma nvidia compare(a[0:n]) </pre> as opposed to: <pre xml:space="preserve"> pcast_compare(a, "float", N, "a", "pcast_compare03.c", "main", 1); </pre></li> <li class="li"> The directive is only enabled when the -Mpcast flag is set, so the source need not be changed when testing is complete. Consider the following usage examples: <pre xml:space="preserve"> #pragma nvidia compare(a[0:N]) // C++ and C !$nvf compare(a(1:N)) ! Fortran </pre></li> </ol> <p class="p"> The directive interface is given below in C++ or C style, and in Fortran. Note that for Fortran, <samp class="ph codeph">var-list</samp> is a variable name, a subarray specification, an array element, or a composite variable member. </p><pre xml:space="preserve"> #pragma nvidia compare (var-list) // C++ and C !$nvf compare (var-list) ! Fortran </pre><p class="p"> Let's look at an example of </p><pre xml:space="preserve"> #include &lt;stdlib.h&gt; #include &lt;openacc.h&gt; int main() { int size = 1000; int i, t; float *a1; float *a2; a1 = (float*)malloc(sizeof(float)*size); a2 = (float*)malloc(sizeof(float)*size); for (i = 0; i &lt; size; i++) { a1[i] = 1.0f; a2[i] = 2.0f; } for (t = 0; t &lt; 5; t++) { for(i = 0; i &lt; size; i++) { a2[i] += a1[i]; } <strong class="ph b">pcast_compare(a2, "float", size, "a2", "example.c", "main", 23);</strong> } return 0; } </pre><p class="p"> Compile the example using these compiler options: </p><pre xml:space="preserve"> &gt; nvc -fast -o a.out example.c </pre><p class="p"> Compiling with redundant or autocompare options are not required to use pcast_compare. Once again, running the compiled executable using the options below, results in the following output: </p><pre xml:space="preserve"> &gt; PCAST_COMPARE=summary,rel=1 ./out.o datafile pcast_compare.dat created with 5 blocks, 5000 elements, 20000 bytes &gt; PCAST_COMPARE=summary,rel=1 ./out.o datafile pcast_compare.dat compared with 5 blocks, 5000 elements, 20000 bytes no errors found relative tolerance = 0.100000, rel=1 </pre><p class="p"> Running the program for the first time, the data file "pcast_compare.dat" is created. Subsequent runs compare calculated data against this file. Use the <samp class="ph codeph">PCAST_COMPARE</samp> environment variable to set the name of the file, or force the program to create a new file on the disk with <samp class="ph codeph">PCAST_COMPARE=create</samp>. </p> <p class="p"> The same example above can be written with the compare directive. Notice how much more concise the directive is to the update host and <samp class="ph codeph">pcast_compare</samp> calls. </p><pre xml:space="preserve"> #include &lt;stdlib.h&gt; #include &lt;openacc.h&gt; int main() { int size = 1000; int i, t; float *a1; float *a2; a1 = (float*)malloc(sizeof(float)*size); a2 = (float*)malloc(sizeof(float)*size); for (i = 0; i &lt; size; i++) { a1[i] = 1.0f; a2[i] = 2.0f; } for (t = 0; t &lt; 5; t++) { for(i = 0; i &lt; size; i++) { a2[i] += a1[i]; } <strong class="ph b">#pragma nvidia compare(a2[0:size])</strong> } return 0; } </pre><p class="p"> With the directive, you will want to add "-Mpcast" to the compilation line to enable the directive. Other than that, the output from this program is identical to the runtime example above. </p> </div> </div> <div class="topic concept nested1" id="pcast-gpu"><a name="pcast-gpu" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#pcast-gpu" name="pcast-gpu" shape="rect">9.3. PCAST with OpenACC</a></h3> <div class="body conbody"> <p class="p"> PCAST can also be used with the NVIDIA OpenACC implementation to compare GPU computation against the same program running on a CPU. In this case, all compute constructs are performed redundantly on both the CPU and GPU. The CPU results are considered to be the "golden master" copy which GPU results are compared against. </p> <p class="p"> There are two ways to perform comparisons with GPU-calculated results. The first is with the explicit call or directive <samp class="ph codeph">acc_compare</samp>. To use <samp class="ph codeph">acc_compare</samp>, you must compile with <samp class="ph codeph">-acc -gpu=redundant</samp> to force the CPU and GPU to compute results redundantly. Then, insert calls to <samp class="ph codeph">acc_compare</samp> or put an <samp class="ph codeph">acc compare</samp> directive at points where you want to compare the GPU-computed values against those computed by the CPU. </p> <p class="p"> The second approach is to turn on autocompare mode by compiling with <samp class="ph codeph">-acc -gpu=autocompare</samp>. In autocompare mode, PCAST will automatically perform a comparison at each point where data is moved from the device to the host. It does not require the programmer to add any additional directives or runtime calls; it's a convenient way to do all comparisons at the end of a data region. If there are multiple compute kernels within a data region, and you're only interested in one specific kernel, you should use the previously-mentioned <samp class="ph codeph">acc_compare</samp> to target a specific kernel. Note that autocompare mode implies <samp class="ph codeph">-gpu=redundant</samp>. </p> <p class="p"> During redundant execution, the compiler will generate both CPU and GPU code for each compute construct. At runtime, both the CPU and GPU versions will execute redundantly, with the CPU code reading and modifying values in system memory and the GPU reading and modifying values in device memory. Insert calls to <samp class="ph codeph">acc_compare()</samp> calls (or the equivalent <samp class="ph codeph">acc compare</samp> directive) at points where you want to compare the GPU-computed values against CPU-computed values. PCAST treats the values generated by the CPU code as the "golden" values. It will compare those results against GPU values. Unlike <samp class="ph codeph">pcast_compare</samp>, <samp class="ph codeph">acc_compare</samp> does not write to an intermediary file; the comparisons are done in-memory. </p> <p class="p"><samp class="ph codeph">acc_compare</samp> only has two arguments: a pointer to the data to be compared, <em class="ph i">hostptr</em>, and the number of elements to compare, <em class="ph i">count</em>. The type can be inferred in the OpenACC runtime, so it doesn't need to be specified. The C++ and C interface is given below: </p><pre xml:space="preserve"> void acc_compare(void *, size_t); </pre><p class="p"> And in Fortran: </p><pre xml:space="preserve"> subroutine acc_compare(a) subroutine acc_compare(a, len) type(*), dimension(*) :: a integer(8), value :: len </pre><p class="p"> You can call <samp class="ph codeph">acc_compare</samp> on any variable or array that is present in device memory. You can also call <samp class="ph codeph">acc_compare_all</samp> (no arguments) to compare all values that are present in device memory against the correponding values in host memory. </p><pre xml:space="preserve"> void acc_compare_all() </pre><pre xml:space="preserve"> subroutine acc_compare_all() </pre><p class="p"> Directive forms of the <samp class="ph codeph">acc_compare</samp> calls exist. They work the same as the API calls and can be used in lieu of them. Similar to PCAST <samp class="ph codeph">compare</samp> directives, <samp class="ph codeph">acc compare</samp> directives are ignored when redundant or autocompare modes are not enabled on the compilation line. </p> <p class="p"> The <samp class="ph codeph">acc compare</samp> directive takes one or more arguments, or the 'all' clause (which corresponds to <samp class="ph codeph">acc_compare_all()</samp>. The interfaces are given below in C++ or C, and Fortran respectively. Argument "var-list" can be a variable name, a sub-array specification, and array element, or a composite variable member. </p><pre xml:space="preserve"> #pragma acc compare [ (var-list) | all ] </pre><pre xml:space="preserve"> $!acc compare [ (var-list) | all ] </pre><p class="p"> For example: </p><pre xml:space="preserve"> #pragma acc compare(a[0:N]) #pragma acc compare all !$acc compare(a, b) !$acc compare(a(1:N)) !$acc compare all </pre><p class="p"> Consider the following OpenACC program that uses the <samp class="ph codeph">acc_compare()</samp> API call and an <samp class="ph codeph">acc compare</samp> directive. This Fortran example uses real*4 and real*8 arrays. </p><pre xml:space="preserve"> program main use openacc implicit none parameter N = 1000 integer :: i real :: a(N) real*4 :: b(N) real(4) :: c(N) double precision :: d(N) real*8 :: e(N) real(8) :: f(N) d = 1.0d0 e = 0.1d0 !$acc data copyout(a, b, c, f) copyin(d, e) !$acc parallel loop do i = 1,N a(i) = 1.0 b(i) = 2.0 c(i) = 0.0 enddo !$acc end parallel !$acc compare(a(1:N), b(1:N), c(1:N)) !$acc parallel loop do i = 1,N f(i) = d(i) * e(i) enddo !$acc end parallel !$acc compare(f) !$acc parallel loop do i = 1,N a(i) = 1.0 b(i) = 1.0 c(i) = 1.0 enddo !$acc end parallel call acc_compare(a, N) call acc_compare(b, N) call acc_compare(c, N) !$acc parallel loop do i = 1,N f(i) = 1.0D0 enddo !$acc end parallel call acc_compare_all() !$acc parallel loop do i = 1,N a(i) = 3.14; b(i) = 3.14; c(i) = 3.14; f(i) = 3.14d0; enddo !$acc end parallel ! In redundant mode, no comparison is performed here. In ! autocompare mode, a comparison is made for a, b, c, and f (but ! not e and d), since they are copied out of the data region. !$acc end data call verify(N, a, b, c, f) end program subroutine verify(N, a, b, c, f) integer, intent(in) :: N real, intent(in) :: a(N) real*4, intent(in) :: b(N) real(4), intent(in) :: c(N) real(8), intent(in) :: f(N) integer :: i, errcnt errcnt = 0 do i=1,N if(abs(a(i) - 3.14e0) .gt. 1.0e-06) then errcnt = errcnt + 1 endif end do do i=1,N if(abs(b(i) - 3.14e0) .gt. 1.0e-06) then errcnt = errcnt + 1 endif end do do i=1,N if(abs(c(i) - 3.14e0) .gt. 1.0e-06) then errcnt = errcnt + 1 endif end do do i=1,N if(abs(f(i) - 3.14d0) .gt. 1.0d-06) then errcnt = errcnt + 1 endif end do if(errcnt /= 0) then write (*, *) "FAILED" else write (*, *) "PASSED" endif end subroutine verify </pre><p class="p"> The program can be compiled with the following command: </p><pre xml:space="preserve"> &gt; nvfortran -fast -acc -gpu=redundant -Minfo=accel example.F90 main: 16, Generating copyout(a(:),b(:)) Generating copyin(e(:)) Generating copyout(f(:),c(:)) Generating copyin(d(:)) 18, Generating Tesla code 19, !$acc loop gang, vector(128) ! blockidx%x threadidx%x 26, Generating acc compare(c(:),b(:),a(:)) 28, Generating Tesla code 29, !$acc loop gang, vector(128) ! blockidx%x threadidx%x 34, Generating acc compare(f(:)) 36, Generating Tesla code 37, !$acc loop gang, vector(128) ! blockidx%x threadidx%x 48, Generating Tesla code 49, !$acc loop gang, vector(128) ! blockidx%x threadidx%x 56, Generating Tesla code 57, !$acc loop gang, vector(128) ! blockidx%x threadidx%x </pre><p class="p"> Here, you can see where the acc compare directives are generated on lines 26 and 34. The program can be run with the following command: </p><pre xml:space="preserve"> &gt; ./a.out PASSED </pre><p class="p"> As you can see, no PCAST output is generated when the comparisons match. We can get more information with the summary option: </p><pre xml:space="preserve"> &gt; PCAST_COMPARE=summary ./a.out PASSED compared 13 blocks, 13000 elements, 68000 bytes no errors found absolute tolerance = 0.00000000000000000e+00, abs=0 </pre><p class="p"> There are 13 blocks compared. Let's count the blocks in the compare calls. </p><pre xml:space="preserve"> !$acc compare(a(1:N), b(1:N), c(1:N)) </pre><p class="p"> Compares three blocks, one each for a, b, and c. </p><pre xml:space="preserve"> !$acc compare(f) </pre><p class="p"> Compares one block for f. </p><pre xml:space="preserve"> call acc_compare(a, N) call acc_compare(b, N) call acc_compare(c, N) </pre><p class="p"> Each call compares one block for their respective array. </p><pre xml:space="preserve"> call acc_compare_all() </pre><p class="p"> Compares one block for each array present on the device (a, b, c, d, e, and f) for a total of 6 blocks. </p> <p class="p"> If the same example is compiled with autocompare, we'll see four additional comparisons, since the four arrays that are copied out (with the copyout clause) are compared at the end of the data region. </p><pre xml:space="preserve"> &gt; nvfortran -fast -acc -gpu=autocompare example.F90 &gt; PCAST_COMPARE=summary ./a.out PASSED compared 17 blocks, 17000 elements, 88000 bytes no errors found absolute tolerance = 0.00000000000000000e+00, abs=0 </pre></div> </div> <div class="topic concept nested1" id="pcast-limitations"><a name="pcast-limitations" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#pcast-limitations" name="pcast-limitations" shape="rect">9.4. Limitations</a></h3> <div class="body conbody"> <p class="p"> There are currently a few limitations with using PCAST that are worth keeping in mind. </p> <ul class="ul"> <li class="li"> Comparisons are not thread-safe. If you are using PCAST with multiple threads, ensure that only one thread is doing the comparisons. This is especially true if you are using PCAST with MPI. If you use <samp class="ph codeph">pcast_compare</samp> with MPI, you must make sure that only one thread is writing to the comparison file. Or, use a script to set PCAST_COMPARE to encode the file name with the MPI rank. </li> <li class="li"> Comparisons must be done with like types; you cannot compare one type with another. It is not possible to, for example, check for differing results after changing from double precision to single. Comparisons are limited to those present in table <a class="xref" href="index.html#pcast-overview__pcast-supported-types-table" shape="rect">Table 23</a>. Currently there is no support for structured or derived types. </li> <li class="li"> The <samp class="ph codeph">-gpu=mem:managed</samp> or <samp class="ph codeph">-gpu=mem:unified</samp> options are incompatible with autocompare and <samp class="ph codeph">acc_compare</samp>. Both the CPU and GPU need to calculate result separately and to do so they must have their own working memory spaces. </li> <li class="li"> If you do any data movement on the device, you must account for it on the host. For example, if you are using CUDA-aware MPI or GPU-accelerated libraries that modify device data, then you must also make the host aware of the changes. In these cases it is helpful to use the <samp class="ph codeph">host_data</samp> clause, which allows you to use device addresses within host code. </li> </ul> </div> </div> <div class="topic concept nested1" id="pcast-env-vars"><a name="pcast-env-vars" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#pcast-env-vars" name="pcast-env-vars" shape="rect">9.5. Environment Variables</a></h3> <div class="body conbody"> <p class="p"> Behavior of PCAST/Autocompare is controlled through the <samp class="ph codeph">PCAST_COMPARE</samp> variable. Options can be specified in a comma-separated list: <samp class="ph codeph">PCAST_COMPARE=&lt;opt1&gt;,&lt;opt2&gt;,...</samp></p> <p class="p"> If no options are specified, the default is to perform comparisons with <em class="ph i">abs=0</em>. Comparison options are not mutually exclusive. PCAST can compare absolute differences with some <em class="ph i">n=3</em> and relative differences with a different threshold, e.g. <em class="ph i">n=5</em>; <em class="ph i">PCAST_COMPARE=abs=3,rel=5,...</em>. </p> <p class="p"> You can specify either an absolute or relative location to be used with the datafile option. The parent directory should be owned by the same user executing the comparisons and the datafile should have the appropriate read/write permissions set. </p> <div class="tablenoborder"><a name="pcast-env-vars__pcast-env-vars-table" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="pcast-env-vars__pcast-env-vars-table" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 24. PCAST_COMPARE Options </span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e17512" rowspan="1" colspan="1">Option</th> <th class="entry" valign="top" width="NaN%" id="d54e17515" rowspan="1" colspan="1">Description</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">abs=n</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1"> Compare absolute difference; tolerate differences up to 10^(-n), only applicable to floating point types. Default value is 0 </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">create</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1"> Specifies that this is the run that will produce the reference file (pcast_compare only) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">compare</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1"> Specifies that the current run will be compared with a reference file (pcast_compare only) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">datafile="name"</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1"> Name of the file that data will be saved to, or compared against. If empty will use the default, 'pcast_compare.dat' (pcast_compare only) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">disable</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1"> Calls to pcast_compare, acc_compare, acc_compare_all, and directives (pcast compare, acc compare, and acc compare) all immediately return from the runtime with no effect. Note that this doesn't disable redundant execution; that will require a recompile. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">ieee</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1">Compare IEEE NaN checks (only implemented for floats and doubles)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">outputfile="name"</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1">Save comparison output to a specific file. Default behavior is to output to stderr</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">patch</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1">Patch errors (outside tolerance) with correct values</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">patchall</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1">Patch all differences (inside and outside tolerance) with correct values</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">rel=n</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1">Compare relative difference; tolerated differences up to 10^(-n), only applicable to floating point types. Default value is 0. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">report=n</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1">Report up to n (default of 50) passes/fails</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">reportall</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1">Report all passes and fails (overrides limit set in report=n)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">reportpass</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1">Report passes; respects limit set with report=n</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">silent</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1">Suppress output - overrides all other output options, including summary and verbose</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">stop</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1">Stop at first differences</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">summary</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1">Print summary of comparisons at end of run</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">ulp=n</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1">Compare Unit of Least Precision difference (only for floats and doubles)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">verbose</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1">Outputs more details of comparison (including patches)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e17512" rowspan="1" colspan="1">verboseautocompare</td> <td class="entry" valign="top" width="NaN%" headers="d54e17515" rowspan="1" colspan="1">Outputs verbose reporting of what and where the host is comparing (autocompare only)</td> </tr> </tbody> </table> </div> </div> </div> </div> <div class="topic concept nested0" id="mpi-use"><a name="mpi-use" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#mpi-use" name="mpi-use" shape="rect">10. Using MPI</a></h2> <div class="body conbody"> <p class="p">MPI (the Message Passing Interface) is an industry-standard application programming interface designed for rapid data exchange between processors in a distributed-memory environment. MPI is computer software used in scalable computer systems that allows the processes of a parallel application to communicate with one another. </p> <p class="p">The NVIDIA HPC SDK includes a pre-compiled version of Open MPI. You can build using alternate versions of MPI with the <span class="ph filepath">-I</span>, <span class="ph filepath">-L</span>, and <span class="ph filepath">-l</span> options. </p> <p class="p">This section describes how to use Open MPI with the NVIDIA HPC Compilers.</p> </div> <div class="topic concept nested1" id="mpi-open-mpi-linux"><a name="mpi-open-mpi-linux" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#mpi-open-mpi-linux" name="mpi-open-mpi-linux" shape="rect">10.1. Using Open MPI on Linux</a></h3> <div class="body conbody"> <p class="p"> The NVIDIA HPC Compilers for Linux ship with a pre-compiled version of Open MPI that includes everything required to compile, execute and debug MPI programs using Open MPI. </p> <p class="p"> To build an application using Open MPI, use the Open MPI compiler wrappers: <span class="keyword cmdname">mpicc</span>, <span class="keyword cmdname">mpic⁠+⁠+</span> and <span class="keyword cmdname">mpifort</span>. These wrappers automatically set up the compiler commands with the correct include file search paths, library directories, and link libraries. </p> <p class="p">The following MPI example program uses Open MPI.</p><pre xml:space="preserve"> $ cd my_example_dir $ cp -r /opt/nvidia/hpc_sdk/Linux_x86_64/<span class="keyword">2024</span>/examples/MPI/samples/mpihello . $ cd mpihello $ export PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/<span class="keyword">24.11</span>/mpi/openmpi/bin:$PATH $ mpifort mpihello.f -o mpihello </pre><pre xml:space="preserve"> $ mpiexec mpihello Hello world! I'm node 0 </pre><pre xml:space="preserve"> $ mpiexec -np 4 mpihello Hello world! I'm node 0 Hello world! I'm node 2 Hello world! I'm node 1 Hello world! I'm node 3 </pre><p class="p"> To build an application using Open MPI for debugging, add <span class="keyword option">-⁠g</span> to the compiler wrapper command line arguments. </p> </div> </div> <div class="topic concept nested1" id="mpi-compiler-wrappers"><a name="mpi-compiler-wrappers" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#mpi-compiler-wrappers" name="mpi-compiler-wrappers" shape="rect">10.2. Using MPI Compiler Wrappers</a></h3> <div class="body conbody"> <p class="p">When you use MPI compiler wrappers to build with the <span class="ph filepath">-fpic</span> or <span class="ph filepath">-mcmodel=medium</span> options, then you must specify <span class="ph filepath">-fortranlibs</span> to link with the correct libraries. Here are a few examples: </p> <p class="p">For a static link to the MPI libraries, use this command:</p><pre xml:space="preserve">% mpifort hello.f</pre><p class="p">For a dynamic link to the MPI libraries, use this command:</p><pre xml:space="preserve">% mpifort hello.f -fortranlibs</pre><p class="p">To compile with <span class="ph filepath">-fpic</span>, which, by default, invokes dynamic linking, use this command: </p><pre xml:space="preserve">% mpifort -fpic -fortranlibs hello.f</pre><p class="p">To compile with <span class="ph filepath">-mcmodel=medium</span>, use this command: </p><pre xml:space="preserve">% mpifort -mcmodel=medium -fortranlibs hello.f</pre></div> </div> <div class="topic concept nested1" id="mpi-testing-benchmarks"><a name="mpi-testing-benchmarks" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#mpi-testing-benchmarks" name="mpi-testing-benchmarks" shape="rect">10.3. Testing and Benchmarking</a></h3> <div class="body conbody"> <p class="p">The <span class="ph filepath">/opt/nvidia/hpc_sdk/Linux_x86_64/<span class="keyword">2024</span>/examples/MPI</span> directory contains various benchmarks and tests. Copy this directory into a local working directory by issuing the following command: </p><pre xml:space="preserve">% cp -r /opt/nvidia/hpc_sdk/Linux_x86_64/<span class="keyword">2024</span>/examples/MPI .</pre><p class="p">There are several example programs available in this directory.</p> </div> </div> </div> <div class="topic concept nested0" id="lib-create-use"><a name="lib-create-use" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#lib-create-use" name="lib-create-use" shape="rect">11. Creating and Using Libraries</a></h2> <div class="body conbody"> <p class="p"> A library is a collection of functions or subprograms that are grouped for reference and ease of linking. This section discusses issues related to NVIDIA-supplied compiler libraries. Specifically, it addresses the <span class="ph">use of C++ and C builtin functions in place of the corresponding libc routines, </span>creation of dynamically linked libraries, known as shared objects or shared libraries, and math libraries. </p> <div class="note note"><span class="notetitle">Note:</span> This section does not duplicate material related to using libraries for inlining which are described in <a class="xref" href="index.html#fn-inline-create-lib" shape="rect">Creating an Inline Library</a>. </div> <p class="p"> NVIDIA provides libraries that export C interfaces by using Fortran modules. </p> </div> <div class="topic concept nested1" id="lib-builtin-c-cpp"><a name="lib-builtin-c-cpp" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#lib-builtin-c-cpp" name="lib-builtin-c-cpp" shape="rect">11.1. Using builtin Math Functions in C++ and C</a></h3> <div class="body conbody"> <p class="p"> The name of the math header file is <span class="ph filepath">math.h</span>. Include the math header file in all of your source files that use a math library routine as in the following example, which calculates the inverse cosine of 3.5. </p><pre xml:space="preserve">#include &lt;math.h&gt; #include&lt;stdio.h&gt; #define PI 3.1415926535 void main() { double x, y; x = PI/3.0; y = acos(0.5); printf(‘%f %f\n’,x,y); }</pre><p class="p"> Including <span class="ph filepath">math.h</span> causes the NVIDIA C++ and C compilers to use builtin functions, which are much more efficient than library calls. In particular, if you include <span class="ph filepath">math.h</span>, the following intrinsics calls are processed using builtins: </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="void" border="0" rules="none"> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">abs</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">acosf</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">asinf</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">atan</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">atan2</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">atan2f</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">atanf</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">cos</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">cosf</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">exp</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">expf</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">fabs</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">fabsf</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">fmax</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">fmaxf</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">fmin</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">fminf</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">log</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">log10</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">log10f</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">logf</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">pow</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">powf</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">sin</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">sinf</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">sqrt</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">sqrtf</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">tan</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">tanf</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"> </td> </tr> </tbody> </table> </div> </div> </div> <div class="topic concept nested1" id="lib-sys-routines"><a name="lib-sys-routines" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#lib-sys-routines" name="lib-sys-routines" shape="rect">11.2. Using System Library Routines</a></h3> <div class="body conbody"> <p class="p"> Release <span class="keyword">24.11</span> of the NVIDIA HPC Compilers runtime libraries makes use of Linux system libraries to implement, for example, OpenMP and Fortran I/O. The NVIDIA HPC Compilers runtime libraries make use of several additional system library routines. </p> <p class="p"> On 64-bit Linux systems, the system library routines used include these: </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="void" border="0" rules="none"> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">aio_error</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">aio_write</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">pthread_mutex_init</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"> sleep</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">aio_read</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">calloc</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">pthread_mutex_lock</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"> </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">aio_return</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">getrlimit</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">pthread_mutex_unlock</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"> </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">aio_suspend</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">pthread_attr_init</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">setrlimit</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"> </td> </tr> </tbody> </table> </div> </div> </div> <div class="topic concept nested1" id="lib-create-use-shared-obj-linux"><a name="lib-create-use-shared-obj-linux" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#lib-create-use-shared-obj-linux" name="lib-create-use-shared-obj-linux" shape="rect">11.3. Creating and Using Shared Object Files on Linux</a></h3> <div class="body conbody"> <p class="p">All of the NVIDIA HPC Fortran, C++ and C compilers support creation of shared object files. Unlike statically-linked object and library files, shared object files link and resolve references with an executable at runtime via a dynamic linker supplied with your operating system. The NVIDIA HPC Compilers must generate position independent code to support creation of shared objects by the linker. However, this is not the default. You must create object files with position independent code and shared object files that will include them. </p> </div> <div class="topic task nested2" id="lib-create-use-shared-obj-linux-proc"><a name="lib-create-use-shared-obj-linux-proc" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#lib-create-use-shared-obj-linux-proc" name="lib-create-use-shared-obj-linux-proc" shape="rect">11.3.1. Procedure to create a use a shared object file</a></h3> <div class="body taskbody"> <div class="section context"> <p class="p">The following steps describe how to create and use a shared object file.</p> </div> <ol class="ol steps"> <li class="li step"><span class="ph cmd">Create an object file with position independent code.</span> To do this, compile your code with the appropriate NVIDIA HPC compiler using the <span class="ph filepath">-⁠fpic</span> option, or one of the equivalent options, such as <span class="ph filepath">-⁠fPIC</span>, <span class="ph filepath">-⁠Kpic,</span> and <span class="ph filepath">-⁠KPIC</span>, which are supported for compatibility with other systems. For example, use the following command to create an object file with position independent code using nvfortran: <pre xml:space="preserve">% nvfortran -c -fpic tobeshared.f</pre></li> <li class="li step"><span class="ph cmd">Produce a shared object file.</span> To do this, use the appropriate NVIDIA HPC compiler to invoke the linker supplied with your system. It is customary to name such files using a <span class="ph filepath">.so</span> filename extension. On Linux, you do this by passing the <span class="ph filepath">-⁠shared</span> option to the linker: <pre xml:space="preserve">% nvfortran -shared -o tobeshared.so tobeshared.o</pre><div class="note note"><span class="notetitle">Note:</span> Compilation and generation of the shared object can be performed in one step using both the <span class="ph filepath">-⁠fpic</span> option and the appropriate option for generation of a shared object file. </div> </li> <li class="li step"><span class="ph cmd">Use a shared object file. </span> To do this, use the appropriate NVIDIA HPC compiler to compile and link the program which will reference functions or subroutines in the shared object file, and list the shared object on the link line, as shown here: <pre xml:space="preserve">% nvfortran -o myprog myprog.f tobeshared.so</pre></li> <li class="li step"><span class="ph cmd">Make the executable available.</span><p class="p">You now have an executable <span class="ph filepath">myprog</span> which does not include any code from functions or subroutines in <span class="ph filepath">tobeshared.so</span>, but which can be executed and dynamically linked to that code. By default, when the program is linked to produce <span class="ph filepath">myprog</span>, no assumptions are made on the location of <span class="ph filepath">tobeshared.so</span>. Therefore, for <span class="ph filepath">myprog</span> to execute correctly, you must initialize the environment variable <span class="ph filepath">LD_LIBRARY_PATH</span> to include the directory containing <span class="ph filepath">tobeshared.so</span>. If <span class="ph filepath">LD_LIBRARY_PATH</span> is already initialized, it is important not to overwrite its contents. If you have placed <span class="ph filepath">tobeshared.so</span> in directory <span class="ph filepath">/home/myusername/bin</span>, you can initialize <span class="ph filepath">LD_LIBRARY_PATH</span> to include that directory and preserve its existing contents, as shown in the following: </p><pre xml:space="preserve">% setenv LD_LIBRARY_PATH "$LD_LIBRARY_PATH":/home/myusername/bin</pre><p class="p">If you know that <span class="ph filepath">tobeshared.so</span> always resides in a specific directory, you can create the executable <span class="ph filepath">myprog</span> in a form that assumes this directory by using the <span class="ph filepath">-⁠R</span> link-time option. For example, you can link as follows: </p><pre xml:space="preserve">% nvfortran -o myprog myprof.f tobeshared.so -R/home/myusername/bin</pre><div class="note note"><span class="notetitle">Note:</span> As with the <span class="ph filepath">-⁠L</span> option, there is no space between <span class="ph filepath">-⁠R</span> and the directory name. If the -⁠R option is used, it is not necessary to initialize <span class="ph filepath">LD_LIBRARY_PATH</span>. </div> <p class="p">In the previous example, the dynamic linker always looks in <span class="ph filepath">/home/myusername/bin</span> to resolve references to <span class="ph filepath">tobeshared.so</span>. By default, if the <span class="ph filepath">LD_LIBRARY_PATH</span> environment variable is not set, the linker only searches <span class="ph filepath">/usr/lib</span> and <span class="ph filepath">/lib</span> for shared objects. </p> </li> </ol> </div> </div> <div class="topic concept nested2" id="lib-create-use-shared-obj-linux-ldd-cmd"><a name="lib-create-use-shared-obj-linux-ldd-cmd" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#lib-create-use-shared-obj-linux-ldd-cmd" name="lib-create-use-shared-obj-linux-ldd-cmd" shape="rect">11.3.2. ldd Command</a></h3> <div class="body conbody"> <p class="p">The <span class="ph filepath">ldd</span> command is a useful tool when working with shared object files and executables that reference them. When applied to an executable, as shown in the following example, <span class="ph filepath">ldd</span> lists all shared object files referenced in the executable along with the pathname of the directory from which they will be extracted. </p><pre xml:space="preserve">% ldd myprog </pre><p class="p">If the pathname is not hard-coded using the <span class="ph filepath">-⁠R</span> option, and if <span class="ph filepath">LD_LIBRARY_PATH</span> is not initialized, the pathname is listed as "not found". For more information on <span class="ph filepath">ldd</span>, its options and usage, see the online man page for <span class="ph filepath">ldd</span>. </p> </div> </div> </div> <div class="topic concept nested1" id="lib-use-lib3f"><a name="lib-use-lib3f" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#lib-use-lib3f" name="lib-use-lib3f" shape="rect">11.4. Using LIB3F</a></h3> <div class="body conbody"> <p class="p">The NVFORTRAN compiler includes support for the de facto standard LIB3F library routines. See the Fortran Language Reference manual for a complete list of available routines in the NVIDIA implementation of LIB3F. </p> </div> </div> <div class="topic concept nested1" id="lib-use-lapack-blas-ffts"><a name="lib-use-lapack-blas-ffts" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#lib-use-lapack-blas-ffts" name="lib-use-lapack-blas-ffts" shape="rect">11.5. LAPACK, BLAS and FFTs</a></h3> <div class="body conbody"> <p class="p"> The NVIDIA HPC SDK includes a BLAS and LAPACK library based on the customized OpenBLAS project source and built with the NVIDIA HPC Compilers. The LAPACK library is called <span class="ph"><span class="ph filepath">liblapack.a</span></span>. The BLAS library is called <span class="ph"><span class="ph filepath">libblas.a</span></span>. </p> <p class="p"> To use these libraries, simply link them in using the <span class="ph filepath">-⁠l</span> option when linking your main program: </p><pre xml:space="preserve">% nvfortran myprog.f -⁠llapack -⁠lblas</pre></div> </div> <div class="topic concept nested1" id="lib-link-scalapack"><a name="lib-link-scalapack" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#lib-link-scalapack" name="lib-link-scalapack" shape="rect">11.6. Linking with ScaLAPACK</a></h3> <div class="body conbody"> <p class="p">The ScaLAPACK libraries are automatically installed with each MPI library version which accompanies an NVIDIA HPC SDK installation. You can link with the ScaLAPACK libraries by specifying <span class="ph filepath">-⁠Mscalapack</span> on any of the MPI wrapper command lines. For example: </p><pre xml:space="preserve">% mpifort myprog.f -Mscalapack</pre><p class="p">A pre-built version of the BLAS library is automatically added when the <span class="ph filepath">-⁠Mscalapack</span> switch is specified. If you wish to use a different BLAS library, and still use the <span class="ph filepath">-⁠Mscalapack</span> switch, then you can list the set of libraries explicitly on your link line. </p> <p class="p">If the <span class="ph filepath">-⁠Mnvpl</span> switch is also specified in addition to <span class="ph filepath">-⁠Mscalapack</span>, then the NVPL ScaLAPACK library will be used. </p> </div> </div> <div class="topic concept nested1" id="lib-std-cpp-template"><a name="lib-std-cpp-template" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#lib-std-cpp-template" name="lib-std-cpp-template" shape="rect">11.7. The C++ Standard Template Library</a></h3> <div class="body conbody"> <p class="p">On Linux, the GNU-compatible nvc++ compiler uses the GNU g++ header files and Standard Template Library (STL) directly. The versions used are dependent on the version of the GNU compilers installed on your system, or specified when makelocalrc was run during installation of the NVIDIA HPC Compilers. </p> </div> </div> <div class="topic concept nested1" id="lib-use-nvpl"><a name="lib-use-nvpl" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#lib-use-nvpl" name="lib-use-nvpl" shape="rect">11.8. NVIDIA Performance Libraries (NVPL)</a></h3> <div class="body conbody"> <p class="p"> The NVIDIA Performance Libraries (NVPL) are a suite of high performance mathematical libraries optimized for the NVIDIA Grace Arm architecture. These CPU-only libraries have no dependencies on CUDA or CTK, and are drop in replacements for standard C and Fortran mathematical APIs allowing HPC applications to achieve maximum performance on the Grace platform. They are available for Arm CPUs only. The NVPL includes the following math libraries: BLAS, FFT, LAPACK, RAND, ScaLAPACK, Sparse, and Tensor. Refer to the <a class="xref" href="https://docs.nvidia.com/nvpl" target="_blank" shape="rect">NVPL documentation</a> for more information about these math libraries. The following section explains how to use them with the NVHPC compilers. </p> <p class="p"> To use the NVPL libraries, use the <span class="ph filepath">-⁠Mnvpl</span> option when linking your main program: </p><pre xml:space="preserve">% nvfortran myprog.f -⁠Mnvpl</pre><p class="p"> You can link only the NVPL libraries your application needs using the sub-options to <span class="ph filepath">-⁠Mnvpl</span>. For example, if you only want the BLAS and FFT libraries from the NVPL, link as follows: </p><pre xml:space="preserve">% nvfortran myprog.f -Mnvpl=blas,fft</pre><p class="p"> Refer to the NVIDIA HPC Compilers Reference Guide for a complete list of supported options for the <span class="ph filepath">-⁠Mnvpl</span> flag. </p> <div class="section"> <h3 class="title sectiontitle">ScaLAPACK</h3> <p class="p"> Similar to other ScaLAPACK libraries, the NVPL version is designed to be used with MPI. A straightforward way to access the NVPL ScaLAPACK library is to use an MPI wrapper (i.e., <span class="ph filepath">mpicc</span>, <span class="ph filepath">mpic++</span>, <span class="ph filepath">mpifort</span>) and link with both <span class="ph filepath">-⁠Mnvpl</span> and <span class="ph filepath">-⁠Mscalapack</span>. For example: </p><pre xml:space="preserve">% mpic++ myprog.cpp -Mscalapack -Mnvpl</pre><p class="p"> If you choose not to use an MPI wrapper, you can satisfy ScaLAPACK's dependency on libmpi.so by explicitly providing this library at link time. </p> <p class="p"> The NVPL ScaLAPACK interfaces are available for the following MPI variants: MPICH, Open MPI 3.x, Open MPI 4.x (including HPC-X), and Open MPI 5.x. The HPC SDK contains builds of Open MPI 3, Open MPI 4, and HPC-X; to take advantage of the NVPL's ScaLAPACK interfaces for MPICH or Open MPI 5.x, you must supply your own build of these MPI libraries. </p> </div> </div> </div> <div class="topic concept nested1" id="lib-link-nvmalloc"><a name="lib-link-nvmalloc" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#lib-link-nvmalloc" name="lib-link-nvmalloc" shape="rect">11.9. Linking with the nvmalloc Library</a></h3> <div class="body conbody"> <p class="p">The NVIDIA HPC SDK installation includes a custom host (system) memory allocation library based on the jemalloc memory allocator. This library replaces the system malloc(), free(), and other related functions used by the nvc, nvc++, and nvfortran runtime for dynamic heap allocations. You can link with this library by specifying <span class="ph filepath">-⁠nvmalloc</span> on any of the compiler command lines used for linking. For example: </p><pre xml:space="preserve">% nvc main.c -nvmalloc</pre></div> </div> </div> <div class="topic concept nested0" id="env-vars-use"><a name="env-vars-use" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#env-vars-use" name="env-vars-use" shape="rect">12. Environment Variables</a></h2> <div class="body conbody"> <p class="p"> Environment variables allow you to set and pass information that can alter the default behavior of the NVIDIA HPC compilers and the executables which they generate. This section includes explanations of the environment variables specific to the NVIDIA HPC Compilers. . </p> <ul class="ul"> <li class="li"> Standard OpenMP environment variables are used to control the behavior of OpenMP programs; these environment variables are described in the OpenMP Specification available online. </li> <li class="li"> Several NVIDIA-specific environment variables can be used to control the behavior of OpenACC programs. OpenACC-related environment variables are described in the OpenACC section: <a class="xref" href="index.html#env-vars" shape="rect">Environment Variables</a> and the <a class="xref" href="../openacc-gs/index.htm" target="_blank" shape="rect">OpenACC Getting Started Guide</a>. </li> </ul> </div> <div class="topic concept nested1" id="env-vars-set"><a name="env-vars-set" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-set" name="env-vars-set" shape="rect">12.1. Setting Environment Variables</a></h3> <div class="body conbody"> <p class="p">Before we look at the environment variables that you might use with the HPC compilers and tools, let’s take a look at how to set environment variables. To illustrate how to set these variables in various environments, let’s look at how a user might initialize a Linux shell environment to enable use of the NVIDIA HPC Compilers. </p> </div> <div class="topic concept nested2" id="env-vars-set-linux"><a name="env-vars-set-linux" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-set-linux" name="env-vars-set-linux" shape="rect">12.1.1. Setting Environment Variables on Linux</a></h3> <div class="body conbody"> <p class="p">Let's assume that you want access to the NVIDIA products when you log in, and that you installed the NVIDIA HPC SDK in <span class="ph filepath">/opt/nvidia/hpc_sdk</span>. For access at startup, you can add the following lines to your shell startup files on a Linux_x86_64 system. </p> <p class="p"><strong class="ph b">For csh, use these commands:</strong></p><pre xml:space="preserve"> % setenv NVHPCSDK /opt/nvidia/hpc_sdk % setenv MANPATH "$MANPATH":$NVHPCSDK/Linux_x86-64/<span class="keyword">24.11</span>/compilers/man % set path = ($NVHPCSDK/Linux_x86_64/<span class="keyword">24.11</span>/compilers/bin $path)</pre><p class="p"><strong class="ph b">For bash, sh, zsh, or ksh, use these commands:</strong></p><pre xml:space="preserve"> $ NVHPCSDK=/opt/nvidia/hpc_sdk; export NVHPCSDK $ MANPATH=$MANPATH:$NVHPCSDK/Linux_x86_64/<span class="keyword">24.11</span>/compilers/man; export MANPATH $ PATH=$NVHPCSDK/Linux_x86_64/<span class="keyword">24.11</span>/compilers/bin:$PATH; export PATH </pre><p class="p">On a Linux/OpenPOWER system replace <samp class="ph codeph">Linux_x86_64</samp> with <samp class="ph codeph">Linux_ppc64le</samp>, and on a Linux/Arm Server system replace it with <samp class="ph codeph">Linux_aarch64</samp>. </p> </div> </div> </div> <div class="topic concept nested1" id="env-vars-nv-related"><a name="env-vars-nv-related" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-nv-related" name="env-vars-nv-related" shape="rect">12.2. HPC Compiler Related Environment Variables</a></h3> <div class="body conbody"> <p class="p"> The following table provides a listing of environment variables that affect the behavior of the NVIDIA HPC Compilers and the executables they generate. </p> <div class="tablenoborder"><a name="env-vars-nv-related__env-vars-nv-related-tbl" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="env-vars-nv-related__env-vars-nv-related-tbl" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 25. NVIDIA HPC Compilers Environment Variable Summary</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e18721" rowspan="1" colspan="1">Environment Variable </th> <th class="entry" valign="top" width="NaN%" id="d54e18724" rowspan="1" colspan="1">Description</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e18721" rowspan="1" colspan="1">FORTRANOPT</td> <td class="entry" valign="top" width="NaN%" headers="d54e18724" rowspan="1" colspan="1">Allows the user to specify that the NVIDIA Fortran compiler should use VAX I/O or other custom I/O conventions.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e18721" rowspan="1" colspan="1">FORT_FMT_RECL</td> <td class="entry" valign="top" width="NaN%" headers="d54e18724" rowspan="1" colspan="1">Allows the user to change the default Fortran stdout (unit 6) line length before a line break occurs. Default: 80 bytes.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e18721" rowspan="1" colspan="1">GMON_OUT_PREFIX</td> <td class="entry" valign="top" width="NaN%" headers="d54e18724" rowspan="1" colspan="1">Specifies the name of the output file for programs that are compiled and linked with the -pg option.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e18721" rowspan="1" colspan="1">LD_LIBRARY_PATH</td> <td class="entry" valign="top" width="NaN%" headers="d54e18724" rowspan="1" colspan="1">Specifies a colon-separated set of directories where libraries should first be searched, prior to searching the standard set of directories. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e18721" rowspan="1" colspan="1">MANPATH</td> <td class="entry" valign="top" width="NaN%" headers="d54e18724" rowspan="1" colspan="1">Sets the directories that are searched for manual pages associated with the command that the user types. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e18721" rowspan="1" colspan="1">NO_STOP_MESSAGE</td> <td class="entry" valign="top" width="NaN%" headers="d54e18724" rowspan="1" colspan="1">If used, the execution of a plain STOP statement does not produce the message <span class="ph filepath">FORTRAN STOP</span>. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e18721" rowspan="1" colspan="1">PATH</td> <td class="entry" valign="top" width="NaN%" headers="d54e18724" rowspan="1" colspan="1">Determines which locations are searched for commands the user may type.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e18721" rowspan="1" colspan="1">NVCOMPILER_FPU_STATE</td> <td class="entry" valign="top" width="NaN%" headers="d54e18724" rowspan="1" colspan="1">Manages the initial state of the processor's floating point control and status register at program startup. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e18721" rowspan="1" colspan="1">NVCOMPILER_TERM</td> <td class="entry" valign="top" width="NaN%" headers="d54e18724" rowspan="1" colspan="1">Controls the stack traceback and just-in-time debugging functionality.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e18721" rowspan="1" colspan="1">NVCOMPILER_TERM_DEBUG</td> <td class="entry" valign="top" width="NaN%" headers="d54e18724" rowspan="1" colspan="1">Overrides the default behavior when <span class="ph filepath">NVCOMPILER_TERM</span> is set to <span class="ph filepath">debug</span>. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e18721" rowspan="1" colspan="1">PWD</td> <td class="entry" valign="top" width="NaN%" headers="d54e18724" rowspan="1" colspan="1">Allows you to display the current directory.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e18721" rowspan="1" colspan="1">STATIC_RANDOM_SEED</td> <td class="entry" valign="top" width="NaN%" headers="d54e18724" rowspan="1" colspan="1">Forces the seed returned by <span class="ph filepath">RANDOM_SEED</span> to be constant. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e18721" rowspan="1" colspan="1">TMP</td> <td class="entry" valign="top" width="NaN%" headers="d54e18724" rowspan="1" colspan="1">Sets the directory to use for temporary files created during execution of the HPC compilers and tools; interchangeable with <span class="ph filepath">TMPDIR</span>. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e18721" rowspan="1" colspan="1">TMPDIR</td> <td class="entry" valign="top" width="NaN%" headers="d54e18724" rowspan="1" colspan="1">Sets the directory to use for temporary files created during execution of the HPC compilers and tools. </td> </tr> </tbody> </table> </div> </div> </div> <div class="topic concept nested1" id="env-vars-nv"><a name="env-vars-nv" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-nv" name="env-vars-nv" shape="rect">12.3. HPC Compilers Environment Variables</a></h3> <div class="body conbody"> <p class="p">Use the environment variables listed in <a class="xref" href="index.html#env-vars-nv-related__env-vars-nv-related-tbl" shape="rect">Table 25</a> to alter the default behavior of the NVIDIA HPC Compilers and the executables which they generate. This section provides more detailed descriptions about the variables in this table. </p> </div> <div class="topic concept nested2" id="env-vars-nv-fortranopt"><a name="env-vars-nv-fortranopt" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-nv-fortranopt" name="env-vars-nv-fortranopt" shape="rect">12.3.1. FORTRANOPT</a></h3> <div class="body conbody"> <p class="p"><span class="ph filepath">FORTRANOPT</span> allows the user to adjust the behavior of the NVIDIA Fortran compiler. </p> <ul class="ul"> <li class="li">If <span class="ph filepath">FORTRANOPT</span> exists and contains the value <span class="ph filepath">vaxio</span>, the record length in the open statement is in units of 4-byte words, and the $ edit descriptor only has an effect for lines beginning with a space or a plus sign (+). </li> <li class="li">If <span class="ph filepath">FORTRANOPT</span> exists and contains the value <span class="ph filepath">format_relaxed</span>, an I/O item corresponding to a numerical edit descriptor (such as F, E, I, and so on) is not required to be a type implied by the descriptor. </li> <li class="li">If <span class="ph filepath">FORTRANOPT</span> exists and contains the value <span class="ph filepath">no_minus_zero</span>, an I/O item corresponding to a numerical edit descriptor (such as F, E, I, and so on) equal to negative zero will be output as if it were positive zero. </li> <li class="li">If <span class="ph filepath">FORTRANOPT</span> exists and contains the value <span class="ph filepath">crif</span>, a sequential formatted or list-directed record is allowed to be terminated with the character sequence <span class="ph filepath">\r\n</span> (carriage return, newline). This approach is useful when reading records from a file produced on a Windows system. </li> </ul> <p class="p">The following example causes the NVIDIA Fortran compiler to use VAX I/O conventions:</p><pre xml:space="preserve">% setenv FORTRANOPT vaxio </pre></div> </div> <div class="topic concept nested2" id="env-vars-nv-fort_fmt_recl"><a name="env-vars-nv-fort_fmt_recl" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-nv-fort_fmt_recl" name="env-vars-nv-fort_fmt_recl" shape="rect">FORT_FMT_RECL</a></h3> <div class="body conbody"> <p class="p"> The <span class="ph filepath">FORT_FMT_RECL</span> environment variable specifies the maximum line in bytes for Fortran formatted output to standard out (unit 6) before a newline will be generated. </p> <p class="p">If the environment variable <span class="ph filepath">FORT_FMT_RECL</span> is present, the Fortran runtime library will use the value specified as the number of bytes to output before a newline is generated. </p> <p class="p">The default value of <span class="ph filepath">FORT_FMT_RECL</span> is 80. </p> <ul class="ul"> <li class="li">In csh: <pre xml:space="preserve">% setenv FORT_FMT_RECL length-in-bytes</pre></li> <li class="li">In bash, sh, zsh, or ksh: <pre xml:space="preserve">$ FORT_FMT_RECL=length-in-bytes $ export FORT_FMT_RECL</pre></li> </ul> </div> </div> <div class="topic concept nested2" id="env-vars-nv-gmon-out-prefix"><a name="env-vars-nv-gmon-out-prefix" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-nv-gmon-out-prefix" name="env-vars-nv-gmon-out-prefix" shape="rect">12.3.3. GMON_OUT_PREFIX</a></h3> <div class="body conbody"> <p class="p"><span class="ph filepath">GMON_OUT_PREFIX</span> specifies the name of the output file for programs that are compiled and linked with the <span class="ph filepath">-pg</span> option. The default name is <span class="ph filepath">gmon.out</span>. </p> <p class="p">If <span class="ph filepath">GMON_OUT_PREFIX</span> is set, the name of the output file has <span class="ph filepath">GMON_OUT_PREFIX</span> as a prefix. Further, the suffix is the pid of the running process. The prefix and suffix are separated by a dot. For example, if the output file is <span class="ph filepath">mygmon</span>, then the full filename may look something similar to this: <span class="ph filepath">mygmon.0012348567</span>. </p> <p class="p">The following example causes the NVIDIA Fortran compiler to use <span class="ph filepath">nvout</span> as the output file for programs compiled and linked with the <span class="ph filepath">-pg</span> option. </p><pre xml:space="preserve">% setenv GMON_OUT_PREFIX nvout </pre></div> </div> <div class="topic concept nested2" id="env-vars-nv-ld-library-path"><a name="env-vars-nv-ld-library-path" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-nv-ld-library-path" name="env-vars-nv-ld-library-path" shape="rect">12.3.4. LD_LIBRARY_PATH</a></h3> <div class="body conbody"> <p class="p">The <span class="ph filepath">LD_LIBRARY_PATH</span> variable is a colon-separated set of directories specifying where libraries should first be searched, prior to searching the standard set of directories. This variable is useful when debugging a new library or using a nonstandard library for special purposes. </p> <p class="p">The following csh example adds the current directory to your <span class="ph filepath">LD_LIBRARY_PATH</span> variable. </p><pre xml:space="preserve">% setenv LD_LIBRARY_PATH "$LD_LIBRARY_PATH":"./"</pre></div> </div> <div class="topic concept nested2" id="env-vars-nv-manpath"><a name="env-vars-nv-manpath" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-nv-manpath" name="env-vars-nv-manpath" shape="rect">12.3.5. MANPATH</a></h3> <div class="body conbody"> <p class="p">The <span class="ph filepath">MANPATH</span> variable sets the directories that are searched for manual pages associated with the commands that the user types. When using NVIDIA HPC Compilers, it is important that you set your <span class="ph filepath">PATH</span> to include the location of the compilers and then set the <span class="ph filepath">MANPATH</span> variable to include the man pages associated with the products. </p> <p class="p">The following csh example targets the Linux_x86_64 version of the compilers and enables access to the manual pages associated with them. The settings are similar for Linux_ppc64le or Linux_aarch64 targets: </p><pre xml:space="preserve">% set path = (/opt/nvidia/hpc_sdk/Linux_x86_64/<span class="keyword">24.11</span>/compilers/bin $path) % setenv MANPATH "$MANPATH":/opt/nvidia/hpc_sdk/Linux_x86_64/<span class="keyword">24.11</span>/compilers/man</pre></div> </div> <div class="topic concept nested2" id="env-vars-nv-no-stop-msg"><a name="env-vars-nv-no-stop-msg" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-nv-no-stop-msg" name="env-vars-nv-no-stop-msg" shape="rect">12.3.6. NO_STOP_MESSAGE</a></h3> <div class="body conbody"> <p class="p">If the <span class="ph filepath">NO_STOP_MESSAGE </span>variable exists, the execution of a plain <span class="ph filepath">STOP</span> statement does not produce the message FORTRAN STOP. The default behavior of the NVIDIA Fortran compiler is to issue this message. </p> </div> </div> <div class="topic concept nested2" id="env-vars-nv-path"><a name="env-vars-nv-path" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-nv-path" name="env-vars-nv-path" shape="rect">12.3.7. PATH</a></h3> <div class="body conbody"> <p class="p"> The <span class="ph filepath">PATH</span> variable determines the directories that are searched for commands that the user types. When using the NVIDIA HPC compilers, it is important that you set your <span class="ph filepath">PATH</span> to include the location of the compilers. </p> <p class="p"> The following csh example initializes path settings to use the Linux_x86_64 versions of the NVIDIA HPC Compilers. Settings for Linux_ppc64le and Linux_aarch64 are done similarly: </p><pre xml:space="preserve">% set path = (/opt/nvidia/hpc_sdk/Linux_x86_64/<span class="keyword">24.11</span>/compilers/bin $path)</pre></div> </div> <div class="topic concept nested2" id="env-vars-nv-fpu-state"><a name="env-vars-nv-fpu-state" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-nv-fpu-state" name="env-vars-nv-fpu-state" shape="rect">NVCOMPILER_FPU_STATE</a></h3> <div class="body conbody"> <p class="p">The <span class="ph filepath">NVCOMPILER_FPU_STATE</span> environment variable manages the initial state of the processor's floating point control and status register. <span class="ph filepath">NVCOMPILER_FPU_STATE</span> eliminates the need to compile the main entry point (c/c++/Fortran) of programs with <span class="ph filepath">-M[no]daz</span>, <span class="ph filepath">-M[no]flushz</span>, or <span class="ph filepath">-Ktrap=</span> command line options, as those options can now be specified at runtime. </p> <div class="note note"><span class="notetitle">Note:</span><p class="p">Linux only</p> <p class="p">If the environment variable NVCOMPILER_FPU_STATE is present, all settings from the command line options <span class="ph filepath">-M[no]daz</span>, <span class="ph filepath">-M[no]flushz</span>, or <span class="ph filepath">-Ktrap=</span> are ignored and the FPU is initialized according to the options specified. NVCOMPILER_FPU_STATE with no options resets the floating-point control and status register to the system defaults. </p> </div> <p class="p">The value of <span class="ph filepath">NVCOMPILER_FPU_STATE</span> is a comma-separated list of options. The commands for setting the environment variable follow. </p> <ul class="ul"> <li class="li">In csh: <pre xml:space="preserve">% setenv NVCOMPILER_FPU_STATE option[,option...]</pre></li> <li class="li">In bash, sh, zsh, or ksh: <pre xml:space="preserve">$ NVCOMPILER_FPU_STATE=option[,option...] $ export NVCOMPILER_FPU_STATE</pre></li> </ul> <p class="p"><a class="xref" href="index.html#env-vars-nv-fpu-state__env-vars-nv-fpu-state-vals" shape="rect">Table 26</a> lists the supported values for <span class="ph filepath">option</span>. </p> <p class="p">By default, these options are taken from the compiler command line options <span class="ph filepath">-M[no]daz</span>, <span class="ph filepath">-M[no]flushz</span>, and <span class="ph filepath">-Ktrap=</span>. </p> <div class="tablenoborder"><a name="env-vars-nv-fpu-state__env-vars-nv-fpu-state-vals" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="env-vars-nv-fpu-state__env-vars-nv-fpu-state-vals" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 26. Supported NVCOMPILER_FPU_STATE options</span></caption> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">fp</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Shorthand for inv,divz,ovf</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">inv</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Raise exception on floating-point invalid operation (infinity - infinity, infinity / infinity, 0 / 0, ...)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">invalid</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Alias for inv</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">denorm</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Raise exception with floating-point denormalized operands (x86_64 only)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">divz</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Raise exception on floating-point divide-by-zero</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">zero</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Alias for divz</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">ovf</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Raise exception on floating-point overflow in result</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">overflow</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Alias for ovf</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">unf</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Raise exception on floating-point underflow in result</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">underflow</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Alias for unf</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">inexact</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Raise exception on floating-point inexact result</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">daz</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Convert denormal source operands to zero</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">nodaz</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Do not convert denormal source operands to zero</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">ftz</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Flush underflow results to zero </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">flushz</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Alias for ftz </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">noftz</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Do not flush underflow results to zero</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">noflushz</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Alias for noftz </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">print</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Print to stderr the state of floating point control and status register before and after processing of environment variable <span class="ph filepath">NVCOMPILER_FPU_STATE</span></td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">debug</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Alias for print</td> </tr> </tbody> </table> </div> </div> </div> <div class="topic concept nested2" id="env-vars-nv-term"><a name="env-vars-nv-term" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-nv-term" name="env-vars-nv-term" shape="rect">12.3.9. NVCOMPILER_TERM</a></h3> <div class="body conbody"> <p class="p">The <span class="ph filepath">NVCOMPILER_TERM</span> environment variable controls the stack traceback and just-in-time debugging functionality. The runtime libraries use the value of <span class="ph filepath">NVCOMPILER_TERM</span> to determine what action to take when a program abnormally terminates. </p> <p class="p">The value of <span class="ph filepath">NVCOMPILER_TERM</span> is a comma-separated list of options. The commands for setting the environment variable follow. </p> <ul class="ul"> <li class="li">In csh: <pre xml:space="preserve">% setenv NVCOMPILER_TERM option[,option...]</pre></li> <li class="li">In bash, sh, zsh, or ksh: <pre xml:space="preserve">$ NVCOMPILER_TERM=option[,option...] $ export NVCOMPILER_TERM</pre></li> </ul> <p class="p"><a class="xref" href="index.html#env-vars-nv-term__env-vars-nv-term-vals" shape="rect">Table 27</a> lists the supported values for <span class="ph filepath">option</span>. Following the table is a complete description of each option that indicates specifically how you might apply the option. </p> <p class="p">By default, all of these options are disabled. </p> <div class="tablenoborder"><a name="env-vars-nv-term__env-vars-nv-term-vals" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="env-vars-nv-term__env-vars-nv-term-vals" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 27. Supported NVCOMPILER_TERM Values</span></caption> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">[no]debug</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Enables/disables just-in-time debugging (debugging invoked on error)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">[no]trace</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Enables/disables stack traceback on error</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">[no]trace-fp</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Enables/disables stack traceback and printing of SIMD registers (ymm/zmm) on error (Linux x86_64 only)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">[no]signal</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Enables/disables establishment of signal handlers for common signals that cause program termination</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">[no]abort</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Enables/disables calling the system termination routine abort()</td> </tr> </tbody> </table> </div> <div class="section"> <h4 class="title sectiontitle">[no]debug</h4> <p class="p">This enables/disables just-in-time debugging. The default is <span class="ph filepath">nodebug</span>. </p> <p class="p">When <span class="ph filepath">NVCOMPILER_TERM</span> is set to <span class="ph filepath">debug</span>, the command to which <span class="ph filepath">NVCOMPILER_TERM_DEBUG</span> is set is invoked on error. </p> </div> <div class="section"> <h4 class="title sectiontitle">[no]trace</h4> <p class="p">This enables/disables stack traceback on error.</p> </div> <div class="section"> <h4 class="title sectiontitle">[no]trace-fp</h4> <p class="p">This enables/disables stack traceback and printing of SIMD registers (ymm/zmm) on error. (Linux x86_64 only)</p> </div> <div class="section"> <h4 class="title sectiontitle">[no]signal</h4> <p class="p">This enables/disables establishing signal handlers for the most common signals that cause program termination. The default is <span class="ph filepath">nosignal</span>. Setting <span class="ph filepath">trace</span> and <span class="ph filepath">debug</span> automatically enables <span class="ph filepath">signal</span>. Specifically setting <span class="ph filepath">nosignal</span> allows you to override this behavior. </p> </div> <div class="section"> <h4 class="title sectiontitle">[no]abort</h4> <p class="p">This enables/disables calling the system termination routine abort(). The default is <span class="ph filepath">noabort</span>. When <span class="ph filepath">noabort</span> is in effect the process terminates by calling <span class="ph filepath">_exit(127)</span>. </p> <p class="p">On Linux, when <span class="ph filepath">abort</span> is in effect, the abort routine creates a core file and exits with code 127. </p> <p class="p">A few runtime errors just print an error message and call <span class="ph filepath">exit(127)</span>, regardless of the status of <span class="ph filepath">NVCOMPILER_TERM</span>. These are mainly errors such as specifying an invalid environment variable value where a traceback would not be useful. </p> <p class="p">If it appears that abort() does not generate core files on a Linux system, be sure to unlimit the coredumpsize. You can do this in these ways: </p> <ul class="ul"> <li class="li">Using csh: <pre xml:space="preserve">% limit coredumpsize unlimited % setenv NVCOMPILER_TERM abort</pre></li> <li class="li">Using bash, sh, zsh, or ksh: <pre xml:space="preserve">$ ulimit -c unlimited $ export NVCOMPILER_TERM=abort</pre></li> </ul> <p class="p">To debug a core file with gdb, invoke gdb with the --core option. For example, to view a core file named "core" for a program named "a.out": </p><pre xml:space="preserve">$ gdb --core=core a.out</pre><p class="p">For more information on why to use this variable, refer to <a class="xref" href="index.html#env-vars-stack-trace-jit-dbg" shape="rect">Stack Traceback and JIT Debugging</a>. </p> </div> </div> </div> <div class="topic concept nested2" id="env-vars-nv-term-debug"><a name="env-vars-nv-term-debug" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-nv-term-debug" name="env-vars-nv-term-debug" shape="rect">12.3.10. NVCOMPILER_TERM_DEBUG</a></h3> <div class="body conbody"> <p class="p">The <span class="ph filepath">NVCOMPILER_TERM_DEBUG</span> variable may be set to override the default behavior when <span class="ph filepath">NVCOMPILER_TERM</span> is set to <span class="ph filepath">debug</span>. </p> <p class="p">The value of <span class="ph filepath">NVCOMPILER_TERM_DEBUG</span> should be set to the command line used to invoke the program. For example: </p><pre xml:space="preserve">gdb --quiet --pid %d</pre><p class="p">The first occurrence of <span class="ph filepath">%d</span> in the <span class="ph filepath">NVCOMPILER_TERM_DEBUG</span> string is replaced by the process id. The program named in the <span class="ph filepath">NVCOMPILER_TERM_DEBUG</span> string must be found on the current <span class="ph filepath">PATH</span> or specified with a full path name. </p> </div> </div> <div class="topic concept nested2" id="env-vars-nv-pwd"><a name="env-vars-nv-pwd" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-nv-pwd" name="env-vars-nv-pwd" shape="rect">12.3.11. PWD</a></h3> <div class="body conbody"> <p class="p">The PWD variable allows you to display the current directory. </p> </div> </div> <div class="topic concept nested2" id="env-vars-nv-static-random-seed"><a name="env-vars-nv-static-random-seed" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-nv-static-random-seed" name="env-vars-nv-static-random-seed" shape="rect">12.3.12. STATIC_RANDOM_SEED</a></h3> <div class="body conbody"> <p class="p">You can use <span class="ph filepath">STATIC_RANDOM_SEED</span> to force the seed returned by the Fortran 90/95 <span class="ph filepath">RANDOM_SEED</span> intrinsic to be constant. The first call to <span class="ph filepath">RANDOM_SEED</span> without arguments resets the random seed to a default value, then advances the seed by a variable amount based on time. Subsequent calls to <span class="ph filepath">RANDOM_SEED</span> without arguments reset the random seed to the same initial value as the first call. Unless the time is exactly the same, each time a program is run a different random number sequence is generated. Setting the environment variable <span class="ph filepath">STATIC_RANDOM_SEED</span> to <span class="ph filepath">YES</span> forces the seed returned by <span class="ph filepath">RANDOM_SEED</span> to be constant, thereby generating the same sequence of random numbers at each execution of the program. </p> </div> </div> <div class="topic concept nested2" id="env-vars-nv-tmp"><a name="env-vars-nv-tmp" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-nv-tmp" name="env-vars-nv-tmp" shape="rect">12.3.13. TMP</a></h3> <div class="body conbody"> <p class="p">You can use <span class="ph filepath">TMP</span> to specify the directory to use for placement of any temporary files created during execution of the NVIDIA HPC Compilers. This variable is interchangeable with <span class="ph filepath">TMPDIR</span>. </p> </div> </div> <div class="topic concept nested2" id="env-vars-nv-tmpdir"><a name="env-vars-nv-tmpdir" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-nv-tmpdir" name="env-vars-nv-tmpdir" shape="rect">12.3.14. TMPDIR</a></h3> <div class="body conbody"> <p class="p">You can use <span class="ph filepath">TMPDIR</span> to specify the directory to use for placement of any temporary files created during execution of the NVIDIA HPC Compilers. </p> </div> </div> </div> <div class="topic concept nested1" id="env-vars-use-mod-linux"><a name="env-vars-use-mod-linux" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-use-mod-linux" name="env-vars-use-mod-linux" shape="rect">12.4. Using Environment Modules on Linux</a></h3> <div class="body conbody"> <p class="p"> On Linux, if you use the Environment Modules package, that is, the <span class="keyword cmdname">module load</span> command, the NVIDIA HPC Compilers include a script to set up the appropriate module files. The install script will generate environment module files for you as part of the set up process. </p> <p class="p"> Assuming your installation base directory is <span class="ph filepath">/opt/nvidia/hpc_sdk</span>, the environment modules will be installed under <span class="ph filepath">/opt/nvidia/hpc_sdk/modulefiles</span>. There will be three sets of module files: </p> <ol class="ol"> <li class="li"> nvhpc <p class="p"> Adds environment variable settings for the NVIDIA HPC Compilers, CUDA libraries, and additional libraries such as MPI, NCCL, and NVSHMEM. </p> </li> <li class="li"> nvhpc-nompi <p class="p"> Adds environment variable settings for the NVIDIA HPC Compilers, CUDA libraries, and additional libraries such as NCCL and NVSHMEM. This will not include MPI, if you wish to use an alternate MPI implementation. </p> </li> <li class="li"> nvhpc-byo-compilers <p class="p"> Adds environment variable settings for the CUDA libraries and additional libraries such as NCCL and NVSHMEM. This will not include the NVIDIA HPC Compilers nor MPI, if you wish to use alternate compilers and MPI. </p> </li> </ol> <p class="p"> You can load the nvhpc environment module for the 20.11 release as follows: </p><pre class="pre screen" xml:space="preserve"> % module load nvhpc/<span class="keyword">24.11</span> </pre><p class="p"> To see what versions of nvhpc are available on this system, use this command: </p><pre class="pre screen" xml:space="preserve"> % module avail nvhpc </pre><p class="p"> The <span class="keyword cmdname">module load</span> command sets or modifies the environment variables as indicated in the following table. </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="border" border="1" rules="all"> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e20001" rowspan="1" colspan="1">This Environment Variable...</th> <th class="entry" valign="top" width="NaN%" id="d54e20004" rowspan="1" colspan="1">Is set or modified by the module load command</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20001" rowspan="1" colspan="1"><samp class="ph codeph">CC</samp></td> <td class="entry" valign="top" width="NaN%" headers="d54e20004" rowspan="1" colspan="1">Full path to <span class="keyword cmdname">nvc</span> (nvhpc and nvhpc-nompi only) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20001" rowspan="1" colspan="1"><samp class="ph codeph">CPATH</samp></td> <td class="entry" valign="top" width="NaN%" headers="d54e20004" rowspan="1" colspan="1"> Prepends the math libraries include directory, the MPI include directory (nvhpc only), and NCCL and NVSHMEM include directories </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20001" rowspan="1" colspan="1"><samp class="ph codeph">CPP</samp></td> <td class="entry" valign="top" width="NaN%" headers="d54e20004" rowspan="1" colspan="1">C preprocessor, normally cpp (nvhpc and nvhpc-nompi only)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20001" rowspan="1" colspan="1"><samp class="ph codeph">CXX</samp></td> <td class="entry" valign="top" width="NaN%" headers="d54e20004" rowspan="1" colspan="1">Path to <span class="keyword cmdname">nvc++</span> (nvhpc and nvhpc-nompi only) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20001" rowspan="1" colspan="1"><samp class="ph codeph">FC</samp></td> <td class="entry" valign="top" width="NaN%" headers="d54e20004" rowspan="1" colspan="1">Full path to <span class="keyword cmdname">nvfortran</span> (nvhpc and nvhpc-nompi only) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20001" rowspan="1" colspan="1"><samp class="ph codeph">F90</samp></td> <td class="entry" valign="top" width="NaN%" headers="d54e20004" rowspan="1" colspan="1">Full path to <span class="keyword cmdname">nvfortran</span> (nvhpc and nvhpc-nompi only) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20001" rowspan="1" colspan="1"><samp class="ph codeph">F77</samp></td> <td class="entry" valign="top" width="NaN%" headers="d54e20004" rowspan="1" colspan="1">Full path to <span class="keyword cmdname">nvfortran</span> (nvhpc and nvhpc-nompi only) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20001" rowspan="1" colspan="1"><samp class="ph codeph">LD_LIBRARY_PATH</samp></td> <td class="entry" valign="top" width="NaN%" headers="d54e20004" rowspan="1" colspan="1"> Prepends the CUDA library directory, the NVIDIA HPC Compilers library directory (nvhpc and nvhpc-nompi only), math libraries library directory, MPI library directory (nvhpc only), and NCCL and NVSHMEM library directories </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20001" rowspan="1" colspan="1"><samp class="ph codeph">MANPATH</samp></td> <td class="entry" valign="top" width="NaN%" headers="d54e20004" rowspan="1" colspan="1">Prepends the NVIDIA HPC Compilers man page directory (nvhpc and nvhpc-nompi only)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20001" rowspan="1" colspan="1"><samp class="ph codeph">OPAL_PREFIX</samp></td> <td class="entry" valign="top" width="NaN%" headers="d54e20004" rowspan="1" colspan="1"> Full path to the MPI directory (nvhpc only), e.g. /opt/nvidia/hpc_sdk/Linux_x86_64/<span class="keyword">24.11</span>/comm_libs/mpi </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20001" rowspan="1" colspan="1"><samp class="ph codeph">PATH</samp></td> <td class="entry" valign="top" width="NaN%" headers="d54e20004" rowspan="1" colspan="1"> Prepends the CUDA bin directory, the MPI bin directory (nvhpc only), and the NVIDIA HPC Compilers bin directory (nvhpc and nvhpc-nompi only) </td> </tr> </tbody> </table> </div> <div class="note note"><span class="notetitle">Note:</span> NVIDIA does not provide support for the Environment Modules package. For more information about the package, go to: <a class="xref" href="http://modules.sourceforge.net" target="_blank" shape="rect">http://modules.sourceforge.net</a>. </div> </div> </div> <div class="topic concept nested1" id="env-vars-stack-trace-jit-dbg"><a name="env-vars-stack-trace-jit-dbg" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#env-vars-stack-trace-jit-dbg" name="env-vars-stack-trace-jit-dbg" shape="rect">12.5. Stack Traceback and JIT Debugging</a></h3> <div class="body conbody"> <p class="p">When a programming error results in a runtime error message or an application exception, a program will usually exit, perhaps with an error message. The NVIDIA HPC Compilers runtime library includes a mechanism to override this default action and instead print a stack traceback<span class="ph">, start a debugger, or, on Linux, create a core file for post-mortem debugging.</span></p> <p class="p">The stack traceback and just-in-time debugging functionality is controlled by an environment variable, <span class="ph filepath">NVCOMPILER_TERM</span>, described in <a class="xref" href="index.html#env-vars-nv-term" shape="rect">NVCOMPILER_TERM</a>. The runtime libraries use the value of <span class="ph filepath">NVCOMPILER_TERM</span> to determine what action to take when a program abnormally terminates. </p> <p class="p">When the NVIDIA HPC Compilers runtime library detects an error or catches a signal, it calls the routine <span class="ph filepath">nvcompiler_stop_here()</span> prior to generating a stack traceback or starting the debugger. The <span class="ph filepath">nvcompiler_stop_here()</span> routine is a convenient spot to set a breakpoint when debugging a program. </p> </div> </div> </div> <div class="topic concept nested0" id="deploy-dist-files"><a name="deploy-dist-files" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#deploy-dist-files" name="deploy-dist-files" shape="rect">13. Distributing Files – Deployment</a></h2> <div class="body conbody"> <p class="p">Once you have successfully built, debugged and tuned your application, you may want to distribute it to users who need to run it on a variety of systems. This section addresses how to effectively distribute applications built using NVIDIA HPC Compilers. The application must be installed in such a way that it executes accurately on a system other than the one on which it was built, and which may be configured differently. </p> </div> <div class="topic concept nested1" id="deploy-app-linux"><a name="deploy-app-linux" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#deploy-app-linux" name="deploy-app-linux" shape="rect">13.1. Deploying Applications on Linux</a></h3> <div class="body conbody"> <p class="p">To successfully deploy your application on Linux, some of the issues to consider include:</p> <ul class="ul"> <li class="li">Runtime Libraries</li> <li class="li">64-bit Linux Systems</li> <li class="li">Redistribution of Files</li> </ul> </div> <div class="topic concept nested2" id="deploy-app-linux-runtm-lib"><a name="deploy-app-linux-runtm-lib" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#deploy-app-linux-runtm-lib" name="deploy-app-linux-runtm-lib" shape="rect">13.1.1. Runtime Library Considerations</a></h3> <div class="body conbody"> <p class="p">On Linux systems, the system runtime libraries can be linked to an application either statically or dynamically. For example, for the C runtime library, <span class="ph filepath">libc</span>, you can use either the static version <span class="ph filepath">libc.a</span> or the shared object version <span class="ph filepath">libc.so</span>. If the application is intended to run on Linux systems other than the one on which it was built, it is generally safer to use the shared object version of the library. This approach ensures that the application uses a version of the library that is compatible with the system on which the application is running. Further, it works best when the application is linked on a system that has an equivalent or earlier version of the system software than the system on which the application will be run. </p> <div class="note note"><span class="notetitle">Note:</span> Building on a newer system and running the application on an older system may not produce the desired output. </div> <p class="p">To use the shared object version of a library, the application must also link to shared object versions of the NVIDIA HPC Compilers runtime libraries. To execute an application built in such a way on a system on which NVIDIA HPC Compilers are <em class="ph i">not</em> installed, those shared objects must be available.To build using the shared object versions of the runtime libraries, use the <span class="ph filepath">-⁠Bdynamic </span>option, as shown here: </p><pre xml:space="preserve">$ nvfortran -Bdynamic myprog.f90</pre></div> </div> <div class="topic concept nested2" id="deploy-app-linux-64bit"><a name="deploy-app-linux-64bit" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#deploy-app-linux-64bit" name="deploy-app-linux-64bit" shape="rect">13.1.2. 64-bit Linux Considerations</a></h3> <div class="body conbody"> <p class="p">On 64-bit Linux systems, 64-bit applications that use the <span class="ph filepath">-⁠mcmodel=medium</span> option sometimes cannot be successfully linked statically. Therefore, users with executables built with the <span class="ph filepath">-⁠mcmodel=medium</span> option may need to use shared libraries, linking dynamically. Also, runtime libraries built using the <span class="ph filepath">-⁠fpic</span> option use 32-bit offsets, so they sometimes need to reside near other runtime <span class="ph filepath">libs</span> in a shared area of Linux program memory. </p> <div class="note note"><span class="notetitle">Note:</span> If your application is linked dynamically using shared objects, then the shared object versions of the NVIDIA HPC Compilers runtime are required. </div> </div> </div> <div class="topic concept nested2" id="deploy-app-linux-redist"><a name="deploy-app-linux-redist" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#deploy-app-linux-redist" name="deploy-app-linux-redist" shape="rect">13.1.3. Linux Redistributable Files</a></h3> <div class="body conbody"> <p class="p">The method for installing the shared object versions of the runtime libraries required for applications built with NVIDIA HPC Compilers is manual distribution. </p> <p class="p">When the NVIDIA HPC Compilers are installed, there are directories that have a name that begins with <span class="ph filepath">REDIST</span>; these directories contain the redistributed shared object libraries. These may be redistributed by licensed NVIDIA HPC Compilers users under the terms of the End-User License Agreement. </p> </div> </div> <div class="topic concept nested2" id="deploy-app-linux-port"><a name="deploy-app-linux-port" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#deploy-app-linux-port" name="deploy-app-linux-port" shape="rect">13.1.4. Restrictions on Linux Portability</a></h3> <div class="body conbody"> <p class="p">You cannot expect to be able to run an executable on any given Linux machine. Portability depends on the system you build on as well as how much your program uses system routines that may have changed from Linux release to Linux release. For example, an area of significant change between some versions of Linux is in <span class="ph filepath">libpthread.so</span> and <span class="ph filepath">libnuma.so</span>. NVIDIA HPC Compilers use these dynamically linked libraries for the options <span class="ph filepath">-acc</span> (OpenACC), <span class="ph filepath">-mp</span> (OpenMP) and <span class="ph filepath">-Mconcur</span> (multicore auto-parallel). Statically linking these libraries may not be possible, or may result in failure at execution. </p> <p class="p">Typically, portability is supported for forward execution, meaning running a program on the same or a later version of Linux. But not for backward compatibility, that is, running on a prior release. For example, a user who compiles and links a program under RHEL 7.2 should not expect the program to run without incident on a RHEL 5.2 system, an earlier Linux version. It <em class="ph i">may</em> run, but it is less likely. Developers might consider building applications on earlier Linux versions for wider usage. Dynamic linking of Linux and gcc system routines on the platform executing the program can also reduce problems. </p> </div> </div> <div class="topic concept nested2" id="deploy-app-linux-license"><a name="deploy-app-linux-license" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#deploy-app-linux-license" name="deploy-app-linux-license" shape="rect">13.1.5. Licensing for Redistributable (REDIST) Files</a></h3> <div class="body conbody"> <p class="p">The files in the REDIST directories may be redistributed under the terms of the End-User License Agreement for the product in which they were included. </p> </div> </div> </div> </div> <div class="topic concept nested0" id="intr-lang-call"><a name="intr-lang-call" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#intr-lang-call" name="intr-lang-call" shape="rect">14. Inter-language Calling</a></h2> <div class="body conbody"> <p class="p"> This section describes inter-language calling conventions for C, C++, and Fortran programs using the HPC compilers. Fortran 2003 ISO_C_Binding provides a mechanism to support the interoperability with C. This includes the <span class="ph filepath">iso_c_binding</span> intrinsic module, binding labels, and the BIND attribute. Additional interoperability with C is available with Fortran 2018 and the <span class="ph filepath">ISO_Fortran_binding.h</span> C header file. nvfortran supports both the <span class="ph filepath">iso_c_binding</span> and the <span class="ph filepath">ISO_Fortan_Binding.h</span> header file. In the absence of these mechanisms, the following sections describe how to call a Fortran function or subroutine from a C or C++ program and how to call a C or C++ function from a Fortran program. </p> <p class="p"> This section provides examples that use the following options related to inter-language calling. </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="void" border="0" rules="none"> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"><span class="ph filepath"> </span></td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"><span class="ph filepath">-c</span></td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"><span class="ph filepath">-Mnomain</span></td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"><span class="ph filepath">-Miface</span></td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"><span class="ph filepath">-Mupcase</span></td> </tr> </tbody> </table> </div> </div> <div class="topic concept nested1" id="intr-lang-call-conv-ovr"><a name="intr-lang-call-conv-ovr" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-call-conv-ovr" name="intr-lang-call-conv-ovr" shape="rect">14.1. Overview of Calling Conventions</a></h3> <div class="body conbody"> <p class="p">This section includes information on the following topics:</p> <ul class="ul"> <li class="li">Functions and subroutines in Fortran, C, and C++</li> <li class="li">Naming and case conversion conventions</li> <li class="li">Compatible data types</li> <li class="li">Argument passing and special return values</li> <li class="li">Arrays and indexes</li> </ul> <p class="p">The sections <a class="xref" href="index.html#intr-lang-call-consider" shape="rect">Inter-language Calling Considerations</a> through <a class="xref" href="index.html#intr-lang-exam-cpp-fort" shape="rect">Example – C++ Calling Fortran</a> describe how to perform inter-language calling<span class="ph"> using the <span class="ph">Linux or </span>Win64 convention</span>. </p> </div> </div> <div class="topic concept nested1" id="intr-lang-call-consider"><a name="intr-lang-call-consider" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-call-consider" name="intr-lang-call-consider" shape="rect">14.2. Inter-language Calling Considerations</a></h3> <div class="body conbody"> <p class="p"> In general, when argument data types and function return values agree, you can call a C or C+⁠+ function from Fortran as well as call a Fortran function from C or C++. When data types for arguments do not agree, you may need to develop custom mechanisms to handle them. For example, the Fortran <span class="ph filepath">COMPLEX</span> type has a matching type in C99 but does not have a matching type in C89; however, it is still possible to provide inter-language calls but there are no general calling conventions for such cases. </p> <div class="note note"><span class="notetitle">Note:</span><ul class="ul"> <li class="li"> If a C++ function contains objects with constructors and destructors, calling such a function from <span class="ph">either C or</span> Fortran is not possible unless the initialization in the main program is performed from a C⁠+⁠+ program in which constructors and destructors are properly initialized. </li> <li class="li"> In general, you can call a C or Fortran function from C++ without problems as long as you use the extern "C" keyword to declare the function in the C++ program. This declaration prevents name mangling for the C function name. If you want to call a C++ function from C or Fortran, you also have to use the extern "C" keyword to declare the C++ function. This keeps the C++ compiler from mangling the name of the function. </li> <li class="li"> You can use the __cplusplus macro to allow a program or header file to work for both C and C++. For example, the following defines in the header file stdio.h allow this file to work for both C and C++. <pre xml:space="preserve">#ifndef _STDIO_H #define _STDIO_H #ifdef __cplusplus extern "C" { #endif /* __cplusplus */ . . /* Functions and data types defined... */ . #ifdef __cplusplus } #endif /* __cplusplus */ #endif</pre></li> <li class="li"> C++ member functions cannot be declared <span class="ph filepath">extern</span>, since their names will always be mangled. Therefore, C++ member functions cannot be called from C or Fortran. </li> </ul> </div> </div> </div> <div class="topic concept nested1" id="intr-lang-funcs-subs"><a name="intr-lang-funcs-subs" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-funcs-subs" name="intr-lang-funcs-subs" shape="rect">14.3. Functions and Subroutines</a></h3> <div class="body conbody"> <p class="p">Fortran, C, and C++ define functions and subroutines differently.</p> <p class="p">For a Fortran program calling a C or C++ function, observe the following return value convention: </p> <ul class="ul"> <li class="li">When a C or C++ function returns a value, call it from Fortran as a function.</li> <li class="li">When a C or C++ function does not return a value, call it as a subroutine.</li> </ul> <p class="p">For a C/C++ program calling a Fortran function, the call should return a similar type. <a class="xref" href="index.html#intr-lang-data-types__intr-lang-data-types-tbl" shape="rect">Table 28</a>, <a class="xref" href="index.html#intr-lang-data-types__intr-lang-data-types-tbl" shape="rect">Fortran and C/C++ Data Type Compatibility</a>, lists compatible types. If the call is to a Fortran subroutine<span class="ph">, </span><span class="ph"> or </span>a Fortran <span class="ph filepath">CHARACTER</span> function<span class="ph">, or a Fortran <span class="ph filepath">COMPLEX</span> function</span>, call it from C/C++ as a function that returns void. The exception to this convention is when a Fortran subroutine has alternate returns; call such a subroutine from C/C++ as a function returning <span class="ph filepath">int</span> whose value is the value of the integer expression specified in the alternate <span class="ph filepath">RETURN</span> statement. </p> </div> </div> <div class="topic concept nested1" id="intr-lang-case-conv"><a name="intr-lang-case-conv" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-case-conv" name="intr-lang-case-conv" shape="rect">14.4. Upper and Lower Case Conventions, Underscores</a></h3> <div class="body conbody"> <p class="p">By default<span class="ph"> on Linux and Win64 systems</span>, all Fortran symbol names are converted to lower case. C and C++ are case sensitive, so upper-case function names stay upper-case. When you use inter-language calling, you can either name your C/C++ functions with lower-case names, or invoke the Fortran compiler command with the option <span class="ph filepath">-⁠Mupcase</span>, in which case it will not convert symbol names to lower-case. </p> <p class="p">When programs are compiled using one of the HPC Fortran compilers<span class="ph"> on Linux and Win64 systems</span>, an underscore is appended to Fortran global names (names of functions, subroutines and common blocks). This mechanism distinguishes Fortran name space from C/C++ name space. Use these naming conventions: </p> <ul class="ul"> <li class="li">If you call a C/C++ function from Fortran, you should rename the C/C++ function by appending an underscore or use <samp class="ph codeph">bind(c)</samp> in the Fortran program. </li> <li class="li">If you call a Fortran function from C/C++, you should append an underscore to the Fortran function name in the calling program.</li> </ul> </div> </div> <div class="topic concept nested1" id="intr-lang-data-types"><a name="intr-lang-data-types" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-data-types" name="intr-lang-data-types" shape="rect">14.5. Compatible Data Types</a></h3> <div class="body conbody"> <p class="p"><a class="xref" href="index.html#intr-lang-data-types__intr-lang-data-types-tbl" shape="rect">Table 28</a> shows compatible data types between Fortran and C/C++. <a class="xref" href="index.html#intr-lang-data-types__intr-lang-data-types-repr-tbl" shape="rect">Table 29</a>, <a class="xref" href="index.html#intr-lang-data-types__intr-lang-data-types-repr-tbl" shape="rect"> Fortran and C/C++ Representation of the COMPLEX Type</a> shows how the Fortran <span class="ph filepath">COMPLEX</span> type may be represented in C/C++. </p> <div class="note tip"><span class="tiptitle">Tip:</span> If you can make your function/subroutine parameters as well as your return values match types, you should be able to use inter-language calling. </div> <div class="tablenoborder"><a name="intr-lang-data-types__intr-lang-data-types-tbl" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="intr-lang-data-types__intr-lang-data-types-tbl" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 28. Fortran and C/C++ Data Type Compatibility</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e20699" rowspan="1" colspan="1">Fortran Type (lower case)</th> <th class="entry" valign="top" width="NaN%" id="d54e20702" rowspan="1" colspan="1">C/C++ Type</th> <th class="entry" valign="top" width="NaN%" id="d54e20705" rowspan="1" colspan="1">Size (bytes)</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20699" rowspan="1" colspan="1">character x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20702" rowspan="1" colspan="1">char x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20705" rowspan="1" colspan="1">1</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20699" rowspan="1" colspan="1">character*n x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20702" rowspan="1" colspan="1">char x[n]</td> <td class="entry" valign="top" width="NaN%" headers="d54e20705" rowspan="1" colspan="1">n</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20699" rowspan="1" colspan="1">real x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20702" rowspan="1" colspan="1">float x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20705" rowspan="1" colspan="1">4</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20699" rowspan="1" colspan="1">real*4 x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20702" rowspan="1" colspan="1">float x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20705" rowspan="1" colspan="1">4</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20699" rowspan="1" colspan="1">real*8 x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20702" rowspan="1" colspan="1">double x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20705" rowspan="1" colspan="1">8</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20699" rowspan="1" colspan="1">double precision</td> <td class="entry" valign="top" width="NaN%" headers="d54e20702" rowspan="1" colspan="1">double x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20705" rowspan="1" colspan="1">8</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20699" rowspan="1" colspan="1">integer x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20702" rowspan="1" colspan="1">int x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20705" rowspan="1" colspan="1">4</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20699" rowspan="1" colspan="1">integer*1 x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20702" rowspan="1" colspan="1">signed char x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20705" rowspan="1" colspan="1">1</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20699" rowspan="1" colspan="1">integer*2 x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20702" rowspan="1" colspan="1">short x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20705" rowspan="1" colspan="1">2</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20699" rowspan="1" colspan="1">integer*4 x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20702" rowspan="1" colspan="1">int x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20705" rowspan="1" colspan="1">4</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20699" rowspan="1" colspan="1">integer*8 x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20702" rowspan="1" colspan="1">long long x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20705" rowspan="1" colspan="1">8</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20699" rowspan="1" colspan="1">logical x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20702" rowspan="1" colspan="1">int x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20705" rowspan="1" colspan="1">4</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20699" rowspan="1" colspan="1">logical*1 x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20702" rowspan="1" colspan="1">char x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20705" rowspan="1" colspan="1">1</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20699" rowspan="1" colspan="1">logical*2 x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20702" rowspan="1" colspan="1">short x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20705" rowspan="1" colspan="1">2</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20699" rowspan="1" colspan="1">logical*4</td> <td class="entry" valign="top" width="NaN%" headers="d54e20702" rowspan="1" colspan="1">int x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20705" rowspan="1" colspan="1">4</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e20699" rowspan="1" colspan="1">logical*8</td> <td class="entry" valign="top" width="NaN%" headers="d54e20702" rowspan="1" colspan="1">long x</td> <td class="entry" valign="top" width="NaN%" headers="d54e20705" rowspan="1" colspan="1">8</td> </tr> </tbody> </table> </div> <div class="tablenoborder"><a name="intr-lang-data-types__intr-lang-data-types-repr-tbl" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="intr-lang-data-types__intr-lang-data-types-repr-tbl" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 29. Fortran and C/C++ Representation of the <span class="ph filepath">COMPLEX</span> Type</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e20929" rowspan="1" colspan="1">Fortran Type (lower case)</th> <th class="entry" valign="top" width="NaN%" id="d54e20932" rowspan="1" colspan="1">C/C++ Type</th> <th class="entry" valign="top" width="NaN%" id="d54e20935" rowspan="1" colspan="1">Size (bytes)</th> </tr> </thead> <tbody class="tbody"> <tr class="row" valign="middle"> <td class="entry" valign="middle" width="NaN%" headers="d54e20929" rowspan="1" colspan="1">complex x</td> <td class="entry" valign="middle" width="NaN%" headers="d54e20932" rowspan="1" colspan="1"> <ul class="sl simple"> <li class="sli">struct {float r,i;} x;</li> <li class="sli">float complex x;</li> </ul> </td> <td class="entry" valign="middle" width="NaN%" headers="d54e20935" rowspan="1" colspan="1"> <ul class="sl simple"> <li class="sli">8</li> <li class="sli">8</li> </ul> </td> </tr> <tr class="row" valign="middle"> <td class="entry" valign="middle" width="NaN%" headers="d54e20929" rowspan="1" colspan="1">complex*8 x</td> <td class="entry" valign="middle" width="NaN%" headers="d54e20932" rowspan="1" colspan="1"> <ul class="sl simple"> <li class="sli">struct {float r,i;} x;</li> <li class="sli">float complex x;</li> </ul> </td> <td class="entry" valign="middle" width="NaN%" headers="d54e20935" rowspan="1" colspan="1"> <ul class="sl simple"> <li class="sli">8</li> <li class="sli">8</li> </ul> </td> </tr> <tr class="row" valign="middle"> <td class="entry" valign="middle" width="NaN%" headers="d54e20929" rowspan="1" colspan="1">double complex x</td> <td class="entry" valign="middle" width="NaN%" headers="d54e20932" rowspan="1" colspan="1"> <ul class="sl simple"> <li class="sli">struct {double dr,di;} x;</li> <li class="sli">double complex x;</li> </ul> </td> <td class="entry" valign="middle" width="NaN%" headers="d54e20935" rowspan="1" colspan="1"> <ul class="sl simple"> <li class="sli">16</li> <li class="sli">16</li> </ul> </td> </tr> <tr class="row" valign="middle"> <td class="entry" valign="middle" width="NaN%" headers="d54e20929" rowspan="1" colspan="1">complex *16 x</td> <td class="entry" valign="middle" width="NaN%" headers="d54e20932" rowspan="1" colspan="1"> <ul class="sl simple"> <li class="sli">struct {double dr,di;} x;</li> <li class="sli">double complex x;</li> </ul> </td> <td class="entry" valign="middle" width="NaN%" headers="d54e20935" rowspan="1" colspan="1"> <ul class="sl simple"> <li class="sli">16</li> <li class="sli">16</li> </ul> </td> </tr> </tbody> </table> </div> <div class="note note"><span class="notetitle">Note:</span> For C/C++, the <span class="ph filepath">complex</span> type implies C99 or later. </div> </div> <div class="topic concept nested2" id="intr-lang-data-fort-blks"><a name="intr-lang-data-fort-blks" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-data-fort-blks" name="intr-lang-data-fort-blks" shape="rect">14.5.1. Fortran Named Common Blocks</a></h3> <div class="body conbody"> <p class="p">A named Fortran common block can be represented in C/C++ by a structure whose members correspond to the members of the common block. The name of the structure in C/C++ must have the added underscore. For example, here is a Fortran common block: </p><pre xml:space="preserve">INTEGER I COMPLEX C DOUBLE COMPLEX CD DOUBLE PRECISION D COMMON /COM/ i, c, cd, d</pre><p class="p">This Fortran Common Block is represented in C with the following equivalent:</p><pre xml:space="preserve">extern struct { int i; struct {float real, imag;} c; struct {double real, imag;} cd; double d; } com_;</pre><p class="p">This same Fortran Common Block is represented in C++ with the following equivalent:</p><pre xml:space="preserve">extern "C" struct { int i; struct {float real, imag;} c; struct {double real, imag;} cd; double d; } com_;</pre><div class="note tip"><span class="tiptitle">Tip:</span> For global or external data sharing, <span class="ph filepath">extern "C"</span> is not required. </div> </div> </div> </div> <div class="topic concept nested1" id="intr-lang-arg-pass-rtn-val"><a name="intr-lang-arg-pass-rtn-val" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-arg-pass-rtn-val" name="intr-lang-arg-pass-rtn-val" shape="rect">14.6. Argument Passing and Return Values</a></h3> <div class="body conbody"> <p class="p">In Fortran, arguments are passed by reference, that is, the address of the argument is passed, rather than the argument itself. In C/C++, arguments are passed by value, except for strings and arrays, which are passed by reference. Due to the flexibility provided in C/C++, you can work around these differences. Solving the parameter passing differences generally involves intelligent use of the <span class="ph filepath">&amp;</span> and <span class="ph filepath">*</span> operators in argument passing when C/C++ calls Fortran and in argument declarations when Fortran calls C/C++. </p> <p class="p">For strings declared in Fortran as type <span class="ph filepath">CHARACTER</span>, an argument representing the length of the string is also passed to a calling function. </p> <p class="p"><span class="ph">On the following systems, the</span><span class="ph">The</span> compiler places the length argument(s) at the end of the parameter list, following the other formal arguments<span class="ph">:</span><span class="ph">.</span></p> <ul class="ul"> <li class="li">On Linux systems</li> <li class="li">On Win64 systems, except when using the option <span class="ph filepath">-⁠Miface=cref</span></li> </ul> <p class="p">The length argument is passed by value, not by reference.</p> </div> <div class="topic concept nested2" id="intr-lang-arg-pass-by-val"><a name="intr-lang-arg-pass-by-val" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-arg-pass-by-val" name="intr-lang-arg-pass-by-val" shape="rect">14.6.1. Passing by Value (%VAL)</a></h3> <div class="body conbody"> <p class="p">When passing parameters from a Fortran subprogram to a C/C++ function, it is possible to pass by value using the <span class="ph filepath">%VAL</span> function. If you enclose a Fortran parameter with <span class="ph filepath">%VAL()</span>, the parameter is passed by value. For example, the following call passes the integer <span class="ph filepath">i</span> and the logical <span class="ph filepath">bvar</span> by value. </p><pre xml:space="preserve">integer*1 i logical*1 bvar call cvalue (%VAL(i), %VAL(bvar))</pre></div> </div> <div class="topic concept nested2" id="intr-lang-arg-char-rtn-val"><a name="intr-lang-arg-char-rtn-val" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-arg-char-rtn-val" name="intr-lang-arg-char-rtn-val" shape="rect">14.6.2. Character Return Values</a></h3> <div class="body conbody"> <p class="p"><a class="xref" href="index.html#intr-lang-funcs-subs" shape="rect">Functions and Subroutines</a> describes the general rules for return values for C/C++ and Fortran inter-language calling. There is a special return value to consider. When a Fortran function returns a character, two arguments need to be added at the beginning of the C/C++ calling function’s argument list: </p> <ul class="ul"> <li class="li">The address of the return character or characters</li> <li class="li">The length of the return character</li> </ul> <p class="p">The following example illustrates the extra parameters, <span class="ph filepath">tmp</span> and <span class="ph filepath">10</span>, supplied by the caller: </p> <div class="example" id="intr-lang-arg-char-rtn-val__char-rtn-param-exam"><a name="intr-lang-arg-char-rtn-val__char-rtn-param-exam" shape="rect"> <!-- --></a><h4 class="title sectiontitle">Character Return Parameters</h4><pre xml:space="preserve">! Fortran function returns a character CHARACTER*(*) FUNCTION CHF(C1,I) CHARACTER*(*) C1 INTEGER I END</pre><pre xml:space="preserve">/* C declaration of Fortran function */ extern void chf_(); char tmp[10]; char c1[9]; int i; chf_(tmp, 10, c1, &amp;i, 9);</pre><p class="p">If the Fortran function is declared to return a character value of constant length, for example <span class="ph filepath">CHARACTER*4 FUNCTION CHF()</span>, the second extra parameter representing the length must still be supplied, but is not used. </p> <div class="note note"><span class="notetitle">Note:</span> The value of the character function is not automatically NULL-terminated. </div> </div> </div> </div> <div class="topic concept nested2" id="intr-lang-arg-cmplx-rtn-val"><a name="intr-lang-arg-cmplx-rtn-val" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-arg-cmplx-rtn-val" name="intr-lang-arg-cmplx-rtn-val" shape="rect">14.6.3. Complex Return Values</a></h3> <div class="body conbody"> <p class="p">When a Fortran function returns a complex value, an argument needs to be added at the beginning of the C/C++ calling function’s argument list; this argument is the address of the complex return value. <a class="xref" href="index.html#intr-lang-arg-cmplx-rtn-val__complx-rtn-val-exam" shape="rect">COMPLEX Return Values</a> illustrates the extra parameter, <span class="ph filepath">cplx</span>, supplied by the caller. </p> <div class="example" id="intr-lang-arg-cmplx-rtn-val__complx-rtn-val-exam"><a name="intr-lang-arg-cmplx-rtn-val__complx-rtn-val-exam" shape="rect"> <!-- --></a><h4 class="title sectiontitle">COMPLEX Return Values</h4><pre xml:space="preserve">COMPLEX FUNCTION CF(C, I) INTEGER I . . . END</pre><pre xml:space="preserve">extern void cf_(); typedef struct {float real, imag;} cplx; cplx c1; int i; cf_(&amp;c1, &amp;i);</pre></div> </div> </div> </div> <div class="topic concept nested1" id="intr-lang-array-idx"><a name="intr-lang-array-idx" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-array-idx" name="intr-lang-array-idx" shape="rect">14.7. Array Indices</a></h3> <div class="body conbody"> <p class="p">C/C++ arrays and Fortran arrays use different default initial array index values. By default, arrays in C/C++ start at 0 and arrqays in Fortran start at 1. If you adjust your array comparisons so that a Fortran second element is compared to a C/C++ first element, and adjust similarly for other elements, you should not have problems working with this difference. If this is not satisfactory, you can declare your Fortran arrays to start at zero. </p> <p class="p">Another difference between Fortran and C/C++ arrays is the storage method used. Fortran uses column-major order and C/C++ uses row-major order. For one-dimensional arrays, this poses no problems. For two-dimensional arrays, where there are an equal number of rows and columns, row and column indexes can simply be reversed. For arrays other than single dimensional arrays, and square two-dimensional arrays, inter-language function mixing is not recommended. </p> </div> </div> <div class="topic concept nested1" id="intr-lang-examples"><a name="intr-lang-examples" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-examples" name="intr-lang-examples" shape="rect">14.8. Examples</a></h3> <div class="body conbody"> <p class="p">This section contains examples that illustrate inter-language calling.</p> </div> <div class="topic concept nested2" id="intr-lang-exam-fort-c"><a name="intr-lang-exam-fort-c" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-exam-fort-c" name="intr-lang-exam-fort-c" shape="rect">14.8.1. Example – Fortran Calling C</a></h3> <div class="body conbody"> <div class="note note"><span class="notetitle">Note:</span> There are other solutions to calling C from Fortran than the one presented in this section. For example, you can use the <span class="ph filepath">iso_c_binding</span> intrinsic module which NVIDIA does support. For more information on this module and for examples of how to use it, search the web using the keyword iso_c_binding. </div> <p class="p"><a class="xref" href="index.html#intr-lang-exam-fort-c__intr-lang-fort-c-exam-subr" shape="rect">C function f2c_func_</a> shows a C function that is called by the Fortran main program shown in <a class="xref" href="index.html#intr-lang-exam-fort-c__intr-lang-fort-c-exam" shape="rect">Fortran Main Program f2c_main.f</a>. Notice that each argument is defined as a pointer, since Fortran passes by reference. Also notice that the C function name uses all lower-case and a trailing "_". </p> <div class="example" id="intr-lang-exam-fort-c__intr-lang-fort-c-exam"><a name="intr-lang-exam-fort-c__intr-lang-fort-c-exam" shape="rect"> <!-- --></a><h4 class="title sectiontitle">Fortran Main Program f2c_main.f</h4><pre xml:space="preserve">logical*1 bool1 character letter1 integer*4 numint1, numint2 real numfloat1 double precision numdoub1 integer*2 numshor1 external f2c_func call f2c_func(bool1, letter1, numint1, numint2, numfloat1, numdoub1, numshor1) write( *, "(L2, A2, I5, I5, F6.1, F6.1, I5)") + bool1, letter1, numint1, numint2, numfloat1,numdoub1, numshor1 end</pre></div> <div class="example" id="intr-lang-exam-fort-c__intr-lang-fort-c-exam-subr"><a name="intr-lang-exam-fort-c__intr-lang-fort-c-exam-subr" shape="rect"> <!-- --></a><h4 class="title sectiontitle">C function f2c_func_</h4><pre xml:space="preserve">#define TRUE 0xff #define FALSE 0 void f2c_func_( bool1, letter1, numint1, numint2, numfloat1,\ numdoub1, numshor1, len_letter1) char *bool1, *letter1; int *numint1, *numint2; float *numfloat1; double *numdoub1; short *numshor1; int len_letter1; { *bool1 = TRUE; *letter1 = 'v'; *numint1 = 11; *numint2 = -44; *numfloat1 = 39.6 ; *numdoub1 = 39.2; *numshor1 = 981; }</pre><p class="p">Compile and execute the program <span class="ph filepath">f2c_main.f</span> with the call to <span class="ph filepath">f2c_func_</span> using the following command lines: </p><pre xml:space="preserve">$ nvc -c f2c_func.c $ nvfortran f2c_func.o f2c_main.f</pre><p class="p">Executing the <span class="ph filepath">a.out</span> file should produce the following output: </p><pre xml:space="preserve">T v 11 -44 39.6 39.2 981</pre></div> </div> </div> <div class="topic concept nested2" id="intr-lang-exam-c-fort"><a name="intr-lang-exam-c-fort" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-exam-c-fort" name="intr-lang-exam-c-fort" shape="rect">14.8.2. Example – C Calling Fortran</a></h3> <div class="body conbody"> <div class="note note"><span class="notetitle">Note:</span> There are other solutions to calling Fortran from C than the one presented in this section. For example, you can use the <span class="ph filepath">ISO_Fortran_binding.h</span> C header file which NVIDIA does support. For more information on this header file and for examples of how to use it, search the web using the keyword ISO_Fortran_binding. </div> <p class="p">The example <a class="xref" href="index.html#intr-lang-exam-c-fort__intr-lang-c-fort-exam" shape="rect">C Main Program c2f_main.c</a> shows a C main program that calls the Fortran subroutine shown in <a class="xref" href="index.html#intr-lang-exam-c-fort__intr-lang-c-fort-exam-sub" shape="rect">Fortran Subroutine c2f_sub.f</a>. </p> <ul class="ul"> <li class="li">Each call uses the &amp; operator to pass by reference. </li> <li class="li">The call to the Fortran subroutine uses all lower-case and a trailing "_".</li> </ul> <div class="example" id="intr-lang-exam-c-fort__intr-lang-c-fort-exam"><a name="intr-lang-exam-c-fort__intr-lang-c-fort-exam" shape="rect"> <!-- --></a><h4 class="title sectiontitle">C Main Program c2f_main.c</h4><pre xml:space="preserve">void main () { char bool1, letter1; int numint1, numint2; float numfloat1; double numdoub1; short numshor1; extern void c2f_func_(); c2f_sub_(&amp;bool1,&amp;letter1,&amp;numint1,&amp;numint2,&amp;numfloat1,&amp;numdoub1,&amp;numshor1, 1); printf(" %s %c %d %d %3.1f %.0f %d\n", bool1?"TRUE":"FALSE", letter1, numint1, numint2, numfloat1, numdoub1, numshor1); }</pre></div> <div class="example" id="intr-lang-exam-c-fort__intr-lang-c-fort-exam-sub"><a name="intr-lang-exam-c-fort__intr-lang-c-fort-exam-sub" shape="rect"> <!-- --></a><h4 class="title sectiontitle">Fortran Subroutine c2f_sub.f</h4><pre xml:space="preserve">subroutine c2f_func ( bool1, letter1, numint1, numint2, + numfloat1, numdoub1, numshor1) logical*1 bool1 character letter1 integer numint1, numint2 double precision numdoub1 real numfloat1 integer*2 numshor1 bool1 = .true. letter1 = "v" numint1 = 11 numint2 = -44 numdoub1 = 902 numfloat1 = 39.6 numshor1 = 299 return end</pre><p class="p">To compile this Fortran subroutine and C program, use the following commands:</p><pre xml:space="preserve">$ nvc -c c2f_main.c $ nvfortran -Mnomain c2f_main.o c2_sub.f</pre><p class="p">Executing the resulting <span class="ph filepath">a.out</span> file should produce the following output: </p><pre xml:space="preserve">TRUE v 11 -44 39.6 902 299</pre></div> </div> </div> <div class="topic concept nested2" id="intr-lang-exam-cpp-c"><a name="intr-lang-exam-cpp-c" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-exam-cpp-c" name="intr-lang-exam-cpp-c" shape="rect">14.8.3. Example – C++ Calling C</a></h3> <div class="body conbody"> <p class="p"><a class="xref" href="index.html#intr-lang-exam-cpp-c__intr-lang-cpp-c-exam" shape="rect">C++ Main Program cp2c_main.C Calling a C Function</a> shows a C++ main program that calls the C function shown in <a class="xref" href="index.html#intr-lang-exam-cpp-c__intr-lang-cpp-c-exam-subr" shape="rect">Simple C Function c2cp_func.c</a>. </p> <div class="example" id="intr-lang-exam-cpp-c__intr-lang-cpp-c-exam"><a name="intr-lang-exam-cpp-c__intr-lang-cpp-c-exam" shape="rect"> <!-- --></a><h4 class="title sectiontitle">C++ Main Program cp2c_main.C Calling a C Function</h4><pre xml:space="preserve">extern "C" void cp2c_func(int n, int m, int *p); #include &lt;iostream&gt; main() { int a,b,c; a=8; b=2; c=0; cout &lt;&lt; "main: a = "&lt;&lt;a&lt;&lt;" b = "&lt;&lt;b&lt;&lt;"ptr c = "&lt;&lt;hex&lt;&lt;&amp;c&lt;&lt; endl; cp2c_func(a,b,&amp;c); cout &lt;&lt; "main: res = "&lt;&lt;c&lt;&lt;endl; }</pre></div> <div class="example" id="intr-lang-exam-cpp-c__intr-lang-cpp-c-exam-subr"><a name="intr-lang-exam-cpp-c__intr-lang-cpp-c-exam-subr" shape="rect"> <!-- --></a><h4 class="title sectiontitle">Simple C Function c2cp_func.c</h4><pre xml:space="preserve">void cp2c_func(num1, num2, res) int num1, num2, *res; { printf("func: a = %d b = %d ptr c = %x\n",num1,num2,res); *res=num1/num2; printf("func: res = %d\n",*res); }</pre><p class="p">To compile this C function and C++ main program, use the following commands:</p><pre xml:space="preserve">$ nvc -c cp2c_func.c $ nvc++ cp2c_main.C cp2c_func.o </pre><p class="p">Executing the resulting a.out file should produce the following output:</p><pre xml:space="preserve">main: a = 8 b = 2 ptr c = 0xbffffb94 func: a = 8 b = 2 ptr c = bffffb94 func: res = 4 main: res = 4 </pre></div> </div> </div> <div class="topic concept nested2" id="intr-lang-exam-c-cpp"><a name="intr-lang-exam-c-cpp" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-exam-c-cpp" name="intr-lang-exam-c-cpp" shape="rect">14.8.4. Example – C Calling C ++</a></h3> <div class="body conbody"> <p class="p">The example in <a class="xref" href="index.html#intr-lang-exam-c-cpp__intr-lang-c-cpp-exam" shape="rect">C Main Program c2cp_main.c Calling a C++ Function</a> shows a C main program that calls the C++ function shown in <a class="xref" href="index.html#intr-lang-exam-c-cpp__intr-lang-c-cpp-extern-exam" shape="rect">Simple C++ Function c2cp_func.C with Extern C</a>. </p> <div class="example" id="intr-lang-exam-c-cpp__intr-lang-c-cpp-exam"><a name="intr-lang-exam-c-cpp__intr-lang-c-cpp-exam" shape="rect"> <!-- --></a><h4 class="title sectiontitle">C Main Program c2cp_main.c Calling a C++ Function</h4><pre xml:space="preserve">extern void c2cp_func(int a, int b, int *c); #include &lt;stdio.h&gt; main() { int a,b,c; a=8; b=2; printf("main: a = %d b = %d ptr c = %x\n",a,b,&amp;c); c2cp_func(a,b,&amp;c); printf("main: res = %d\n",c); }</pre></div> <div class="example" id="intr-lang-exam-c-cpp__intr-lang-c-cpp-extern-exam"><a name="intr-lang-exam-c-cpp__intr-lang-c-cpp-extern-exam" shape="rect"> <!-- --></a><h4 class="title sectiontitle">Simple C++ Function c2cp_func.C with Extern C</h4><pre xml:space="preserve">#include &lt;iostream&gt; extern "C" void c2cp_func(int num1,int num2,int *res) { cout &lt;&lt; "func: a = "&lt;&lt;num1&lt;&lt;" b = "&lt;&lt;num2&lt;&lt;"ptr c ="&lt;&lt;res&lt;&lt;endl; *res=num1/num2; cout &lt;&lt; "func: res = "&lt;&lt;res&lt;&lt;endl; }</pre><p class="p">To compile this C function and C++ main program, use the following commands:</p><pre xml:space="preserve">$ nvc -c c2cp_main.c $ nvc++ c2cp_main.o c2cp_func.C</pre><p class="p">Executing the resulting a.out file should produce the following output:</p><pre xml:space="preserve">main: a = 8 b = 2 ptr c = 0xbffffb94 func: a = 8 b = 2 ptr c = bffffb94 func: res = 4 main: res = 4</pre><div class="note note"><span class="notetitle">Note:</span> You cannot use the extern "C" form of declaration for an object’s member functions. </div> </div> </div> </div> <div class="topic concept nested2" id="intr-lang-exam-fort-cpp"><a name="intr-lang-exam-fort-cpp" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-exam-fort-cpp" name="intr-lang-exam-fort-cpp" shape="rect">14.8.5. Example – Fortran Calling C++</a></h3> <div class="body conbody"> <p class="p">The Fortran main program shown in <a class="xref" href="index.html#intr-lang-exam-fort-cpp__lang-fort-cpp-exam" shape="rect">Fortran Main Program f2cp_main.f calling a C++ function</a> calls the C++ function shown in <a class="xref" href="index.html#intr-lang-exam-fort-cpp__lang-fort-cpp-exam-subr" shape="rect">C++ function f2cp_func.C</a> . </p> <p class="p">Notice:</p> <ul class="ul"> <li class="li">Each argument is defined as a pointer in the C++ function, since Fortran passes by reference. </li> <li class="li">The C++ function name uses all lower-case and a trailing "_":</li> </ul> <div class="example" id="intr-lang-exam-fort-cpp__lang-fort-cpp-exam"><a name="intr-lang-exam-fort-cpp__lang-fort-cpp-exam" shape="rect"> <!-- --></a><h4 class="title sectiontitle">Fortran Main Program f2cp_main.f calling a C++ function</h4><pre xml:space="preserve"> logical*1 bool1 character letter1 integer*4 numint1, numint2 real numfloat1 double precision numdoub1 integer*2 numshor1 external f2cpfunc call f2cp_func (bool1, letter1, numint1, + numint2, numfloat1, numdoub1, numshor1) write( *, "(L2, A2, I5, I5, F6.1, F6.1, I5)") + bool1, letter1, numint1, numint2, numfloat1, + numdoub1, numshor1 end</pre></div> <div class="example" id="intr-lang-exam-fort-cpp__lang-fort-cpp-exam-subr"><a name="intr-lang-exam-fort-cpp__lang-fort-cpp-exam-subr" shape="rect"> <!-- --></a><h4 class="title sectiontitle">C++ function f2cp_func.C</h4><pre xml:space="preserve">#define TRUE 0xff #define FALSE 0 extern "C" { extern void f2cp_func_ ( char *bool1, *letter1, int *numint1, *numint2, float *numfloat1, double *numdoub1, short *numshort1, int len_letter1) { *bool1 = TRUE; *letter1 = 'v'; *numint1 = 11; *numint2 = -44; *numfloat1 = 39.6; *numdoub1 = 39.2; *numshort1 = 981; } }</pre><p class="p">Assuming the Fortran program is in a file <span class="ph filepath">fmain.f</span>, and the C++ function is in a file <span class="ph filepath">cpfunc.C</span>, create an executable, using the following command lines: </p><pre xml:space="preserve">$ nvc++ -c f2cp_func.C $ nvfortran f2cp_func.o f2cp_main.f -c++libs</pre><p class="p">Executing the <span class="ph filepath">a.out</span> file should produce the following output: </p><pre xml:space="preserve">T v 11 -44 39.6 39.2 981</pre></div> </div> </div> <div class="topic concept nested2" id="intr-lang-exam-cpp-fort"><a name="intr-lang-exam-cpp-fort" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#intr-lang-exam-cpp-fort" name="intr-lang-exam-cpp-fort" shape="rect">14.8.6. Example – C++ Calling Fortran</a></h3> <div class="body conbody"> <p class="p"><a class="xref" href="index.html#intr-lang-exam-cpp-fort__intr-lang-cpp-fort-exam-subr" shape="rect">Fortran Subroutine cp2f_func.f</a> shows a Fortran subroutine called by the C⁠+⁠+ main program shown in <a class="xref" href="index.html#intr-lang-exam-cpp-fort__intr-lang-cpp-fort-exam" shape="rect">C++ main program cp2f_main.C</a>. Notice that each call uses the <samp class="ph codeph">&amp;</samp> operator to pass by reference. Also notice that the call to the Fortran subroutine uses all lower-case and a trailing "<samp class="ph codeph">_</samp>": </p> <div class="example" id="intr-lang-exam-cpp-fort__intr-lang-cpp-fort-exam"><a name="intr-lang-exam-cpp-fort__intr-lang-cpp-fort-exam" shape="rect"> <!-- --></a><h4 class="title sectiontitle">C++ main program cp2f_main.C</h4><pre xml:space="preserve">#include &lt;iostream&gt; extern "C" { extern void cp2f_func_(char *,char *,int *,int *, float *,double *,short *); } main () { char bool1, letter1; int numint1, numint2; float numfloat1; double numdoub1; short numshor1; cp2f_func(&amp;bool1,&amp;letter1,&amp;numint1,&amp;numint2,&amp;numfloat1, &amp;numdoub1,&amp;numshor1); cout &lt;&lt; " bool1 = "; bool1?cout &lt;&lt; "TRUE ":cout &lt;&lt; "FALSE "; cout &lt;&lt;endl; cout &lt;&lt; " letter1 = " &lt;&lt; letter1 &lt;&lt;endl; cout &lt;&lt; " numint1 = " &lt;&lt; numint1 &lt;&lt;endl; cout &lt;&lt; " numint2 = " &lt;&lt; numint2 &lt;&lt;endl; cout &lt;&lt; " numfloat1 = " &lt;&lt; numfloat1 &lt;&lt;endl; cout &lt;&lt; " numdoub1 = " &lt;&lt; numdoub1 &lt;&lt;endl; cout &lt;&lt; " numshor1 = " &lt;&lt; numshor1 &lt;&lt;endl; }</pre></div> <div class="example" id="intr-lang-exam-cpp-fort__intr-lang-cpp-fort-exam-subr"><a name="intr-lang-exam-cpp-fort__intr-lang-cpp-fort-exam-subr" shape="rect"> <!-- --></a><h4 class="title sectiontitle">Fortran Subroutine cp2f_func.f</h4><pre xml:space="preserve"> subroutine cp2f_func ( bool1, letter1, numint1, + numint2, numfloat1, numdoub1, numshor1) logical*1 bool1 character letter1 integer numint1, numint2 double precision numdoub1 real numfloat1 integer*2 numshor1 bool1 = .true. ; letter1 = "v" numint1 = 11 ; numint2 = -44 numdoub1 = 902 ; numfloat1 = 39.6 ; numshor1 = 299 return end</pre><p class="p"> To compile this Fortran subroutine and C⁠+⁠+ program, use the following command lines: </p><pre class="pre screen" xml:space="preserve">$ nvfortran -c cp2f_func.f $ nvc++ cp2f_func.o cp2f_main.C -fortranlibs</pre><p class="p">Executing this C++ main should produce the following output:</p><pre class="pre msgblock" xml:space="preserve">bool1 = TRUE letter1 = v numint1 = 11 numint2 = -44 numfloat1 = 39.6 numdoub1 = 902 numshor1 = 299</pre><div class="note note"><span class="notetitle">Note:</span> You must explicitly link in the NVFORTRAN runtime support libraries when linking nvfortran-compiled program units into C++ or C main programs. </div> </div> </div> </div> </div> </div> <div class="topic concept nested0" id="prog-64bits"><a name="prog-64bits" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#prog-64bits" name="prog-64bits" shape="rect">15. Programming Considerations for 64-Bit Environments</a></h2> <div class="body conbody"> <p class="p"><span class="ph"> NVIDIA provides 64-bit compilers for 64-bit Linux operating systems running on x86-64 (Linux_x86_64), OpenPOWER (Linux_ppcle64) and Arm Server (Linux_aarch64) architectures. You can use these compilers to create programs that use 64-bit memory addresses. </span><span class="ph"> The GNU toolchain on 64-bit Linux systems implements an option to control 32-bit vs 64-bit code generation, as described in <a class="xref" href="index.html#prog-64bits-static-data-linux" shape="rect">Large Static Data in Linux</a>. </span> This section describes the specifics of how to use the NVIDIA compilers to make use of 64-bit memory addressing. </p> <div class="note note"><span class="notetitle">Note:</span> The NVIDIA HPC compilers themselves are 64-bit applications which can only run on 64-bit CPUs running 64-bit Operating Systems. </div> <p class="p"> This section describes how to use the following options related to 64-bit programming. </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="void" border="0" rules="none"> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"><span class="ph filepath">-fPIC</span></td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"><span class="ph filepath">-mcmodel=medium</span></td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"><span class="ph filepath">-Mlarge_arrays</span></td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"><span class="ph filepath">-i8</span></td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"><span class="ph filepath">-Mlargeaddressaware</span></td> </tr> </tbody> </table> </div> </div> <div class="topic concept nested1" id="prog-64bits-data-types"><a name="prog-64bits-data-types" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#prog-64bits-data-types" name="prog-64bits-data-types" shape="rect">15.1. Data Types in the 64-Bit Environment</a></h3> <div class="body conbody"> <p class="p"> The size of some data types can differ across 64-bit environments. This section describes the major differences. </p> </div> <div class="topic concept nested2" id="prog-64bits-data-types-c"><a name="prog-64bits-data-types-c" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#prog-64bits-data-types-c" name="prog-64bits-data-types-c" shape="rect">15.1.1. C++ and C Data Types</a></h3> <div class="body conbody"> <p class="p">On 64-bit Linux operating systems, the size of an int is 4 bytes, a long is 8 bytes, a long long is 8 bytes, and a pointer is 8 bytes. </p> </div> </div> <div class="topic concept nested2" id="prog-64bits-data-types-fort"><a name="prog-64bits-data-types-fort" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#prog-64bits-data-types-fort" name="prog-64bits-data-types-fort" shape="rect">15.1.2. Fortran Data Types</a></h3> <div class="body conbody"> <p class="p">In Fortran, the default size of the INTEGER type is 4 bytes. The <span class="ph filepath">-i8</span> compiler option may be used to make the default size of all INTEGER data in the program 8 bytes. </p> <p class="p">When using the <span class="ph filepath">-Mlarge_arrays</span> option, described in <a class="xref" href="index.html#prog-64bits-array-indexing" shape="rect">64-Bit Array Indexing</a>, any 4-byte INTEGER variables that are used to index arrays are silently promoted by the compiler to 8 bytes. This promotion can lead to unexpected consequences, so 8-byte INTEGER variables are recommended for array indexing when using the option <span class="ph filepath">-Mlarge_arrays</span>. </p> </div> </div> </div> <div class="topic concept nested1" id="prog-64bits-static-data-linux"><a name="prog-64bits-static-data-linux" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#prog-64bits-static-data-linux" name="prog-64bits-static-data-linux" shape="rect">15.2. Large Static Data in Linux</a></h3> <div class="body conbody"> <p class="p">64-bit Linux operating systems support two different memory models. The default model used by the NVIDIA HPC compilers on Linux_x86_64 and Linux_aarch64 targets is the small memory model, which can be specified using -mcmodel=small. This is the 32-bit model, which limits the size of code plus statically allocated data, including system and user libraries, to 2GB. The medium memory model, specified by -mcmodel=medium, allows combined code and static data areas (.text and .bss sections) larger than 2GB and is the default on Linux_ppc64le targets. The <span class="ph filepath">-mcmodel=medium </span>option must be used on both the compile command and the link command in order to take effect. </p> <p class="p">There are implications to using <span class="ph filepath">-mcmodel=medium</span>. The generated code requires increased addressing overhead to support the large data range. This can affect performance, though the compilers seek to minimize the added overhead through careful instruction selection and optimization. </p> <p class="p">Linux_aarch64 does not support -mcmodel=medium. If the medium model is specified on the command-line, the compiler driver will automatically select the large model. </p> </div> </div> <div class="topic concept nested1" id="prog-64bits-dynamic-alloc-data"><a name="prog-64bits-dynamic-alloc-data" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#prog-64bits-dynamic-alloc-data" name="prog-64bits-dynamic-alloc-data" shape="rect">15.3. Large Dynamically Allocated Data</a></h3> <div class="body conbody"> <p class="p">Dynamically allocated data objects in programs compiled by the NVIDIA HPC compilers can be larger than 2GB. No special compiler options are required to enable this functionality. The size of the allocation is only limited by the system. <span class="ph">However, to correctly access dynamically allocated arrays with more than 2G elements you should use the <span class="ph filepath">-Mlarge_arrays</span> option, described in the following section.</span></p> </div> </div> <div class="topic concept nested1" id="prog-64bits-array-indexing"><a name="prog-64bits-array-indexing" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#prog-64bits-array-indexing" name="prog-64bits-array-indexing" shape="rect">15.4. 64-Bit Array Indexing</a></h3> <div class="body conbody"> <p class="p">The NVIDIA Fortran compilers provide an option, <span class="ph filepath">-Mlarge_arrays</span>, that enables 64-bit indexing of arrays. This means that, as necessary, 64-bit INTEGER constants and variables are used to index arrays. </p> <div class="note note"><span class="notetitle">Note:</span> In the presence of <span class="ph filepath">-Mlarge_arrays</span>, the compiler may silently promote 32-bit integers to 64 bits, which can have unexpected side effects. </div> <p class="p">On 64-bit Linux, the <span class="ph filepath">-Mlarge_arrays</span> option also enables single static data objects larger than 2 GB. This option is the default in the presence of <span class="ph filepath">-mcmodel=medium</span>. </p> </div> </div> <div class="topic concept nested1" id="prog-64bits-comp-opts"><a name="prog-64bits-comp-opts" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#prog-64bits-comp-opts" name="prog-64bits-comp-opts" shape="rect">15.5. Compiler Options for 64-bit Programming</a></h3> <div class="body conbody"> <p class="p">The usual switches that apply to 64-bit programmers seeking to increase the data range of their applications are in the following table. </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 30. 64-bit Compiler Options</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e21919" rowspan="1" colspan="1">Option</th> <th class="entry" valign="top" width="NaN%" id="d54e21922" rowspan="1" colspan="1">Purpose</th> <th class="entry" valign="top" width="NaN%" id="d54e21925" rowspan="1" colspan="1">Considerations</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e21919" rowspan="1" colspan="1">-mcmodel=medium</td> <td class="entry" valign="top" width="NaN%" headers="d54e21922" rowspan="1" colspan="1">Allow for data declarations larger than 2GB. Default on Linux_ppc64le.</td> <td class="entry" valign="top" width="NaN%" headers="d54e21925" rowspan="1" colspan="1">Linux_aarch64 does not support -mcmodel=medium. If the medium model is specified on the command-line, the compiler driver will automatically select the large model. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e21919" rowspan="1" colspan="1">-Mlarge_arrays</td> <td class="entry" valign="top" width="NaN%" headers="d54e21922" rowspan="1" colspan="1">Perform all array-location-to-address calculations using 64-bit integer arithmetic. </td> <td class="entry" valign="top" width="NaN%" headers="d54e21925" rowspan="1" colspan="1">Slightly slower execution. <span class="ph">Is implicit with -⁠mcmodel=medium. Can be used with option -⁠mcmodel=small. </span></td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e21919" rowspan="1" colspan="1">-fpic</td> <td class="entry" valign="top" width="NaN%" headers="d54e21922" rowspan="1" colspan="1">Position independent code. Necessary for shared libraries.</td> <td class="entry" valign="top" width="NaN%" headers="d54e21925" rowspan="1" colspan="1">Dynamic linking restricted to a 32-bit offset. External symbol references should refer to other shared lib routines, rather than the program calling them. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e21919" rowspan="1" colspan="1">-i8</td> <td class="entry" valign="top" width="NaN%" headers="d54e21922" rowspan="1" colspan="1">All INTEGER functions, data, and constants not explicitly declared INTEGER*4 are assumed to be INTEGER*8. </td> <td class="entry" valign="top" width="NaN%" headers="d54e21925" rowspan="1" colspan="1">Users should take care to explicitly declare INTEGER functions as INTEGER*4.</td> </tr> </tbody> </table> </div> <p class="p">The following table summarizes the limits of these programming models under the specified conditions. The compiler options you use vary by processor. </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 31. Effects of Options on Memory and Array Sizes</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" rowspan="2" colspan="2" valign="top" id="d54e22036">Condition</th> <th class="entry" colspan="2" valign="top" id="d54e22039" rowspan="1">Addr. Math</th> <th class="entry" colspan="3" valign="top" id="d54e22042" rowspan="1">Max Size Gbytes</th> </tr> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e22048" rowspan="1" colspan="1">A</th> <th class="entry" valign="top" width="NaN%" id="d54e22051" rowspan="1" colspan="1">I</th> <th class="entry" valign="top" width="NaN%" id="d54e22054" rowspan="1" colspan="1">AS</th> <th class="entry" valign="top" width="NaN%" id="d54e22057" rowspan="1" colspan="1">DS</th> <th class="entry" valign="top" width="NaN%" id="d54e22060" rowspan="1" colspan="1">TS</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" colspan="2" valign="top" headers="d54e22036 d54e22048 d54e22051" rowspan="1">64-bit addr limited by option <span class="ph filepath">-⁠mcmodel=small</span></td> <td class="entry" valign="top" width="NaN%" headers="d54e22039 d54e22054" rowspan="1" colspan="1">64</td> <td class="entry" valign="top" width="NaN%" headers="d54e22039 d54e22057" rowspan="1" colspan="1">32</td> <td class="entry" valign="top" width="NaN%" headers="d54e22042 d54e22060" rowspan="1" colspan="1">2</td> <td class="entry" valign="top" width="NaN%" headers="d54e22042" rowspan="1" colspan="1">2</td> <td class="entry" valign="top" width="NaN%" headers="d54e22042" rowspan="1" colspan="1">2</td> </tr> <tr class="row"> <td class="entry" colspan="2" valign="top" headers="d54e22036 d54e22048 d54e22051" rowspan="1"><span class="ph filepath">-⁠fpic</span><em class="ph i"> incompatible with </em><span class="ph filepath">-⁠mcmodel=medium</span></td> <td class="entry" valign="top" width="NaN%" headers="d54e22039 d54e22054" rowspan="1" colspan="1">64</td> <td class="entry" valign="top" width="NaN%" headers="d54e22039 d54e22057" rowspan="1" colspan="1">32</td> <td class="entry" valign="top" width="NaN%" headers="d54e22042 d54e22060" rowspan="1" colspan="1">2</td> <td class="entry" valign="top" width="NaN%" headers="d54e22042" rowspan="1" colspan="1">2</td> <td class="entry" valign="top" width="NaN%" headers="d54e22042" rowspan="1" colspan="1">2</td> </tr> <tr class="row"> <td class="entry" colspan="2" valign="top" headers="d54e22036 d54e22048 d54e22051" rowspan="1">Enable full support for 64-bit data addressing</td> <td class="entry" valign="top" width="NaN%" headers="d54e22039 d54e22054" rowspan="1" colspan="1">64</td> <td class="entry" valign="top" width="NaN%" headers="d54e22039 d54e22057" rowspan="1" colspan="1">64</td> <td class="entry" valign="top" width="NaN%" headers="d54e22042 d54e22060" rowspan="1" colspan="1">&gt;2</td> <td class="entry" valign="top" width="NaN%" headers="d54e22042" rowspan="1" colspan="1">&gt;2</td> <td class="entry" valign="top" width="NaN%" headers="d54e22042" rowspan="1" colspan="1">&gt;2</td> </tr> </tbody> </table> </div> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="border" border="1" rules="all"> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">A</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Address Type – size in bits of data used for address calculations, 64-bits. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">I</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Index Arithmetic -bit-size of data used to index into arrays and other aggregate data structures. If 32-bit, total range of any single data object is limited to 2GB. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">AS</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Maximum Array Size - the maximum size in gigabytes of any single data object.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">DS</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1"> Maximum Data Size - max size in gigabytes combined of all data objects in .bss</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">TS</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Maximum Total Size - max size in gigabytes, in aggregate, of all executable code and data objects in a running program.</td> </tr> </tbody> </table> </div> </div> </div> <div class="topic concept nested1" id="prog-64bits-array-limit"><a name="prog-64bits-array-limit" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#prog-64bits-array-limit" name="prog-64bits-array-limit" shape="rect">15.6. Practical Limitations of Large Array Programming</a></h3> <div class="body conbody"> <p class="p">The 64-bit addressing capability of 64-bit Linux environments can cause unexpected issues when data sizes are enlarged significantly. The following table describes the most common occurrences of practical limitations of large array programming. </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 32. 64-Bit Limitations</span></caption> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">array initialization</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Initializing a large array with a data statement may result in very large assembly and object files, where a line of assembler source is required for each element in the initialized array. Compilation and linking can be very time consuming as well. To avoid this issue, consider initializing large arrays in a loop at runtime rather than in a data statement. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">stack space</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Stack space can be a problem for data that is stack-based. On Linux, stack size is increased in your shell environment. If setting stacksize to unlimited is not large enough, try setting the size explicitly: <pre xml:space="preserve">limit stacksize new_size ! in csh</pre><pre xml:space="preserve">ulimit -⁠s new_size ! in bash</pre></td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">page swapping</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">If your executable is much larger than the physical size of memory, page swapping can cause it to run dramatically slower; it may even fail. This is not a compiler problem. Try smaller data sets to determine whether or not a problem is due to page thrashing. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">configured space</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Be sure your Linux system is configured with swap space sufficiently large to support the data sets used in your application(s). If your memory+swap space is not sufficiently large, your application will likely encounter a segmentation fault at runtime. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">support for large address offsets in object file format</td> <td class="entry" valign="top" width="NaN%" rowspan="1" colspan="1">Arrays that are not dynamically allocated are limited by how the compiler can express the ‘distance’ between them when generating code. A field in the object file stores this ‘distance’ value, which is limited to 32-bits on Linux with -⁠mcmodel=small. It is 64-bits on Linux with -⁠mcmodel=medium. <div class="note note"><span class="notetitle">Note:</span> Without the 64-bit offset support in the object file format, large arrays cannot be declared statically, or locally on the stack. </div> </td> </tr> </tbody> </table> </div> </div> </div> <div class="topic concept nested1" id="prog-64bits-lrg-ary-med-mem-c"><a name="prog-64bits-lrg-ary-med-mem-c" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#prog-64bits-lrg-ary-med-mem-c" name="prog-64bits-lrg-ary-med-mem-c" shape="rect">15.7. Medium Memory Model and Large Array in C</a></h3> <div class="body conbody"> <p class="p">Consider the following example, where the aggregate size of the arrays exceeds 2GB.</p> <div class="example" id="prog-64bits-lrg-ary-med-mem-c__med-mem-mod-lrg-ary-c-exam"><a name="prog-64bits-lrg-ary-med-mem-c__med-mem-mod-lrg-ary-c-exam" shape="rect"> <!-- --></a><h3 class="title sectiontitle">Medium Memory Model and Large Array in C</h3><pre xml:space="preserve">% cat bigadd.c #include &lt;stdio.h&gt; #define SIZE 600000000 /* &gt; 2GB/4 */ static float a[SIZE], b[SIZE]; int main() { long long i, n, m; float c[SIZE]; /* goes on stack */ n = SIZE; m = 0; for (i = 0; i &lt; n; i += 10000) { a[i] = i + 1; b[i] = 2.0 * (i + 1); c[i] = a[i] + b[i]; m = i; } printf("a[0]=%g b[0]=%g c[0]=%g\n", a[0], b[0], c[0]); printf("m=%lld a[%lld]=%g b[%lld]=%gc[%lld]=%g\n",m,m,a[m],m,b[m],m,c[m]); return 0; }</pre><pre xml:space="preserve">% nvc -⁠mcmodel=medium -⁠o bigadd bigadd.c </pre><p class="p">When SIZE is greater than 2G/4, and the arrays are of type float with 4 bytes per element, the size of each array is greater than 2GB. With nvc, using the -⁠mcmodel=medium switch, a static data object can now be &gt; 2GB in size. If you execute with these settings in your environment, you may see the following: </p><pre xml:space="preserve">% bigadd Segmentation fault</pre><p class="p">Execution fails because the stack size is not large enough. You can most likely correct this error by using the <samp class="ph codeph">limit stacksize</samp> command to reset the stack size in your environment: </p><pre xml:space="preserve">% limit stacksize 3000M </pre><div class="note note"><span class="notetitle">Note:</span> The command <samp class="ph codeph">limit stacksize unlimited</samp> probably does not provide as large a stack as we are using in the <a class="xref" href="index.html#prog-64bits-lrg-ary-med-mem-c__med-mem-mod-lrg-ary-c-exam" shape="rect">this example</a>. </div><pre xml:space="preserve">% bigadd a[0]=1 b[0]=2 c[0]=3 n=599990000 a[599990000]=5.9999e+08 b[599990000]=1.19998e+09 c[599990000]=1.79997e+09</pre></div> </div> </div> <div class="topic concept nested1" id="prog-64bits-lrg-ary-med-mem-fort"><a name="prog-64bits-lrg-ary-med-mem-fort" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#prog-64bits-lrg-ary-med-mem-fort" name="prog-64bits-lrg-ary-med-mem-fort" shape="rect">15.8. Medium Memory Model and Large Array in Fortran</a></h3> <div class="body conbody"> <p class="p"> The following example works with the NVFORTRAN compiler. It uses 64-bit addresses and index arithmetic when the <span class="keyword option">-⁠mcmodel=medium</span> option is used. </p> <p class="p">Consider the following example:</p> <div class="example"> <h3 class="title sectiontitle">Medium Memory Model and Large Array in Fortran</h3><pre class="pre screen" xml:space="preserve">% cat mat.f program mat integer i, j, k, size, l, m, n parameter (size=16000) ! &gt;2GB parameter (m=size,n=size) real*8 a(m,n),b(m,n),c(m,n),d do i = 1, m do j = 1, n a(i,j)=10000.0D0*dble(i)+dble(j) b(i,j)=20000.0D0*dble(i)+dble(j) enddo enddo !$omp parallel !$omp do do i = 1, m do j = 1, n c(i,j) = a(i,j) + b(i,j) enddo enddo !$omp do do i=1,m do j = 1, n d = 30000.0D0*dble(i)+dble(j)+dble(j) if (d .ne. c(i,j)) then print *,"err i=",i,"j=",j print *,"c(i,j)=",c(i,j) print *,"d=",d stop endif enddo enddo !$omp end parallel print *, "M =",M,", N =",N print *, "c(M,N) = ", c(m,n) end</pre><p class="p"> When compiled with the NVFORTRAN compiler using <span class="keyword option">-⁠mcmodel=medium</span>: </p><pre class="pre screen" xml:space="preserve">% nvfortran -⁠Mfree -⁠mp -⁠o mat mat.f -⁠i8 -⁠mcmodel=medium % setenv OMP_NUM_THREADS 2 % mat M = 16000 , N = 16000 c(M,N) = 480032000.0000000</pre></div> </div> </div> <div class="topic concept nested1" id="prog-64bits-lrg-ary-sml-mem-fort"><a name="prog-64bits-lrg-ary-sml-mem-fort" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#prog-64bits-lrg-ary-sml-mem-fort" name="prog-64bits-lrg-ary-sml-mem-fort" shape="rect">15.9. Large Array and Small Memory Model in Fortran</a></h3> <div class="body conbody"> <p class="p">The following example uses large, dynamically-allocated arrays. The code is divided into a main and subroutine so you could put the subroutine into a shared library. Dynamic allocation of large arrays saves space in the size of executable and saves time initializing data. </p> <div class="example"> <h3 class="title sectiontitle">Large Array and Small Memory Model in Fortran</h3><pre xml:space="preserve">% cat mat_allo.f90</pre><pre xml:space="preserve">program mat_allo integer i, j integer size, m, n parameter (size=16000) parameter (m=size,n=size) double precision, allocatable::a(:,:),b(:,:),c(:,:) allocate(a(m,n), b(m,n), c(m,n)) do i = 100, m, 1 do j = 100, n, 1 a(i,j) = 10000.0D0 * dble(i) + dble(j) b(i,j) = 20000.0D0 * dble(i) + dble(j) enddo enddo call mat_add(a,b,c,m,n) print *, "M =",m,",N =",n print *, "c(M,N) = ", c(m,n) end </pre><pre xml:space="preserve">subroutine mat_add(a,b,c,m,n) integer m, n, i, j double precision a(m,n),b(m,n),c(m,n) do i = 1, m do j = 1, n c(i,j) = a(i,j) + b(i,j) enddo enddo return end</pre><pre xml:space="preserve">% nvfortran -⁠o mat_allo mat_allo.f90 -⁠i8 -⁠Mlarge_arrays -mp -fast</pre></div> </div> </div> </div> <div class="topic concept nested0" id="inline-asm-intrin-c-cpp"><a name="inline-asm-intrin-c-cpp" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#inline-asm-intrin-c-cpp" name="inline-asm-intrin-c-cpp" shape="rect">16. C++ and C Inline Assembly<span class="ph"> and Intrinsics</span></a></h2> <div class="body conbody"> <p class="p">The examples in this section are shown using x86-64 assembly instructions. Inline assembly is supported on OpenPOWER and Arm Server platforms as well, but is not documented in detail in this section. </p> </div> <div class="topic concept nested1" id="inline-asm"><a name="inline-asm" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#inline-asm" name="inline-asm" shape="rect">16.1. Inline Assembly</a></h3> <div class="body conbody"> <p class="p">Inline Assembly lets you specify machine instructions inside a "C" function. The format for an inline assembly instruction is this: </p><pre xml:space="preserve">{ asm | __asm__ } ("string");</pre><p class="p">The asm statement begins with the <em class="ph i">asm</em> or <em class="ph i">__asm__</em> keyword. The __asm__ keyword is typically used in header files that may be included in ISO "C" programs. </p> <p class="p"><em class="ph i">string</em> is one or more machine specific instructions separated with a semi-colon (<em class="ph i">;</em>) or newline (<em class="ph i">\n</em>) character. These instructions are inserted directly into the compiler’s assembly-language output for the enclosing function. </p> <p class="p">Some simple asm statements are:</p><pre xml:space="preserve">asm ("cli"); asm ("sti");</pre><p class="p">These asm statements disable and enable system interrupts respectively.</p> <p class="p">In the following example, the eax register is set to zero.</p><pre xml:space="preserve">asm( "pushl %eax\n\t" "movl $0, %eax\n\t" "popl %eax");</pre><p class="p">Notice that eax is pushed on the stack so that it is it not clobbered. When the statement is done with eax, it is restored with the popl instruction. </p> <p class="p">Typically a program uses macros that enclose asm statements. The following two examples use the interrupt constructs created previously in this section: </p><pre xml:space="preserve">#define disableInt __asm__ ("cli"); #define enableInt __asm__ ("sti");</pre></div> </div> <div class="topic concept nested1" id="inline-asm-extended"><a name="inline-asm-extended" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#inline-asm-extended" name="inline-asm-extended" shape="rect">16.2. Extended Inline Assembly</a></h3> <div class="body conbody"> <p class="p"><a class="xref" href="index.html#inline-asm" shape="rect">Inline Assembly</a> explains how to use inline assembly to specify machine specific instructions inside a "C" function. This approach works well for simple machine operations such as disabling and enabling system interrupts. However, inline assembly has three distinct limitations: </p> <ol class="ol"> <li class="li">The programmer must choose the registers required by the inline assembly.</li> <li class="li">To prevent register clobbering, the inline assembly must include push and pop code for registers that get modified by the inline assembly. </li> <li class="li">There is no easy way to access stack variables in an inline assembly statement.</li> </ol> <p class="p"><em class="ph i">Extended Inline Assembly </em>was created to address these limitations. The format for extended inline assembly, also known as <em class="ph i">extended asm</em>, is as follows: </p><pre xml:space="preserve">{ asm | __asm__ } [ volatile | __volatile__ ] ("string" [: [output operands]] [: [input operands]] [: [clobberlist]]);</pre><ul class="ul"> <li class="li">Extended asm statements begin with the <em class="ph i">asm</em> or <em class="ph i">__asm__</em> keyword. Typically the <em class="ph i">__asm__</em> keyword is used in header files that may be included by ISO "C" programs. </li> <li class="li">An optional <em class="ph i">volatile</em> or <em class="ph i">__volatile__</em> keyword may appear after the <em class="ph i">asm</em> keyword. This keyword instructs the compiler not to delete, move significantly, or combine with any other asm statement. Like __asm__, the __volatile__ keyword is typically used with header files that may be included by ISO "C" programs. </li> <li class="li">"<em class="ph i">string</em>" is one or more machine specific instructions separated with a semi-colon (<em class="ph i">;</em>) or newline (<em class="ph i">\n</em>) character. The string can also contain operands specified in the <em class="ph i">[output operands]</em>, <em class="ph i">[input operands]</em>, and <em class="ph i">[clobber list]</em>. The instructions are inserted directly into the compiler's assembly-language output for the enclosing function. </li> <li class="li">The <em class="ph i">[output operands]</em>, <em class="ph i">[input operands]</em>, and <em class="ph i">[clobber list]</em> items each describe the effect of the instruction for the compiler. For example: <pre xml:space="preserve">asm( "movl %1, %%eax\n" "movl %%eax, %0":"=r" (x) : "r" (y) : "%eax" );</pre><p class="p">where</p> <ul class="sl simple"> <li class="sli">"=r" (x) is an output operand.</li> <li class="sli">"r" (y) is an input operand.</li> <li class="sli">"%eax" is the clobber list consisting of one register, "%eax".</li> <li class="sli"></li> </ul> <p class="p">The notation for the output and input operands is a constraint string surrounded by quotes, followed by an expression, and surrounded by parentheses. The constraint string describes how the input and output operands are used in the asm "string". For example, "r" tells the compiler that the operand is a register. The "=" tells the compiler that the operand is write only, which means that a value is stored in an output operand's expression at the end of the asm statement. </p> <p class="p">Each operand is referenced in the asm "string" by a percent "%" and its number. The first operand is number 0, the second is number 1, the third is number 2, and so on. In the preceding example, "%0" references the output operand, and "%1" references the input operand. The asm "string" also contains "%%eax", which references machine register "%eax". Hard coded registers like "%eax" should be specified in the clobber list to prevent conflicts with other instructions in the compiler's assembly-language output. <em class="ph i">[output operands]</em>,<em class="ph i"> [input operands]</em>, and <em class="ph i">[clobber list]</em> items are described in more detail in the following sections. </p> </li> </ul> </div> <div class="topic concept nested2" id="inline-asm-output-operands"><a name="inline-asm-output-operands" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#inline-asm-output-operands" name="inline-asm-output-operands" shape="rect">16.2.1. Output Operands</a></h3> <div class="body conbody"> <p class="p">The <em class="ph i">[output operands]</em> are an optional list of output constraint and expression pairs that specify the result(s) of the asm statement. An output constraint is a string that specifies how a result is delivered to the expression. For example, "=r" (x) says the output operand is a write-only register that stores its value in the "C" variable x at the end of the asm statement. An example follows: </p><pre xml:space="preserve">int x; void example() { asm( "movl $0, %0" : "=r" (x) ); }</pre><p class="p">The previous example assigns 0 to the "C" variable x. For the function in this example, the compiler produces the following assembly. If you want to produce an assembly listing, compile the example with the nvc <span class="ph filepath">-⁠S</span> compiler option: </p><pre xml:space="preserve">example: ..Dcfb0: pushq %rbp ..Dcfi0: movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 8 movl $0, %eax movl %eax, x(%rip) ## lineno: 0 popq %rbp ret</pre><p class="p">In the generated assembly shown, notice that the compiler generated two statements for the asm statement at line number 5. The compiler generated "<em class="ph i">movl $0, %eax</em>" from the asm "<em class="ph i">string</em>". Also notice that <em class="ph i">%eax</em> appears in place of "<em class="ph i">%0</em>" because the compiler assigned the <em class="ph i">%eax</em> register to variable <em class="ph i">x</em>. Since item 0 is an output operand, the result must be stored in its expression (x). </p> <p class="p">In addition to write-only output operands, there are read/write output operands designated with a "<strong class="ph b">+</strong>" instead of a "<strong class="ph b">=</strong>". For example, "<em class="ph i">+r</em>" (<em class="ph i">x</em>) tells the compiler to initialize the output operand with variable x at the beginning of the asm statement. </p> <p class="p">To illustrate this point, the following example increments variable x by 1:</p><pre xml:space="preserve">int x=1; void example2() { asm( "addl $1, %0" : "+r" (x) ); }</pre><p class="p">To perform the increment, the output operand must be initialized with variable x. The <em class="ph i">read/write</em> constraint modifier ("+") instructs the compiler to initialize the output operand with its expression. The compiler generates the following assembly code for the example2() function: </p><pre xml:space="preserve">example2: ..Dcfb0: pushq %rbp ..Dcfi0: movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 5 movl x(%rip), %eax addl $1, %eax movl %eax, x(%rip) ## lineno: 0 popq %rbp ret</pre><p class="p">From the example2() code, two extraneous moves are generated in the assembly: one movl for initializing the output register and a second movl to write it to variable x. To eliminate these moves, use a memory constraint type instead of a register constraint type, as shown in the following example: </p><pre xml:space="preserve">int x=1; void example2() { asm( "addl $1, %0" : "+m" (x) ); }</pre><p class="p">The compiler generates a memory reference in place of a memory constraint. This eliminates the two extraneous moves. Because the assembly uses a memory reference to variable x, it does not have to move x into a register prior to the asm statement; nor does it need to store the result after the asm statement. Additional constraint types are found in <a class="xref" href="index.html#inline-asm-addl-constraints" shape="rect">Additional Constraints</a>. </p><pre xml:space="preserve">example2: ..Dcfb0: pushq %rbp ..Dcfi0: movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 5 addl $1, x(%rip) ## lineno: 0 popq %rbp ret</pre><p class="p">The examples thus far have used only one output operand. Because extended asm accepts a list of output operands, asm statements can have more than one result, as shown in the following example: </p><pre xml:space="preserve">void example4() { int x=1; int y=2; asm( "addl $1, %1\n" "addl %1, %0": "+r" (x), "+m" (y) ); }</pre><p class="p">This example increments variable <em class="ph i">y</em> by <em class="ph i">1</em> then adds it to variable <em class="ph i">x</em>. Multiple output operands are separated with a comma. The first output operand is item 0 ("%0") and the second is item 1 ("%1") in the asm <em class="ph i">"string"</em>. The resulting values for <em class="ph i">x</em> and <em class="ph i">y</em> are <em class="ph i">4</em> and <em class="ph i">3</em> respectively. </p> </div> </div> <div class="topic concept nested2" id="inline-asm-input-operands"><a name="inline-asm-input-operands" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#inline-asm-input-operands" name="inline-asm-input-operands" shape="rect">16.2.2. Input Operands</a></h3> <div class="body conbody"> <p class="p">The <em class="ph i">[input operands]</em> are an optional list of input constraint and expression pairs that specify what "C" values are needed by the asm statement. The input constraints specify how the data is delivered to the asm statement. For example, <em class="ph i">"r" (x)</em> says that the input operand is a register that has a copy of the value stored in "C" variable <em class="ph i">x</em>. Another example is <em class="ph i">"m" (x)</em> which says that the input item is the <em class="ph i">memory</em> location associated with variable <em class="ph i">x</em>. Other constraint types are discussed in <a class="xref" href="index.html#inline-asm-addl-constraints" shape="rect">Additional Constraints</a>. An example follows: </p><pre xml:space="preserve">void example5() { int x=1; int y=2; int z=3; asm( "addl %2, %1\n" "addl %2, %0" : "+r" (x), "+m" (y) : "r" (z) ); }</pre><p class="p">The previous example adds variable z, item 2, to variable x and variable y. The resulting values for x and y are 4 and 5 respectively.</p> <p class="p">Another type of input constraint worth mentioning here is the <em class="ph i">matching constraint</em>. A matching constraint is used to specify an operand that fills both an input as well as an output role. An example follows: </p><pre xml:space="preserve">int x=1; void example6() { asm( "addl $1, %1" : "=r" (x) : "0" (x) ); }</pre><p class="p">The previous example is equivalent to the <em class="ph i">example2()</em> function shown in <a class="xref" href="index.html#inline-asm-output-operands" shape="rect">Output Operands</a>. The constraint/expression pair, <em class="ph i">"0" (x)</em>, tells the compiler to initialize output item <em class="ph i">0</em> with variable <em class="ph i">x</em> at the beginning of the <em class="ph i">asm</em> statement. The resulting value for <em class="ph i">x</em> is 2. Also note that "<em class="ph i">%1</em>" in the asm <em class="ph i">"string"</em> means the same thing as "<em class="ph i">%0</em>" in this case. That is because there is only one operand with both an input and an output role. </p> <p class="p">Matching constraints are very similar to the <em class="ph i">read/write</em> output operands mentioned in <a class="xref" href="index.html#inline-asm-output-operands" shape="rect">Output Operands</a>. However, there is one key difference between <em class="ph i">read/write</em> output operands and <em class="ph i">matching constraints</em>. The <em class="ph i">matching constraint</em> can have an <em class="ph i">input expression</em> that differs from its <em class="ph i">output expression</em>. </p> <p class="p">The following example uses different values for the input and output roles:</p><pre xml:space="preserve">int x; int y=2; void example7() { asm( "addl $1, %1" : "=r" (x) : "0" (y) ); }</pre><p class="p">The compiler generates the following assembly for example7():</p><pre xml:space="preserve">example7: ..Dcfb0: pushq %rbp ..Dcfi0: movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 8 movl y(%rip), %eax addl $1, %eax movl %eax, x(%rip) ## lineno: 0 popq %rbp ret</pre><p class="p">Variable <em class="ph i">x</em> gets initialized with the value stored in <em class="ph i">y</em>, which is <em class="ph i">2</em>. After adding <em class="ph i">1</em>, the resulting value for variable <em class="ph i">x</em> is <em class="ph i">3</em>. </p> <p class="p">Because <em class="ph i">matching constraints</em> perform an input role for an output operand, it does not make sense for the output operand to have the read/write ("<em class="ph i">+</em>") modifier. In fact, the compiler disallows <em class="ph i">matching constraints</em> with read/write output operands. The output operand must have a write only ("<em class="ph i">=</em>") modifier. </p> </div> </div> <div class="topic concept nested2" id="inline-asm-clobber-lst"><a name="inline-asm-clobber-lst" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#inline-asm-clobber-lst" name="inline-asm-clobber-lst" shape="rect">16.2.3. Clobber List</a></h3> <div class="body conbody"> <p class="p">The <em class="ph i">[clobber list]</em> is an optional list of strings that hold machine registers used in the asm "<em class="ph i">string</em>". Essentially, these strings tell the compiler which registers may be clobbered by the asm statement. By placing registers in this list, the programmer does not have to explicitly save and restore them as required in traditional inline assembly (described in <a class="xref" href="index.html#inline-asm" shape="rect">Inline Assembly</a>). The compiler takes care of any required saving and restoring of the registers in this list. </p> <p class="p">Each machine register in the [clobber list] is a string separated by a comma. The leading '%' is optional in the register name. For example, "%eax" is equivalent to "eax". When specifying the register inside the asm "string", you must include two leading '%' characters in front of the name (for example., "%%eax"). Otherwise, the compiler will behave as if a bad input/output operand was specified and generate an error message. An example follows: </p><pre xml:space="preserve">void example8() { int x; int y=2; asm( "movl %1, %%eax\n" "movl %1, %%edx\n" "addl %%edx, %%eax\n" "addl %%eax, %0" : "=r" (x) : "0" (y) : "eax", "edx" ); }</pre><p class="p">This code uses two hard-coded registers, eax and edx. It performs the equivalent of 3*y and assigns it to x, producing a result of 6. </p> <p class="p">In addition to machine registers, the clobber list may contain the following special flags:</p> <dl class="dl"> <dt class="dt dlterm">"cc"</dt> <dd class="dd">The asm statement may alter the control code register.</dd> <dt class="dt dlterm">"memory"</dt> <dd class="dd">The asm statement may modify memory in an unpredictable fashion.</dd> </dl> <p class="p">When the "memory" flag is present, the compiler does not keep memory values cached in registers across the asm statement and does not optimize stores or loads to that memory. For example: </p><pre xml:space="preserve">asm("call MyFunc":::"memory");</pre><p class="p">This asm statement contains a "memory" flag because it contains a call. The callee may otherwise clobber registers in use by the caller without the "memory" flag. </p> <p class="p">The following function uses extended asm and the "cc" flag to compute a power of 2 that is less than or equal to the input parameter n. </p><pre xml:space="preserve">#pragma noinline int asmDivideConquer(int n) { int ax = 0; int bx = 1; asm ( "LogLoop:n" "cmp %2, %1n" "jnle Donen" "inc %0n" "add %1,%1n" "jmp LogLoopn" "Done:n" "dec %0n" :"+r" (ax), "+r" (bx) : "r" (n) : "cc"); return ax; }</pre><p class="p">The ‘cc’ flag is used because the asm statement contains some control flow that may alter the control code register. The #pragma noinline statement prevents the compiler from inlining the asmDivideConquer() function. If the compiler inlines asmDivideConquer(), then it may illegally duplicate the labels LogLoop and Done in the generated assembly. </p> </div> </div> <div class="topic concept nested2" id="inline-asm-addl-constraints"><a name="inline-asm-addl-constraints" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#inline-asm-addl-constraints" name="inline-asm-addl-constraints" shape="rect">16.2.4. Additional Constraints</a></h3> <div class="body conbody"> <p class="p"> Operand constraints can be divided into four main categories:</p> <ul class="ul"> <li class="li">Simple Constraints</li> <li class="li">Machine Constraints</li> <li class="li">Multiple Alternative Constraints</li> <li class="li">Constraint Modifiers</li> </ul> </div> </div> <div class="topic concept nested2" id="inline-asm-simple-constraints"><a name="inline-asm-simple-constraints" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#inline-asm-simple-constraints" name="inline-asm-simple-constraints" shape="rect">16.2.5. Simple Constraints</a></h3> <div class="body conbody"> <p class="p">The simplest kind of constraint is a string of letters or characters, known as <em class="ph i">Simple Constraints</em>, such as the "r" and "m" constraints introduced in <a class="xref" href="index.html#inline-asm-output-operands" shape="rect">Output Operands</a>. <a class="xref" href="index.html#inline-asm-simple-constraints__inline-asm-simple-constraints-tbl" shape="rect">Table 33</a> describes these constraints. </p> <div class="tablenoborder"><a name="inline-asm-simple-constraints__inline-asm-simple-constraints-tbl" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="inline-asm-simple-constraints__inline-asm-simple-constraints-tbl" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 33. Simple Constraints</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e23064" rowspan="1" colspan="1">Constraint</th> <th class="entry" valign="top" width="NaN%" id="d54e23067" rowspan="1" colspan="1">Description</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23064" rowspan="1" colspan="1">whitespace</td> <td class="entry" valign="top" width="NaN%" headers="d54e23067" rowspan="1" colspan="1">Whitespace characters are ignored.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23064" rowspan="1" colspan="1">E</td> <td class="entry" valign="top" width="NaN%" headers="d54e23067" rowspan="1" colspan="1">An immediate floating point operand.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23064" rowspan="1" colspan="1">F</td> <td class="entry" valign="top" width="NaN%" headers="d54e23067" rowspan="1" colspan="1">Same as "E".</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23064" rowspan="1" colspan="1">g</td> <td class="entry" valign="top" width="NaN%" headers="d54e23067" rowspan="1" colspan="1">Any general purpose register, memory, or immediate integer operand is allowed.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23064" rowspan="1" colspan="1">i</td> <td class="entry" valign="top" width="NaN%" headers="d54e23067" rowspan="1" colspan="1">An immediate integer operand. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23064" rowspan="1" colspan="1">m</td> <td class="entry" valign="top" width="NaN%" headers="d54e23067" rowspan="1" colspan="1">A memory operand. Any address supported by the machine is allowed.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23064" rowspan="1" colspan="1">n</td> <td class="entry" valign="top" width="NaN%" headers="d54e23067" rowspan="1" colspan="1">Same as "i".</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23064" rowspan="1" colspan="1">o</td> <td class="entry" valign="top" width="NaN%" headers="d54e23067" rowspan="1" colspan="1">Same as "m".</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23064" rowspan="1" colspan="1">p</td> <td class="entry" valign="top" width="NaN%" headers="d54e23067" rowspan="1" colspan="1">An operand that is a valid memory address. The expression associated with the constraint is expected to evaluate to an address (for example, "p" (&amp;x) ). </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23064" rowspan="1" colspan="1">r</td> <td class="entry" valign="top" width="NaN%" headers="d54e23067" rowspan="1" colspan="1">A general purpose register operand.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23064" rowspan="1" colspan="1">X</td> <td class="entry" valign="top" width="NaN%" headers="d54e23067" rowspan="1" colspan="1">Same as "g".</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23064" rowspan="1" colspan="1">0,1,2,..9</td> <td class="entry" valign="top" width="NaN%" headers="d54e23067" rowspan="1" colspan="1">Matching Constraint. See <a class="xref" href="index.html#inline-asm-output-operands" shape="rect">Output Operands</a> for a description. </td> </tr> </tbody> </table> </div> <p class="p">The following example uses the general or "g" constraint, which allows the compiler to pick an appropriate constraint type for the operand; the compiler chooses from a general purpose register, memory, or immediate operand. This code lets the compiler choose the constraint type for "y". </p><pre xml:space="preserve">void example9() { int x, y=2; asm( "movl %1, %0\n" : "=r" (x) : "g" (y) ); }</pre><p class="p">This technique can result in more efficient code. For example, when compiling example9() the compiler replaces the load and store of y with a constant 2. The compiler can then generate an immediate 2 for the y operand in the example. The assembly generated by nvc for our example is as follows: </p><pre xml:space="preserve">example9: ..Dcfb0: pushq %rbp ..Dcfi0: movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 3 movl $2, %eax ## lineno: 6 popq %rbp ret</pre><p class="p">In this example, notice the use of $2 for the "y" operand.</p> <p class="p">Of course, if y is always 2, then the immediate value may be used instead of the variable with the "i" constraint, as shown here: </p><pre xml:space="preserve">void example10() { int x; asm( "movl %1, %0\n" : "=r" (x) : "i" (2) ); }</pre><p class="p">Compiling example10() with nvc produces assembly similar to that produced for example9().</p> </div> </div> <div class="topic concept nested2" id="inline-asm-machine-constraints"><a name="inline-asm-machine-constraints" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#inline-asm-machine-constraints" name="inline-asm-machine-constraints" shape="rect">16.2.6. Machine Constraints</a></h3> <div class="body conbody"> <p class="p">Another category of constraints is <em class="ph i">Machine Constraints</em>. The x86_64 architectures has several classes of registers. To choose a particular class of register, you can use the x86_64 machine constraints described in <a class="xref" href="index.html#inline-asm-machine-constraints__inline-asm-machine-constraints-tbl" shape="rect">Table 34</a>. </p> <div class="tablenoborder"><a name="inline-asm-machine-constraints__inline-asm-machine-constraints-tbl" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="inline-asm-machine-constraints__inline-asm-machine-constraints-tbl" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 34. x86_64 Machine Constraints</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e23255" rowspan="1" colspan="1">Constraint</th> <th class="entry" valign="top" width="NaN%" id="d54e23258" rowspan="1" colspan="1">Description</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">a</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">a register (e.g., %al, %ax, %eax, %rax)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">A</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Specifies a or d registers. The d register holds the most significant bits and the a register holds the least significant bits. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">b</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">b register (e.g, %bl, %bx, %ebx, %rbx)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">c</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">c register (e.g., %cl, %cx, %ecx, %rcx)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">C</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Not supported.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">d</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">d register (e.g., %dl, %dx, %edx, %rdx)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">D</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">di register (e.g., %dil, %di, %edi, %rdi)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">e</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Constant in range of 0xffffffff to 0x7fffffff </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">f</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Not supported.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">G</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Floating point constant in range of 0.0 to 1.0.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">I</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Constant in range of 0 to 31 (e.g., for 32-bit shifts).</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">J</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Constant in range of 0 to 63 (e.g., for 64-bit shifts)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">K</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Constant in range of 0to 127.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">L</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Constant in range of 0 to 65535.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">M</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Constant in range of 0 to 3 constant (e.g., shifts for lea instruction).</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">N</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Constant in range of 0 to 255 (e.g., for out instruction).</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">q</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Same as "r" simple constraint.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">Q</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Same as "r" simple constraint.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">R</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Same as "r" simple constraint.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">S</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">si register (e.g., %sil, %si, %edi, %rsi)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">t</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Not supported.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">u</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Not supported.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">x</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">XMM SSE register</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">y</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Not supported. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23255" rowspan="1" colspan="1">Z</td> <td class="entry" valign="top" width="NaN%" headers="d54e23258" rowspan="1" colspan="1">Constant in range of 0 to 0x7fffffff.</td> </tr> </tbody> </table> </div> <p class="p">The following example uses the "x" or XMM register constraint to subtract c from b and store the result in a.</p><pre xml:space="preserve">double example11() { double a; double b = 400.99; double c = 300.98; asm ( "subpd %2, %0;" :"=x" (a) : "0" (b), "x" (c) ); return a; }</pre><p class="p">The generated assembly for this example is this:</p><pre xml:space="preserve">example11: ..Dcfb0: pushq %rbp ..Dcfi0: movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 4 movsd .C00128(%rip), %xmm1 movsd .C00130(%rip), %xmm2 movapd %xmm1, %xmm0 subpd %xmm2, %xmm0; ## lineno: 10 ## lineno: 11 popq %rbp ret</pre><p class="p">If a specified register is not available, the nvc and nvc++ compilers issue an error message.</p> </div> </div> <div class="topic concept nested2" id="inline-asm-mult-alt-constraints"><a name="inline-asm-mult-alt-constraints" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#inline-asm-mult-alt-constraints" name="inline-asm-mult-alt-constraints" shape="rect">16.2.7. Multiple Alternative Constraints</a></h3> <div class="body conbody"> <p class="p">Sometimes a single instruction can take a variety of operand types. For example, the x86-64 permits register-to-memory and memory-to-register operations. To allow this flexibility in inline assembly, use <em class="ph i">multiple alternative constraints</em>. An <em class="ph i">alternative</em> is a series of constraints for each operand. </p> <p class="p">To specify multiple alternatives, separate each alternative with a comma.</p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 35. Multiple Alternative Constraints</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e23555" rowspan="1" colspan="1">Constraint</th> <th class="entry" valign="top" width="NaN%" id="d54e23558" rowspan="1" colspan="1">Description</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23555" rowspan="1" colspan="1">,</td> <td class="entry" valign="top" width="NaN%" headers="d54e23558" rowspan="1" colspan="1">Separates each alternative for a particular operand.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23555" rowspan="1" colspan="1">?</td> <td class="entry" valign="top" width="NaN%" headers="d54e23558" rowspan="1" colspan="1">Ignored</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23555" rowspan="1" colspan="1">!</td> <td class="entry" valign="top" width="NaN%" headers="d54e23558" rowspan="1" colspan="1">Ignored</td> </tr> </tbody> </table> </div> <p class="p">The following example uses multiple alternatives for an add operation.</p><pre xml:space="preserve">void example13() { int x=1; int y=1; asm( "addl %1, %0\n" : "+ab,cd" (x) : "db,cam" (y) ); }</pre><p class="p">The preceding <em class="ph i">example13()</em> has two alternatives for each operand: "ab,cd" for the output operand and "db,cam" for the input operand. Each operand must have the same number of alternatives; however, each alternative can have any number of constraints (for example, the output operand in <em class="ph i">example13()</em> has two constraints for its second alternative and the input operand has three for its second alternative). </p> <p class="p">The compiler first tries to satisfy the left-most alternative of the first operand (for example, the output operand in <em class="ph i">example13()</em>). When satisfying the operand, the compiler starts with the left-most constraint. If the compiler cannot satisfy an alternative with this constraint (for example, if the desired register is not available), it tries to use any subsequent constraints. If the compiler runs out of constraints, it moves on to the next alternative. If the compiler runs out of alternatives, it issues an error similar to the one mentioned in <em class="ph i">example12()</em>. If an alternative is found, the compiler uses the same alternative for subsequent operands. For example, if the compiler chooses the "c" register for the output operand in example13(), then it will use either the "a" or "m" constraint for the input operand. </p> </div> </div> <div class="topic concept nested2" id="inline-asm-constraint-modifiers"><a name="inline-asm-constraint-modifiers" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#inline-asm-constraint-modifiers" name="inline-asm-constraint-modifiers" shape="rect">16.2.8. Constraint Modifiers</a></h3> <div class="body conbody"> <p class="p">Characters that affect the compiler's interpretation of a constraint are known as <em class="ph i">Constraint Modifiers</em>. Two constraint modifiers, the "=" and the "+", were introduced in <a class="xref" href="index.html#inline-asm-output-operands" shape="rect">Output Operands</a>. The following table summarizes each constraint modifier. </p> <div class="tablenoborder"><a name="inline-asm-constraint-modifiers__constraint-mod-chars-tbl" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="inline-asm-constraint-modifiers__constraint-mod-chars-tbl" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 36. Constraint Modifier Characters</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e23657" rowspan="1" colspan="1">Constraint Modifier</th> <th class="entry" valign="top" width="NaN%" id="d54e23660" rowspan="1" colspan="1">Description</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23657" rowspan="1" colspan="1">=</td> <td class="entry" valign="top" width="NaN%" headers="d54e23660" rowspan="1" colspan="1">This operand is write-only. It is valid for output operands only. If specified, the "=" must appear as the first character of the constraint string. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23657" rowspan="1" colspan="1">+</td> <td class="entry" valign="top" width="NaN%" headers="d54e23660" rowspan="1" colspan="1">This operand is both read and written by the instruction. It is valid for output operands only. The output operand is initialized with its expression before the first instruction in the asm statement. If specified, the "+" must appear as the first character of the constraint string. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23657" rowspan="1" colspan="1">&amp;</td> <td class="entry" valign="top" width="NaN%" headers="d54e23660" rowspan="1" colspan="1">A constraint or an alternative constraint, as defined in <a class="xref" href="index.html#inline-asm-mult-alt-constraints" shape="rect">Multiple Alternative Constraints</a>, containing an "&amp;" indicates that the output operand is an <em class="ph i">early clobber operand</em>. This type operand is an output operand that may be modified before the asm statement finishes using all of the input operands. The compiler will not place this operand in a register that may be used as an input operand or part of any memory address. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23657" rowspan="1" colspan="1">%</td> <td class="entry" valign="top" width="NaN%" headers="d54e23660" rowspan="1" colspan="1">Ignored.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23657" rowspan="1" colspan="1">#</td> <td class="entry" valign="top" width="NaN%" headers="d54e23660" rowspan="1" colspan="1">Characters following a "#" up to the first comma (if present) are to be ignored in the constraint. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23657" rowspan="1" colspan="1">*</td> <td class="entry" valign="top" width="NaN%" headers="d54e23660" rowspan="1" colspan="1">The character that follows the "*" is to be ignored in the constraint.</td> </tr> </tbody> </table> </div> <p class="p">The "=" and "+" modifiers apply to the operand, regardless of the number of alternatives in the constraint string. For example, the "+" in the output operand of example13() appears once and applies to both alternatives in the constraint string. The "&amp;", "#", and "*" modifiers apply only to the alternative in which they appear. </p> <p class="p">Normally, the compiler assumes that input operands are used before assigning results to the output operands. This assumption lets the compiler reuse registers as needed inside the asm statement. However, if the asm statement does not follow this convention, the compiler may indiscriminately clobber a result register with an input operand. To prevent this behavior, apply the early clobber "&amp;" modifier. An example follows: </p><pre xml:space="preserve">void example15() { int w=1; int z; asm( "movl $1, %0\n" "addl %2, %0\n" "movl %2, %1" : "=a" (w), "=r" (z) : "r" (w) ); }</pre><p class="p">The previous code example presents an interesting ambiguity because "w" appears both as an output and as an input operand. So, the value of "z" can be either 1 or 2, depending on whether the compiler uses the same register for operand 0 and operand 2. The use of constraint "r" for operand 2 allows the compiler to pick any general purpose register, so it may (or may not) pick register "a" for operand 2. This ambiguity can be eliminated by changing the constraint for operand 2 from "r" to "a" so the value of "z" will be 2, or by adding an early clobber "&amp;" modifier so that "z" will be 1. The following example shows the same function with an early clobber "&amp;" modifier: </p><pre xml:space="preserve">void example16() { int w=1; int z; asm( "movl $1, %0\n" "addl %2, %0\n" "movl %2, %1" : "=&amp;a" (w), "=r" (z) : "r" (w) ); }</pre><p class="p">Adding the early clobber "&amp;" forces the compiler not to use the "a" register for anything other than operand 0. Operand 2 will therefore get its own register with its own copy of "w". The result for "z" in <em class="ph i">example16()</em> is 1. </p> </div> </div> </div> <div class="topic concept nested1" id="inline-asm-operand-aliases"><a name="inline-asm-operand-aliases" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#inline-asm-operand-aliases" name="inline-asm-operand-aliases" shape="rect">16.3. Operand Aliases</a></h3> <div class="body conbody"> <p class="p">Extended asm specifies operands in assembly strings with a percent '%' followed by the operand number. For example, "%0" references operand 0 or the output item "=&amp;a" (w) in function <em class="ph i">example16()</em> in the previous example. Extended asm also supports operand aliasing, which allows use of a symbolic name instead of a number for specifying operands, as illustrated in this example: </p><pre xml:space="preserve">void example17() { int w=1, z=0; asm( "movl $1, %[output1]\n" "addl %[input], %[output1]\n" "movl %[input], %[output2]" : [output1] "=&amp;a" (w), [output2] "=r" (z) : [input] "r" (w)); }</pre><p class="p">In <em class="ph i">example18(),</em> "%0" and "%[output1]" both represent the output operand. </p> </div> </div> <div class="topic concept nested1" id="inline-asm-strg-modifiers"><a name="inline-asm-strg-modifiers" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#inline-asm-strg-modifiers" name="inline-asm-strg-modifiers" shape="rect">16.4. Assembly String Modifiers</a></h3> <div class="body conbody"> <p class="p">Special character sequences in the assembly string affect the way the assembly is generated by the compiler. For example, the "%" is an escape sequence for specifying an operand, "%%" produces a percent for hard coded registers, and "\n" specifies a new line. <a class="xref" href="index.html#inline-asm-strg-modifiers__inline-asm-strg-modifiers-tbl" shape="rect">Table 37</a> summarizes these modifiers, known as <em class="ph i">Assembly String Modifiers</em>. </p> <div class="tablenoborder"><a name="inline-asm-strg-modifiers__inline-asm-strg-modifiers-tbl" shape="rect"> <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="inline-asm-strg-modifiers__inline-asm-strg-modifiers-tbl" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 37. Assembly String Modifier Characters</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e23849" rowspan="1" colspan="1">Modifier</th> <th class="entry" valign="top" width="NaN%" id="d54e23852" rowspan="1" colspan="1">Description</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">\</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">Same as \ in printf format strings. </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">%*</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">Adds a '*' in the assembly string.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">%%</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">Adds a '%' in the assembly string.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">%A</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">Adds a '*' in front of an operand in the assembly string. (For example, %A0 adds a '*' in front of operand 0 in the assembly output.) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">%B</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">Produces the byte op code suffix for this operand. (For example, %b0 produces 'b' on x86-64.)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">%L</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">Produces the word op code suffix for this operand. (For example, %L0 produces 'l' on x86-64.)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">%P</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">If producing Position Independent Code (PIC), the compiler adds the PIC suffix for this operand. (For example, %P0 produces @PLT on x86-64.) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">%Q</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">Produces a quad word op code suffix for this operand if it is supported by the target. Otherwise, it produces a word op code suffix. (For example, %Q0 produces 'q' on x86-64.) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">%S</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">Produces 's' suffix for this operand. (For example, %S0 produces 's' on x86-64.)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">%T</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">Produces 't' suffix for this operand. (For example, %S0 produces 't' on x86-64.)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">%W</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">Produces the half word op code suffix for this operand. (For example, %W0 produces 'w' on x86-64.)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">%a</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">Adds open and close parentheses ( ) around the operand.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">%b</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">Produces the byte register name for an operand. (For example, if operand 0 is in register 'a', then %b0 will produce '%al'.)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">%c</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">Cuts the '$' character from an immediate operand.</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">%k</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">Produces the word register name for an operand. (For example, if operand 0 is in register 'a', then %k0 will produce '%eax'.)</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">%q</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">Produces the quad word register name for an operand if the target supports quad word. Otherwise, it produces a word register name. (For example, if operand 0 is in register 'a', then %q0 produces %rax on x86-64.) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">%w</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">Produces the half word register name for an operand. (For example, if operand 0 is in register 'a', then %w0 will produce '%ax'.) </td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e23849" rowspan="1" colspan="1">%z</td> <td class="entry" valign="top" width="NaN%" headers="d54e23852" rowspan="1" colspan="1">Produces an op code suffix based on the size of an operand. (For example, 'b' for byte, 'w' for half word, 'l' for word, and 'q' for quad word.) </td> </tr> <tr class="row"> <td class="entry" colspan="2" valign="top" headers="d54e23849 d54e23852" rowspan="1">%+ %C %D %F %O %X %f %h %l %n %s %y are not supported.</td> </tr> </tbody> </table> </div> <p class="p">These modifiers begin with either a backslash "\" or a percent "%".</p> <p class="p">The modifiers that begin with a backslash "\" (e.g., "\n") have the same effect as they do in a printf format string. The modifiers that are preceded with a "%" are used to modify a particular operand. </p> <p class="p">These modifiers begin with either a backslash "\" or a percent "%" For example, "%b0" means, "produce the byte or 8 bit version of operand 0". If operand 0 is a register, it will produce a byte register such as %al, %bl, %cl, and so on. </p> </div> </div> <div class="topic concept nested1" id="inline-asm-extended-macros"><a name="inline-asm-extended-macros" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#inline-asm-extended-macros" name="inline-asm-extended-macros" shape="rect">16.5. Extended Asm Macros</a></h3> <div class="body conbody"> <p class="p">As with traditional inline assembly, described in <a class="xref" href="index.html#inline-asm" shape="rect">Inline Assembly</a>, extended asm can be used in a macro. For example, you can use the following macro to access the runtime stack pointer. </p><pre xml:space="preserve">#define GET_SP(x) \ asm("mov %%sp, %0": "=m" (##x):: "%sp" ); void example20() { void * stack_pointer; GET_SP(stack_pointer); }</pre><p class="p">The GET_SP macro assigns the value of the stack pointer to whatever is inserted in its argument (for example, stack_pointer). Another "C" extension known as <em class="ph i">statement expressions</em> is used to write the GET_SP macro another way: </p><pre xml:space="preserve">#define GET_SP2 ({ \ void *my_stack_ptr; \ asm("mov %%sp, %0": "=m" (my_stack_ptr) :: "%sp" ); \ my_stack_ptr; \ }) void example21() { void * stack_pointer = GET_SP2; }</pre><p class="p">The statement expression allows a body of code to evaluate to a single value. This value is specified as the last instruction in the statement expression. In this case, the value is the result of the asm statement, my_stack_ptr. By writing an asm macro with a statement expression, the asm result may be assigned directly to another variable (for example, void * stack_pointer = GET_SP2) or included in a larger expression, such as: void * stack_pointer = GET_SP2 - sizeof(long). </p> <p class="p">Which style of macro to use depends on the application. If the asm statement needs to be a part of an expression, then a macro with a statement expression is a good approach. Otherwise, a traditional macro, like GET_SP(x), will probably suffice. </p> </div> </div> <div class="topic concept nested1" id="inline-asm-intrinsics"><a name="inline-asm-intrinsics" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#inline-asm-intrinsics" name="inline-asm-intrinsics" shape="rect">16.6. Intrinsics</a></h3> <div class="body conbody"> <p class="p">Inline intrinsic functions map to actual x86-64 machine instructions. Intrinsics are inserted inline to avoid the overhead of a function call. The compiler has special knowledge of intrinsics, so with use of intrinsics, better code may be generated as compared to extended inline assembly code. </p> <p class="p">The NVIDIA HPC Compilers intrinsics library implements MMX, SSE, SS2, SSE3, SSSE3, SSE4a, ABM, and AVX instructions. The intrinsic functions are available to C and C++ programs. Unlike most functions which are in libraries, intrinsics are implemented internally by the compiler. A program can call the intrinsic functions from C/C++ source code after including the corresponding header file. </p> <p class="p">The intrinsics are divided into header files as follows:</p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 38. Intrinsic Header File Organization</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="NaN%" id="d54e24129" rowspan="1" colspan="1">Instructions</th> <th class="entry" valign="top" width="NaN%" id="d54e24132" rowspan="1" colspan="1">Header File</th> <th class="entry" valign="top" width="NaN%" id="d54e24135" rowspan="1" colspan="1"> </th> <th class="entry" valign="top" width="NaN%" id="d54e24137" rowspan="1" colspan="1">Instructions</th> <th class="entry" valign="top" width="NaN%" id="d54e24140" rowspan="1" colspan="1">Header File</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e24129" rowspan="1" colspan="1">ABM</td> <td class="entry" valign="top" width="NaN%" headers="d54e24132" rowspan="1" colspan="1">intrin.h</td> <td class="entry" valign="top" width="NaN%" headers="d54e24135" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e24137" rowspan="1" colspan="1">SSE2</td> <td class="entry" valign="top" width="NaN%" headers="d54e24140" rowspan="1" colspan="1">emmintrin.h</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e24129" rowspan="1" colspan="1">AVX</td> <td class="entry" valign="top" width="NaN%" headers="d54e24132" rowspan="1" colspan="1">immintrin.h</td> <td class="entry" valign="top" width="NaN%" headers="d54e24135" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e24137" rowspan="1" colspan="1">SSE3</td> <td class="entry" valign="top" width="NaN%" headers="d54e24140" rowspan="1" colspan="1">pmmintrin.h</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e24129" rowspan="1" colspan="1">MMX</td> <td class="entry" valign="top" width="NaN%" headers="d54e24132" rowspan="1" colspan="1">mmintrin.h</td> <td class="entry" valign="top" width="NaN%" headers="d54e24135" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e24137" rowspan="1" colspan="1">SSSE3</td> <td class="entry" valign="top" width="NaN%" headers="d54e24140" rowspan="1" colspan="1">tmmintrin.h</td> </tr> <tr class="row"> <td class="entry" valign="top" width="NaN%" headers="d54e24129" rowspan="1" colspan="1">SSE</td> <td class="entry" valign="top" width="NaN%" headers="d54e24132" rowspan="1" colspan="1">xmmintrin.h</td> <td class="entry" valign="top" width="NaN%" headers="d54e24135" rowspan="1" colspan="1"> </td> <td class="entry" valign="top" width="NaN%" headers="d54e24137" rowspan="1" colspan="1">SSE4a</td> <td class="entry" valign="top" width="NaN%" headers="d54e24140" rowspan="1" colspan="1">ammintrin.h</td> </tr> </tbody> </table> </div> <p class="p">The following is a simple example program that calls XMM intrinsics.</p><pre xml:space="preserve">#include &lt;xmmintrin.h&gt; int main(){ __m128 __A, __B, result; __A = _mm_set_ps(23.3, 43.7, 234.234, 98.746); __B = _mm_set_ps(15.4, 34.3, 4.1, 8.6); result = _mm_add_ps(__A,__B); return 0; }</pre></div> </div> </div> <div class="topic concept nested0" id="notices-header"><a name="notices-header" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#notices-header" name="notices-header" shape="rect">Notices</a></h2> <div class="topic reference nested1" id="notice"><a name="notice" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#notice" name="notice" shape="rect"></a></h3> <div class="body refbody"> <div class="section"> <h3 class="title sectiontitle">Notice</h3> <p class="p">ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. </p> <p class="p">Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation. </p> </div> </div> </div> <div class="topic reference nested1" id="trademarks"><a name="trademarks" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#trademarks" name="trademarks" shape="rect"></a></h3> <div class="body refbody"> <div class="section"> <h3 class="title sectiontitle">Trademarks</h3> <p class="p">NVIDIA, the NVIDIA logo, CUDA, CUDA-X, GPUDirect, HPC SDK, NGC, NVIDIA Volta, NVIDIA DGX, NVIDIA Nsight, NVLink, NVSwitch, and Tesla are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. </p> </div> </div> </div> <div class="topic reference nested1" id="copyright"><a name="copyright" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#copyright" name="copyright" shape="rect"></a></h3> <div class="body refbody"> <div class="section"> <h3 class="title sectiontitle">Copyright</h3> <p class="p">© 2013–<span class="ph">2024</span> NVIDIA Corporation. All rights reserved. </p> </div> </div> </div> </div> </article> <footer id="footer"><img src="/hpc-sdk/common/formatting/NVIDIA-LogoBlack.svg" /><div><a href="https://www.nvidia.com/en-us/about-nvidia/privacy-policy/" target="_blank">Privacy Policy</a> | <a href="https://www.nvidia.com/en-us/privacy-center/" target="_blank">Manage My Privacy</a> | <a href="https://www.nvidia.com/en-us/preferences/email-preferences/" target="_blank">Do Not Sell or Share My Data</a> | <a href="https://www.nvidia.com/en-us/about-nvidia/terms-of-service/" target="_blank">Terms of Service</a> | <a href="https://www.nvidia.com/en-us/about-nvidia/accessibility/" target="_blank">Accessibility</a> | <a href="https://www.nvidia.com/en-us/about-nvidia/company-policies/" target="_blank">Corporate Policies</a> | <a href="https://www.nvidia.com/en-us/product-security/" target="_blank">Product Security</a> | <a href="https://www.nvidia.com/en-us/contact/" target="_blank">Contact</a></div> <div class="copyright">Copyright © 2024 NVIDIA Corporation</div> </footer> </div> </div> </div><script src="/hpc-sdk/common/formatting/consent.js"></script><div id="cookiePolicy-layer" class="noSelection shadowBox-light"> <div class="iconContainer"><img src="/hpc-sdk/common/formatting/consent-icon.png" class="cookiePolicy-icon" /></div> <div class="messageContainer fontSmall"> <p> NVIDIA websites use cookies to deliver and improve the website experience. See our <a href="https://www.nvidia.com/en-us/about-nvidia/cookie-policy/" target="_blank">cookie policy</a> for further details on how we use cookies and how to change your cookie settings. </p> </div> <div class="buttonContainer"> <div id="cookiePolicy-btn-close" class="nvbutton">X</div> </div> </div><script language="JavaScript" type="text/javascript" charset="utf-8" src="/hpc-sdk/common/formatting/common.min.js"></script><script type="text/javascript"> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-59690533-1', 'auto'); ga('send', 'pageview'); </script></body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10