CINXE.COM

<!DOCTYPE html> <html class="writer-html5" lang="en" > <head> <meta charset="utf-8" /><meta name="generator" content="Docutils 0.17.1: http://docutils.sourceforge.net/" /> <meta content="The API reference guide for cuFFT, the CUDA Fast Fourier Transform library." name="description" /> <meta content="CUDA cuFFT, CUDA cuFFT features, CUDA cuFFT data, CUDA cuFFT layout, CUDA cuFFT performance, CUDA cuFFT API, CUDA cuFFT library, CUDA cuFFT example, CUDA cuFFT transforms, CUDA cuFFT multidimensional transforms" name="keywords" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <title>cuFFT</title> <link rel="stylesheet" href="../_static/pygments.css" type="text/css" /> <link rel="stylesheet" href="../_static/css/theme.css" type="text/css" /> <link rel="stylesheet" href="../_static/copybutton.css" type="text/css" /> <link rel="stylesheet" href="../_static/design-style.b7bb847fb20b106c3d81b95245e65545.min.css" type="text/css" /> <link rel="stylesheet" href="../_static/omni-style.css" type="text/css" /> <link rel="stylesheet" href="../_static/api-styles.css" type="text/css" /> <link rel="shortcut icon" href="../_static/favicon.ico"/>  <script data-url_root="./" id="documentation_options" src="../_static/documentation_options.js"></script> <script src="../_static/jquery.js"></script> <script src="../_static/underscore.js"></script> <script src="../_static/doctools.js"></script> <script src="../_static/mermaid-init.js"></script> <script src="../_static/clipboard.min.js"></script> <script src="../_static/copybutton.js"></script> <script src="../_static/design-tabs.js"></script> <script async="async" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script> <script src="../_static/geoip/geoip.js"></script> <script src="../_static/js/theme.js"></script> <link rel="index" title="Index" href="genindex.html" /> <link rel="search" title="Search" href="search.html" /> <link rel="prev" title="Contents" href="contents.html" /> <script src="//assets.adobedtm.com/5d4962a43b79/c1061d2c5e7b/launch-191c2462b890.min.js"></script> </head> <body class="wy-body-for-nav"> <div class="wy-grid-for-nav"> <nav data-toggle="wy-nav-shift" class="wy-nav-side"> <div class="wy-side-scroll"> <div class="wy-side-nav-search" > <a href="contents.html"> <img src="../_static/Logo_and_CUDA.png" class="logo" alt="Logo"/> </a> <div role="search"> <form id="rtd-search-form" class="wy-form" action="search.html" method="get"> <input type="text" name="q" placeholder="Search docs" /> <input type="hidden" name="check_keywords" value="yes" /> <input type="hidden" name="area" value="default" /> </form> </div> </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu"> <ul class="current"> <li class="toctree-l1 current"><a class="current reference internal" href="#">1. Introduction</a></li> <li class="toctree-l1"><a class="reference internal" href="#using-the-cufft-api">2. Using the cuFFT API</a><ul> <li class="toctree-l2"><a class="reference internal" href="#accessing-cufft">2.1. Accessing cuFFT</a></li> <li class="toctree-l2"><a class="reference internal" href="#fourier-transform-setup">2.2. Fourier Transform Setup</a><ul> <li class="toctree-l3"><a class="reference internal" href="#free-memory-requirement">2.2.1. Free Memory Requirement</a></li> <li class="toctree-l3"><a class="reference internal" href="#plan-initialization-time">2.2.2. Plan Initialization Time</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#fourier-transform-types">2.3. Fourier Transform Types</a><ul> <li class="toctree-l3"><a class="reference internal" href="#half-precision-cufft-transforms">2.3.1. Half-precision cuFFT Transforms</a></li> <li class="toctree-l3"><a class="reference internal" href="#bfloat16-precision-cufft-transforms">2.3.2. Bfloat16-precision cuFFT Transforms</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#data-layout">2.4. Data Layout</a></li> <li class="toctree-l2"><a class="reference internal" href="#multidimensional-transforms">2.5. Multidimensional Transforms</a></li> <li class="toctree-l2"><a class="reference internal" href="#advanced-data-layout">2.6. Advanced Data Layout</a></li> <li class="toctree-l2"><a class="reference internal" href="#streamed-cufft-transforms">2.7. Streamed cuFFT Transforms</a></li> <li class="toctree-l2"><a class="reference internal" href="#multiple-gpu-cufft-transforms">2.8. Multiple GPU cuFFT Transforms</a><ul> <li class="toctree-l3"><a class="reference internal" href="#plan-specification-and-work-areas">2.8.1. Plan Specification and Work Areas</a></li> <li class="toctree-l3"><a class="reference internal" href="#helper-functions">2.8.2. Helper Functions</a></li> <li class="toctree-l3"><a class="reference internal" href="#multiple-gpu-2d-and-3d-transforms-on-permuted-input">2.8.3. Multiple GPU 2D and 3D Transforms on Permuted Input</a></li> <li class="toctree-l3"><a class="reference internal" href="#supported-functionality">2.8.4. Supported Functionality</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#cufft-callback-routines">2.9. cuFFT Callback Routines</a><ul> <li class="toctree-l3"><a class="reference internal" href="#overview-of-the-cufft-callback-routine-feature">2.9.1. Overview of the cuFFT Callback Routine Feature</a></li> <li class="toctree-l3"><a class="reference internal" href="#lto-load-and-store-callback-routines">2.9.2. LTO Load and Store Callback Routines</a><ul> <li class="toctree-l4"><a class="reference internal" href="#specifying-lto-load-and-store-callback-routines">2.9.2.1. Specifying LTO Load and Store Callback Routines</a></li> <li class="toctree-l4"><a class="reference internal" href="#lto-callback-routine-function-details">2.9.2.2. LTO Callback Routine Function Details</a></li> </ul> </li> <li class="toctree-l3"><a class="reference internal" href="#legacy-load-and-store-callback-routines">2.9.3. Legacy Load and Store Callback Routines</a><ul> <li class="toctree-l4"><a class="reference internal" href="#specifying-legacy-load-and-store-callback-routines">2.9.3.1. Specifying Legacy Load and Store Callback Routines</a></li> <li class="toctree-l4"><a class="reference internal" href="#legacy-callback-routine-function-details">2.9.3.2. Legacy Callback Routine Function Details</a></li> </ul> </li> <li class="toctree-l3"><a class="reference internal" href="#coding-considerations-for-the-cufft-callback-routine-feature">2.9.4. Coding Considerations for the cuFFT Callback Routine Feature</a><ul> <li class="toctree-l4"><a class="reference internal" href="#coding-considerations-for-lto-callback-routines">2.9.4.1. Coding Considerations for LTO Callback Routines</a></li> <li class="toctree-l4"><a class="reference internal" href="#coding-considerations-for-legacy-callback-routines">2.9.4.2. Coding Considerations for Legacy Callback Routines</a></li> </ul> </li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#thread-safety">2.10. Thread Safety</a></li> <li class="toctree-l2"><a class="reference internal" href="#cuda-graphs-support">2.11. CUDA Graphs Support</a></li> <li class="toctree-l2"><a class="reference internal" href="#static-library-and-callback-support">2.12. Static Library and Callback Support</a><ul> <li class="toctree-l3"><a class="reference internal" href="#static-library-without-legacy-callback-support">2.12.1. Static library without legacy callback support</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#accuracy-and-performance">2.13. Accuracy and Performance</a></li> <li class="toctree-l2"><a class="reference internal" href="#caller-allocated-work-area-support">2.14. Caller Allocated Work Area Support</a></li> <li class="toctree-l2"><a class="reference internal" href="#cufft-link-time-optimized-kernels">2.15. cuFFT Link-Time Optimized Kernels</a></li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#cufft-api-reference">3. cuFFT API Reference</a><ul> <li class="toctree-l2"><a class="reference internal" href="#return-value-cufftresult">3.1. Return value cufftResult</a></li> <li class="toctree-l2"><a class="reference internal" href="#cufft-basic-plans">3.2. cuFFT Basic Plans</a><ul> <li class="toctree-l3"><a class="reference internal" href="#cufftplan1d">3.2.1. cufftPlan1d()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftplan2d">3.2.2. cufftPlan2d()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftplan3d">3.2.3. cufftPlan3d()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftplanmany">3.2.4. cufftPlanMany()</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#cufft-extensible-plans">3.3. cuFFT Extensible Plans</a><ul> <li class="toctree-l3"><a class="reference internal" href="#cufftcreate">3.3.1. cufftCreate()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftdestroy">3.3.2. cufftDestroy()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftmakeplan1d">3.3.3. cufftMakePlan1d()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftmakeplan2d">3.3.4. cufftMakePlan2d()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftmakeplan3d">3.3.5. cufftMakePlan3d()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftmakeplanmany">3.3.6. cufftMakePlanMany()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftmakeplanmany64">3.3.7. cufftMakePlanMany64()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftxtmakeplanmany">3.3.8. cufftXtMakePlanMany()</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#cufft-plan-properties">3.4. cuFFT Plan Properties</a><ul> <li class="toctree-l3"><a class="reference internal" href="#cufftsetplanpropertyint64">3.4.1. cufftSetPlanPropertyInt64()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftgetplanpropertyint64">3.4.2. cufftGetPlanPropertyInt64()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftresetplanproperty">3.4.3. cufftResetPlanProperty()</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#cufft-estimated-size-of-work-area">3.5. cuFFT Estimated Size of Work Area</a><ul> <li class="toctree-l3"><a class="reference internal" href="#cufftestimate1d">3.5.1. cufftEstimate1d()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftestimate2d">3.5.2. cufftEstimate2d()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftestimate3d">3.5.3. cufftEstimate3d()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftestimatemany">3.5.4. cufftEstimateMany()</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#cufft-refined-estimated-size-of-work-area">3.6. cuFFT Refined Estimated Size of Work Area</a><ul> <li class="toctree-l3"><a class="reference internal" href="#cufftgetsize1d">3.6.1. cufftGetSize1d()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftgetsize2d">3.6.2. cufftGetSize2d()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftgetsize3d">3.6.3. cufftGetSize3d()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftgetsizemany">3.6.4. cufftGetSizeMany()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftgetsizemany64">3.6.5. cufftGetSizeMany64()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftxtgetsizemany">3.6.6. cufftXtGetSizeMany()</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#cufftgetsize">3.7. cufftGetSize()</a></li> <li class="toctree-l2"><a class="reference internal" href="#cufft-caller-allocated-work-area-support">3.8. cuFFT Caller Allocated Work Area Support</a><ul> <li class="toctree-l3"><a class="reference internal" href="#cufftsetautoallocation">3.8.1. cufftSetAutoAllocation()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftsetworkarea">3.8.2. cufftSetWorkArea()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftxtsetworkareapolicy">3.8.3. cufftXtSetWorkAreaPolicy()</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#cufft-execution">3.9. cuFFT Execution</a><ul> <li class="toctree-l3"><a class="reference internal" href="#cufftexecc2c-and-cufftexecz2z">3.9.1. cufftExecC2C() and cufftExecZ2Z()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftexecr2c-and-cufftexecd2z">3.9.2. cufftExecR2C() and cufftExecD2Z()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftexecc2r-and-cufftexecz2d">3.9.3. cufftExecC2R() and cufftExecZ2D()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftxtexec">3.9.4. cufftXtExec()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftxtexecdescriptor">3.9.5. cufftXtExecDescriptor()</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#cufft-and-multiple-gpus">3.10. cuFFT and Multiple GPUs</a><ul> <li class="toctree-l3"><a class="reference internal" href="#cufftxtsetgpus">3.10.1. cufftXtSetGPUs()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftxtsetworkarea">3.10.2. cufftXtSetWorkArea()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufft-multiple-gpu-execution">3.10.3. cuFFT Multiple GPU Execution</a><ul> <li class="toctree-l4"><a class="reference internal" href="#cufftxtexecdescriptorc2c-and-cufftxtexecdescriptorz2z">3.10.3.1. cufftXtExecDescriptorC2C() and cufftXtExecDescriptorZ2Z()</a></li> <li class="toctree-l4"><a class="reference internal" href="#cufftxtexecdescriptorr2c-and-cufftxtexecdescriptord2z">3.10.3.2. cufftXtExecDescriptorR2C() and cufftXtExecDescriptorD2Z()</a></li> <li class="toctree-l4"><a class="reference internal" href="#cufftxtexecdescriptorc2r-and-cufftxtexecdescriptorz2d">3.10.3.3. cufftXtExecDescriptorC2R() and cufftXtExecDescriptorZ2D()</a></li> </ul> </li> <li class="toctree-l3"><a class="reference internal" href="#memory-allocation-and-data-movement-functions">3.10.4. Memory Allocation and Data Movement Functions</a><ul> <li class="toctree-l4"><a class="reference internal" href="#cufftxtmalloc">3.10.4.1. cufftXtMalloc()</a><ul> <li class="toctree-l5"><a class="reference internal" href="#parameter-cufftxtsubformat">3.10.4.1.1. Parameter cufftXtSubFormat</a></li> </ul> </li> <li class="toctree-l4"><a class="reference internal" href="#cufftxtfree">3.10.4.2. cufftXtFree()</a></li> <li class="toctree-l4"><a class="reference internal" href="#cufftxtmemcpy">3.10.4.3. cufftXtMemcpy()</a><ul> <li class="toctree-l5"><a class="reference internal" href="#parameter-cufftxtcopytype">3.10.4.3.1. Parameter cufftXtCopyType</a></li> </ul> </li> </ul> </li> <li class="toctree-l3"><a class="reference internal" href="#general-multiple-gpu-descriptor-types">3.10.5. General Multiple GPU Descriptor Types</a><ul> <li class="toctree-l4"><a class="reference internal" href="#cudaxtdesc">3.10.5.1. cudaXtDesc</a></li> <li class="toctree-l4"><a class="reference internal" href="#cudalibxtdesc">3.10.5.2. cudaLibXtDesc</a></li> </ul> </li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#cufft-callbacks">3.11. cuFFT Callbacks</a><ul> <li class="toctree-l3"><a class="reference internal" href="#cufftxtsetjitcallback">3.11.1. cufftXtSetJITCallback()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftxtsetcallback">3.11.2. cufftXtSetCallback()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftxtclearcallback">3.11.3. cufftXtClearCallback()</a></li> <li class="toctree-l3"><a class="reference internal" href="#cufftxtsetcallbacksharedsize">3.11.4. cufftXtSetCallbackSharedSize()</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#cufftsetstream">3.12. cufftSetStream()</a></li> <li class="toctree-l2"><a class="reference internal" href="#cufftgetversion">3.13. cufftGetVersion()</a></li> <li class="toctree-l2"><a class="reference internal" href="#cufftgetproperty">3.14. cufftGetProperty()</a></li> <li class="toctree-l2"><a class="reference internal" href="#cufft-types">3.15. cuFFT Types</a><ul> <li class="toctree-l3"><a class="reference internal" href="#parameter-cuffttype">3.15.1. Parameter cufftType</a></li> <li class="toctree-l3"><a class="reference internal" href="#parameters-for-transform-direction">3.15.2. Parameters for Transform Direction</a></li> <li class="toctree-l3"><a class="reference internal" href="#type-definitions-for-callbacks">3.15.3. Type definitions for callbacks</a><ul> <li class="toctree-l4"><a class="reference internal" href="#type-definitions-for-lto-callbacks">3.15.3.1. Type definitions for LTO callbacks</a></li> <li class="toctree-l4"><a class="reference internal" href="#type-definitions-for-legacy-callbacks">3.15.3.2. Type definitions for legacy callbacks</a></li> </ul> </li> <li class="toctree-l3"><a class="reference internal" href="#other-cufft-types">3.15.4. Other cuFFT Types</a><ul> <li class="toctree-l4"><a class="reference internal" href="#cuffthandle">3.15.4.1. cufftHandle</a></li> <li class="toctree-l4"><a class="reference internal" href="#cufftreal">3.15.4.2. cufftReal</a></li> <li class="toctree-l4"><a class="reference internal" href="#cufftdoublereal">3.15.4.3. cufftDoubleReal</a></li> <li class="toctree-l4"><a class="reference internal" href="#cufftcomplex">3.15.4.4. cufftComplex</a></li> <li class="toctree-l4"><a class="reference internal" href="#cufftdoublecomplex">3.15.4.5. cufftDoubleComplex</a></li> </ul> </li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="#common-types">3.16. Common types</a><ul> <li class="toctree-l3"><a class="reference internal" href="#cudadatatype">3.16.1. cudaDataType</a></li> <li class="toctree-l3"><a class="reference internal" href="#librarypropertytype">3.16.2. libraryPropertyType</a></li> </ul> </li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#multiple-gpu-data-organization">4. Multiple GPU Data Organization</a><ul> <li class="toctree-l2"><a class="reference internal" href="#multiple-gpu-data-organization-for-batched-transforms">4.1. Multiple GPU Data Organization for Batched Transforms</a></li> <li class="toctree-l2"><a class="reference internal" href="#multiple-gpu-data-organization-for-single-2d-and-3d-transforms">4.2. Multiple GPU Data Organization for Single 2D and 3D Transforms</a></li> <li class="toctree-l2"><a class="reference internal" href="#multiple-gpu-data-organization-for-single-1d-transforms">4.3. Multiple-GPU Data Organization for Single 1D Transforms</a></li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="#fftw-conversion-guide">5. FFTW Conversion Guide</a></li> <li class="toctree-l1"><a class="reference internal" href="#fftw-interface-to-cufft">6. FFTW Interface to cuFFT</a></li> <li class="toctree-l1"><a class="reference internal" href="#deprecated-functionality">7. Deprecated Functionality</a></li> <li class="toctree-l1"><a class="reference internal" href="#notices">8. Notices</a><ul> <li class="toctree-l2"><a class="reference internal" href="#notice">8.1. Notice</a></li> <li class="toctree-l2"><a class="reference internal" href="#opencl">8.2. OpenCL</a></li> <li class="toctree-l2"><a class="reference internal" href="#trademarks">8.3. Trademarks</a></li> </ul> </li> </ul> </div> </div> </nav> <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" > <a href="contents.html">cuFFT</a> </nav> <div class="wy-nav-content"> <div class="rst-content"> <div role="navigation" aria-label="Page navigation"> <ul class="wy-breadcrumbs"> <li><a href="../index.html" class="icon icon-home"></a> »</li> <li>1. Introduction</li> <li class="wy-breadcrumbs-aside"> </li> <li class="wy-breadcrumbs-aside"> v12.6 | <a href="../pdf/CUFFT_Library.pdf" class="reference external">PDF</a> | <a href="https://developer.nvidia.com/cuda-toolkit-archive" class="reference external">Archive</a>   </li> </ul> <hr/> </div> <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article"> <div itemprop="articleBody"> cuFFT API Reference The API reference guide for cuFFT, the CUDA Fast Fourier Transform library. <section id="introduction"> <h1>1. Introduction<a class="headerlink" href="#introduction" title="Permalink to this headline"></a></h1> cuFFT Release Notes: <a class="reference external" href="https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cufft-library">CUDA Toolkit Release Notes</a> cuFFT GitHub Samples: <a class="reference external" href="https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFT">CUDA Library Samples</a> Nvidia Developer Forum: <a class="reference external" href="https://forums.developer.nvidia.com/c/accelerated-computing/gpu-accelerated-libraries/">GPU-Accelerated Libraries</a> Provide Feedback: <a class="reference external" href="mailto:Math-Libs-Feedback%40nvidia.com?subject=cuFFT-Feedback">Math-Libs-Feedback@nvidia.com</a> Related FFT Libraries: <ul class="simple"> <li><a class="reference external" href="https://docs.nvidia.com/hpc-sdk/cufftmp/">cuFFTMP</a></li> <li><a class="reference external" href="https://docs.nvidia.com/cuda/cufftdx/index.html">cuFFTDx</a></li> <li><a class="reference external" href="https://docs.nvidia.com/cuda/cufft/ltoea/index.html">cuFFT LTO EA (DEPRECATED)</a></li> <li><a class="reference external" href="https://docs.nvidia.com/nvpl/_static/fft/index.html">NVPL FFT</a></li> </ul> Relevant cuFFT Blog Posts and GTC presentations: <ul class="simple"> <li><a class="reference external" href="https://developer.nvidia.com/blog/accelerating-gpu-applications-with-nvidia-math-libraries/">Accelerating GPU Applications with NVIDIA Math Libraries</a></li> <li><a class="reference external" href="https://developer.nvidia.com/blog/multinode-multi-gpu-using-nvidia-cufftmp-ffts-at-scale/">Multinode Multi-GPU: Using NVIDIA cuFFTMp FFTs at Scale</a></li> <li><a class="reference external" href="https://developer.nvidia.com/blog/building-high-performance-applications-in-the-era-of-accelerated-computing/">New Asynchronous Programming Model Library Now Available with NVIDIA HPC SDK v22.11</a></li> <li><a class="reference external" href="https://www.nvidia.com/en-us/on-demand/session/gtcfall21-a31155/?playlistId=playList-ead11304-9931-4e91-9d5a-fb0e1ef27014">Just-In-Time Link-Time Optimization Adoption in cuSPARSE/cuFFT: Use Case Overview</a></li> </ul> This document describes cuFFT, the NVIDIA® CUDA® Fast Fourier Transform (FFT) product. It consists of two separate libraries: cuFFT and cuFFTW. The cuFFT library is designed to provide high performance on NVIDIA GPUs. The cuFFTW library is provided as a porting tool to enable users of FFTW to start using NVIDIA GPUs with a minimum amount of effort. The FFT is a divide-and-conquer algorithm for efficiently computing discrete Fourier transforms of complex or real-valued data sets. It is one of the most important and widely used numerical algorithms in computational physics and general signal processing. The cuFFT library provides a simple interface for computing FFTs on an NVIDIA GPU, which allows users to quickly leverage the floating-point power and parallelism of the GPU in a highly optimized and tested FFT library. The cuFFT product supports a wide range of FFT inputs and options efficiently on NVIDIA GPUs. This version of the cuFFT library supports the following features: <ul class="simple"> <li>Algorithms highly optimized for input sizes that can be written in the form \(2^{a} \times 3^{b} \times 5^{c} \times 7^{d}\). In general the smaller the prime factor, the better the performance, i.e., powers of two are fastest.</li> <li>An \(O\left( n\log n \right)\) algorithm for every input data size</li> <li>Half-precision (16-bit floating point), single-precision (32-bit floating point) and double-precision (64-bit floating point). Transforms of lower precision have higher performance.</li> <li>Complex and real-valued input and output. Real valued input or output require less computations and data than complex values and often have faster time to solution. Types supported are: <ul> <li>C2C - Complex input to complex output</li> <li>R2C - Real input to complex output</li> <li>C2R - Symmetric complex input to real output</li> </ul> </li> <li>1D, 2D and 3D transforms</li> <li>Execution of multiple 1D, 2D and 3D transforms simultaneously. These batched transforms have higher performance than single transforms.</li> <li>In-place and out-of-place transforms</li> <li>Arbitrary intra- and inter-dimension element strides (strided layout)</li> <li>FFTW compatible data layout</li> <li>Execution of transforms across multiple GPUs</li> <li>Streamed execution, enabling asynchronous computation and data movement</li> </ul> The cuFFTW library provides the FFTW3 API to facilitate porting of existing FFTW applications. Please note that starting from CUDA 11.0, the minimum supported GPU architecture is SM35. See <a class="reference external" href="index.html#deprecated-functionality">Deprecated Functionality</a>. <blockquote> <div></div></blockquote> </section> <section id="using-the-cufft-api"> <h1>2. Using the cuFFT API<a class="headerlink" href="#using-the-cufft-api" title="Permalink to this headline"></a></h1> This chapter provides a general overview of the cuFFT library API. For more complete information on specific functions, see <a class="reference external" href="index.html#cufft-api-reference">cuFFT API Reference</a>. Users are encouraged to read this chapter before continuing with more detailed descriptions. The Discrete Fourier transform (DFT) maps a complex-valued vector \(x_{k}\) (time domain) into its frequency domain representation given by: <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 100%" /> </colgroup> <tbody> <tr class="row-odd"><td>\(X_{k} = \sum\limits_{n = 0}^{N - 1}x_{n}e^{-2\pi i\frac{kn}{N}}\)</td> </tr> </tbody> </table> where \(X_{k}\) is a complex-valued vector of the same size. This is known as a forward DFT. If the sign on the exponent of e is changed to be positive, the transform is an inverse transform. Depending on \(N\), different algorithms are deployed for the best performance. The cuFFT API is modeled after <a class="reference external" href="http://www.fftw.org">FFTW</a>, which is one of the most popular and efficient CPU-based FFT libraries. cuFFT provides a simple configuration mechanism called a plan that uses internal building blocks to optimize the transform for the given configuration and the particular GPU hardware selected. Then, when the execution function is called, the actual transform takes place following the plan of execution. The advantage of this approach is that once the user creates a plan, the library retains whatever state is needed to execute the plan multiple times without recalculation of the configuration. This model works well for cuFFT because different kinds of FFTs require different thread configurations and GPU resources, and the plan interface provides a simple way of reusing configurations. Computing a number <code class="docutils literal notranslate">BATCH</code> of one-dimensional DFTs of size <code class="docutils literal notranslate">NX</code> using cuFFT will typically look like this: <div class="highlight-c++ notranslate"><div class="highlight"><pre>#define NX 256 #define BATCH 10 #define RANK 1 ... { cufftHandle plan; cufftComplex *data; ... cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*BATCH); cufftPlanMany(&plan, RANK, NX, &iembed, istride, idist, &oembed, ostride, odist, CUFFT_C2C, BATCH); ... cufftExecC2C(plan, data, data, CUFFT_FORWARD); cudaDeviceSynchronize(); ... cufftDestroy(plan); cudaFree(data); } </pre></div> </div> <section id="accessing-cufft"> <h2>2.1. Accessing cuFFT<a class="headerlink" href="#accessing-cufft" title="Permalink to this headline"></a></h2> The cuFFT and cuFFTW libraries are available as shared libraries. They consist of compiled programs ready for users to incorporate into applications with the compiler and linker. cuFFT can be downloaded from <a class="reference external" href="https://developer.nvidia.com/cufft">https://developer.nvidia.com/cufft</a>. By selecting Download CUDA Production Release users are all able to install the package containing the CUDA Toolkit, SDK code samples and development drivers. The CUDA Toolkit contains cuFFT and the samples include <code class="docutils literal notranslate">simplecuFFT</code>. The Linux release for <code class="docutils literal notranslate">simplecuFFT</code> assumes that the root install directory is <code class="docutils literal notranslate">/usr/local/cuda</code> and that the locations of the products are contained there as follows. Modify the Makefile as appropriate for your system. <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 45%" /> <col style="width: 34%" /> <col style="width: 21%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Product</th> <th class="head">Location and name</th> <th class="head">Include file</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">nvcc</code> compiler</td> <td><code class="docutils literal notranslate">/bin/nvcc</code></td> <td></td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">cuFFT</code> library</td> <td><code class="docutils literal notranslate">{lib, lib64}/libcufft.so</code></td> <td><code class="docutils literal notranslate">inc/cufft.h</code></td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">cuFFT</code> library with Xt functionality</td> <td><code class="docutils literal notranslate">{lib, lib64}/libcufft.so</code></td> <td><code class="docutils literal notranslate">inc/cufftXt.h</code></td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">cuFFTW</code> library</td> <td><code class="docutils literal notranslate">{lib, lib64}/libcufftw.so</code></td> <td><code class="docutils literal notranslate">inc/cufftw.h</code></td> </tr> </tbody> </table> The most common case is for developers to modify an existing CUDA routine (for example, <code class="docutils literal notranslate">filename.cu</code>) to call cuFFT routines. In this case the include file <code class="docutils literal notranslate">cufft.h</code> or <code class="docutils literal notranslate">cufftXt.h</code> should be inserted into <code class="docutils literal notranslate">filename.cu</code> file and the library included in the link line. A single compile and link line might appear as <ul class="simple"> <li><code class="docutils literal notranslate">/usr/local/cuda/bin/nvcc [options] filename.cu … -I/usr/local/cuda/inc -L/usr/local/cuda/lib -lcufft</code></li> </ul> Of course there will typically be many compile lines and the compiler <code class="docutils literal notranslate">g++</code> may be used for linking so long as the library path is set correctly. Users of the FFTW interface (see <a class="reference external" href="index.html#fftw-supported-interface">FFTW Interface to cuFFT</a>) should include <code class="docutils literal notranslate">cufftw.h</code> and link with both cuFFT and cuFFTW libraries. Functions in the cuFFT and cuFFTW library assume that the data is in GPU visible memory. This means any memory allocated by <code class="docutils literal notranslate">cudaMalloc</code>, <code class="docutils literal notranslate">cudaMallocHost</code> and <code class="docutils literal notranslate">cudaMallocManaged</code> or registered with <code class="docutils literal notranslate">cudaHostRegister</code> can be used as input, output or plan work area with cuFFT and cuFFTW functions. For the best performance input data, output data and plan work area should reside in device memory. cuFFTW library also supports input data and output data that is not GPU visible. </section> <section id="fourier-transform-setup"> <h2>2.2. Fourier Transform Setup<a class="headerlink" href="#fourier-transform-setup" title="Permalink to this headline"></a></h2> The first step in using the cuFFT Library is to create a plan using one of the following: <ul class="simple"> <li><code class="docutils literal notranslate">cufftPlan1D() / cufftPlan2D() / cufftPlan3D()</code> - Create a simple plan for a 1D/2D/3D transform respectively.</li> <li><code class="docutils literal notranslate">cufftPlanMany()</code> - Creates a plan supporting batched input and strided data layouts.</li> <li><code class="docutils literal notranslate">cufftXtMakePlanMany()</code> - Creates a plan supporting batched input and strided data layouts for any supported precision.</li> </ul> Among the plan creation functions, <code class="docutils literal notranslate">cufftPlanMany()</code> allows use of more complicated data layouts and batched executions. Execution of a transform of a particular size and type may take several stages of processing. When a plan for the transform is generated, cuFFT derives the internal steps that need to be taken. These steps may include multiple kernel launches, memory copies, and so on. In addition, all the intermediate buffer allocations (on CPU/GPU memory) take place during planning. These buffers are released when the plan is destroyed. In the worst case, the cuFFT Library allocates space for <code class="docutils literal notranslate">8*batch*n[0]*..*n[rank-1] cufftComplex</code> or <code class="docutils literal notranslate">cufftDoubleComplex</code> elements (where <code class="docutils literal notranslate">batch</code> denotes the number of transforms that will be executed in parallel, <code class="docutils literal notranslate">rank</code> is the number of dimensions of the input data (see <a class="reference external" href="index.html#multi-dimensional">Multidimensional Transforms</a>) and <code class="docutils literal notranslate">n[]</code> is the array of transform dimensions) for single and double-precision transforms respectively. Depending on the configuration of the plan, less memory may be used. In some specific cases, the temporary space allocations can be as low as <code class="docutils literal notranslate">1*batch*n[0]*..*n[rank-1] cufftComplex</code> or <code class="docutils literal notranslate">cufftDoubleComplex</code> elements. This temporary space is allocated separately for each individual plan when it is created (i.e., temporary space is not shared between the plans). The next step in using the library is to call an execution function such as <code class="docutils literal notranslate">cufftExecC2C()</code> (see <a class="reference external" href="index.html#cufft-transform-types">Parameter cufftType</a>) which will perform the transform with the specifications defined at planning. One can create a cuFFT plan and perform multiple transforms on different data sets by providing different input and output pointers. Once the plan is no longer needed, the <code class="docutils literal notranslate">cufftDestroy()</code> function should be called to release the resources allocated for the plan. <section id="free-memory-requirement"> <h3>2.2.1. Free Memory Requirement<a class="headerlink" href="#free-memory-requirement" title="Permalink to this headline"></a></h3> The first program call to any cuFFT function causes the initialization of the cuFFT kernels. This can fail if there is not enough free memory on the GPU. It is advisable to initialize cufft first (e.g. by creating a plan) and then allocating memory. </section> <section id="plan-initialization-time"> <h3>2.2.2. Plan Initialization Time<a class="headerlink" href="#plan-initialization-time" title="Permalink to this headline"></a></h3> During plan initialization, cuFFT conducts a series of steps, including heuristics to determine which kernels to be used as well as kernel module loads. Starting from CUDA 12.0, cuFFT delivers a larger portion of kernels using the CUDA Parallel Thread eXecution assembly form (PTX code), instead of the binary form (cubin object). The PTX code of cuFFT kernels are loaded and compiled further to the binary code by the CUDA device driver at runtime when a cuFFT plan is initialized. This is called <a class="reference external" href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#just-in-time-compilation">just-in-time (JIT) compilation</a>. JIT compilation slightly increases cuFFT plan initialization time, depending on the transform size and the speed of the host CPU (see <a class="reference external" href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#module">Module load driver API</a>) . But the JIT overhead occurs only when a binary code is generated for the first time during plan initialization using one of the <a class="reference external" href="https://docs.nvidia.com/cuda/cufft/index.html#cufft-setup">plan creation functions</a>. The device driver automatically caches a copy of the generated binary code to avoid repeating the compilation in subsequent invocations. If necessary, <code class="docutils literal notranslate">CUDA_CACHE_PATH</code> or <code class="docutils literal notranslate">CUDA_CACHE_MAXSIZE</code> can be customized to set the cache folder and max size (see detail in <a class="reference external" href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars">CUDA Environmental Variables</a>), but the default settings are fine in general. </section> </section> <section id="fourier-transform-types"> <h2>2.3. Fourier Transform Types<a class="headerlink" href="#fourier-transform-types" title="Permalink to this headline"></a></h2> Apart from the general complex-to-complex (C2C) transform, cuFFT implements efficiently two other types: real-to-complex (R2C) and complex-to-real (C2R). In many practical applications the input vector is real-valued. It can be easily shown that in this case the output satisfies Hermitian symmetry ( \(X_{k} = X_{N - k}^{\ast}\), where the star denotes complex conjugation). The converse is also true: for complex-Hermitian input the inverse transform will be purely real-valued. cuFFT takes advantage of this redundancy and works only on the first half of the Hermitian vector. Transform execution functions for single and double-precision are defined separately as: <ul class="simple"> <li><code class="docutils literal notranslate">cufftExecC2C() / cufftExecZ2Z()</code> - complex-to-complex transforms for single/double precision.</li> <li><code class="docutils literal notranslate">cufftExecR2C() / cufftExecD2Z()</code> - real-to-complex forward transform for single/double precision.</li> <li><code class="docutils literal notranslate">cufftExecC2R() / cufftExecZ2D()</code> - complex-to-real inverse transform for single/double precision.</li> </ul> Each of those functions demands different input data layout (see <a class="reference external" href="index.html#data-layout">Data Layout</a> for details). <div class="admonition note"> Note Complex-to-real (C2R) transforms accept complex-Hermitian input. For one-dimensional signals, this requires the 0th element (and the \(\frac{N}{2}\)th input if N is even) to be real-valued, i.e. its imaginary part should be zero. For d-dimension signals, this means \(x_{(n_{1},n_{2},\ldots,n_{d})} = x_{(N_{1} - n_{1},N_{2} - n_{2},\ldots,N_{d} - n_{d})}^{\ast}\). Otherwise, the behavior of the transform is undefined. Also see <a class="reference external" href="index.html#multidimensional-transforms">Multidimensional Transforms</a>. </div> Functions <code class="docutils literal notranslate">cufftXtExec()</code> and <code class="docutils literal notranslate">cufftXtExecDescriptor()</code> can perform transforms on any of the supported types. <section id="half-precision-cufft-transforms"> <h3>2.3.1. Half-precision cuFFT Transforms<a class="headerlink" href="#half-precision-cufft-transforms" title="Permalink to this headline"></a></h3> Half-precision transforms have the following limitations: <ul class="simple"> <li>Minimum GPU architecture is SM_53</li> <li>Sizes are restricted to powers of two only</li> <li>Strides on the real part of real-to-complex and complex-to-real transforms are not supported</li> <li>More than one GPU is not supported</li> <li>Transforms spanning more than 4 billion elements are not supported</li> </ul> Please refer to <code class="docutils literal notranslate">cufftXtMakePlanMany</code> function for plan creation details. The CUDA Toolkit provides the <code class="docutils literal notranslate">cuda_fp16.h</code> header with types and intrinsic functions for handling half-precision arithmetic. </section> <section id="bfloat16-precision-cufft-transforms"> <h3>2.3.2. Bfloat16-precision cuFFT Transforms<a class="headerlink" href="#bfloat16-precision-cufft-transforms" title="Permalink to this headline"></a></h3> cuFFT supports bfloat16 precision using the <code class="docutils literal notranslate">nv_bfloat16</code> data type. Please note that cuFFT utilizes a combination of single- and bfloat16-precision arithmetic operations when computing the FFT in bfloat16 precision. Bfloat16-precision transforms have similar limitations to half-precision transforms: <ul class="simple"> <li>Minimum GPU architecture is SM_80</li> <li>Sizes are restricted to powers of two only</li> <li>Strides on the real part of real-to-complex and complex-to-real transforms are not supported</li> <li>More than one GPU is not supported</li> <li>Transforms spanning more than 4 billion elements are not supported</li> </ul> Please refer to <code class="docutils literal notranslate">cufftXtMakePlanMany</code> function for plan creation details. The CUDA Toolkit provides the <code class="docutils literal notranslate">cuda_bf16.h</code> header with types and intrinsic functions for handling bfloat16-precision arithmetic. </section> </section> <section id="data-layout"> <h2>2.4. Data Layout<a class="headerlink" href="#data-layout" title="Permalink to this headline"></a></h2> In the cuFFT Library, data layout depends strictly on the configuration and the transform type. In the case of general complex-to-complex transform both the input and output data shall be a <code class="docutils literal notranslate">cufftComplex</code>/<code class="docutils literal notranslate">cufftDoubleComplex</code> array in single- and double-precision modes respectively. In C2R mode an input array \((x_{1},x_{2},\ldots,x_{\lfloor\frac{N}{2}\rfloor + 1})\) of only non-redundant complex elements is required. The output array \((X_{1},X_{2},\ldots,X_{N})\) consists of <code class="docutils literal notranslate">cufftReal</code>/<code class="docutils literal notranslate">cufftDouble</code> elements in this mode. Finally, R2C demands an input array \((X_{1},X_{2},\ldots,X_{N})\) of real values and returns an array \((x_{1},x_{2},\ldots,x_{\lfloor\frac{N}{2}\rfloor + 1})\) of non-redundant complex elements. In real-to-complex and complex-to-real transforms the size of input data and the size of output data differ. For out-of-place transforms a separate array of appropriate size is created. For in-place transforms the user should use <code class="docutils literal notranslate">padded</code> data layout. This layout is FFTW compatibile. In the <code class="docutils literal notranslate">padded</code> layout output signals begin at the same memory addresses as the input data. Therefore input data for real-to-complex and output data for complex-to-real must be padded. Expected sizes of input/output data for 1-d transforms are summarized in the table below: <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 7%" /> <col style="width: 47%" /> <col style="width: 47%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">FFT type</th> <th class="head">input data size</th> <th class="head">output data size</th> </tr> </thead> <tbody> <tr class="row-even"><td>C2C</td> <td>\(x\)<code class="docutils literal notranslate">cufftComplex</code></td> <td>\(x\)<code class="docutils literal notranslate">cufftComplex</code></td> </tr> <tr class="row-odd"><td>C2R</td> <td>\(\left\lfloor \frac{x}{2} \right\rfloor + 1\)<code class="docutils literal notranslate">cufftComplex</code></td> <td>\(x\)<code class="docutils literal notranslate">cufftReal</code></td> </tr> <tr class="row-even"><td>R2C*</td> <td>\(x\)<code class="docutils literal notranslate">cufftReal</code></td> <td>\(\left\lfloor \frac{x}{2} \right\rfloor + 1\)<code class="docutils literal notranslate">cufftComplex</code></td> </tr> </tbody> </table> The real-to-complex transform is implicitly a forward transform. For an in-place real-to-complex transform where FFTW compatible output is desired, the input size must be padded to \(\left( {\lfloor\frac{N}{2}\rfloor + 1} \right)\) complex elements. For out-of-place transforms, input and output sizes match the logical transform size \(N\) and the non-redundant size \(\lfloor\frac{N}{2}\rfloor + 1\), respectively. The complex-to-real transform is implicitly inverse. For in-place complex-to-real FFTs where FFTW compatible output is selected (default padding mode), the input size is assumed to be \(\lfloor\frac{N}{2}\rfloor + 1\)<code class="docutils literal notranslate">cufftComplex</code> elements. Note that in-place complex-to-real FFTs may overwrite arbitrary imaginary input point values when non-unit input and output strides are chosen. Out-of-place complex-to-real FFT will always overwrite input buffer. For out-of-place transforms, input and output sizes match the logical transform non-redundant size \(\lfloor\frac{N}{2}\rfloor + 1\) and size \(N\), respectively. </section> <section id="multidimensional-transforms"> <h2>2.5. Multidimensional Transforms<a class="headerlink" href="#multidimensional-transforms" title="Permalink to this headline"></a></h2> Multidimensional DFT map a \(d\)-dimensional array \(x_{\mathbf{n}}\), where \(\mathbf{n} = (n_{1},n_{2},\ldots,n_{d})\) into its frequency domain array given by: <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 100%" /> </colgroup> <tbody> <tr class="row-odd"><td>\(X_{\mathbf{k}} = \sum\limits_{n = 0}^{N - 1}x_{\mathbf{n}}e^{-2\pi i\frac{\mathbf{k}\mathbf{n}}{\mathbf{N}}}\)</td> </tr> </tbody> </table> where \(\frac{\mathbf{n}}{\mathbf{N}} = (\frac{n_{1}}{N_{1}},\frac{n_{2}}{N_{2}},\ldots,\frac{n_{d}}{N_{d}})\), and the summation denotes the set of nested summations <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 100%" /> </colgroup> <tbody> <tr class="row-odd"><td>\(\sum\limits_{n_{1} = 0}^{N_{1} - 1}\sum\limits_{n_{2} = 0}^{N_{2} - 1}\ldots\sum\limits_{n_{d} = 0}^{N_{d} - 1}\)</td> </tr> </tbody> </table> cuFFT supports one-dimensional, two-dimensional and three-dimensional transforms, which can all be called by the same <code class="docutils literal notranslate">cufftExec*</code> functions (see <a class="reference external" href="index.html#fft-types">Fourier Transform Types</a>). Similar to the one-dimensional case, the frequency domain representation of real-valued input data satisfies Hermitian symmetry, defined as: \(x_{(n_{1},n_{2},\ldots,n_{d})} = x_{(N_{1} - n_{1},N_{2} - n_{2},\ldots,N_{d} - n_{d})}^{\ast}\). C2R and R2C algorithms take advantage of this fact by operating only on half of the elements of signal array, namely on: \(x_{\mathbf{n}}\) for \(\mathbf{n} \in \{ 1,\ldots,N_{1}\} \times \ldots \times \{ 1,\ldots,N_{d - 1}\} \times \{ 1,\ldots,\lfloor\frac{N_{d}}{2}\rfloor + 1\}\). The general rules of data alignment described in <a class="reference external" href="index.html#data-layout">Data Layout</a> apply to higher-dimensional transforms. The following table summarizes input and output data sizes for multidimensional DFTs: <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 3%" /> <col style="width: 5%" /> <col style="width: 46%" /> <col style="width: 46%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Dims</th> <th class="head">FFT type</th> <th class="head">Input data size</th> <th class="head">Output data size</th> </tr> </thead> <tbody> <tr class="row-even"><td>1D</td> <td>C2C</td> <td>\(\mathbf{N}_{1}\)<code class="docutils literal notranslate">cufftComplex</code></td> <td>\(\mathbf{N}_{1}\)<code class="docutils literal notranslate">cufftComplex</code></td> </tr> <tr class="row-odd"><td>1D</td> <td>C2R</td> <td>\(\lfloor\frac{\mathbf{N}_{1}}{2}\rfloor + 1\)<code class="docutils literal notranslate">cufftComplex</code></td> <td>\(\mathbf{N}_{1}\)<code class="docutils literal notranslate">cufftReal</code></td> </tr> <tr class="row-even"><td>1D</td> <td>R2C</td> <td>\(\mathbf{N}_{1}\)<code class="docutils literal notranslate">cufftReal</code></td> <td>\(\lfloor\frac{\mathbf{N}_{1}}{2}\rfloor + 1\)<code class="docutils literal notranslate">cufftComplex</code></td> </tr> <tr class="row-odd"><td>2D</td> <td>C2C</td> <td>\(\mathbf{N}_{1}\mathbf{N}_{2}\)<code class="docutils literal notranslate">cufftComplex</code></td> <td>\(\mathbf{N}_{1}\mathbf{N}_{2}\)<code class="docutils literal notranslate">cufftComplex</code></td> </tr> <tr class="row-even"><td>2D</td> <td>C2R</td> <td>\(\mathbf{N}_{1}(\lfloor\frac{\mathbf{N}_{2}}{2}\rfloor + 1)\)<code class="docutils literal notranslate">cufftComplex</code></td> <td>\(\mathbf{N}_{1}\mathbf{N}_{2}\)<code class="docutils literal notranslate">cufftReal</code></td> </tr> <tr class="row-odd"><td>2D</td> <td>R2C</td> <td>\(\mathbf{N}_{1}\mathbf{N}_{2}\)<code class="docutils literal notranslate">cufftReal</code></td> <td>\(\mathbf{N}_{1}(\lfloor\frac{\mathbf{N}_{2}}{2}\rfloor + 1)\)<code class="docutils literal notranslate">cufftComplex</code></td> </tr> <tr class="row-even"><td>3D</td> <td>C2C</td> <td>\(\mathbf{N}_{1}\mathbf{N}_{2}\mathbf{N}_{3}\)<code class="docutils literal notranslate">cufftComplex</code></td> <td>\(\mathbf{N}_{1}\mathbf{N}_{2}\mathbf{N}_{3}\)<code class="docutils literal notranslate">cufftComplex</code></td> </tr> <tr class="row-odd"><td>3D</td> <td>C2R</td> <td>\(\mathbf{N}_{1}\mathbf{N}_{2}(\lfloor\frac{\mathbf{N}_{3}}{2}\rfloor + 1)\)<code class="docutils literal notranslate">cufftComplex</code></td> <td>\(\mathbf{N}_{1}\mathbf{N}_{2}\mathbf{N}_{3}\)<code class="docutils literal notranslate">cufftReal</code></td> </tr> <tr class="row-even"><td>3D</td> <td>R2C</td> <td>\(\mathbf{N}_{1}\mathbf{N}_{2}\mathbf{N}_{3}\)<code class="docutils literal notranslate">cufftReal</code></td> <td>\(\mathbf{N}_{1}\mathbf{N}_{2}(\lfloor\frac{\mathbf{N}_{3}}{2}\rfloor + 1)\)<code class="docutils literal notranslate">cufftComplex</code></td> </tr> </tbody> </table> For example, static declaration of a three-dimensional array for the output of an out-of-place real-to-complex transform will look like this: <div class="highlight-c++ notranslate"><div class="highlight"><pre>cufftComplex odata[N1][N2][N3/2+1]; </pre></div> </div> </section> <section id="advanced-data-layout"> <h2>2.6. Advanced Data Layout<a class="headerlink" href="#advanced-data-layout" title="Permalink to this headline"></a></h2> The advanced data layout feature allows transforming only a subset of an input array, or outputting to only a portion of a larger data structure. It can be set by calling function: <div class="highlight-c++ notranslate"><div class="highlight"><pre>cufftResult cufftPlanMany(cufftHandle *plan, int rank, int *n, int *inembed, int istride, int idist, int *onembed, int ostride, int odist, cufftType type, int batch); </pre></div> </div> Passing <code class="docutils literal notranslate">inembed</code> or <code class="docutils literal notranslate">onembed</code> set to <code class="docutils literal notranslate">NULL</code> is a special case and is equivalent to passing <code class="docutils literal notranslate">n</code> for each. This is same as the basic data layout and other advanced parameters such as <code class="docutils literal notranslate">istride</code> are ignored. If the advanced parameters are to be used, then all of the advanced interface parameters must be specified correctly. Advanced parameters are defined in units of the relevant data type (<code class="docutils literal notranslate">cufftReal</code>, <code class="docutils literal notranslate">cufftDoubleReal</code>, <code class="docutils literal notranslate">cufftComplex</code>, or <code class="docutils literal notranslate">cufftDoubleComplex</code>). Advanced layout can be perceived as an additional layer of abstraction above the access to input/output data arrays. An element of coordinates <code class="docutils literal notranslate">[z][y][x]</code> in signal number <code class="docutils literal notranslate">b</code> in the batch will be associated with the following addresses in the memory: <ul> <li>1D <code class="docutils literal notranslate">input[ b * idist + x * istride ]</code> <code class="docutils literal notranslate">output[ b * odist + x * ostride ]</code> </li> <li>2D <code class="docutils literal notranslate">input[ b * idist` + (x * inembed[1] + y) * istride ]</code> <code class="docutils literal notranslate">output[ b * odist + (x * onembed[1] + y) * ostride ]</code> </li> <li>3D <code class="docutils literal notranslate">input[ b * idist + ((x * inembed[1] + y) * inembed[2] + z) * istride ]</code> <code class="docutils literal notranslate">output[ b * odist + ((x * onembed[1] + y) * onembed[2] + z) * ostride ]</code> </li> </ul> The <code class="docutils literal notranslate">istride</code> and <code class="docutils literal notranslate">ostride</code> parameters denote the distance between two successive input and output elements in the least significant (that is, the innermost) dimension respectively. In a single 1D transform, if every input element is to be used in the transform, <code class="docutils literal notranslate">istride</code> should be set to \(1\); if every other input element is to be used in the transform, then <code class="docutils literal notranslate">istride</code> should be set to \(2\). Similarly, in a single 1D transform, if it is desired to output final elements one after another compactly, <code class="docutils literal notranslate">ostride</code> should be set to \(1\); if spacing is desired between the least significant dimension output data, <code class="docutils literal notranslate">ostride</code> should be set to the distance between the elements. The <code class="docutils literal notranslate">inembed</code> and <code class="docutils literal notranslate">onembed</code> parameters define the number of elements in each dimension in the input array and the output array respectively. The <code class="docutils literal notranslate">inembed[rank-1]</code> contains the number of elements in the least significant (innermost) dimension of the input data excluding the <code class="docutils literal notranslate">istride</code> elements; the number of total elements in the least significant dimension of the input array is then <code class="docutils literal notranslate">istride*inembed[rank-1]</code>. The <code class="docutils literal notranslate">inembed[0]</code> or <code class="docutils literal notranslate">onembed[0]</code> corresponds to the most significant (that is, the outermost) dimension and is effectively ignored since the <code class="docutils literal notranslate">idist</code> or <code class="docutils literal notranslate">odist</code> parameter provides this information instead. Note that the size of each dimension of the transform should be less than or equal to the <code class="docutils literal notranslate">inembed</code> and <code class="docutils literal notranslate">onembed</code> values for the corresponding dimension, that is <code class="docutils literal notranslate">n[i]</code> ≤ <code class="docutils literal notranslate">inembed[i]</code>, <code class="docutils literal notranslate">n[i]</code> ≤ <code class="docutils literal notranslate">onembed[i]</code>, where \(i \in \{ 0,\ldots,rank - 1\}\). The <code class="docutils literal notranslate">idist</code> and <code class="docutils literal notranslate">odist</code> parameters indicate the distance between the first element of two consecutive batches in the input and output data. </section> <section id="streamed-cufft-transforms"> <h2>2.7. Streamed cuFFT Transforms<a class="headerlink" href="#streamed-cufft-transforms" title="Permalink to this headline"></a></h2> Every cuFFT plan may be associated with a CUDA stream. Once so associated, all launches of the internal stages of that plan take place through the specified stream. Streaming of cuFFT execution allows for potential overlap between transforms and memory copies. (See the NVIDIA CUDA Programming Guide for more information on streams.) If no stream is associated with a plan, launches take place in <code class="docutils literal notranslate">stream(0)</code>, the default CUDA stream. Note that many plan executions require multiple kernel launches. cuFFT uses private streams internally to sort operations, including event syncrhonization. cuFFT does not guarantee ordering of internal operations, and the order is only preserved with respect to the streams set by the user. As of CUDA 11.2 (cuFFT 10.4.0), <code class="docutils literal notranslate">cufftSetStream()</code> is supported in multiple GPU cases. However, calls to <code class="docutils literal notranslate">cufftXtMemcpy()</code> are still synchronous across multiple GPUs when using streams. In previous versions of cuFFT, <code class="docutils literal notranslate">cufftSetStream()</code> returns an error in the multiple GPU case. Likewise, calling certain multi-GPU functions such as <code class="docutils literal notranslate">cufftXtSetCallback()</code> after setting a stream with <code class="docutils literal notranslate">cufftSetStream()</code> will result in an error (see API functions for more details). Please note that in order to overlap plans using single plan handle user needs to manage work area buffers. Each concurrent plan execution needs it’s exclusive work area. Work area can be set by <code class="docutils literal notranslate">cufftSetWorkArea</code> function. </section> <section id="multiple-gpu-cufft-transforms"> <h2>2.8. Multiple GPU cuFFT Transforms<a class="headerlink" href="#multiple-gpu-cufft-transforms" title="Permalink to this headline"></a></h2> cuFFT supports using up to sixteen GPUs connected to a CPU to perform Fourier Transforms whose calculations are distributed across the GPUs. An API has been defined to allow users to write new code or modify existing code to use this functionality. Some existing functions such as the creation of a plan using <code class="docutils literal notranslate">cufftCreate()</code> also apply in the multiple GPU case. Multiple GPU routines contain <code class="docutils literal notranslate">Xt</code> in their name. The memory on the GPUs is managed by helper functions <code class="docutils literal notranslate">cufftXtMalloc()/cufftXtFree()</code> and <code class="docutils literal notranslate">cufftXtMemcpy()</code> using the <code class="docutils literal notranslate">cudaLibXtDesc</code> descriptor. Performance is a function of the bandwidth between the GPUs, the computational ability of the individual GPUs, and the type and number of FFT to be performed. The highest performance is obtained using NVLink interconnect (<a class="reference external" href="https://www.nvidia.com/object/nvlink.html">https://www.nvidia.com/object/nvlink.html</a>). The second best option is using PCI Express 3.0 between the GPUs and ensuring that both GPUs are on the same switch. Note that multiple GPU execution is not guaranteed to solve a given size problem in a shorter time than single GPU execution. The multiple GPU extensions to cuFFT are built on the extensible cuFFT API. The general steps in defining and executing a transform with this API are: <ul class="simple"> <li><code class="docutils literal notranslate">cufftCreate()</code> - create an empty plan, as in the single GPU case</li> <li><code class="docutils literal notranslate">cufftXtSetGPUs()</code> - define which GPUs are to be used</li> <li>Optional: <code class="docutils literal notranslate">cufftEstimate{1d,2d,3d,Many}()</code> - estimate the sizes of the work areas required. These are the same functions used in the single GPU case although the definition of the argument <code class="docutils literal notranslate">workSize</code> reflects the number of GPUs used.</li> <li><code class="docutils literal notranslate">cufftMakePlan{1d,2d,3d,Many}()</code> - create the plan. These are the same functions used in the single GPU case although the definition of the argument <code class="docutils literal notranslate">workSize</code> reflects the number of GPUs used.</li> <li>Optional: <code class="docutils literal notranslate">cufftGetSize{1d,2d,3d,Many}()</code> - refined estimate of the sizes of the work areas required. These are the same functions used in the single GPU case although the definition of the argument <code class="docutils literal notranslate">workSize</code> reflects the number of GPUs used.</li> <li>Optional: <code class="docutils literal notranslate">cufftGetSize()</code> - check workspace size. This is the same function used in the single GPU case although the definition of the argument <code class="docutils literal notranslate">workSize</code> reflects the number of GPUs used.</li> <li>Optional: <code class="docutils literal notranslate">cufftXtSetWorkArea()</code> - do your own workspace allocation.</li> <li><code class="docutils literal notranslate">cufftXtMalloc()</code> - allocate descriptor and data on the GPUs</li> <li><code class="docutils literal notranslate">cufftXtMemcpy()</code> - copy data to the GPUs</li> <li><code class="docutils literal notranslate">cufftXtExecDescriptorC2C()/cufftXtExecDescriptorZ2Z()</code> - execute the plan</li> <li><code class="docutils literal notranslate">cufftXtMemcpy()</code> - copy data from the GPUs</li> <li><code class="docutils literal notranslate">cufftXtFree()</code> - free any memory allocated with <code class="docutils literal notranslate">cufftXtMalloc()</code></li> <li><code class="docutils literal notranslate">cufftDestroy()</code> - free cuFFT plan resources</li> </ul> <section id="plan-specification-and-work-areas"> <h3>2.8.1. Plan Specification and Work Areas<a class="headerlink" href="#plan-specification-and-work-areas" title="Permalink to this headline"></a></h3> In the single GPU case a plan is created by a call to <code class="docutils literal notranslate">cufftCreate()</code> followed by a call to <code class="docutils literal notranslate">cufftMakePlan*()</code>. For multiple GPUs, the GPUs to use for execution are identified by a call to <code class="docutils literal notranslate">cufftXtSetGPUs()</code> and this must occur after the call to <code class="docutils literal notranslate">cufftCreate()</code> and prior to the call to <code class="docutils literal notranslate">cufftMakePlan*()</code>. Note that when <code class="docutils literal notranslate">cufftMakePlan*()</code> is called for a single GPU, the work area is on that GPU. In a multiple GPU plan, the returned work area has multiple entries; one value per GPU. That is <code class="docutils literal notranslate">workSize</code> points to a <code class="docutils literal notranslate">size_t</code> array, one entry per GPU. Also the strides and batches apply to the entire plan across all GPUs associated with the plan. Once a plan is locked by a call to <code class="docutils literal notranslate">cufftMakePlan*()</code>, different descriptors may be specified in calls to <code class="docutils literal notranslate">cufftXtExecDescriptor*()</code> to execute the plan on different data sets, but the new descriptors must use the same GPUs in the same order. As in the single GPU case, <code class="docutils literal notranslate">cufftEstimateSize{Many,1d,2d,3d}()</code> and <code class="docutils literal notranslate">cufftGetSize{Many,1d,2d,3d}()</code> give estimates of the work area sizes required for a multiple GPU plan and in this case <code class="docutils literal notranslate">workSize</code> points to a <code class="docutils literal notranslate">size_t</code> array, one entry per GPU. Similarly the actual work size returned by <code class="docutils literal notranslate">cufftGetSize()</code> is a <code class="docutils literal notranslate">size_t</code> array, one entry per GPU in the multiple GPU case. </section> <section id="helper-functions"> <h3>2.8.2. Helper Functions<a class="headerlink" href="#helper-functions" title="Permalink to this headline"></a></h3> Multiple GPU cuFFT execution functions assume a certain data layout in terms of what input data has been copied to which GPUs prior to execution, and what output data resides in which GPUs post execution. cuFFT provides functions to assist users in manipulating data on multiple GPUs. These must be called after the call to <code class="docutils literal notranslate">cufftMakePlan*()</code>. On a single GPU users may call <code class="docutils literal notranslate">cudaMalloc()</code> and <code class="docutils literal notranslate">cudaFree()</code> to allocate and free GPU memory. To provide similar functionality in the multiple GPU case, cuFFT includes <code class="docutils literal notranslate">cufftXtMalloc()</code> and <code class="docutils literal notranslate">cufftXtFree()</code> functions. The function <code class="docutils literal notranslate">cufftXtMalloc()</code> returns a descriptor which specifies the location of these memories. On a single GPU users may call <code class="docutils literal notranslate">cudaMemcpy()</code> to transfer data between host and GPU memory. To provide similar functionality in the multiple GPU case, cuFFT includes <code class="docutils literal notranslate">cufftXtMemcpy()</code> which allows users to copy between host and multiple GPU memories or even between the GPU memories. All single GPU cuFFT FFTs return output the data in natural order, that is the ordering of the result is the same as if a DFT had been performed on the data. Some Fast Fourier Transforms produce intermediate results where the data is left in a permutation of the natural output. When batch is one, data is left in the GPU memory in a permutation of the natural output. When <code class="docutils literal notranslate">cufftXtMemcpy()</code> is used to copy data from GPU memory back to host memory, the results are in natural order regardless of whether the data on the GPUs is in natural order or permuted. Using <code class="docutils literal notranslate">CUFFT_COPY_DEVICE_TO_DEVICE</code> allows users to copy data from the permuted data format produced after a single transform to the natural order on GPUs. </section> <section id="multiple-gpu-2d-and-3d-transforms-on-permuted-input"> <h3>2.8.3. Multiple GPU 2D and 3D Transforms on Permuted Input<a class="headerlink" href="#multiple-gpu-2d-and-3d-transforms-on-permuted-input" title="Permalink to this headline"></a></h3> For single 2D or 3D transforms on multiple GPUs, when <code class="docutils literal notranslate">cufftXtMemcpy()</code> distributes the data to the GPUs, the array is divided on the X axis. E.G. for two GPUs half of the X dimenson points, for all Y (and Z) values, are copied to each of the GPUs. When the transform is computed, the data are permuted such that they are divided on the Y axis. I.E. half of the Y dimension points, for all X (and Z) values are on each of the GPUs. When cuFFT creates a 2D or 3D plan for a single transform on multiple GPUs, it actually creates two plans. One plan expects input to be divided on the X axis. The other plan expects data to be divided on the Y axis. This is done because many algorithms compute a forward FFT, then perform some point-wise operation on the result, and then compute the inverse FFT. A memory copy to restore the data to the original order would be expensive. To avoid this, <code class="docutils literal notranslate">cufftXtMemcpy</code> and <code class="docutils literal notranslate">cufftXtExecDescriptor()</code> keep track of the data ordering so that the correct operation is used. The ability of cuFFT to process data in either order makes the following sequence possible. <ul class="simple"> <li><code class="docutils literal notranslate">cufftCreate()</code> - create an empty plan, as in the single GPU case</li> <li><code class="docutils literal notranslate">cufftXtSetGPUs()</code> - define which GPUs are to be used</li> <li><code class="docutils literal notranslate">cufftMakePlan{1d,2d,3d,Many}()</code> - create the plan.</li> <li><code class="docutils literal notranslate">cufftXtMalloc()</code> - allocate descriptor and data on the GPUs</li> <li><code class="docutils literal notranslate">cufftXtMemcpy()</code> - copy data to the GPUs</li> <li><code class="docutils literal notranslate">cufftXtExecDescriptorC2C()/cufftXtExecDescriptorZ2Z()</code> - compute the forward FFT</li> <li><code class="docutils literal notranslate">userFunction()</code> - modify the data in the frequency domain</li> <li><code class="docutils literal notranslate">cufftXtExecDescriptorC2C()/cufftXtExecDescriptorZ2Z()</code> - compute the inverse FFT</li> <li>Note that it was not necessary to copy/permute the data between execute calls</li> <li><code class="docutils literal notranslate">cufftXtMemcpy()</code> - copy data to the host</li> <li><code class="docutils literal notranslate">cufftXtFree()</code> - free any memory allocated with <code class="docutils literal notranslate">cufftXtMalloc()</code></li> <li><code class="docutils literal notranslate">cufftDestroy()</code> - free cuFFT plan resources</li> </ul> </section> <section id="supported-functionality"> <h3>2.8.4. Supported Functionality<a class="headerlink" href="#supported-functionality" title="Permalink to this headline"></a></h3> Starting with cuFFT version 7.0, a subset of single GPU functionality is supported for multiple GPU execution. Requirements and limitations: <ul class="simple"> <li>All GPUs must have the same CUDA architecture level and support Unified Virtual Address Space.</li> <li>On Windows, the GPU boards must be operating in Tesla Compute Cluster (TCC) mode.</li> <li>For an application that uses the CUDA Driver API, running cuFFT on multiple GPUs is only compatible with applications using the primary context on each GPU.</li> <li>Strided input and output are not supported.</li> <li>Running cuFFT on more than 8 GPUs (16 GPUs is max) is supported on machines with NVLink only.</li> </ul> While transforms with batch count greater than one do not impose additional constraints, those with a single batch have some restrictions. Single-batch FFTs support only in-place mode, and have additional constraints depending on the FFT type. This behavior is summarized in the following table: <table class="table-no-stripes longtable docutils align-default"> <colgroup> <col style="width: 11%" /> <col style="width: 25%" /> <col style="width: 53%" /> <col style="width: 11%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">batch=1</th> <th class="head">1D</th> <th class="head">2D</th> <th class="head">3D</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">C2C</code>/<code class="docutils literal notranslate">Z2Z</code></td> <td><ul class="simple"> <li>2,4,8,16 GPUs</li> <li>power of 2 sizes only</li> <li>Minimum size for 2-4 GPUs is 64</li> <li>Minimum size for 8 GPUs is 128</li> <li>Minimum size for 16 GPUs is 1024</li> </ul> </td> <td colspan="2"><ul class="simple"> <li>2-16 GPUs</li> <li>One of the following conditions is met for each dimension: <ul> <li>Dimension must factor into primes less than or equal to 127</li> <li>Maximum dimension size is 4096 for single precision</li> <li>Maximum dimension size is 2048 for double precision</li> </ul> </li> <li>Minimum size is 32</li> <li>No LTO callback support</li> </ul> </td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">R2C</code>/<code class="docutils literal notranslate">D2Z</code></td> <td>not supported</td> <td colspan="2"><ul class="simple"> <li>2-16 GPUs</li> <li>One of the following conditions is met for each dimension: <ul> <li>Dimension must factor into primes less than or equal to 127</li> <li>Maximum dimension size is 4096 for single precision</li> <li>Maximum dimension size is 2048 for double precision</li> </ul> </li> <li>Minimum size is 32</li> <li>Fastest changing dimension size needs to be even</li> <li>Supports only <code class="docutils literal notranslate">CUFFT_XT_FORMAT_INPLACE</code> input descriptor format</li> <li>No legacy callback / LTO callback support</li> </ul> </td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">C2R</code>/<code class="docutils literal notranslate">Z2D</code></td> <td>not supported</td> <td colspan="2"><ul class="simple"> <li>2-16 GPUs</li> <li>One of the following conditions is met for each dimension: <ul> <li>Dimension must factor into primes less than or equal to 127</li> <li>Maximum dimension size is 4096 for single precision</li> <li>Maximum dimension size is 2048 for double precision</li> </ul> </li> <li>Minimum size is 32</li> <li>Fastest changing dimension size needs to be even</li> <li>Supports only <code class="docutils literal notranslate">CUFFT_XT_FORMAT_INPLACE_SHUFFLED</code> input descriptor format</li> <li>No legacy callback / LTO callback support</li> </ul> </td> </tr> </tbody> </table> General guidelines are: <ul class="simple"> <li>Parameter <code class="docutils literal notranslate">whichGPUs</code> of <code class="docutils literal notranslate">cufftXtSetGPUs()</code> function determines ordering of the GPUs with respect to data decomposition (first data chunk is placed on GPU denoted by first element of <code class="docutils literal notranslate">whichGPUs</code>)</li> <li>The data for the entire transform must fit within the memory of the GPUs assigned to it.</li> <li>For batch size <code class="docutils literal notranslate">m</code> on <code class="docutils literal notranslate">n</code> GPUs : <ul> <li>The first <code class="docutils literal notranslate">m % n</code> GPUs execute \(\left\lfloor \frac{m}{n} \right\rfloor+\ 1\) transforms.</li> <li>The remaining GPUs execute \(\left\lfloor \frac{m}{n} \right\rfloor\) transforms.</li> </ul> </li> </ul> Batch size output differences: Single GPU cuFFT results are always returned in natural order. When multiple GPUs are used to perform more than one transform, the results are also returned in natural order. When multiple GPUs are used to perform a single transform the results are returned in a permutation of the normal results to reduce communication time. This behavior is summarized in the following table: <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 23%" /> <col style="width: 43%" /> <col style="width: 33%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Number of GPUs</th> <th class="head">Number of transforms</th> <th class="head">Output Order on GPUs</th> </tr> </thead> <tbody> <tr class="row-even"><td>One</td> <td>One or multiple transforms</td> <td>Natural order</td> </tr> <tr class="row-odd"><td>Multiple</td> <td>One</td> <td>Permuted results</td> </tr> <tr class="row-even"><td>Multiple</td> <td>Multiple</td> <td>Natural order</td> </tr> </tbody> </table> To produce natural order results in GPU memory for multi-GPU runs in the 1D single transform case, requires calling <code class="docutils literal notranslate">cufftXtMemcpy()</code> with <code class="docutils literal notranslate">CUFFT_COPY_DEVICE_TO_DEVICE</code>. 2D and 3D multi-GPU transforms support execution of a transform given permuted order results as input. After execution in this case, the output will be in natural order. It is also possible to use <code class="docutils literal notranslate">cufftXtMemcpy()</code> with <code class="docutils literal notranslate">CUFFT_COPY_DEVICE_TO_DEVICE</code> to return 2D or 3D data to natural order. See the cuFFT Code Examples section for single GPU and multiple GPU examples. </section> </section> <section id="cufft-callback-routines"> <h2>2.9. cuFFT Callback Routines<a class="headerlink" href="#cufft-callback-routines" title="Permalink to this headline"></a></h2> Callback routines are user-supplied kernel routines that cuFFT will call when loading or storing data. They allow the user to do data pre- or post- processing without additional kernel calls. <div class="admonition note"> Note In CUDA 12.6 Update 2, we introduced support for Link-Time Optimized (LTO) callbacks as a replacement for the deprecated (legacy) callbacks. See more in <a class="reference external" href="index.html#lto-load-and-store-callback-routines">LTO Load and Store Callback Routines</a>. Starting from CUDA 11.4, support for callback functionality using separately compiled device code (i.e. legacy callbacks) is deprecated on all GPU architectures. Callback functionality will continue to be supported for all GPU architectures. </div> <section id="overview-of-the-cufft-callback-routine-feature"> <h3>2.9.1. Overview of the cuFFT Callback Routine Feature<a class="headerlink" href="#overview-of-the-cufft-callback-routine-feature" title="Permalink to this headline"></a></h3> cuFFT provides a set of APIs that allow the cuFFT user to provide CUDA functions that re-direct or manipulate the data as it is loaded prior to processing the FFT, or stored once the FFT has been done. For the load callback, cuFFT calls the callback routine the address of the input data and the offset to the value to be loaded from device memory, and the callback routine returns the value it wishes cuFFT to use instead. For the store callback, cuFFT calls the callback routine the value it has computed, along with the address of the output data and the offset to the value to be written to device memory, and the callback routine modifies the value and stores the modified result. In order to provide a callback to cuFFT, a plan is created using the extensible plan APIs. After the call to <code class="docutils literal notranslate">cufftCreate</code>, the user may associate a load callback routine, or a store callback routine, or both, with the plan, by: <ul class="simple"> <li>Calling <code class="docutils literal notranslate">cufftXtSetJITCallback</code> before <code class="docutils literal notranslate">cufftMakePlan</code>, for LTO callbacks</li> <li>Calling <code class="docutils literal notranslate">cufftXtSetCallback</code> after <code class="docutils literal notranslate">cufftMakePlan</code>, for legacy callbacks</li> </ul> The caller also has the option to specify a device pointer to an opaque structure they wish to associate with the plan. This pointer will be passed to the callback routine by the cuFFT library. The caller may use this structure to remember plan dimensions and strides, or have a pointer to auxiliary data, etc. With some restrictions, the callback routine is allowed to request shared memory for its own use. If the requested amount of shared memory is available, cufft will pass a pointer to it when it calls the callback routine. CUFFT allows for 8 types of callback routines, one for each possible combination of: load or store, real or complex, single precision or double: <ul class="simple"> <li>For LTO callbacks, the user must provide an LTO routine that matches the function prototype for the type of routine specified. Otherwise, the planning function <code class="docutils literal notranslate">cufftMakePlan</code> will fail.</li> <li>For legacy callbacks, it is the caller’s responsibility to provide a routine that matches the function prototype for the type of routine specified.</li> </ul> If there is already a callback of the specified type associated with the plan handle, the set callback functions will replace it with the new one. The callback routine extensions to cuFFT are built on the extensible cuFFT API. The general steps in defining and executing a transform with callbacks are: <ul class="simple"> <li><code class="docutils literal notranslate">cufftCreate()</code> - create an empty plan, as in the single GPU case.</li> <li>(For LTO callbacks) <code class="docutils literal notranslate">cufftXtSetJITCallback()</code> - set a load and/or store LTO callback for this plan.</li> <li><code class="docutils literal notranslate">cufftMakePlan{1d,2d,3d,Many}()</code> - create the plan. These are the same functions used in the single GPU case.</li> <li>(For legacy callbacks) <code class="docutils literal notranslate">cufftXtSetCallback()</code> - set a load and/or store legacy callback for this plan.</li> <li><code class="docutils literal notranslate">cufftExecC2C() etc.</code> - execute the plan.</li> <li><code class="docutils literal notranslate">cufftDestroy()</code> - free cuFFT plan resources.</li> </ul> Callback functions are not supported on transforms with a dimension size that does not factor into primes smaller than 127. Callback functions on plans whose dimensions’ prime factors are limited to 2, 3, 5, and 7 can safely call <code class="docutils literal notranslate">__syncthreads()</code>. On other plans, results are not defined. <div class="admonition note"> Note The LTO callback API is available in the dynamic and static cuFFT libraries on 64 bit Windows and LINUX operating systems. The LTO callback API requires compatible nvJitLink and NVRTC libraries present in the dynamic library path. See more details in <a class="reference external" href="index.html#lto-load-and-store-callback-routines">LTO Load and Store Callback Routines</a>. The legacy callback API is available only in the static cuFFT library on 64 bit LINUX operating systems. </div> </section> <section id="lto-load-and-store-callback-routines"> <h3>2.9.2. LTO Load and Store Callback Routines<a class="headerlink" href="#lto-load-and-store-callback-routines" title="Permalink to this headline"></a></h3> LTO callbacks in cuFFT for a given toolkit version require using the <a class="reference external" href="https://docs.nvidia.com/cuda/nvjitlink/index.html">nvJitLink library</a> from the same toolkit or greater, but within the same toolkit major. Additionally, in order to specify custom names for the LTO callback routines, cuFFT requires using the <a class="reference external" href="https://docs.nvidia.com/cuda/nvrtc/index.html">NVRTC library</a>. cuFFT uses NVRTC to compile a minimal wrapper around the user callback with custom symbol name. The custom symbol name provided to the cuFFT API must be a valid, null-terminated C-string containing the unmangled name; currently, keywords that alter the scope of the symbol name (such as <code class="docutils literal notranslate">namespace</code>) or the mangling (such as <code class="docutils literal notranslate">extern "C"</code>) are not supported. The NVRTC library used must be from a toolkit that is either the same version or older than the nvJitLink library, and both must be from the same toolkit major. For example, in toolkit version 12.6 cuFFT requires nvJitLink to be from toolkit version 12.X, where <code class="docutils literal notranslate">X >= 6</code>, and NVRTC to be from toolkit version 12.Y, where <code class="docutils literal notranslate">0 <= Y <= X</code>. Both the nvJitLink and the NVRTC libraries are loaded dynamically, and should be present in the system’s dynamic linking path (e.g. <code class="docutils literal notranslate">LD_LIBRARY_PATH</code> on Unix systems, or <code class="docutils literal notranslate">PATH</code> on Windows systems). Code samples for LTO callbacks are available in the public <a class="reference external" href="https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFT">CUDA Library Samples github repository</a>. <section id="specifying-lto-load-and-store-callback-routines"> <h4>2.9.2.1. Specifying LTO Load and Store Callback Routines<a class="headerlink" href="#specifying-lto-load-and-store-callback-routines" title="Permalink to this headline"></a></h4> Usage of LTO callbacks in cuFFT is divided in two parts: <ul class="simple"> <li>Generating the LTO callback (i.e. compiling the callback routine to LTO-IR).</li> <li>Associating the LTO callback with the cuFFT plan.</li> </ul> To generate the LTO callback, users can compile the callback device function to LTO-IR using nvcc with any of the supported flags (such as <code class="docutils literal notranslate">-dlto</code> or <code class="docutils literal notranslate">-gencode=arch=compute_XX,code=lto_XX</code>, with <code class="docutils literal notranslate">XX</code> indicating the target GPU architecture); alternatively, users can generate the LTO callback using NVRTC to do runtime compilation via the <code class="docutils literal notranslate">-dlto</code> flag. Notice that PTX JIT is part of the JIT LTO kernel finalization trajectory, so architectures older than the current system architecture are supported; users can compile their callback function to LTO-IR for target arch <code class="docutils literal notranslate">XX</code> and and execute plans which use the callback functions on GPUs with arch <code class="docutils literal notranslate">YY</code>, where <code class="docutils literal notranslate">XX <= YY</code>. Please see <a class="reference external" href="https://developer.nvidia.com/blog/cuda-12-0-compiler-support-for-runtime-lto-using-nvjitlink-library/">Compiler Support for Runtime LTO Using nvJitLink Library</a> and <a class="reference external" href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#just-in-time-compilation">Just-in-Time (JIT) Compilation</a> for more details. As an example, if a user wants to specify a load callback for a R2C transform, they could write the following code <div class="highlight-c++ notranslate"><div class="highlight"><pre>__device__ cufftReal myOwnLTOCallback(void *dataIn, unsigned long long offset, void *callerInfo, void *sharedPtr) { cufftReal ret; // use offset, dataIn, and optionally callerInfo to // compute the return value return ret; } </pre></div> </div> To compile the callback to LTO-IR, the user could do <div class="highlight-c++ notranslate"><div class="highlight"><pre># Compile the code to SM60 LTO-IR into a fatbin file nvcc -gencode=arch=compute_60,code=lto_60 -dc -fatbin callback.cu -o callback.fatbin #Turn the fatbin data into a C array inside a header, for easy inclusion in host code bin2c --name my_lto_callback_fatbin --type longlong callback.fatbin > callback_fatbin.h </pre></div> </div> To associate the LTO callback with the cuFFT plan, users can leverage the new API call <code class="docutils literal notranslate">cufftXtSetJITCallback()</code>, which works similarly to <code class="docutils literal notranslate">cufftXtSetCallback()</code>, with a few caveats. First, <code class="docutils literal notranslate">cufftXtSetJITCallback()</code> must be called after plan creation with <code class="docutils literal notranslate">cufftCreate()</code>, and before calling the plan initialization function with <code class="docutils literal notranslate">cufftMakePlan*()</code> and similar routines. Second, removing the LTO callback from the plan (using <code class="docutils literal notranslate">cufftXtClearCallback()</code>) is currently not supported. A new plan must be created. <div class="highlight-c++ notranslate"><div class="highlight"><pre>#include <cufftXt.h> #include "callback_fatbin.h" int main() { cufftResult status; cufftHandle fft_plan; ... status = cufftCreate(&fft_plan); // NOTE: LTO callbacks must be set before plan creation and cannot be unset (yet) size_t lto_callback_fatbin_size = sizeof(my_lto_callback_fatbin); status = cufftXtSetJITCallback(fft_plan, "myOwnLTOCallback", (void*)my_lto_callback_fatbin, lto_callback_fatbin_size, CUFFT_CB_LD_REAL, (void **)&device_params)); status = cufftMakePlan1d(fft_plan, signal_size, CUFFT_C2R, batches, &work_size); ... } </pre></div> </div> </section> <section id="lto-callback-routine-function-details"> <h4>2.9.2.2. LTO Callback Routine Function Details<a class="headerlink" href="#lto-callback-routine-function-details" title="Permalink to this headline"></a></h4> Below are the function prototypes for the user-supplied LTO callback routines that cuFFT calls to load data prior to the transform. <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef cufftComplex (*cufftJITCallbackLoadC)(void *dataIn, unsigned long long offset, void *callerInfo, void *sharedPointer); typedef cufftDoubleComplex (*cufftJITCallbackLoadZ)(void *dataIn, unsigned long long offset, void *callerInfo, void *sharedPointer); typedef cufftReal (*cufftJITCallbackLoadR)(void *dataIn, unsigned long long offset, void *callerInfo, void *sharedPointer); typedef cufftDoubleReal (*cufftJITCallbackLoadD)(void *dataIn, unsigned long long offset, void *callerInfo, void *sharedPointer); </pre></div> </div> Parameters for all of the LTO load callbacks are defined as below: <ul class="simple"> <li><code class="docutils literal notranslate">offset</code>: offset of the input element from the start of input data. This is not a byte offset, rather it is the number of elements from start of data.</li> <li><code class="docutils literal notranslate">dataIn</code>: device pointer to the start of the input array that was passed in the <code class="docutils literal notranslate">cufftExecute</code> call.</li> <li><code class="docutils literal notranslate">callerInfo</code>: device pointer to the optional caller specified data passed in the <code class="docutils literal notranslate">cufftXtSetCallback</code> call.</li> <li><code class="docutils literal notranslate">sharedPointer</code>: pointer to shared memory, valid only if the user has called <code class="docutils literal notranslate">cufftXtSetCallbackSharedSize()</code>.</li> </ul> Below are the function prototypes, and typedefs for pointers to the user supplied LTO callback routines that cuFFT calls to store data after completion of the transform. Note that the store callback functions do not return a value. This is because a store callback function is responsible not only for transforming the data as desired, but also for writing the data to the desired location. This allows the store callback to rearrange the data, for example to shift the zero frequency result to the center of the ouput. <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef void (*cufftJITCallbackStoreC)(void *dataOut, unsigned long long offset, cufftComplex element, void *callerInfo, void *sharedPointer); typedef void (*cufftJITCallbackStoreZ)(void *dataOut, unsigned long long offset, cufftDoubleComplex element, void *callerInfo, void *sharedPointer); typedef void (*cufftJITCallbackStoreR)(void *dataOut, unsigned long long offset, cufftReal element, void *callerInfo, void *sharedPointer); typedef void (*cufftJITCallbackStoreD)(void *dataOut, unsigned long long offset, cufftDoubleReal element, void *callerInfo, void *sharedPointer); </pre></div> </div> Parameters for all of the LTO store callbacks are defined as below: <ul class="simple"> <li><code class="docutils literal notranslate">offset</code>: offset of the output element from the start of output data. This is not a byte offset, rather it is the number of elements from start of data.</li> <li><code class="docutils literal notranslate">dataOut</code>: device pointer to the start of the output array that was passed in the <code class="docutils literal notranslate">cufftExecute</code> call.</li> <li><code class="docutils literal notranslate">element</code>: the real or complex result computed by CUFFT for the element specified by the offset argument.</li> <li><code class="docutils literal notranslate">callerInfo</code>: device pointer to the optional caller specified data passed in the <code class="docutils literal notranslate">cufftXtSetCallback</code> call.</li> <li><code class="docutils literal notranslate">sharedPointer</code>: pointer to shared memory, valid only if the user has called <code class="docutils literal notranslate">cufftXtSetCallbackSharedSize()</code>.</li> </ul> </section> </section> <section id="legacy-load-and-store-callback-routines"> <h3>2.9.3. Legacy Load and Store Callback Routines<a class="headerlink" href="#legacy-load-and-store-callback-routines" title="Permalink to this headline"></a></h3> <section id="specifying-legacy-load-and-store-callback-routines"> <h4>2.9.3.1. Specifying Legacy Load and Store Callback Routines<a class="headerlink" href="#specifying-legacy-load-and-store-callback-routines" title="Permalink to this headline"></a></h4> In order to associate a legacy callback routine with a plan, it is necessary to obtain a device pointer to the callback routine. As an example, if the user wants to specify a load callback for an R2C transform, they would write the device code for the callback function, and define a global device variable that contains a pointer to the function: <div class="highlight-c++ notranslate"><div class="highlight"><pre>__device__ cufftReal myOwnCallback(void *dataIn, size_t offset, void *callerInfo, void *sharedPtr) { cufftReal ret; // use offset, dataIn, and optionally callerInfo to // compute the return value return ret; } __device__ cufftCallbackLoadR myOwnCallbackPtr = myOwnCallback; </pre></div> </div> From the host side, the user then has to get the address of the legacy callback routine, which is stored in <code class="docutils literal notranslate">myOwnCallbackPtr</code>. This is done with <code class="docutils literal notranslate">cudaMemcpyFromSymbol</code>, as follows: <div class="highlight-c++ notranslate"><div class="highlight"><pre>cufftCallbackLoadR hostCopyOfCallbackPtr; cudaMemcpyFromSymbol(&hostCopyOfCallbackPtr, myOwnCallbackPtr, sizeof(hostCopyOfCallbackPtr)); </pre></div> </div> <code class="docutils literal notranslate">hostCopyOfCallbackPtr</code> then contains the device address of the callback routine, that should be passed to <code class="docutils literal notranslate">cufftXtSetCallback</code>. Note that, for multi-GPU transforms, <code class="docutils literal notranslate">hostCopyOfCallbackPtr</code> will need to be an array of pointers, and the <code class="docutils literal notranslate">cudaMemcpyFromSymbol</code> will have to be invoked for each GPU. Please note that <code class="docutils literal notranslate">__managed__</code> variables are not suitable to pass to <code class="docutils literal notranslate">cufftSetCallback</code> due to restrictions on variable usage (See the NVIDIA CUDA Programming Guide for more information about <code class="docutils literal notranslate">__managed__</code> variables). </section> <section id="legacy-callback-routine-function-details"> <h4>2.9.3.2. Legacy Callback Routine Function Details<a class="headerlink" href="#legacy-callback-routine-function-details" title="Permalink to this headline"></a></h4> Below are the function prototypes, and typedefs for pointers to the user supplied legacy callback routines that cuFFT calls to load data prior to the transform. <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef cufftComplex (*cufftCallbackLoadC)(void *dataIn, size_t offset, void *callerInfo, void *sharedPointer); typedef cufftDoubleComplex (*cufftCallbackLoadZ)(void *dataIn, size_t offset, void *callerInfo, void *sharedPointer); typedef cufftReal (*cufftCallbackLoadR)(void *dataIn, size_t offset, void *callerInfo, void *sharedPointer); typedef cufftDoubleReal (*cufftCallbackLoadD)(void *dataIn, size_t offset, void *callerInfo, void *sharedPointer); </pre></div> </div> Parameters for all of the legacy load callbacks are defined as below: <ul class="simple"> <li><code class="docutils literal notranslate">offset</code>: offset of the input element from the start of input data. This is not a byte offset, rather it is the number of elements from start of data.</li> <li><code class="docutils literal notranslate">dataIn</code>: device pointer to the start of the input array that was passed in the <code class="docutils literal notranslate">cufftExecute</code> call.</li> <li><code class="docutils literal notranslate">callerInfo</code>: device pointer to the optional caller specified data passed in the <code class="docutils literal notranslate">cufftXtSetCallback</code> call.</li> <li><code class="docutils literal notranslate">sharedPointer</code>: pointer to shared memory, valid only if the user has called <code class="docutils literal notranslate">cufftXtSetCallbackSharedSize()</code>.</li> </ul> Below are the function prototypes, and typedefs for pointers to the user supplied legacy callback routines that cuFFT calls to store data after completion of the transform. Note that the store callback functions do not return a value. This is because a store callback function is responsible not only for transforming the data as desired, but also for writing the data to the desired location. This allows the store callback to rearrange the data, for example to shift the zero frequency result to the center of the ouput. <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef void (*cufftCallbackStoreC)(void *dataOut, size_t offset, cufftComplex element, void *callerInfo, void *sharedPointer); typedef void (*cufftCallbackStoreZ)(void *dataOut, size_t offset, cufftDoubleComplex element, void *callerInfo, void *sharedPointer); typedef void (*cufftCallbackStoreR)(void *dataOut, size_t offset, cufftReal element, void *callerInfo, void *sharedPointer); typedef void (*cufftCallbackStoreD)(void *dataOut, size_t offset, cufftDoubleReal element, void *callerInfo, void *sharedPointer); </pre></div> </div> Parameters for all of the legacy store callbacks are defined as below: <ul class="simple"> <li><code class="docutils literal notranslate">offset</code>: offset of the output element from the start of output data. This is not a byte offset, rather it is the number of elements from start of data.</li> <li><code class="docutils literal notranslate">dataOut</code>: device pointer to the start of the output array that was passed in the <code class="docutils literal notranslate">cufftExecute</code> call.</li> <li><code class="docutils literal notranslate">element</code>: the real or complex result computed by CUFFT for the element specified by the offset argument.</li> <li><code class="docutils literal notranslate">callerInfo</code>: device pointer to the optional caller specified data passed in the <code class="docutils literal notranslate">cufftXtSetCallback</code> call.</li> <li><code class="docutils literal notranslate">sharedPointer</code>: pointer to shared memory, valid only if the user has called <code class="docutils literal notranslate">cufftXtSetCallbackSharedSize()</code>.</li> </ul> </section> </section> <section id="coding-considerations-for-the-cufft-callback-routine-feature"> <h3>2.9.4. Coding Considerations for the cuFFT Callback Routine Feature<a class="headerlink" href="#coding-considerations-for-the-cufft-callback-routine-feature" title="Permalink to this headline"></a></h3> cuFFT supports callbacks on all types of transforms, dimension, batch, or stride between elements. Callbacks are supported for transforms of single and double precision. cuFFT supports a wide range of parameters, and based on those for a given plan, it attempts to optimize performance. The number of kernels launched, and for each of those, the number of blocks launched and the number of threads per block, will vary depending on how cuFFT decomposes the transform. For some configurations, cuFFT will load or store (and process) multiple inputs or outputs per thread. For some configurations, threads may load or store inputs or outputs in any order, and cuFFT does not guarantee that the inputs or outputs handled by a given thread will be contiguous. These characteristics may vary with transform size, transform type (e.g. C2C vs C2R), number of dimensions, and GPU architecture. These variations may also change from one library version to the next. When more than one kernel are used to implement a transform, the thread and block structure of the first kernel (the one that does the load) is often different from the thread and block structure of the last kernel (the one that does the store). One common use of callbacks is to reduce the amount of data read or written to memory, either by selective filtering or via type conversions. When more than one kernel are used to implement a transform, cuFFT alternates using the workspace and the output buffer to write intermediate results. This means that the output buffer must always be large enough to accommodate the entire transform. For transforms whose dimensions can be factored into powers of 2, 3, 5, or 7, cuFFT guarantees that it will call the load and store callback routines from points in the kernel where it is safe to call the <code class="docutils literal notranslate">__syncthreads</code> function from within the callback routine. The caller is responsible for guaranteeing that the callback routine is at a point where the callback code has converged, to avoid deadlock. For plans whose dimensions are factored into higher primes, results of a callback routine calling <code class="docutils literal notranslate">__syncthreads</code> are not defined. Note that there are no guarantees on the relative order of execution of blocks within a grid. As such, callbacks should not rely on any particular ordering within a kernel. For instance, reordering data (such as an FFT-shift) could rely on the order of execution of the blocks. Results in this case would be undefined. <section id="coding-considerations-for-lto-callback-routines"> <h4>2.9.4.1. Coding Considerations for LTO Callback Routines<a class="headerlink" href="#coding-considerations-for-lto-callback-routines" title="Permalink to this headline"></a></h4> cuFFT will call the LTO load callback routine, for each point in the input, once and only once for real-to-complex (<code class="docutils literal notranslate">R2C</code>, <code class="docutils literal notranslate">D2Z</code>) and complex-to-complex (<code class="docutils literal notranslate">C2C</code>, <code class="docutils literal notranslate">Z2Z</code>) transforms. Unlike with legacy callbacks, LTO load callbacks may be called more than once per element for complex-to-real (<code class="docutils literal notranslate">C2R</code>, <code class="docutils literal notranslate">Z2D</code>) transforms. The input value will not be updated twice (i.e. the transformed value will be stored in register and not memory, even for in-place transforms), but users should not rely on the amount of calls per element in their callback device functions. Similarly to legacy callbacks, LTO store callbacks will be called once and only once for each point in the output. If the transform is being done in-place (i.e. the input and output data are in the same memory location) the store callback for a given element cannot overwrite other elements. It can either overwrite the given element, or write in a completely distinct output buffer. cuFFT does not support LTO callbacks for multi-GPU transforms (yet). </section> <section id="coding-considerations-for-legacy-callback-routines"> <h4>2.9.4.2. Coding Considerations for Legacy Callback Routines<a class="headerlink" href="#coding-considerations-for-legacy-callback-routines" title="Permalink to this headline"></a></h4> cuFFT supports legacy callbacks on any number of GPUs. cuFFT will call the load callback routine, for each point in the input, once and only once. Similarly it will call the store callback routine, for each point in the output, once and only once. If the transform is being done in-place (i.e. the input and output data are in the same memory location) the store callback for a given element cannot overwrite other elements. It can either overwrite the given element, or write in a completely distinct output buffer. For multi-GPU transforms, the index passed to the callback routine is the element index from the start of data on that GPU, not from the start of the entire input or output data array. </section> </section> </section> <section id="thread-safety"> <h2>2.10. Thread Safety<a class="headerlink" href="#thread-safety" title="Permalink to this headline"></a></h2> cuFFT APIs are thread safe as long as different host threads execute FFTs using different plans and the output data are disjoint. </section> <section id="cuda-graphs-support"> <h2>2.11. CUDA Graphs Support<a class="headerlink" href="#cuda-graphs-support" title="Permalink to this headline"></a></h2> Using <a class="reference external" href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs">CUDA Graphs</a> with cuFFT is supported on single GPU plans. It is also supported on multiple GPU plans starting with cuFFT version 10.4.0. The stream associated with a cuFFT plan must meet the requirements stated in <a class="reference external" href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#creating-a-graph-using-stream-capture">Creating a Graph Using Stream Capture</a>. <div class="admonition note"> Note Starting from CUDA 11.8 (including CUDA 12.0 onward), CUDA Graphs are no longer supported for legacy callback routines that load data in out-of-place mode transforms. Starting from CUDA 12.6 Update 2, LTO callbacks can be used as a replacement for legacy callbacks without this limitation. cuFFT deprecated callback functionality based on separate compiled device code (legacy callbacks) in cuFFT 11.4. </div> </section> <section id="static-library-and-callback-support"> <h2>2.12. Static Library and Callback Support<a class="headerlink" href="#static-library-and-callback-support" title="Permalink to this headline"></a></h2> Starting with release 6.5, the cuFFT libraries are also delivered in a static form as libcufft_static.a and libcufftw_static.a on Linux and Mac. Static libraries are not supported on Windows. The static cufft and cufftw libraries depend on thread abstraction layer library <code class="docutils literal notranslate">libculibos.a</code>. For example, on linux, to compile a small application using cuFFT against the dynamic library, the following command can be used: <div class="highlight-text notranslate"><div class="highlight"><pre>nvcc mCufftApp.c -lcufft -o myCufftApp </pre></div> </div> For cufftw on Linux, to compile a small application against the dynamic library, the following command can be used: <div class="highlight-text notranslate"><div class="highlight"><pre>nvcc mCufftwApp.c -lcufftw -lcufft -o myCufftwApp </pre></div> </div> Whereas to compile against the static cuFFT library, extra steps need to be taken. The library needs to be device linked. It may happen during building and linking of a simple program, or as a separate step. The entire process is described in <a class="reference external" href="https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#using-separate-compilation-in-cuda">Using Separarate Compilation in CUDA</a>. For cuFFT and cufftw in version 9.0 or later any supported architecture can be used to do the device linking: Static cuFFT compilation command: <div class="highlight-text notranslate"><div class="highlight"><pre>nvcc mCufftApp.c -lcufft_static -lculibos -o myCufftApp </pre></div> </div> Static cufftw compilation command: <div class="highlight-text notranslate"><div class="highlight"><pre>nvcc mCufftwApp.c -lcufftw_static -lcufft_static -lculibos -o myCufftwApp </pre></div> </div> Prior to version 9.0 proper linking required specifying a subset of supported architectures, as shown in the following commands: Static cuFFT compilation command: <div class="highlight-text notranslate"><div class="highlight"><pre>nvcc mCufftApp.c -lcufft_static -lculibos -o myCufftApp\ -gencode arch=compute_20,\"code=sm_20\"\ -gencode arch=compute_30,\"code=sm_30\"\ -gencode arch=compute_35,\"code=sm_35\"\ -gencode arch=compute_50,\"code=sm_50\"\ -gencode arch=compute_60,\"code=sm_60\"\ -gencode arch=compute_60,\"code=compute_60\" </pre></div> </div> Static cufftw compilation command: <div class="highlight-text notranslate"><div class="highlight"><pre>nvcc mCufftwApp.c -lcufftw_static -lcufft_static -lculibos -o myCufftwApp\ -gencode arch=compute_20,\"code=sm_20\"\ -gencode arch=compute_30,\"code=sm_30\"\ -gencode arch=compute_35,\"code=sm_35\"\ -gencode arch=compute_50,\"code=sm_50\"\ -gencode arch=compute_60,\"code=sm_60\"\ -gencode arch=compute_60,\"code=compute_60\" </pre></div> </div> Please note that the cuFFT library might not contain code for certain architectures as long as there is code for a lower architecture that is binary compatibile (e.g. SM52, SM61). This is reflected in link commands above and significant when using versions prior r9.0. To determine if a specific SM is included in the cuFFT library, one may use <code class="docutils literal notranslate">cuobjdump</code> utility. For example, if you wish to know if SM_50 is included, the command to run is <code class="docutils literal notranslate">cuobjdump -arch sm_50 libcufft_static.a</code>. Some kernels are built only on select architectures (e.g. kernels with half precision arithmetics are present only for SM53 and above). This can cause warnings at link time that architectures are missing from these kernels. These warnings can be safely ignored. It is also possible to use the native Host C++ compiler and perform device link as a separate step. Please consult NVCC documentation for more details. Depending on the Host Operating system, some additional libraries like <code class="docutils literal notranslate">pthread</code> or <code class="docutils literal notranslate">dl</code> might be needed on the linking line. Note that in this case, the library <code class="docutils literal notranslate">cuda</code> is not needed. The CUDA Runtime will try to open explicitly the <code class="docutils literal notranslate">cuda</code> library if needed. In the case of a system which does not have the CUDA driver installed, this allows the application to gracefully manage this issue and potentially run if a CPU-only path is available. The cuFFT static library supports user supplied legacy callback routines. The legacy callback routines are CUDA device code, and must be separately compiled with NVCC and linked with the cuFFT library. Please refer to the NVCC documentation regarding separate compilation for details. If you specify an SM when compiling your callback functions, you must specify one of the SM’s cuFFT includes. <section id="static-library-without-legacy-callback-support"> <h3>2.12.1. Static library without legacy callback support<a class="headerlink" href="#static-library-without-legacy-callback-support" title="Permalink to this headline"></a></h3> Starting with cuFFT version 9.2, a new variant of the cuFTT static library, <code class="docutils literal notranslate">libcufft_static_nocallback.a</code>, was added. This new version does not contain legacy callback functionality and can be linked using the host compiler only. </section> </section> <section id="accuracy-and-performance"> <h2>2.13. Accuracy and Performance<a class="headerlink" href="#accuracy-and-performance" title="Permalink to this headline"></a></h2> A DFT can be implemented as a matrix vector multiplication that requires \(O(N^{2})\) operations. However, the cuFFT Library employs the <a class="reference external" href="http://en.wikipedia.org/wiki/Cooley-Tukey_FFT_algorithm">Cooley-Tukey algorithm</a> to reduce the number of required operations to optimize the performance of particular transform sizes. This algorithm expresses the DFT matrix as a product of sparse building block matrices. The cuFFT Library implements the following building blocks: radix-2, radix-3, radix-5, and radix-7. Hence the performance of any transform size that can be factored as \(2^{a} \times 3^{b} \times 5^{c} \times 7^{d}\) (where a, b, c, and d are non-negative integers) is optimized in the cuFFT library. There are also radix-m building blocks for other primes, m, whose value is < 128. When the length cannot be decomposed as multiples of powers of primes from 2 to 127, <a class="reference external" href="http://en.wikipedia.org/wiki/Bluestein's_FFT_algorithm">Bluestein’s algorithm</a> is used. Since the Bluestein implementation requires more computations per output point than the Cooley-Tukey implementation, the accuracy of the Cooley-Tukey algorithm is better. The pure Cooley-Tukey implementation has excellent accuracy, with the relative error growing proportionally to \(\log_{2}(N)\) , where \(N\) is the transform size in points. For sizes handled by the Cooley-Tukey code path, the most efficient implementation is obtained by applying the following constraints (listed in order from the most generic to the most specialized constraint, with each subsequent constraint providing the potential of an additional performance improvement). Half precision transforms might not be suitable for all kinds of problems due to limited range represented by half precision floating point arithmetics. Please note that the first element of FFT result is the sum of all input elements and it is likely to overflow for certain inputs. Results produced by the cuFFT library are deterministic (ie, bitwise reproducible) as long as the following are kept constant between runs: plan input parameters, cuFFT version, and GPU model. cuFFT batched plans require that input data includes valid signal for all batches. Performance optimizations in batched mode can combine signal from different batches for processing. Optimizations used in cuFFT can vary from version to version. <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 14%" /> <col style="width: 38%" /> <col style="width: 49%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Applies to</th> <th class="head">Recommendation</th> <th class="head">Comment</th> </tr> </thead> <tbody> <tr class="row-even"><td>All</td> <td>Use single precision transforms.</td> <td>Single precision transforms require less bandwidth per computation than double precision transforms.</td> </tr> <tr class="row-odd"><td>All</td> <td>Restrict the size along all dimensions to be representable as \(2^{a} \times 3^{b} \times 5^{c} \times 7^{d}\).</td> <td>The cuFFT library has highly optimized kernels for transforms whose dimensions have these prime factors. In general the best performance occurs when using powers of 2, followed by powers of 3, then 5, 7.</td> </tr> <tr class="row-even"><td>All</td> <td>Restrict the size along each dimension to use fewer distinct prime factors.</td> <td>A transform of size \(2^{n}\) or \(3^{n}\) will usually be faster than one of size \(2^{i} \times 3^{j}\) even if the latter is slightly smaller, due to the composition of specialized paths.</td> </tr> <tr class="row-odd"><td>All</td> <td>Restrict the data to be contiguous in memory when performing a single transform. When performing multiple transforms make the individual datasets contiguous</td> <td>The cuFFT library has been optimized for this data layout.</td> </tr> <tr class="row-even"><td>All</td> <td>Perform multiple (i.e., batched) transforms.</td> <td>Additional optimizations are performed in batched mode.</td> </tr> <tr class="row-odd"><td>real-to-complex transforms or complex-to-real transforms</td> <td>Ensure problem size of x dimension is a multiple of 4.</td> <td>This scheme uses more efficient kernels to implement conjugate symmetry property.</td> </tr> <tr class="row-even"><td>real-to-complex transforms or complex-to-real transforms</td> <td>Use <code class="docutils literal notranslate">out-of-place</code> mode.</td> <td>This scheme uses more efficient kernels than <code class="docutils literal notranslate">in-place</code> mode.</td> </tr> <tr class="row-odd"><td>Multiple GPU transforms</td> <td>Use PCI Express 3.0 between GPUs and ensure the GPUs are on the same switch.</td> <td>The faster the interconnect between the GPUs, the faster the performance.</td> </tr> </tbody> </table> </section> <section id="caller-allocated-work-area-support"> <h2>2.14. Caller Allocated Work Area Support<a class="headerlink" href="#caller-allocated-work-area-support" title="Permalink to this headline"></a></h2> cuFFT plans may use additional memory to store intermediate results. The cuFFT library offers several functions to manage this temporary memory utilization behavior: <ul class="simple"> <li><code class="docutils literal notranslate">cufftSetAutoAllocation</code></li> <li><code class="docutils literal notranslate">cufftEstimate1d</code>, <code class="docutils literal notranslate">cufftEstimate2d</code>, <code class="docutils literal notranslate">cufftEstimate3d</code> and <code class="docutils literal notranslate">cufftEstimateMany</code></li> <li><code class="docutils literal notranslate">cufftGetSize</code></li> <li><code class="docutils literal notranslate">cufftXtSetWorkAreaPolicy</code></li> </ul> The first two functions manage allocation and ownership of temporary memory. By default cuFFT always allocates its own work area in GPU memory. Each cuFFT handle allocates data separately. If multiple cuFFT plans are to be launched sequentially it is possible to assign the same memory chunk as work area to all those plans and reduce memory overhead. The memory assigned as work area needs to be GPU visible. In addition to the regular memory acquired with <code class="docutils literal notranslate">cudaMalloc</code>, usage of CUDA Unified Virtual Addressing enables cuFFT to use the following types of memory as work area memory: pinned host memory, managed memory, memory on GPU other than the one performing the calculations. While this provides flexibility, it comes with a performance penalty whose magnitude depends on the available memory bandwidth. The <code class="docutils literal notranslate">cufftEstimateNd</code>, <code class="docutils literal notranslate">cufftEstimateMany</code>, and <code class="docutils literal notranslate">cufftGetSize</code> functions provide information about the required memory size for cases where the user is allocating the work space buffer. <div class="line-block"> <div class="line">In version 9.2 cuFFT also introduced the <code class="docutils literal notranslate">cufftXtSetWorkAreaPolicy</code> function. This function allows fine tuning of work area memory usage.</div> <div class="line">cuFFT 9.2 version supports only the <code class="docutils literal notranslate">CUFFT_WORKAREA_MINIMAL</code> policy, which instructs cuFFT to re-plan the existing plan without the need to use work area memory.</div> </div> Also as of cuFFT 9.2, supported FFT transforms that allow for <code class="docutils literal notranslate">CUFFT_WORKAREA_MINIMAL</code> policy are as follows: <ul class="simple"> <li>Transforms of type <code class="docutils literal notranslate">C2C</code> are supported with sizes up to 4096 in any dimension.</li> <li>Transforms of type <code class="docutils literal notranslate">Z2Z</code> are supported with sizes up to 2048 in any dimension.</li> <li>Only single GPU transforms are supported.</li> </ul> Depending on the FFT transform size, a different FFT algorithm may be used when the <code class="docutils literal notranslate">CUFFT_WORKAREA_MINIMAL</code> policy is set. </section> <section id="cufft-link-time-optimized-kernels"> <h2>2.15. cuFFT Link-Time Optimized Kernels<a class="headerlink" href="#cufft-link-time-optimized-kernels" title="Permalink to this headline"></a></h2> Starting from CUDA 12.4, cuFFT ships Link-Time Optimized (LTO) kernels. These kernels are linked and finalized at runtime as part of the cuFFT planning routines. This enables the cuFFT library to generate kernels optimized for the underlying architecture and the specific problem to solve. The current LTO kernel coverage includes: <ul class="simple"> <li>Kernels for 64-bit addressing (with FFTs spanning addresses greater than 2^(32)-1 elements).</li> <li>Some single- and double-precision R2C and C2R sizes.</li> </ul> The number and coverage of LTO kernels will grow with future releases of cuFFT. We encourage our users to test whether LTO kernels improve the performance for their use case. Users can opt-in into LTO kernels by setting the <code class="docutils literal notranslate">NVFFT_PLAN_PROPERTY_INT64_PATIENT_JIT</code> plan property using the <code class="docutils literal notranslate">cufftSetPlanProperty</code> routine. In order to finalize LTO kernels, cuFFT relies on the nvJitLink library that ships as part of the CUDA Toolkit. Finalizing the kernels at runtime can cause an increase in planning time (which could be in the order of hundreds of milliseconds, depending on the cuFFT plan and hardware characteristics of the host system), in exchange for faster execution time of the optimized kernels. Note that nvJitLink caches kernels linked at runtime to speed-up subsequent kernel finalizations in repeated planning routines. If for any reason the runtime linking of the kernel fails, cuFFT will fall back to offline-compiled kernels to compute the FFT. <div class="admonition note"> Note cuFFT LTO kernels for a given toolkit version require using the nvJitLink library from the same toolkit or greater, but within the same toolkit major. For example, cuFFT in 12.4 requires nvJitLink to be from a CUDA Toolkit 12.X, with <code class="docutils literal notranslate">X >= 4</code>. The nvJitLink library is loaded dynamically, and should be present in the system’s dynamic linking path (e.g. <code class="docutils literal notranslate">LD_LIBRARY_PATH</code> on Unix systems, or <code class="docutils literal notranslate">PATH</code> on Windows systems). </div> </section> </section> <section id="cufft-api-reference"> <h1>3. cuFFT API Reference<a class="headerlink" href="#cufft-api-reference" title="Permalink to this headline"></a></h1> This chapter specifies the behavior of the cuFFT library functions by describing their input/output parameters, data types, and error codes. The cuFFT library is initialized upon the first invocation of an API function, and cuFFT shuts down automatically when all user-created FFT plans are destroyed. <section id="return-value-cufftresult"> <h2>3.1. Return value cufftResult<a class="headerlink" href="#return-value-cufftresult" title="Permalink to this headline"></a></h2> All cuFFT Library return values except for <code class="docutils literal notranslate">CUFFT_SUCCESS</code> indicate that the current API call failed and the user should reconfigure to correct the problem. The possible return values are defined as follows: <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef enum cufftResult_t { CUFFT_SUCCESS = 0, // The cuFFT operation was successful CUFFT_INVALID_PLAN = 1, // cuFFT was passed an invalid plan handle CUFFT_ALLOC_FAILED = 2, // cuFFT failed to allocate GPU or CPU memory CUFFT_INVALID_TYPE = 3, // No longer used CUFFT_INVALID_VALUE = 4, // User specified an invalid pointer or parameter CUFFT_INTERNAL_ERROR = 5, // Driver or internal cuFFT library error CUFFT_EXEC_FAILED = 6, // Failed to execute an FFT on the GPU CUFFT_SETUP_FAILED = 7, // The cuFFT library failed to initialize CUFFT_INVALID_SIZE = 8, // User specified an invalid transform size CUFFT_UNALIGNED_DATA = 9, // No longer used CUFFT_INCOMPLETE_PARAMETER_LIST = 10, // Missing parameters in call CUFFT_INVALID_DEVICE = 11, // Execution of a plan was on different GPU than plan creation CUFFT_PARSE_ERROR = 12, // Internal plan database error CUFFT_NO_WORKSPACE = 13 // No workspace has been provided prior to plan execution CUFFT_NOT_IMPLEMENTED = 14, // Function does not implement functionality for parameters given. CUFFT_LICENSE_ERROR = 15, // Used in previous versions. CUFFT_NOT_SUPPORTED = 16 // Operation is not supported for parameters given. } cufftResult; </pre></div> </div> Users are encouraged to check return values from cuFFT functions for errors as shown in <a class="reference external" href="index.html#cufft-code-examples">cuFFT Code Examples</a>. </section> <section id="cufft-basic-plans"> <h2>3.2. cuFFT Basic Plans<a class="headerlink" href="#cufft-basic-plans" title="Permalink to this headline"></a></h2> These API routines take care of initializing the cufftHandle. Any already-initialized handle attributes passed to the planning functions will be ignored. <section id="cufftplan1d"> <h3>3.2.1. cufftPlan1d()<a class="headerlink" href="#cufftplan1d" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftPlan1d"> cufftResult cufftPlan1d(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> *plan, int nx, cufftType type, int batch);<a class="headerlink" href="#c.cufftPlan1d" title="Permalink to this definition"></a> </dt> <dd>Creates a 1D FFT plan configuration for a specified signal size and data type. The <code class="docutils literal notranslate">batch</code> input parameter tells cuFFT how many 1D transforms to configure. This call can only be used once for a given handle. It will fail and return <code class="docutils literal notranslate">CUFFT_INVALID_PLAN</code> if the plan is locked, i.e. the handle was previously used with a different <code class="docutils literal notranslate">cufftPlan</code> or <code class="docutils literal notranslate">cufftMakePlan</code> call. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – Pointer to an uninitialized <code class="docutils literal notranslate">cufftHandle</code> object.</li> <li>nx[In] – The transform size (e.g. 256 for a 256-point FFT).</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_C2C</code> for single precision complex to complex).</li> <li>batch[In] – Number of transforms of size <code class="docutils literal notranslate">nx</code>. Please consider using <code class="docutils literal notranslate">cufftPlanMany</code> for multiple transforms.</li> <li>plan[Out] – Contains a cuFFT 1D plan handle value.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully created the FFT plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle. Handle is not valid when the plan is locked.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – The <code class="docutils literal notranslate">nx</code> or <code class="docutils literal notranslate">batch</code> parameter is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftplan2d"> <h3>3.2.2. cufftPlan2d()<a class="headerlink" href="#cufftplan2d" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftPlan2d"> cufftResult cufftPlan2d(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> *plan, int nx, int ny, cufftType type);<a class="headerlink" href="#c.cufftPlan2d" title="Permalink to this definition"></a> </dt> <dd>Creates a 2D FFT plan configuration according to specified signal sizes and data type. This call can only be used once for a given handle. It will fail and return <code class="docutils literal notranslate">CUFFT_INVALID_PLAN</code> if the plan is locked, i.e. the handle was previously used with a different <code class="docutils literal notranslate">cufftPlan</code> or <code class="docutils literal notranslate">cufftMakePlan</code> call. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – Pointer to an uninitialized <code class="docutils literal notranslate">cufftHandle</code> object.</li> <li>nx[In] – The transform size in the x dimension This is slowest changing dimension of a transform (strided in memory).</li> <li>ny[In] – The transform size in the y dimension. This is fastest changing dimension of a transform (contiguous in memory).</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_C2R</code> for single precision complex to real).</li> <li>plan[Out] – Contains a cuFFT 2D plan handle value.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully created the FFT plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle. Handle is not valid when the plan is locked.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – Either or both of the <code class="docutils literal notranslate">nx</code> or <code class="docutils literal notranslate">ny</code> parameters is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftplan3d"> <h3>3.2.3. cufftPlan3d()<a class="headerlink" href="#cufftplan3d" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftPlan3d"> cufftResult cufftPlan3d(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> *plan, int nx, int ny, int nz, cufftType type);<a class="headerlink" href="#c.cufftPlan3d" title="Permalink to this definition"></a> </dt> <dd>Creates a 3D FFT plan configuration according to specified signal sizes and data type. This function is the same as <code class="docutils literal notranslate">cufftPlan2d()</code> except that it takes a third size parameter <code class="docutils literal notranslate">nz</code>. This call can only be used once for a given handle. It will fail and return <code class="docutils literal notranslate">CUFFT_INVALID_PLAN</code> if the plan is locked, i.e. the handle was previously used with a different <code class="docutils literal notranslate">cufftPlan</code> or <code class="docutils literal notranslate">cufftMakePlan</code> call. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – Pointer to an uninitialized <code class="docutils literal notranslate">cufftHandle</code> object.</li> <li>nx[In] – The transform size in the x dimension. This is slowest changing dimension of a transform (strided in memory).</li> <li>ny[In] – The transform size in the y dimension.</li> <li>nz[In] – The transform size in the z dimension. This is fastest changing dimension of a transform (contiguous in memory).</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_R2C</code> for single precision real to complex).</li> <li>plan[Out] – Contains a cuFFT 3D plan handle value.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully created the FFT plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle. Handle is not valid when the plan is locked.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – One or more of the <code class="docutils literal notranslate">nx</code>, <code class="docutils literal notranslate">ny</code>, or <code class="docutils literal notranslate">nz</code> parameters is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftplanmany"> <h3>3.2.4. cufftPlanMany()<a class="headerlink" href="#cufftplanmany" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftPlanMany"> cufftResult cufftPlanMany(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> *plan, int rank, int *n, int *inembed, int istride, int idist, int *onembed, int ostride, int odist, cufftType type, int batch);<a class="headerlink" href="#c.cufftPlanMany" title="Permalink to this definition"></a> </dt> <dd>Creates a FFT plan configuration of dimension <code class="docutils literal notranslate">rank</code>, with sizes specified in the array <code class="docutils literal notranslate">n</code>. The <code class="docutils literal notranslate">batch</code> input parameter tells cuFFT how many transforms to configure. With this function, batched plans of 1, 2, or 3 dimensions may be created. The <code class="docutils literal notranslate">cufftPlanMany()</code> API supports more complicated input and output data layouts via the advanced data layout parameters: <code class="docutils literal notranslate">inembed</code>, <code class="docutils literal notranslate">istride</code>, <code class="docutils literal notranslate">idist</code>, <code class="docutils literal notranslate">onembed</code>, <code class="docutils literal notranslate">ostride</code>, and <code class="docutils literal notranslate">odist</code>. If <code class="docutils literal notranslate">inembed</code> and <code class="docutils literal notranslate">onembed</code> are set to <code class="docutils literal notranslate">NULL</code>, all other stride information is ignored, and default strides are used. The default assumes contiguous data arrays. All arrays are assumed to be in CPU memory. Please note that behavior of <code class="docutils literal notranslate">cufftPlanMany</code> function when <code class="docutils literal notranslate">inembed</code> and <code class="docutils literal notranslate">onembed</code> is <code class="docutils literal notranslate">NULL</code> is different than corresponding function in FFTW library <code class="docutils literal notranslate">fftw_plan_many_dft</code>. This call can only be used once for a given handle. It will fail and return <code class="docutils literal notranslate">CUFFT_INVALID_PLAN</code> if the plan is locked, i.e. the handle was previously used with a different <code class="docutils literal notranslate">cufftPlan</code> or <code class="docutils literal notranslate">cufftMakePlan</code> call. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – Pointer to an uninitialized <code class="docutils literal notranslate">cufftHandle</code> object.</li> <li>rank[In] – Dimensionality of the transform (1, 2, or 3).</li> <li>n[In] – Array of size <code class="docutils literal notranslate">rank</code>, describing the size of each dimension, <code class="docutils literal notranslate">n[0]</code> being the size of the outermost and <code class="docutils literal notranslate">n[rank-1]</code> innermost (contiguous) dimension of a transform.</li> <li>inembed[In] – Pointer of size <code class="docutils literal notranslate">rank</code> that indicates the storage dimensions of the input data in memory. If set to NULL all other advanced data layout parameters are ignored.</li> <li>istride[In] – Indicates the distance between two successive input elements in the least significant (i.e., innermost) dimension.</li> <li>idist[In] – Indicates the distance between the first element of two consecutive signals in a batch of the input data.</li> <li>onembed[In] – Pointer of size <code class="docutils literal notranslate">rank</code> that indicates the storage dimensions of the output data in memory. If set to NULL all other advanced data layout parameters are ignored.</li> <li>ostride[In] – Indicates the distance between two successive output elements in the output array in the least significant (i.e., innermost) dimension.</li> <li>odist[In] – Indicates the distance between the first element of two consecutive signals in a batch of the output data.</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_R2C</code> for single precision real to complex).</li> <li>batch[In] – Batch size for this transform.</li> <li>plan[Out] – Contains a cuFFT plan handle.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully created the FFT plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle. Handle is not valid when the plan is locked.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – One or more of the parameters is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> </section> <section id="cufft-extensible-plans"> <h2>3.3. cuFFT Extensible Plans<a class="headerlink" href="#cufft-extensible-plans" title="Permalink to this headline"></a></h2> These API routines separates handle creation from plan generation. This makes it possible to change plan settings, which may alter the outcome of the plan generation phase, before the plan is actually generated. <section id="cufftcreate"> <h3>3.3.1. cufftCreate()<a class="headerlink" href="#cufftcreate" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftCreate"> cufftResult cufftCreate(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> *plan)<a class="headerlink" href="#c.cufftCreate" title="Permalink to this definition"></a> </dt> <dd>Creates only an opaque handle, and allocates small data structures on the host. The <code class="docutils literal notranslate">cufftMakePlan*()</code> calls actually do the plan generation. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – Pointer to a <code class="docutils literal notranslate">cufftHandle</code> object.</li> <li>plan[Out] – Contains a cuFFT plan handle value.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully created the FFT plan.</li> <li>CUFFT_ALLOC_FAILED – The allocation of resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftdestroy"> <h3>3.3.2. cufftDestroy()<a class="headerlink" href="#cufftdestroy" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftDestroy"> cufftResult cufftDestroy(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan)<a class="headerlink" href="#c.cufftDestroy" title="Permalink to this definition"></a> </dt> <dd>Frees all GPU resources associated with a cuFFT plan and destroys the internal plan data structure. This function should be called once a plan is no longer needed, to avoid wasting GPU memory. In the case of multi-GPU plans, the plan created first should be destroyed last. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – The <code class="docutils literal notranslate">cufftHandle</code> object of the plan to be destroyed.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully destroyed the FFT plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftmakeplan1d"> <h3>3.3.3. cufftMakePlan1d()<a class="headerlink" href="#cufftmakeplan1d" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftMakePlan1d"> cufftResult cufftMakePlan1d(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, int nx, cufftType type, int batch, size_t *workSize);<a class="headerlink" href="#c.cufftMakePlan1d" title="Permalink to this definition"></a> </dt> <dd>Following a call to <code class="docutils literal notranslate">cufftCreate()</code> makes a 1D FFT plan configuration for a specified signal size and data type. The <code class="docutils literal notranslate">batch</code> input parameter tells cuFFT how many 1D transforms to configure. This call can only be used once for a given handle. It will fail and return <code class="docutils literal notranslate">CUFFT_INVALID_PLAN</code> if the plan is locked, i.e. the handle was previously used with a different <code class="docutils literal notranslate">cufftPlan</code> or <code class="docutils literal notranslate">cufftMakePlan</code> call. If <code class="docutils literal notranslate">cufftXtSetGPUs()</code> was called prior to this call with multiple GPUs, then <code class="docutils literal notranslate">workSize</code> will contain multiple sizes. See sections on multiple GPUs for more details. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>nx[In] – The transform size (e.g. 256 for a 256-point FFT). For multiple GPUs, this must be a power of 2.</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_C2C</code> for single precision complex to complex). For multiple GPUs this must be a complex to complex transform.</li> <li>batch[In] – Number of transforms of size <code class="docutils literal notranslate">nx</code>. Please consider using <code class="docutils literal notranslate">cufftMakePlanMany</code> for multiple transforms.</li> <li>*workSize[In] – Pointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.</li> <li>*workSize[Out] – Pointer to the size(s) of the work areas.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully created the FFT plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle. Handle is not valid when the plan is locked or multi-GPU restrictions are not met.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED` – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – The <code class="docutils literal notranslate">nx</code> or <code class="docutils literal notranslate">batch</code> parameter is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftmakeplan2d"> <h3>3.3.4. cufftMakePlan2d()<a class="headerlink" href="#cufftmakeplan2d" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftMakePlan2d"> cufftResult cufftMakePlan2d(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, int nx, int ny, cufftType type, size_t *workSize);<a class="headerlink" href="#c.cufftMakePlan2d" title="Permalink to this definition"></a> </dt> <dd>Following a call to <code class="docutils literal notranslate">cufftCreate()</code> makes a 2D FFT plan configuration according to specified signal sizes and data type. This call can only be used once for a given handle. It will fail and return <code class="docutils literal notranslate">CUFFT_INVALID_PLAN</code> if the plan is locked, i.e. the handle was previously used with a different <code class="docutils literal notranslate">cufftPlan</code> or <code class="docutils literal notranslate">cufftMakePlan</code> call. If <code class="docutils literal notranslate">cufftXtSetGPUs()</code> was called prior to this call with multiple GPUs, then <code class="docutils literal notranslate">workSize</code> will contain multiple sizes. See sections on multiple GPUs for more details. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>nx[In] – The transform size in the x dimension. This is slowest changing dimension of a transform (strided in memory). For multiple GPUs, this must be factorable into primes less than or equal to 127.</li> <li>ny[In] – The transform size in the y dimension. This is fastest changing dimension of a transform (contiguous in memory). For 2 GPUs, this must be factorable into primes less than or equal to 127.</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_C2R</code> for single precision complex to real).</li> <li>workSize[In] – Pointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.</li> <li>*workSize[Out] – Pointer to the size(s) of the work areas.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully created the FFT plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – Either or both of the <code class="docutils literal notranslate">nx</code> or <code class="docutils literal notranslate">ny</code> parameters is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftmakeplan3d"> <h3>3.3.5. cufftMakePlan3d()<a class="headerlink" href="#cufftmakeplan3d" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftMakePlan3d"> cufftResult cufftMakePlan3d(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, int nx, int ny, int nz, cufftType type, size_t *workSize);<a class="headerlink" href="#c.cufftMakePlan3d" title="Permalink to this definition"></a> </dt> <dd>Following a call to <code class="docutils literal notranslate">cufftCreate()</code> makes a 3D FFT plan configuration according to specified signal sizes and data type. This function is the same as <code class="docutils literal notranslate">cufftPlan2d()</code> except that it takes a third size parameter <code class="docutils literal notranslate">nz</code>. This call can only be used once for a given handle. It will fail and return <code class="docutils literal notranslate">CUFFT_INVALID_PLAN</code> if the plan is locked, i.e. the handle was previously used with a different <code class="docutils literal notranslate">cufftPlan</code> or <code class="docutils literal notranslate">cufftMakePlan</code> call. If <code class="docutils literal notranslate">cufftXtSetGPUs()</code> was called prior to this call with multiple GPUs, then <code class="docutils literal notranslate">workSize</code> will contain multiple sizes. See sections on multiple GPUs for more details. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>nx[In] – The transform size in the x dimension. This is slowest changing dimension of a transform (strided in memory). For multiple GPUs, this must be factorable into primes less than or equal to 127.</li> <li>ny[In] – The transform size in the y dimension. For multiple GPUs, this must be factorable into primes less than or equal to 127.</li> <li>nz[In] – The transform size in the z dimension. This is fastest changing dimension of a transform (contiguous in memory). For multiple GPUs, this must be factorable into primes less than or equal to 127.</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_R2C</code> for single precision real to complex).</li> <li>workSize[In] – Pointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.</li> <li>*workSize[Out] – Pointer to the size(s) of the work area(s).</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully created the FFT plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – One or more of the <code class="docutils literal notranslate">nx</code>, <code class="docutils literal notranslate">ny</code>, or <code class="docutils literal notranslate">nz</code> parameters is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftmakeplanmany"> <h3>3.3.6. cufftMakePlanMany()<a class="headerlink" href="#cufftmakeplanmany" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftMakePlanMany"> cufftResult cufftMakePlanMany(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, int rank, int *n, int *inembed, int istride, int idist, int *onembed, int ostride, int odist, cufftType type, int batch, size_t *workSize);<a class="headerlink" href="#c.cufftMakePlanMany" title="Permalink to this definition"></a> </dt> <dd>Following a call to <code class="docutils literal notranslate">cufftCreate()</code> makes a FFT plan configuration of dimension <code class="docutils literal notranslate">rank</code>, with sizes specified in the array <code class="docutils literal notranslate">n</code>. The <code class="docutils literal notranslate">batch</code> input parameter tells cuFFT how many transforms to configure. With this function, batched plans of 1, 2, or 3 dimensions may be created. The <code class="docutils literal notranslate">cufftPlanMany()</code> API supports more complicated input and output data layouts via the advanced data layout parameters: <code class="docutils literal notranslate">inembed</code>, <code class="docutils literal notranslate">istride</code>, <code class="docutils literal notranslate">idist</code>, <code class="docutils literal notranslate">onembed</code>, <code class="docutils literal notranslate">ostride</code>, and <code class="docutils literal notranslate">odist</code>. If <code class="docutils literal notranslate">inembed</code> and <code class="docutils literal notranslate">onembed</code> are set to <code class="docutils literal notranslate">NULL</code>, all other stride information is ignored, and default strides are used. The default assumes contiguous data arrays. This call can only be used once for a given handle. It will fail and return <code class="docutils literal notranslate">CUFFT_INVALID_PLAN</code> if the plan is locked, i.e. the handle was previously used with a different <code class="docutils literal notranslate">cufftPlan</code> or <code class="docutils literal notranslate">cufftMakePlan</code> call. If <code class="docutils literal notranslate">cufftXtSetGPUs()</code> was called prior to this call with multiple GPUs, then <code class="docutils literal notranslate">workSize</code> will contain multiple sizes. See sections on multiple GPUs for more details. All arrays are assumed to be in CPU memory. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>rank[In] – Dimensionality of the transform (1, 2, or 3)</li> <li>n[In] – Array of size <code class="docutils literal notranslate">rank</code>, describing the size of each dimension, <code class="docutils literal notranslate">n[0]</code> being the size of the outermost and <code class="docutils literal notranslate">n[rank-1]</code> innermost (contiguous) dimension of a transform. For multiple GPUs and rank equal to 1, the sizes must be a power of 2. For multiple GPUs and rank equal to 2 or 3, the sizes must be factorable into primes less than or equal to 127.</li> <li>inembed[In] – Pointer of size <code class="docutils literal notranslate">rank</code> that indicates the storage dimensions of the input data in memory, <code class="docutils literal notranslate">inembed[0]</code> being the storage dimension of the outermost dimension. If set to NULL all other advanced data layout parameters are ignored.</li> <li>istride[In] – Indicates the distance between two successive input elements in the least significant (i.e., innermost) dimension</li> <li>idist[In] – Indicates the distance between the first element of two consecutive signals in a batch of the input data</li> <li>onembed[In] – Pointer of size <code class="docutils literal notranslate">rank</code> that indicates the storage dimensions of the output data in memory, <code class="docutils literal notranslate">onembed[0]</code> being the storage dimension of the outermost dimension. If set to NULL all other advanced data layout parameters are ignored.</li> <li>ostride[In] – Indicates the distance between two successive output elements in the output array in the least significant (i.e., innermost) dimension</li> <li>odist[In] – Indicates the distance between the first element of two consecutive signals in a batch of the output data</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_R2C</code> for single precision real to complex). For 2 GPUs this must be a complex to complex transform.</li> <li>batch[In] – Batch size for this transform.</li> <li>*workSize[In] – Pointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.</li> <li>*workSize[Out] – Pointer to the size(s) of the work areas.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully created the FFT plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle. Handle is not valid when the plan is locked or multi-GPU restrictions are not met.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – One or more of the parameters is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftmakeplanmany64"> <h3>3.3.7. cufftMakePlanMany64()<a class="headerlink" href="#cufftmakeplanmany64" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftMakePlanMany64"> cufftResult cufftMakePlanMany64(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, int rank, long long int *n, long long int *inembed, long long int istride, long long int idist, long long int *onembed, long long int ostride, long long int odist, cufftType type, long long int batch, size_t *workSize);<a class="headerlink" href="#c.cufftMakePlanMany64" title="Permalink to this definition"></a> </dt> <dd>Following a call to <code class="docutils literal notranslate">cufftCreate()</code> makes a FFT plan configuration of dimension <code class="docutils literal notranslate">rank</code>, with sizes specified in the array <code class="docutils literal notranslate">n</code>. The <code class="docutils literal notranslate">batch</code> input parameter tells cuFFT how many transforms to configure. With this function, batched plans of 1, 2, or 3 dimensions may be created. This API is identical to <code class="docutils literal notranslate">cufftMakePlanMany</code> except that the arguments specifying sizes and strides are 64 bit integers. This API makes very large transforms possible. cuFFT includes kernels that use 32 bit indexes, and kernels that use 64 bit indexes. cuFFT planning selects 32 bit kernels whenever possible to avoid any overhead due to 64 bit arithmetic. All sizes and types of transform are supported by this interface, with two exceptions. For transforms whose size exceeds 4G elements, the dimensions specified in the array <code class="docutils literal notranslate">n</code> must be factorable into primes that are less than or equal to 127. For real to complex and complex to real transforms whose size exceeds 4G elements, the fastest changing dimension must be even. The <code class="docutils literal notranslate">cufftPlanMany64()</code> API supports more complicated input and output data layouts via the advanced data layout parameters: <code class="docutils literal notranslate">inembed</code>, <code class="docutils literal notranslate">istride</code>, <code class="docutils literal notranslate">idist</code>, <code class="docutils literal notranslate">onembed</code>, <code class="docutils literal notranslate">ostride</code>, and <code class="docutils literal notranslate">odist</code>. If <code class="docutils literal notranslate">inembed</code> and <code class="docutils literal notranslate">onembed</code> are set to <code class="docutils literal notranslate">NULL</code>, all other stride information is ignored, and default strides are used. The default assumes contiguous data arrays. This call can only be used once for a given handle. It will fail and return <code class="docutils literal notranslate">CUFFT_INVALID_PLAN</code> if the plan is locked, i.e. the handle was previously used with a different <code class="docutils literal notranslate">cufftPlan</code> or <code class="docutils literal notranslate">cufftMakePlan</code> call. If <code class="docutils literal notranslate">cufftXtSetGPUs()</code> was called prior to this call with multiple GPUs, then <code class="docutils literal notranslate">workSize</code> will contain multiple sizes. See sections on multiple GPUs for more details. All arrays are assumed to be in CPU memory. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>rank[In] – Dimensionality of the transform (1, 2, or 3).</li> <li>n[In] – Array of size <code class="docutils literal notranslate">rank</code>, describing the size of each dimension. For multiple GPUs and rank equal to 1, the sizes must be a power of 2. For multiple GPUs and rank equal to 2 or 3, the sizes must be factorable into primes less than or equal to 127.</li> <li>inembed[In] – Pointer of size <code class="docutils literal notranslate">rank</code> that indicates the storage dimensions of the input data in memory. If set to NULL all other advanced data layout parameters are ignored.</li> <li>istride[In] – Indicates the distance between two successive input elements in the least significant (i.e., innermost) dimension.</li> <li>idist[In] – Indicates the distance between the first element of two consecutive signals in a batch of the input data.</li> <li>onembed[In] – Pointer of size <code class="docutils literal notranslate">rank</code> that indicates the storage dimensions of the output data in memory. If set to NULL all other advanced data layout parameters are ignored.</li> <li>ostride[In] – Indicates the distance between two successive output elements in the output array in the least significant (i.e., innermost) dimension.</li> <li>odist[In] – Indicates the distance between the first element of two consecutive signals in a batch of the output data.</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_R2C</code> for single precision real to complex). For 2 GPUs this must be a complex to complex transform.</li> <li>batch[In] – Batch size for this transform.</li> <li>*workSize[In] – Pointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.</li> <li>*workSize[Out] – Pointer to the size(s) of the work areas.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully created the FFT plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle. Handle is not valid when the plan is locked or multi-GPU restrictions are not met.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – One or more of the parameters is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftxtmakeplanmany"> <h3>3.3.8. cufftXtMakePlanMany()<a class="headerlink" href="#cufftxtmakeplanmany" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtMakePlanMany"> cufftResult cufftXtMakePlanMany(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, int rank, long long int *n, long long int *inembed, long long int istride, long long int idist, cudaDataType inputtype, long long int *onembed, long long int ostride, long long int odist, cudaDataType outputtype, long long int batch, size_t *workSize, cudaDataType executiontype);<a class="headerlink" href="#c.cufftXtMakePlanMany" title="Permalink to this definition"></a> </dt> <dd>Following a call to <code class="docutils literal notranslate">cufftCreate()</code> makes an FFT plan configuration of dimension <code class="docutils literal notranslate">rank</code>, with sizes specified in the array <code class="docutils literal notranslate">n</code>. The <code class="docutils literal notranslate">batch</code> input parameter tells cuFFT how many transforms to configure. With this function, batched plans of 1, 2, or 3 dimensions may be created. Type specifiers <code class="docutils literal notranslate">inputtype</code>, <code class="docutils literal notranslate">outputtype</code> and <code class="docutils literal notranslate">executiontype</code> dictate type and precision of transform to be performed. Not all combinations of parameters are supported. Currently all three parameters need to match precision. Parameters <code class="docutils literal notranslate">inputtype</code> and <code class="docutils literal notranslate">outputtype</code> need to match transform type complex-to-complex, real-to-complex or complex-to-real. Parameter <code class="docutils literal notranslate">executiontype</code> needs to match precision and be of a complex type. Example: for a half-precision real-to-complex transform, parameters <code class="docutils literal notranslate">inputtype</code>, <code class="docutils literal notranslate">outputtype</code> and <code class="docutils literal notranslate">executiontype</code> would have values of <code class="docutils literal notranslate">CUDA_R_16F</code>, <code class="docutils literal notranslate">CUDA_C_16F</code> and <code class="docutils literal notranslate">CUDA_C_16F</code> respectively. Similarly, a bfloat16 complex-to-real transform would use <code class="docutils literal notranslate">CUDA_C_16BF</code> for <code class="docutils literal notranslate">inputtype</code> and <code class="docutils literal notranslate">executiontype</code>, and <code class="docutils literal notranslate">CUDA_R_16BF</code> for <code class="docutils literal notranslate">outputtype</code>. The <code class="docutils literal notranslate">cufftXtMakePlanMany()</code> API supports more complicated input and output data layouts via the advanced data layout parameters: <code class="docutils literal notranslate">inembed</code>, <code class="docutils literal notranslate">istride</code>, <code class="docutils literal notranslate">idist</code>, <code class="docutils literal notranslate">onembed</code>, <code class="docutils literal notranslate">ostride</code>, and <code class="docutils literal notranslate">odist</code>. If <code class="docutils literal notranslate">inembed</code> and <code class="docutils literal notranslate">onembed</code> are set to <code class="docutils literal notranslate">NULL</code>, all other stride information is ignored, and default strides are used. The default assumes contiguous data arrays. If <code class="docutils literal notranslate">cufftXtSetGPUs()</code> was called prior to this call with multiple GPUs, then <code class="docutils literal notranslate">workSize</code> will contain multiple sizes. See sections on multiple GPUs for more details. All arrays are assumed to be in CPU memory. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>rank[In] – Dimensionality of the transform (1, 2, or 3).</li> <li>n[In] – Array of size <code class="docutils literal notranslate">rank</code>, describing the size of each dimension, <code class="docutils literal notranslate">n[0]</code> being the size of the outermost and <code class="docutils literal notranslate">n[rank-1]</code> innermost (contiguous) dimension of a transform. For multiple GPUs and rank equal to 1, the sizes must be a power of 2. For multiple GPUs and rank equal to 2 or 3, the sizes must be factorable into primes less than or equal to 127.</li> <li>inembed[In] – Pointer of size <code class="docutils literal notranslate">rank</code> that indicates the storage dimensions of the input data in memory, <code class="docutils literal notranslate">inembed[0]</code> being the storage dimension of the outermost dimension. If set to NULL all other advanced data layout parameters are ignored.</li> <li>istride[In] – Indicates the distance between two successive input elements in the least significant (i.e., innermost) dimension.</li> <li>idist[In] – Indicates the distance between the first element of two consecutive signals in a batch of the input data.</li> <li>inputtype[In] – Type of input data.</li> <li>onembed[In] – Pointer of size <code class="docutils literal notranslate">rank</code> that indicates the storage dimensions of the output data in memory, <code class="docutils literal notranslate">onembed[0]</code> being the storage dimension of the outermost dimension. If set to NULL all other advanced data layout parameters are ignored.</li> <li>ostride[In] – Indicates the distance between two successive output elements in the output array in the least significant (i.e., innermost) dimension.</li> <li>odist[In] – Indicates the distance between the first element of two consecutive signals in a batch of the output data.</li> <li>outputtype[In] – Type of output data.</li> <li>batch[In] – Batch size for this transform.</li> <li>*workSize[In] – Pointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.</li> <li>executiontype[In] – Type of data to be used for computations.</li> <li>*workSize[Out] – Pointer to the size(s) of the work areas.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully created the FFT plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle. Handle is not valid when multi-GPU restrictions are not met.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – One or more of the parameters is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> </section> <section id="cufft-plan-properties"> <h2>3.4. cuFFT Plan Properties<a class="headerlink" href="#cufft-plan-properties" title="Permalink to this headline"></a></h2> Users can further customize cuFFT plans using plan properties. These properties can be set, queried and reset on a per-plan basis as needed, using the routines listed in this section. The current supported properties are listed below: <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 17%" /> <col style="width: 12%" /> <col style="width: 55%" /> <col style="width: 17%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Property</th> <th class="head">Underlying Type</th> <th class="head">Description</th> <th class="head">Behavior</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">NVFFT_PLAN_PROPERTY_INT64_PATIENT_JIT</code></td> <td>long long int</td> <td><ul class="simple"> <li>Runtime LTO kernels are enabled when set to not-zero value. See <a class="reference external" href="index.html#cufft-link-time-optimized-kernels">Link-Time Optimized Kernels</a></li> <li>Runtime LTO kernles are disabled when set to zero (default)</li> </ul> </td> <td><ul class="simple"> <li>Can be set / reset before planning</li> <li>Cannot be set / reset after planning</li> </ul> </td> </tr> </tbody> </table> <section id="cufftsetplanpropertyint64"> <h3>3.4.1. cufftSetPlanPropertyInt64()<a class="headerlink" href="#cufftsetplanpropertyint64" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftSetPlanPropertyInt64"> cufftResult cufftSetPlanPropertyInt64(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cufftProperty property, const long long int propertyValueInt64);<a class="headerlink" href="#c.cufftSetPlanPropertyInt64" title="Permalink to this definition"></a> </dt> <dd>Associates a cuFFT plan with a property identified by the key <code class="docutils literal notranslate">property</code>. The value for the property is given by value <code class="docutils literal notranslate">propertyValueInt64</code>, which is a signed long long integer. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>property[In] – The property identifier, of type <code class="docutils literal notranslate">cufftPlanProperty</code>.</li> <li>propertyValueInt64[In] – Value to set for the property, a long long signed integer.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully set the property.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_NOT_SUPPORTED – The property is not supported, or it cannot be set at the time (e.g. some properties cannot be set after calling a planning routine for the plan, see <a class="reference external" href="index.html#cufft-plan-properties">cuFFT Plan Properties</a>).</li> <li>CUFFT_INVALID_VALUE – Invalid property or value with which to set the property</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftgetplanpropertyint64"> <h3>3.4.2. cufftGetPlanPropertyInt64()<a class="headerlink" href="#cufftgetplanpropertyint64" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftGetPlanPropertyInt64"> cufftResult cufftGetPlanPropertyInt64(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cufftProperty property, long long int *propertyValueInt64);<a class="headerlink" href="#c.cufftGetPlanPropertyInt64" title="Permalink to this definition"></a> </dt> <dd>Retrieves the property value identified by the key <code class="docutils literal notranslate">property</code> associated with the cuFFT plan <code class="docutils literal notranslate">plan</code>. The value for the property, which is a signed long long integer, is set in the address space pointed by <code class="docutils literal notranslate">propertyValueInt64</code>. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>property[In] – The property identifier, of type <code class="docutils literal notranslate">cufftPlanProperty</code>.</li> <li>propertyValueInt64[In] – Pointer to the value to be set with the value of the property.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully retrieved the property value.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_NOT_SUPPORTED – The property is not supported.</li> <li>CUFFT_INVALID_VALUE – Invalid property, or pointer <code class="docutils literal notranslate">propertyValueInt64</code> is null</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftresetplanproperty"> <h3>3.4.3. cufftResetPlanProperty()<a class="headerlink" href="#cufftresetplanproperty" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftResetPlanProperty"> cufftResult cufftResetPlanProperty(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cufftProperty property);<a class="headerlink" href="#c.cufftResetPlanProperty" title="Permalink to this definition"></a> </dt> <dd>Resets the value of the property identified by the key <code class="docutils literal notranslate">property</code>, associated with the cuFFT plan <code class="docutils literal notranslate">plan</code>, to its default value. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>property[In] – The property identifier, of type <code class="docutils literal notranslate">cufftPlanProperty</code>.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully reset the property value.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_NOT_SUPPORTED – The property is not supported for <code class="docutils literal notranslate">plan</code>, or cannot be reset at present time (see Behavior column on <a class="reference external" href="index.html#cufft-plan-properties">cuFFT Plan Properties</a>).</li> <li>CUFFT_INVALID_VALUE – Invalid property</li> </ul> </dd> </dl> </dd></dl> </section> </section> <section id="cufft-estimated-size-of-work-area"> <h2>3.5. cuFFT Estimated Size of Work Area<a class="headerlink" href="#cufft-estimated-size-of-work-area" title="Permalink to this headline"></a></h2> During plan execution, cuFFT requires a work area for temporary storage of intermediate results. The <code class="docutils literal notranslate">cufftEstimate*()</code> calls return an estimate for the size of the work area required, given the specified parameters, and assuming default plan settings. Some problem sizes require much more storage than others. In particular powers of 2 are very efficient in terms of temporary storage. Large prime numbers, however, use different algorithms and may need up to the eight times that of a similarly sized power of 2. These routines return estimated <code class="docutils literal notranslate">workSize</code> values which may still be smaller than the actual values needed especially for values of <code class="docutils literal notranslate">n</code> that are not multiples of powers of 2, 3, 5 and 7. More refined values are given by the <code class="docutils literal notranslate">cufftGetSize*()</code> routines, but these values may still be conservative. <section id="cufftestimate1d"> <h3>3.5.1. cufftEstimate1d()<a class="headerlink" href="#cufftestimate1d" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftEstimate1d"> cufftResult cufftEstimate1d(int nx, cufftType type, int batch, size_t *workSize);<a class="headerlink" href="#c.cufftEstimate1d" title="Permalink to this definition"></a> </dt> <dd>During plan execution, cuFFT requires a work area for temporary storage of intermediate results. This call returns an estimate for the size of the work area required, given the specified parameters, and assuming default plan settings. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>nx[In] – The transform size (e.g. 256 for a 256-point FFT).</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_C2C</code> for single precision complex to complex).</li> <li>batch[In] – Number of transforms of size <code class="docutils literal notranslate">nx</code>. Please consider using <code class="docutils literal notranslate">cufftEstimateMany</code> for multiple transforms.</li> <li>*workSize[In] – Pointer to the size, in bytes, of the work space.</li> <li>*workSize[Out] – Pointer to the size of the work space.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully returned the size of the work space.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – The <code class="docutils literal notranslate">nx</code> parameter is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftestimate2d"> <h3>3.5.2. cufftEstimate2d()<a class="headerlink" href="#cufftestimate2d" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftEstimate2d"> cufftResult cufftEstimate2d(int nx, int ny, cufftType type, size_t *workSize);<a class="headerlink" href="#c.cufftEstimate2d" title="Permalink to this definition"></a> </dt> <dd>During plan execution, cuFFT requires a work area for temporary storage of intermediate results. This call returns an estimate for the size of the work area required, given the specified parameters, and assuming default plan settings. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>nx[In] – The transform size in the x dimension (number of rows).</li> <li>ny[In] – The transform size in the y dimension (number of columns).</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_C2R</code> for single precision complex to real).</li> <li>*workSize[In] – Pointer to the size, in bytes, of the work space.</li> <li>*workSize[Out] – Pointer to the size of the work space.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully returned the size of the work space.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – Either or both of the <code class="docutils literal notranslate">nx</code> or <code class="docutils literal notranslate">ny</code> parameters is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftestimate3d"> <h3>3.5.3. cufftEstimate3d()<a class="headerlink" href="#cufftestimate3d" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftEstimate3d"> cufftResult cufftEstimate3d(int nx, int ny, int nz, cufftType type, size_t *workSize);<a class="headerlink" href="#c.cufftEstimate3d" title="Permalink to this definition"></a> </dt> <dd>During plan execution, cuFFT requires a work area for temporary storage of intermediate results. This call returns an estimate for the size of the work area required, given the specified parameters, and assuming default plan settings. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>nx[In] – The transform size in the x dimension.</li> <li>ny[In] – The transform size in the y dimension.</li> <li>nz[In] – The transform size in the z dimension.</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_R2C</code> for single precision real to complex).</li> <li>*workSize[In] – Pointer to the size, in bytes, of the work space.</li> <li>*workSize[Out] – Pointer to the size of the work space.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully returned the size of the work space.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – One or more of the <code class="docutils literal notranslate">nx</code>, <code class="docutils literal notranslate">ny</code>, or <code class="docutils literal notranslate">nz</code> parameters is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftestimatemany"> <h3>3.5.4. cufftEstimateMany()<a class="headerlink" href="#cufftestimatemany" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftEstimateMany"> cufftResult cufftEstimateMany(int rank, int *n, int *inembed, int istride, int idist, int *onembed, int ostride, int odist, cufftType type, int batch, size_t *workSize);<a class="headerlink" href="#c.cufftEstimateMany" title="Permalink to this definition"></a> </dt> <dd>During plan execution, cuFFT requires a work area for temporary storage of intermediate results. This call returns an estimate for the size of the work area required, given the specified parameters, and assuming default plan settings. The <code class="docutils literal notranslate">cufftEstimateMany()</code> API supports more complicated input and output data layouts via the advanced data layout parameters: <code class="docutils literal notranslate">inembed</code>, <code class="docutils literal notranslate">istride</code>, <code class="docutils literal notranslate">idist</code>, <code class="docutils literal notranslate">onembed</code>, <code class="docutils literal notranslate">ostride</code>, and <code class="docutils literal notranslate">odist</code>. All arrays are assumed to be in CPU memory. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>rank[In] – Dimensionality of the transform (1, 2, or 3).</li> <li>n[In] – Array of size <code class="docutils literal notranslate">rank</code>, describing the size of each dimension.</li> <li>inembed[In] – Pointer of size <code class="docutils literal notranslate">rank</code> that indicates the storage dimensions of the input data in memory. If set to NULL all other advanced data layout parameters are ignored.</li> <li>istride[In] – Indicates the distance between two successive input elements in the least significant (i.e., innermost) dimension.</li> <li>idist[In] – Indicates the distance between the first element of two consecutive signals in a batch of the input data.</li> <li>onembed[In] – Pointer of size <code class="docutils literal notranslate">rank</code> that indicates the storage dimensions of the output data in memory. If set to NULL all other advanced data layout parameters are ignored.</li> <li>ostride[In] – Indicates the distance between two successive output elements in the output array in the least significant (i.e., innermost) dimension.</li> <li>odist[In] – Indicates the distance between the first element of two consecutive signals in a batch of the output data.</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_R2C</code> for single precision real to complex).</li> <li>batch[In] – Batch size for this transform.</li> <li>*workSize[In] – Pointer to the size, in bytes, of the work space.</li> <li>*workSize[Out] – Pointer to the size of the work space</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully returned the size of the work space.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – One or more of the parameters is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> </section> <section id="cufft-refined-estimated-size-of-work-area"> <h2>3.6. cuFFT Refined Estimated Size of Work Area<a class="headerlink" href="#cufft-refined-estimated-size-of-work-area" title="Permalink to this headline"></a></h2> The <code class="docutils literal notranslate">cufftGetSize*()</code> routines give a more accurate estimate of the work area size required for a plan than the <code class="docutils literal notranslate">cufftEstimate*()</code> routines as they take into account any plan settings that may have been made. As discussed in the section <a class="reference external" href="index.html#work-estimate">cuFFT Estimated Size of Work Area</a>, the <code class="docutils literal notranslate">workSize</code> value(s) returned may be conservative especially for values of <code class="docutils literal notranslate">n</code> that are not multiples of powers of 2, 3, 5 and 7. <section id="cufftgetsize1d"> <h3>3.6.1. cufftGetSize1d()<a class="headerlink" href="#cufftgetsize1d" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftGetSize1d"> cufftResult cufftGetSize1d(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, int nx, cufftType type, int batch, size_t *workSize);<a class="headerlink" href="#c.cufftGetSize1d" title="Permalink to this definition"></a> </dt> <dd>This call gives a more accurate estimate of the work area size required for a plan than <code class="docutils literal notranslate">cufftEstimate1d()</code>, given the specified parameters, and taking into account any plan settings that may have been made. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>nx[In] – The transform size (e.g. 256 for a 256-point FFT).</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_C2C</code> for single precision complex to complex).</li> <li>batch[In] – Number of transforms of size <code class="docutils literal notranslate">nx</code>. Please consider using <code class="docutils literal notranslate">cufftGetSizeMany</code> for multiple transforms.</li> <li>*workSize[In] – Pointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.</li> <li>*workSize[Out] – Pointer to the size of the work space.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully returned the size of the work space.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – The <code class="docutils literal notranslate">nx</code> parameter is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftgetsize2d"> <h3>3.6.2. cufftGetSize2d()<a class="headerlink" href="#cufftgetsize2d" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftGetSize2d"> cufftResult cufftGetSize2d(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, int nx, int ny, cufftType type, size_t *workSize);<a class="headerlink" href="#c.cufftGetSize2d" title="Permalink to this definition"></a> </dt> <dd>This call gives a more accurate estimate of the work area size required for a plan than <code class="docutils literal notranslate">cufftEstimate2d()</code>, given the specified parameters, and taking into account any plan settings that may have been made. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>nx[In] – The transform size in the x dimension (number of rows).</li> <li>ny[In] – The transform size in the y dimension (number of columns).</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_C2R</code> for single precision complex to real).</li> <li>*workSize[In] – Pointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.</li> <li>*workSize[Out] – Pointer to the size of the work space.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully returned the size of the work space.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – Either or both of the <code class="docutils literal notranslate">nx</code> or <code class="docutils literal notranslate">ny</code> parameters is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftgetsize3d"> <h3>3.6.3. cufftGetSize3d()<a class="headerlink" href="#cufftgetsize3d" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftGetSize3d"> cufftResult cufftGetSize3d(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, int nx, int ny, int nz, cufftType type, size_t *workSize);<a class="headerlink" href="#c.cufftGetSize3d" title="Permalink to this definition"></a> </dt> <dd>This call gives a more accurate estimate of the work area size required for a plan than <code class="docutils literal notranslate">cufftEstimate3d()</code>, given the specified parameters, and taking into account any plan settings that may have been made. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>nx[In] – The transform size in the x dimension.</li> <li>ny[In] – The transform size in the y dimension.</li> <li>nz[In] – The transform size in the z dimension.</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_R2C</code> for single precision real to complex).</li> <li>*workSize[In] – Pointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.</li> <li>*workSize[Out] – Pointer to the size of the work space.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully returned the size of the work space.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – One or more of the <code class="docutils literal notranslate">nx</code>, <code class="docutils literal notranslate">ny</code>, or <code class="docutils literal notranslate">nz</code> parameters is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftgetsizemany"> <h3>3.6.4. cufftGetSizeMany()<a class="headerlink" href="#cufftgetsizemany" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftGetSizeMany"> cufftResult cufftGetSizeMany(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, int rank, int *n, int *inembed, int istride, int idist, int *onembed, int ostride, int odist, cufftType type, int batch, size_t *workSize);<a class="headerlink" href="#c.cufftGetSizeMany" title="Permalink to this definition"></a> </dt> <dd>This call gives a more accurate estimate of the work area size required for a plan than <code class="docutils literal notranslate">cufftEstimateSizeMany()</code>, given the specified parameters, and taking into account any plan settings that may have been made. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>rank[In] – Dimensionality of the transform (1, 2, or 3).</li> <li>n[In] – Array of size <code class="docutils literal notranslate">rank</code>, describing the size of each dimension.</li> <li>inembed[In] – Pointer of size <code class="docutils literal notranslate">rank</code> that indicates the storage dimensions of the input data in memory. If set to NULL all other advanced data layout parameters are ignored.</li> <li>istride[In] – Indicates the distance between two successive input elements in the least significant (i.e., innermost) dimension.</li> <li>idist[In] – Indicates the distance between the first element of two consecutive signals in a batch of the input data.</li> <li>onembed[In] – Pointer of size <code class="docutils literal notranslate">rank</code> that indicates the storage dimensions of the output data in memory. If set to NULL all other advanced data layout parameters are ignored.</li> <li>ostride[In] – Indicates the distance between two successive output elements in the output array in the least significant (i.e., innermost) dimension.</li> <li>odist[In] – Indicates the distance between the first element of two consecutive signals in a batch of the output data.</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_R2C</code> for single precision real to complex).</li> <li>batch[In] – Batch size for this transform.</li> <li>*workSize[In] – Pointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.</li> <li>*workSize[Out] – Pointer to the size of the work area.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully returned the size of the work space.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – One or more of the parameters is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftgetsizemany64"> <h3>3.6.5. cufftGetSizeMany64()<a class="headerlink" href="#cufftgetsizemany64" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftGetSizeMany64"> cufftResult cufftGetSizeMany64(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, int rank, long long int *n, long long int *inembed, long long int istride, long long int idist, long long int *onembed, long long int ostride, long long int odist, cufftType type, long long int batch, size_t *workSize);<a class="headerlink" href="#c.cufftGetSizeMany64" title="Permalink to this definition"></a> </dt> <dd>This call gives a more accurate estimate of the work area size required for a plan than <code class="docutils literal notranslate">cufftEstimateSizeMany()</code>, given the specified parameters, and taking into account any plan settings that may have been made. This API is identical to <code class="docutils literal notranslate">cufftMakePlanMany</code> except that the arguments specifying sizes and strides are 64 bit integers. This API makes very large transforms possible. cuFFT includes kernels that use 32 bit indexes, and kernels that use 64 bit indexes. cuFFT planning selects 32 bit kernels whenever possible to avoid any overhead due to 64 bit arithmetic. All sizes and types of transform are supported by this interface, with two exceptions. For transforms whose total size exceeds 4G elements, the dimensions specified in the array <code class="docutils literal notranslate">n</code> must be factorable into primes that are less than or equal to 127. For real to complex and complex to real transforms whose total size exceeds 4G elements, the fastest changing dimension must be even. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>rank[In] – Dimensionality of the transform (1, 2, or 3).</li> <li>n[In] – Array of size <code class="docutils literal notranslate">rank</code>, describing the size of each dimension.</li> <li>inembed[In] – Pointer of size <code class="docutils literal notranslate">rank</code> that indicates the storage dimensions of the input data in memory. If set to NULL all other advanced data layout parameters are ignored.</li> <li>istride[In] – Indicates the distance between two successive input elements in the least significant (i.e., innermost) dimension.</li> <li>idist[In] – Indicates the distance between the first element of two consecutive signals in a batch of the input data.</li> <li>onembed[In] – Pointer of size <code class="docutils literal notranslate">rank</code> that indicates the storage dimensions of the output data in memory. If set to NULL all other advanced data layout parameters are ignored.</li> <li>ostride[In] – Indicates the distance between two successive output elements in the output array in the least significant (i.e., innermost) dimension.</li> <li>odist[In] – Indicates the distance between the first element of two consecutive signals in a batch of the output data.</li> <li>type[In] – The transform data type (e.g., <code class="docutils literal notranslate">CUFFT_R2C</code> for single precision real to complex).</li> <li>batch[In] – Batch size for this transform.</li> <li>*workSize[In] – Pointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.</li> <li>*workSize[Out] – Pointer to the size of the work area.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully returned the size of the work space.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – One or more of the parameters is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftxtgetsizemany"> <h3>3.6.6. cufftXtGetSizeMany()<a class="headerlink" href="#cufftxtgetsizemany" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtGetSizeMany"> cufftResult cufftXtGetSizeMany(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, int rank, long long int *n, long long int *inembed, long long int istride, long long int idist, cudaDataType inputtype, long long int *onembed, long long int ostride, long long int odist, cudaDataType outputtype, long long int batch, size_t *workSize, cudaDataType executiontype);<a class="headerlink" href="#c.cufftXtGetSizeMany" title="Permalink to this definition"></a> </dt> <dd>This call gives a more accurate estimate of the work area size required for a plan than <code class="docutils literal notranslate">cufftEstimateSizeMany()</code>, given the specified parameters that match signature of <code class="docutils literal notranslate">cufftXtMakePlanMany</code> function, and taking into account any plan settings that may have been made. For more information about valid combinations of <code class="docutils literal notranslate">inputtype</code>, <code class="docutils literal notranslate">outputtype</code> and <code class="docutils literal notranslate">executiontype</code> parameters please refer to documentation of <code class="docutils literal notranslate">cufftXtMakePlanMany</code> function. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>rank[In] – Dimensionality of the transform (1, 2, or 3).</li> <li>n[In] – Array of size <code class="docutils literal notranslate">rank</code>, describing the size of each dimension.</li> <li>inembed[In] – Pointer of size <code class="docutils literal notranslate">rank</code> that indicates the storage dimensions of the input data in memory. If set to NULL all other advanced data layout parameters are ignored.</li> <li>istride[In] – Indicates the distance between two successive input elements in the least significant (i.e., innermost) dimension.</li> <li>idist[In] – Indicates the distance between the first element of two consecutive signals in a batch of the input data.</li> <li>inputtype[In] (cudaDataType) – Type of input data.</li> <li>onembed[In] – Pointer of size <code class="docutils literal notranslate">rank</code> that indicates the storage dimensions of the output data in memory. If set to NULL all other advanced data layout parameters are ignored.</li> <li>ostride[In] – Indicates the distance between two successive output elements in the output array in the least significant (i.e., innermost) dimension.</li> <li>odist[In] – Indicates the distance between the first element of two consecutive signals in a batch of the output data.</li> <li>outputtype[In] (cudaDataType) – Type of output data.</li> <li>batch[In] – Batch size for this transform.</li> <li>*workSize[In] – Pointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.</li> <li>executiontype[In] (cudaDataType) – Type of data to be used for computations.</li> <li>*workSize[Out] – Pointer to the size of the work area.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully returned the size of the work space.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_SIZE – One or more of the parameters is not a supported size.</li> </ul> </dd> </dl> </dd></dl> </section> </section> <section id="cufftgetsize"> <h2>3.7. cufftGetSize()<a class="headerlink" href="#cufftgetsize" title="Permalink to this headline"></a></h2> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftGetSize"> cufftResult cufftGetSize(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, size_t *workSize);<a class="headerlink" href="#c.cufftGetSize" title="Permalink to this definition"></a> </dt> <dd>Once plan generation has been done, either with the original API or the extensible API, this call returns the actual size of the work area required to support the plan. Callers who choose to manage work area allocation within their application must use this call after plan generation, and after any <code class="docutils literal notranslate">cufftSet*()</code> calls subsequent to plan generation, if those calls might alter the required work space size. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>*workSize[In] – Pointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.</li> <li>*workSize[Out] – Pointer to the size of the work area.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully returned the size of the work space.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufft-caller-allocated-work-area-support"> <h2>3.8. cuFFT Caller Allocated Work Area Support<a class="headerlink" href="#cufft-caller-allocated-work-area-support" title="Permalink to this headline"></a></h2> <section id="cufftsetautoallocation"> <h3>3.8.1. cufftSetAutoAllocation()<a class="headerlink" href="#cufftsetautoallocation" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftSetAutoAllocation"> cufftResult cufftSetAutoAllocation(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, int autoAllocate);<a class="headerlink" href="#c.cufftSetAutoAllocation" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftSetAutoAllocation()</code> indicates that the caller intends to allocate and manage work areas for plans that have been generated. cuFFT default behavior is to allocate the work area at plan generation time. If <code class="docutils literal notranslate">cufftSetAutoAllocation()</code> has been called with autoAllocate set to 0 (“false”) prior to one of the <code class="docutils literal notranslate">cufftMakePlan*()</code> calls, cuFFT does not allocate the work area. This is the preferred sequence for callers wishing to manage work area allocation. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>autoAllocate[In] – Indicates whether to allocate work area.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully returned the size of the work space.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftsetworkarea"> <h3>3.8.2. cufftSetWorkArea()<a class="headerlink" href="#cufftsetworkarea" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftSetWorkArea"> cufftResult cufftSetWorkArea(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, void *workArea);<a class="headerlink" href="#c.cufftSetWorkArea" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftSetWorkArea()</code> overrides the work area pointer associated with a plan. If the work area was auto-allocated, cuFFT frees the auto-allocated space. The <code class="docutils literal notranslate">cufftExecute*()</code> calls assume that the work area pointer is valid and that it points to a contiguous region in device memory that does not overlap with any other work area. If this is not the case, results are indeterminate. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>*workArea[In] – Pointer to <code class="docutils literal notranslate">workArea</code>. For multiple GPUs, multiple work area pointers must be given.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully returned the size of the work space.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftxtsetworkareapolicy"> <h3>3.8.3. cufftXtSetWorkAreaPolicy()<a class="headerlink" href="#cufftxtsetworkareapolicy" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtSetWorkAreaPolicy"> cufftResult cufftXtSetWorkAreaPolicy(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cufftXtWorkAreaPolicy policy, size_t *workSize);<a class="headerlink" href="#c.cufftXtSetWorkAreaPolicy" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftXtSetWorkAreaPolicy()</code> indicates that the caller intends to change work area size for a given plan handle. cuFFT’s default behavior is to allocate the work area at plan generation time with a default size that depends on the plan type and other parameters. If <code class="docutils literal notranslate">cufftXtSetWorkAreaPolicy()</code> has been called with the <code class="docutils literal notranslate">policy</code> parameter set to <code class="docutils literal notranslate">CUFFT_WORKAREA_MINIMAL</code>, cuFFT will attempt to re-plan the handle to use zero bytes of work area memory. If the <code class="docutils literal notranslate">cufftXtSetWorkAreaPolicy()</code> call is successful the auto-allocated work area memory is released. Currently the policies <code class="docutils literal notranslate">CUFFT_WORKAREA_PERFORMANCE</code>, <code class="docutils literal notranslate">CUFFT_WORKAREA_USER</code> and the <code class="docutils literal notranslate">workSize</code> parameter are not supported and reserved for use in future cuFFT releases. This function can be called once per lifetime of a plan handle. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>policy[In] – Type of work area policy to apply.</li> <li>*workSize[In] – Reserved for future use.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully returned the size of the work space.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_INVALID_SIZE – FFT size does not allow use of the selected policy.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> </ul> </dd> </dl> </dd></dl> </section> </section> <section id="cufft-execution"> <h2>3.9. cuFFT Execution<a class="headerlink" href="#cufft-execution" title="Permalink to this headline"></a></h2> <section id="cufftexecc2c-and-cufftexecz2z"> <h3>3.9.1. cufftExecC2C() and cufftExecZ2Z()<a class="headerlink" href="#cufftexecc2c-and-cufftexecz2z" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftExecC2C"> cufftResult cufftExecC2C(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cufftComplex *idata, cufftComplex *odata, int direction);<a class="headerlink" href="#c.cufftExecC2C" title="Permalink to this definition"></a> </dt> <dd></dd></dl> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftExecZ2Z"> cufftResult cufftExecZ2Z(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cufftDoubleComplex *idata, cufftDoubleComplex *odata, int direction);<a class="headerlink" href="#c.cufftExecZ2Z" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftExecC2C()</code> (<code class="docutils literal notranslate">cufftExecZ2Z()</code>) executes a single-precision (double-precision) complex-to-complex transform plan in the transform direction as specified by <code class="docutils literal notranslate">direction</code> parameter. cuFFT uses the GPU memory pointed to by the <code class="docutils literal notranslate">idata</code> parameter as input data. This function stores the Fourier coefficients in the <code class="docutils literal notranslate">odata</code> array. If <code class="docutils literal notranslate">idata</code> and <code class="docutils literal notranslate">odata</code> are the same, this method does an in-place transform. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>idata[In] – Pointer to the complex input data (in GPU memory) to transform.</li> <li>odata[In] – Pointer to the complex output data (in GPU memory).</li> <li>direction[In] – The transform direction: <code class="docutils literal notranslate">CUFFT_FORWARD</code> or <code class="docutils literal notranslate">CUFFT_INVERSE</code>.</li> <li>odata[Out] – ontains the complex Fourier coefficients.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully returned the size of the work space.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_INVALID_VALUE – At least one of the parameters <code class="docutils literal notranslate">idata</code>, <code class="docutils literal notranslate">odata</code>, and <code class="docutils literal notranslate">direction</code> is not valid.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_EXEC_FAILED – cuFFT failed to execute the transform on the GPU.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftexecr2c-and-cufftexecd2z"> <h3>3.9.2. cufftExecR2C() and cufftExecD2Z()<a class="headerlink" href="#cufftexecr2c-and-cufftexecd2z" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftExecR2C"> cufftResult cufftExecR2C(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cufftReal *idata, cufftComplex *odata);<a class="headerlink" href="#c.cufftExecR2C" title="Permalink to this definition"></a> </dt> <dd></dd></dl> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftExecD2Z"> cufftResult cufftExecD2Z(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cufftDoubleReal *idata, cufftDoubleComplex *odata);<a class="headerlink" href="#c.cufftExecD2Z" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftExecR2C()</code> (<code class="docutils literal notranslate">cufftExecD2Z()</code>) executes a single-precision (double-precision) real-to-complex, implicitly forward, cuFFT transform plan. cuFFT uses as input data the GPU memory pointed to by the <code class="docutils literal notranslate">idata</code> parameter. This function stores the nonredundant Fourier coefficients in the <code class="docutils literal notranslate">odata</code> array. Pointers to <code class="docutils literal notranslate">idata</code> and <code class="docutils literal notranslate">odata</code> are both required to be aligned to <code class="docutils literal notranslate">cufftComplex</code> data type in single-precision transforms and <code class="docutils literal notranslate">cufftDoubleComplex</code> data type in double-precision transforms. If <code class="docutils literal notranslate">idata</code> and <code class="docutils literal notranslate">odata</code> are the same, this method does an in-place transform. Note the data layout differences between in-place and out-of-place transforms as described in <a class="reference external" href="index.html#cufft-transform-types">Parameter cufftType</a>. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>idata[In] – Pointer to the real input data (in GPU memory) to transform.</li> <li>odata[In] – Pointer to the complex output data (in GPU memory).</li> <li>odata[Out] – Contains the complex Fourier coefficients.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully returned the size of the work space.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_INVALID_VALUE – At least one of the parameters <code class="docutils literal notranslate">idata</code> and <code class="docutils literal notranslate">odata</code> is not valid.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_EXEC_FAILED – cuFFT failed to execute the transform on the GPU.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftexecc2r-and-cufftexecz2d"> <h3>3.9.3. cufftExecC2R() and cufftExecZ2D()<a class="headerlink" href="#cufftexecc2r-and-cufftexecz2d" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftExecC2R"> cufftResult cufftExecC2R(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cufftComplex *idata, cufftReal *odata);<a class="headerlink" href="#c.cufftExecC2R" title="Permalink to this definition"></a> </dt> <dd></dd></dl> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftExecZ2D"> cufftResult cufftExecZ2D(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cufftDoubleComplex *idata, cufftDoubleReal *odata);<a class="headerlink" href="#c.cufftExecZ2D" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftExecC2R()</code> (<code class="docutils literal notranslate">cufftExecZ2D()</code>) executes a single-precision (double-precision) complex-to-real, implicitly inverse, cuFFT transform plan. cuFFT uses as input data the GPU memory pointed to by the <code class="docutils literal notranslate">idata</code> parameter. The input array holds only the nonredundant complex Fourier coefficients. This function stores the real output values in the <code class="docutils literal notranslate">odata</code> array. and pointers are both required to be aligned to <code class="docutils literal notranslate">cufftComplex</code> data type in single-precision transforms and <code class="docutils literal notranslate">cufftDoubleComplex</code> type in double-precision transforms. If <code class="docutils literal notranslate">idata</code> and <code class="docutils literal notranslate">odata</code> are the same, this method does an in-place transform. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>idata[In] – Pointer to the complex input data (in GPU memory) to transform.</li> <li>odata[In] – Pointer to the real output data (in GPU memory).</li> <li>odata[Out] – Contains the real output data.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully executed the FFT plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_INVALID_VALUE – At least one of the parameters <code class="docutils literal notranslate">idata</code> and <code class="docutils literal notranslate">odata</code> is not valid.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_EXEC_FAILED – cuFFT failed to execute the transform on the GPU.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftxtexec"> <h3>3.9.4. cufftXtExec()<a class="headerlink" href="#cufftxtexec" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtExec"> cufftResult cufftXtExec(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, void *input, void *output, int direction);<a class="headerlink" href="#c.cufftXtExec" title="Permalink to this definition"></a> </dt> <dd>Function <code class="docutils literal notranslate">cufftXtExec</code> executes any cuFFT transform regardless of precision and type. In case of complex-to-real and real-to-complex transforms <code class="docutils literal notranslate">direction</code> parameter is ignored. cuFFT uses the GPU memory pointed to by the <code class="docutils literal notranslate">input</code> parameter as input data. This function stores the Fourier coefficients in the <code class="docutils literal notranslate">output</code> array. If <code class="docutils literal notranslate">input</code> and <code class="docutils literal notranslate">output</code> are the same, this method does an in-place transform. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>input[In] – Pointer to the input data (in GPU memory) to transform.</li> <li>output[In] – Pointer to the output data (in GPU memory).</li> <li>direction[In] – The transform direction: <code class="docutils literal notranslate">CUFFT_FORWARD</code> or <code class="docutils literal notranslate">CUFFT_INVERSE</code>. Ignored for complex-to-real and real-to-complex transforms.</li> <li>output[Out] – Contains the complex Fourier coefficients.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully executed the FFT plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_INVALID_VALUE – At least one of the parameters <code class="docutils literal notranslate">idata</code>, <code class="docutils literal notranslate">odata</code>, and <code class="docutils literal notranslate">direction</code> is not valid.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_EXEC_FAILED – cuFFT failed to execute the transform on the GPU.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftxtexecdescriptor"> <h3>3.9.5. cufftXtExecDescriptor()<a class="headerlink" href="#cufftxtexecdescriptor" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtExecDescriptor"> cufftResult cufftXtExecDescriptor(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cudaLibXtDesc *input, cudaLibXtDesc *output, int direction);<a class="headerlink" href="#c.cufftXtExecDescriptor" title="Permalink to this definition"></a> </dt> <dd>Function <code class="docutils literal notranslate">cufftXtExecDescriptor()</code> executes any cuFFT transform regardless of precision and type. In case of complex-to-real and real-to-complex transforms <code class="docutils literal notranslate">direction</code> parameter is ignored. cuFFT uses the GPU memory pointed to by <code class="docutils literal notranslate">cudaLibXtDesc                                         *input</code> descriptor as input data and <code class="docutils literal notranslate">cudaLibXtDesc *output</code> as output data. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>input[In] – Pointer to the complex input data (in GPU memory) to transform.</li> <li>output[In] – Pointer to the complex output data (in GPU memory).</li> <li>direction[In] – The transform direction: <code class="docutils literal notranslate">CUFFT_FORWARD</code> or <code class="docutils literal notranslate">CUFFT_INVERSE</code>. Ignored for complex-to-real and real-to-complex transforms.</li> <li>idata[Out] – Contains the complex Fourier coefficients.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully executed the FFT plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_INVALID_VALUE – At least one of the parameters <code class="docutils literal notranslate">idata</code> and <code class="docutils literal notranslate">direction</code> is not valid.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_EXEC_FAILED – cuFFT failed to execute the transform on the GPU.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_DEVICE – An invalid GPU index was specified in a descriptor.</li> </ul> </dd> </dl> </dd></dl> </section> </section> <section id="cufft-and-multiple-gpus"> <h2>3.10. cuFFT and Multiple GPUs<a class="headerlink" href="#cufft-and-multiple-gpus" title="Permalink to this headline"></a></h2> <section id="cufftxtsetgpus"> <h3>3.10.1. cufftXtSetGPUs()<a class="headerlink" href="#cufftxtsetgpus" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtSetGPUs"> cufftResult cufftXtSetGPUs(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, int nGPUs, int *whichGPUs);<a class="headerlink" href="#c.cufftXtSetGPUs" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftXtSetGPUs()</code> identifies which GPUs are to be used with the plan. As in the single GPU case <code class="docutils literal notranslate">cufftCreate()</code> creates a plan and <code class="docutils literal notranslate">cufftMakePlan*()</code> does the plan generation. In cuFFT prior to 10.4.0, this call will return an error if a non-default stream has been associated with the plan. Note that the call to <code class="docutils literal notranslate">cufftXtSetGPUs()</code> must occur after the call to <code class="docutils literal notranslate">cufftCreate()</code> and prior to the call to <code class="docutils literal notranslate">cufftMakePlan*()</code>. Parameter <code class="docutils literal notranslate">whichGPUs</code> of <code class="docutils literal notranslate">cufftXtSetGPUs()</code> function determines ordering of the GPUs with respect to data decomposition (first data chunk is placed on GPU denoted by first element of <code class="docutils literal notranslate">whichGPUs</code>). <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>nGPUs[In] – Number of GPUs to use.</li> <li>whichGPUs[In] – The GPUs to use.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully set the GPUs to use.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle, or a <a class="reference external" href="index.html#streamed-cufft-transforms">non-default stream has been associated with the plan in cuFFT prior to 10.4.0</a>.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_VALUE – The requested number of GPUs was less than 2 or more than 8.</li> <li>CUFFT_INVALID_DEVICE – An invalid GPU index was specified.</li> <li>CUFFT_INVALID_SIZE – Transform size that <code class="docutils literal notranslate">plan</code> was created for does not meet minimum size criteria.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftxtsetworkarea"> <h3>3.10.2. cufftXtSetWorkArea()<a class="headerlink" href="#cufftxtsetworkarea" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtSetWorkArea"> cufftResult cufftXtSetWorkArea(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, void **workArea);<a class="headerlink" href="#c.cufftXtSetWorkArea" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftXtSetWorkArea()</code> overrides the work areas associated with a plan. If the work area was auto-allocated, cuFFT frees the auto-allocated space. The <code class="docutils literal notranslate">cufftXtExec*()</code> calls assume that the work area is valid and that it points to a contiguous region in each device memory that does not overlap with any other work area. If this is not the case, results are indeterminate. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>workArea[In] – Pointer to the pointers to workArea.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully set the GPUs to use.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_DEVICE – A GPU associated with the plan could not be selected.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufft-multiple-gpu-execution"> <h3>3.10.3. cuFFT Multiple GPU Execution<a class="headerlink" href="#cufft-multiple-gpu-execution" title="Permalink to this headline"></a></h3> <section id="cufftxtexecdescriptorc2c-and-cufftxtexecdescriptorz2z"> <h4>3.10.3.1. cufftXtExecDescriptorC2C() and cufftXtExecDescriptorZ2Z()<a class="headerlink" href="#cufftxtexecdescriptorc2c-and-cufftxtexecdescriptorz2z" title="Permalink to this headline"></a></h4> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtExecDescriptorC2C"> cufftResult cufftXtExecDescriptorC2C(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cudaLibXtDesc *input, cudaLibXtDesc *output, int direction);<a class="headerlink" href="#c.cufftXtExecDescriptorC2C" title="Permalink to this definition"></a> </dt> <dd></dd></dl> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtExecDescriptorZ2Z"> cufftResult cufftXtExecDescriptorZ2Z(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cudaLibXtDesc *input, cudaLibXtDesc *output, int direction);<a class="headerlink" href="#c.cufftXtExecDescriptorZ2Z" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftXtExecDescriptorC2C()</code> (<code class="docutils literal notranslate">cufftXtExecDescriptorZ2Z()</code>) executes a single-precision (double-precision) complex-to-complex transform plan in the transform direction as specified by <code class="docutils literal notranslate">direction</code> parameter. cuFFT uses the GPU memory pointed to by <code class="docutils literal notranslate">cudaLibXtDesc *input</code> as input data. Since only in-place multiple GPU functionality is supported, this function also stores the result in the <code class="docutils literal notranslate">cudaLibXtDesc *input</code> arrays. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>*input[In] – Pointer to the complex input data (in GPU memory) to transform.</li> <li>*output[In] – Pointer to the complex output data (in GPU memory).</li> <li>direction[In] – The transform direction: <code class="docutils literal notranslate">CUFFT_FORWARD</code> or <code class="docutils literal notranslate">CUFFT_INVERSE</code>.</li> <li>input[Out] – Contains the complex Fourier coefficients.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully executed the FFT plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_INVALID_VALUE – At least one of the parameters <code class="docutils literal notranslate">input</code> and <code class="docutils literal notranslate">direction</code> is not valid.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_EXEC_FAILED – cuFFT failed to execute the transform on the GPU.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_DEVICE – An invalid GPU index was specified in a descriptor.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftxtexecdescriptorr2c-and-cufftxtexecdescriptord2z"> <h4>3.10.3.2. cufftXtExecDescriptorR2C() and cufftXtExecDescriptorD2Z()<a class="headerlink" href="#cufftxtexecdescriptorr2c-and-cufftxtexecdescriptord2z" title="Permalink to this headline"></a></h4> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtExecDescriptorR2C"> cufftResult cufftXtExecDescriptorR2C(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cudaLibXtDesc *input, cudaLibXtDesc *output);<a class="headerlink" href="#c.cufftXtExecDescriptorR2C" title="Permalink to this definition"></a> </dt> <dd></dd></dl> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtExecDescriptorD2Z"> cufftResult cufftXtExecDescriptorD2Z(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cudaLibXtDesc *input, cudaLibXtDesc *output);<a class="headerlink" href="#c.cufftXtExecDescriptorD2Z" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftXtExecDescriptorR2C()</code> (<code class="docutils literal notranslate">cufftXtExecDescriptorD2Z()</code>) executes a single-precision (double-precision) real-to-complex transform plan. cuFFT uses the GPU memory pointed to by <code class="docutils literal notranslate">cudaLibXtDesc *input</code> as input data. Since only in-place multiple GPU functionality is supported, this function also stores the result in the <code class="docutils literal notranslate">cudaLibXtDesc *input</code> arrays. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>*input[In] – Pointer to the complex input data (in GPU memory) to transform.</li> <li>*output[In] – Pointer to the complex output data (in GPU memory).</li> <li>input[Out] – Contains the complex Fourier coefficients</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully executed the FFT plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_INVALID_VALUE – At least one of the parameters <code class="docutils literal notranslate">input</code> and <code class="docutils literal notranslate">direction</code> is not valid.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_EXEC_FAILED – cuFFT failed to execute the transform on the GPU.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_DEVICE – An invalid GPU index was specified in a descriptor.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftxtexecdescriptorc2r-and-cufftxtexecdescriptorz2d"> <h4>3.10.3.3. cufftXtExecDescriptorC2R() and cufftXtExecDescriptorZ2D()<a class="headerlink" href="#cufftxtexecdescriptorc2r-and-cufftxtexecdescriptorz2d" title="Permalink to this headline"></a></h4> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtExecDescriptorC2R"> cufftResult cufftXtExecDescriptorC2R(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cudaLibXtDesc *input, cudaLibXtDesc *output);<a class="headerlink" href="#c.cufftXtExecDescriptorC2R" title="Permalink to this definition"></a> </dt> <dd></dd></dl> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtExecDescriptorZ2D"> cufftResult cufftXtExecDescriptorZ2D(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cudaLibXtDesc *input, cudaLibXtDesc *output);<a class="headerlink" href="#c.cufftXtExecDescriptorZ2D" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftXtExecDescriptorC2R()</code> (<code class="docutils literal notranslate">cufftXtExecDescriptorZ2D()</code>) executes a single-precision (double-precision) complex-to-real transform plan in the transform direction as specified by <code class="docutils literal notranslate">direction</code> parameter. cuFFT uses the GPU memory pointed to by <code class="docutils literal notranslate">cudaLibXtDesc *input</code> as input data. Since only in-place multiple GPU functionality is supported, this function also stores the result in the <code class="docutils literal notranslate">cudaLibXtDesc *input</code> arrays. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>*input[In] – Pointer to the complex input data (in GPU memory) to transform.</li> <li>*output[In] – Pointer to the complex output data (in GPU memory).</li> <li>input[Out] – Contains the complex Fourier coefficients.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully executed the FFT plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_INVALID_VALUE – At least one of the parameters <code class="docutils literal notranslate">input</code> and <code class="docutils literal notranslate">direction</code> is not valid.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_EXEC_FAILED – cuFFT failed to execute the transform on the GPU.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_DEVICE – An invalid GPU index was specified in a descriptor.</li> </ul> </dd> </dl> </dd></dl> </section> </section> <section id="memory-allocation-and-data-movement-functions"> <h3>3.10.4. Memory Allocation and Data Movement Functions<a class="headerlink" href="#memory-allocation-and-data-movement-functions" title="Permalink to this headline"></a></h3> Multiple GPU cuFFT execution functions assume a certain data layout in terms of what input data has been copied to which GPUs prior to execution, and what output data resides in which GPUs post execution. The following functions assist in allocation, setup and retrieval of the data. They must be called after the call to <code class="docutils literal notranslate">cufftMakePlan*()</code>. <section id="cufftxtmalloc"> <h4>3.10.4.1. cufftXtMalloc()<a class="headerlink" href="#cufftxtmalloc" title="Permalink to this headline"></a></h4> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtMalloc"> cufftResult cufftXtMalloc(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cudaLibXtDesc **descriptor, cufftXtSubFormat format);<a class="headerlink" href="#c.cufftXtMalloc" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftXtMalloc()</code> allocates a descriptor, and all memory for data in GPUs associated with the plan, and returns a pointer to the descriptor. Note the descriptor contains an array of device pointers so that the application may preprocess or postprocess the data on the GPUs. The enumerated parameter <code class="docutils literal notranslate">cufftXtSubFormat_t</code> indicates if the buffer will be used for input or output. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>**descriptor[In] – Pointer to a pointer to a <code class="docutils literal notranslate">cudaLibXtDesc</code> object.</li> <li>format[In] – cufftXtSubFormat`` value.</li> <li>**descriptor[Out] – Pointer to a pointer to a <code class="docutils literal notranslate">cudaLibXtDesc</code> object.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully allows user to allocate descriptor and GPU memory.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle or it is not a multiple GPU <code class="docutils literal notranslate">plan</code>.</li> <li>CUFFT_ALLOC_FAILED – The allocation of GPU resources for the plan failed.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_DEVICE – An invalid GPU index was specified in the descriptor.</li> </ul> </dd> </dl> </dd></dl> <section id="parameter-cufftxtsubformat"> <h5>3.10.4.1.1. Parameter cufftXtSubFormat<a class="headerlink" href="#parameter-cufftxtsubformat" title="Permalink to this headline"></a></h5> <code class="docutils literal notranslate">cufftXtSubFormat_t</code> is an enumerated type that indicates if the buffer will be used for input or output and the ordering of the data. <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef enum cufftXtSubFormat_t { CUFFT_XT_FORMAT_INPUT, //by default input is in linear order across GPUs CUFFT_XT_FORMAT_OUTPUT, //by default output is in scrambled order depending on transform CUFFT_XT_FORMAT_INPLACE, //by default inplace is input order, which is linear across GPUs CUFFT_XT_FORMAT_INPLACE_SHUFFLED, //shuffled output order after execution of the transform CUFFT_FORMAT_UNDEFINED } cufftXtSubFormat; </pre></div> </div> </section> </section> <section id="cufftxtfree"> <h4>3.10.4.2. cufftXtFree()<a class="headerlink" href="#cufftxtfree" title="Permalink to this headline"></a></h4> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtFree"> cufftResult cufftXtFree(cudaLibXtDesc *descriptor);<a class="headerlink" href="#c.cufftXtFree" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftXtFree()</code> frees the descriptor and all memory associated with it. The descriptor and memory must have been returned by a previous call to <code class="docutils literal notranslate">cufftXtMalloc()</code>. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>*descriptor[In] – Pointer to a <code class="docutils literal notranslate">cudaLibXtDesc</code> object.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully allows user to free descriptor and associated GPU memory.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftxtmemcpy"> <h4>3.10.4.3. cufftXtMemcpy()<a class="headerlink" href="#cufftxtmemcpy" title="Permalink to this headline"></a></h4> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtMemcpy"> cufftResult cufftXtMemcpy(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, void *dstPointer, void *srcPointer, cufftXtCopyType type);<a class="headerlink" href="#c.cufftXtMemcpy" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftXtMemcpy()</code> copies data between buffers on the host and GPUs or between GPUs. The enumerated parameter <code class="docutils literal notranslate">cufftXtCopyType_t</code> indicates the type and direction of transfer. Calling <code class="docutils literal notranslate">cufftXtMemcpy</code> function for multi-GPU batched FFT plans with <code class="docutils literal notranslate">CUFFT_COPY_DEVICE_TO_DEVICE</code> transfer type is not supported. Note that starting from CUDA 11.2 (cuFFT 10.4.0), <code class="docutils literal notranslate">cufftSetStream()</code> is supported on multi-GPU plans. When associating a stream with a plan, <code class="docutils literal notranslate">cufftXtMemcpy()</code> remains synchronous across the multiple GPUs. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>dstPointer[In] – Pointer to the destination address(es).</li> <li>srcPointer[In] – Pointer to the source address(es).</li> <li>type[In] – <code class="docutils literal notranslate">cufftXtCopyType</code>value.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully allows user to copy memory between host and GPUs or between GPUs.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle.</li> <li>CUFFT_INVALID_VALUE – One or more invalid parameters were passed to the API.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> <li>CUFFT_INVALID_DEVICE – An invalid GPU index was specified in a descriptor.</li> </ul> </dd> </dl> </dd></dl> <section id="parameter-cufftxtcopytype"> <h5>3.10.4.3.1. Parameter cufftXtCopyType<a class="headerlink" href="#parameter-cufftxtcopytype" title="Permalink to this headline"></a></h5> <code class="docutils literal notranslate">cufftXtCopyType_t</code> is an enumerated type for multiple GPU functions that specifies the type of copy for <code class="docutils literal notranslate">cufftXtMemcpy()</code>. <code class="docutils literal notranslate">CUFFT_COPY_HOST_TO_DEVICE</code> copies data from a contiguous host buffer to multiple device buffers, in the layout cuFFT requires for input data. <code class="docutils literal notranslate">dstPointer</code> must point to a <code class="docutils literal notranslate">cudaLibXtDesc</code> structure, and <code class="docutils literal notranslate">srcPointer</code> must point to a host memory buffer. <code class="docutils literal notranslate">CUFFT_COPY_DEVICE_TO_HOST</code> copies data from multiple device buffers, in the layout cuFFT produces for output data, to a contiguous host buffer. <code class="docutils literal notranslate">dstPointer</code> must point to a host memory buffer, and <code class="docutils literal notranslate">srcPointer</code> must point to a <code class="docutils literal notranslate">cudaLibXtDesc</code> structure. <code class="docutils literal notranslate">CUFFT_COPY_DEVICE_TO_DEVICE</code> copies data from multiple device buffers, in the layout cuFFT produces for output data, to multiple device buffers, in the layout cuFFT requires for input data. <code class="docutils literal notranslate">dstPointer</code> and <code class="docutils literal notranslate">srcPointer</code> must point to different <code class="docutils literal notranslate">cudaLibXtDesc</code> structures (and therefore memory locations). That is, the copy cannot be in-place. Note that device_to_device <code class="docutils literal notranslate">cufftXtMemcpy()</code> for 2D and 3D data is not currently supported. <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef enum cufftXtCopyType_t { CUFFT_COPY_HOST_TO_DEVICE, CUFFT_COPY_DEVICE_TO_HOST, CUFFT_COPY_DEVICE_TO_DEVICE } cufftXtCopyType; </pre></div> </div> </section> </section> </section> <section id="general-multiple-gpu-descriptor-types"> <h3>3.10.5. General Multiple GPU Descriptor Types<a class="headerlink" href="#general-multiple-gpu-descriptor-types" title="Permalink to this headline"></a></h3> <section id="cudaxtdesc"> <h4>3.10.5.1. cudaXtDesc<a class="headerlink" href="#cudaxtdesc" title="Permalink to this headline"></a></h4> A descriptor type used in multiple GPU routines that contains information about the GPUs and their memory locations. <div class="highlight-c++ notranslate"><div class="highlight"><pre> struct cudaXtDesc_t{ int version; //descriptor version int nGPUs; //number of GPUs int GPUs[MAX_CUDA_DESCRIPTOR_GPUS]; //array of device IDs void *data[MAX_CUDA_DESCRIPTOR_GPUS]; //array of pointers to data, one per GPU size_t size[MAX_CUDA_DESCRIPTOR_GPUS]; //array of data sizes, one per GPU void *cudaXtState; //opaque CUDA utility structure }; typedef struct cudaXtDesc_t cudaXtDesc; </pre></div> </div> </section> <section id="cudalibxtdesc"> <h4>3.10.5.2. cudaLibXtDesc<a class="headerlink" href="#cudalibxtdesc" title="Permalink to this headline"></a></h4> A descriptor type used in multiple GPU routines that contains information about the library used. <div class="highlight-c++ notranslate"><div class="highlight"><pre>struct cudaLibXtDesc_t{ int version; //descriptor version cudaXtDesc *descriptor; //multi-GPU memory descriptor libFormat library; //which library recognizes the format int subFormat; //library specific enumerator of sub formats void *libDescriptor; //library specific descriptor e.g. FFT transform plan object }; typedef struct cudaLibXtDesc_t cudaLibXtDesc; </pre></div> </div> </section> </section> </section> <section id="cufft-callbacks"> <h2>3.11. cuFFT Callbacks<a class="headerlink" href="#cufft-callbacks" title="Permalink to this headline"></a></h2> <section id="cufftxtsetjitcallback"> <h3>3.11.1. cufftXtSetJITCallback()<a class="headerlink" href="#cufftxtsetjitcallback" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtSetJITCallback"> cufftResult cufftXtSetJITCallback(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, const char *callbackSymbolName, const void *callbackFatbin, size_t callbackFatbinSize, cufftXtCallbackType type, void **caller_info)<a class="headerlink" href="#c.cufftXtSetJITCallback" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftXtSetJITCallback()</code> specifies a load or store LTO callback to be used with the plan. This call is valid only after a call to <code class="docutils literal notranslate">cufftCreate()</code>, but before calling <code class="docutils literal notranslate">cufftMakePlan*()</code>, which does the plan generation. If there was already an LTO callback of this type associated with the plan, this new callback routine replaces it. If the new callback requires shared memory, you must call <code class="docutils literal notranslate">cufftXtSetCallbackSharedSize</code> with the amount of shared memory the callback function needs. cuFFT will not retain the amount of shared memory associated with the previous callback if the callback function is changed. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>callbackSymbolName[In] – null-terminated C string containing the (unmangled) callback symbol name (i.e. the name of the LTO callback routine). This symbol name will be runtime-compiled, and modifiers such as <code class="docutils literal notranslate">extern "C"</code> or <code class="docutils literal notranslate">namespace</code> are not supported.</li> <li>callbackFatbin[In] – Pointer to the location in host memory where the callback device function is located, after being compiled into LTO-IR with nvcc or NVRTC.</li> <li>callbackFatbinSize[In] – Size in bytes of the data pointed at by <code class="docutils literal notranslate">callbackFatbin</code>.</li> <li>type[In] – Type of callback routine.</li> <li>callerInfo[In] – Optional array of device pointers to caller specific information, one per GPU.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully associated the callback function with the plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not valid (e.g. the handle was already used to make a plan).</li> <li>CUFFT_INVALID_TYPE – The callback type is not valid.</li> <li>CUFFT_INVALID_VALUE – The pointer to the callback device function is invalid or the size is <code class="docutils literal notranslate">0</code>.</li> <li>CUFFT_NOT_SUPPORTED – The functionality is not supported yet (e.g. multi-GPU with LTO callbacks).</li> <li>CUFFT_INTERNAL_ERROR – cuFFT encountered an unexpected error, likely in the runtime linking process; error codes will be expanded in a future release.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftxtsetcallback"> <h3>3.11.2. cufftXtSetCallback()<a class="headerlink" href="#cufftxtsetcallback" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtSetCallback"> cufftResult cufftXtSetCallback(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, void **callbackRoutine, cufftXtCallbackType type, void **callerInfo)<a class="headerlink" href="#c.cufftXtSetCallback" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftXtSetCallback()</code> specifies a load or store legacy callback to be used with the plan. This call is valid only after a call to <code class="docutils literal notranslate">cufftMakePlan*()</code>, which does the plan generation. If there was already a legacy callback of this type associated with the plan, this new callback routine replaces it. If the new callback requires shared memory, you must call <code class="docutils literal notranslate">cufftXtSetCallbackSharedSize</code> with the amount of shared memory it needs. cuFFT will not retain the amount of shared memory associated with the previous callback. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>callbackRoutine[In] – Array of callback routine pointers, one per GPU.</li> <li>type[In] – Type of callback routine.</li> <li>callerInfo[In] – Optional array of device pointers to caller specific information, one per GPU.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully associated the callback function with the plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle, or a <a class="reference external" href="index.html#streamed-cufft-transforms">non-default stream has been associated with the plan in cuFFT prior to 10.4.0</a>.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_SETUP_FAILED – The cuFFT library failed to initialize.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftxtclearcallback"> <h3>3.11.3. cufftXtClearCallback()<a class="headerlink" href="#cufftxtclearcallback" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtClearCallback"> cufftResult cufftXtClearCallback(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cufftXtCallbackType type)<a class="headerlink" href="#c.cufftXtClearCallback" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftXtClearCallback()</code> instructs cuFFT to stop invoking the specified legacy callback type when executing the plan. Only the specified callback is cleared. If no callback of this type had been specified, the return code is <code class="docutils literal notranslate">CUFFT_SUCCESS</code>. Note that this method does not work with LTO callbacks. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>type[In] – Type of callback routine.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT successfully disassociated the callback function with the plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle, or a <a class="reference external" href="index.html#streamed-cufft-transforms">non-default stream has been associated with the plan in cuFFT prior to 10.4.0</a>.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftxtsetcallbacksharedsize"> <h3>3.11.4. cufftXtSetCallbackSharedSize()<a class="headerlink" href="#cufftxtsetcallbacksharedsize" title="Permalink to this headline"></a></h3> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftXtSetCallbackSharedSize"> cufftResult cufftXtSetCallbackSharedSize(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cufftXtCallbackType type, size_t sharedSize)<a class="headerlink" href="#c.cufftXtSetCallbackSharedSize" title="Permalink to this definition"></a> </dt> <dd><code class="docutils literal notranslate">cufftXtSetCallbackSharedSize()</code> instructs cuFFT to dynamically allocate shared memory at launch time, for use by the callback. The maximum allowable amount of shared memory is 16K bytes. cuFFT passes a pointer to this shared memory to the callback routine at execution time. This shared memory is only valid for the life of the load or store callback operation. During execution, cuFFT may overwrite shared memory for its own purposes. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – <code class="docutils literal notranslate">cufftHandle</code> returned by <code class="docutils literal notranslate">cufftCreate</code>.</li> <li>type[In] – Type of callback routine.</li> <li>sharedSize[In] – Amount of shared memory requested.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – cuFFT will invoke the callback routine with a pointer to the requested amount of shared memory.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle, or a <a class="reference external" href="index.html#streamed-cufft-transforms">non-default stream has been associated with the plan in cuFFT prior to 10.4.0</a>.</li> <li>CUFFT_INTERNAL_ERROR – An internal driver error was detected.</li> <li>CUFFT_ALLOC_FAILED – cuFFT will not be able to allocate the requested amount of shared memory.</li> </ul> </dd> </dl> </dd></dl> </section> </section> <section id="cufftsetstream"> <h2>3.12. cufftSetStream()<a class="headerlink" href="#cufftsetstream" title="Permalink to this headline"></a></h2> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftSetStream"> cufftResult cufftSetStream(<a class="reference internal" href="#c.cufftHandle" title="cufftHandle">cufftHandle</a> plan, cudaStream_t stream);<a class="headerlink" href="#c.cufftSetStream" title="Permalink to this definition"></a> </dt> <dd>Associates a CUDA stream with a cuFFT plan. All kernel launches made during plan execution are now done through the associated stream, enabling overlap with activity in other streams (e.g. data copying). The association remains until the plan is destroyed or the stream is changed with another call to <code class="docutils literal notranslate">cufftSetStream()</code>. Note that starting from CUDA 11.2 (cuFFT 10.4.0), <code class="docutils literal notranslate">cufftSetStream()</code> is supported on multi-GPU plans. When associating a stream with a plan, <code class="docutils literal notranslate">cufftXtMemcpy()</code> remains synchronous across the multiple GPUs. For previous versions of cuFFT, <code class="docutils literal notranslate">cufftSetStream()</code> will return an error in multiple GPU plans. Note that starting from CUDA 12.2 (cuFFT 11.0.8), on multi-GPU plans, <code class="docutils literal notranslate">stream</code> can be associated with any context on any GPU. However, repeated calls to <code class="docutils literal notranslate">cufftSetStream()</code> with streams from different contexts incur a small time penalty. Optimal performance is obtained when repeated calls to <code class="docutils literal notranslate">cufftSetStream</code> use streams from the same CUDA context. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>plan[In] – The <code class="docutils literal notranslate">cufftHandle</code> object to associate with the stream.</li> <li>stream[In] – A valid CUDA stream created with <code class="docutils literal notranslate">cudaStreamCreate()</code>; <code class="docutils literal notranslate">0</code> for the default stream.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – The stream was associated with the plan.</li> <li>CUFFT_INVALID_PLAN – The <code class="docutils literal notranslate">plan</code> parameter is not a valid handle, or plan is multi-gpu in cuFFT version prior to 10.4.0.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufftgetversion"> <h2>3.13. cufftGetVersion()<a class="headerlink" href="#cufftgetversion" title="Permalink to this headline"></a></h2> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftGetVersion"> cufftResult cufftGetVersion(int *version);<a class="headerlink" href="#c.cufftGetVersion" title="Permalink to this definition"></a> </dt> <dd>Returns the version number of cuFFT. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>*version[In] – Pointer to the version number.</li> <li>*version[Out] – Contains the version number.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even">CUFFT_SUCCESS – cuFFT successfully returned the version number. </dd> </dl> </dd></dl> </section> <section id="cufftgetproperty"> <h2>3.14. cufftGetProperty()<a class="headerlink" href="#cufftgetproperty" title="Permalink to this headline"></a></h2> <dl class="c function"> <dt class="sig sig-object c" id="c.cufftGetProperty"> cufftResult cufftGetProperty(libraryPropertyType type, int *value);<a class="headerlink" href="#c.cufftGetProperty" title="Permalink to this definition"></a> </dt> <dd>Return in <code class="docutils literal notranslate">*value</code> the number for the property described by <code class="docutils literal notranslate">type</code> of the dynamically linked CUFFT library. <dl class="field-list simple"> <dt class="field-odd">Parameters</dt> <dd class="field-odd"><ul class="simple"> <li>type[In] – CUDA library property.</li> <li>value[Out] – Contains the integer value for the requested property.</li> </ul> </dd> <dt class="field-even">Return values</dt> <dd class="field-even"><ul class="simple"> <li>CUFFT_SUCCESS – The property value was successfully returned.</li> <li>CUFFT_INVALID_TYPE – The property type is not recognized.</li> <li>CUFFT_INVALID_VALUE – <code class="docutils literal notranslate">value</code> is <code class="docutils literal notranslate">NULL</code>.</li> </ul> </dd> </dl> </dd></dl> </section> <section id="cufft-types"> <h2>3.15. cuFFT Types<a class="headerlink" href="#cufft-types" title="Permalink to this headline"></a></h2> <section id="parameter-cuffttype"> <h3>3.15.1. Parameter cufftType<a class="headerlink" href="#parameter-cuffttype" title="Permalink to this headline"></a></h3> The cuFFT library supports complex- and real-data transforms. The <code class="docutils literal notranslate">cufftType</code> data type is an enumeration of the types of transform data supported by cuFFT. <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef enum cufftType_t { CUFFT_R2C = 0x2a, // Real to complex (interleaved) CUFFT_C2R = 0x2c, // Complex (interleaved) to real CUFFT_C2C = 0x29, // Complex to complex (interleaved) CUFFT_D2Z = 0x6a, // Double to double-complex (interleaved) CUFFT_Z2D = 0x6c, // Double-complex (interleaved) to double CUFFT_Z2Z = 0x69 // Double-complex to double-complex (interleaved) } cufftType; </pre></div> </div> </section> <section id="parameters-for-transform-direction"> <h3>3.15.2. Parameters for Transform Direction<a class="headerlink" href="#parameters-for-transform-direction" title="Permalink to this headline"></a></h3> The cuFFT library defines forward and inverse Fast Fourier Transforms according to the sign of the complex exponential term. <div class="highlight-c++ notranslate"><div class="highlight"><pre>#define CUFFT_FORWARD -1 #define CUFFT_INVERSE 1 </pre></div> </div> cuFFT performs un-normalized FFTs; that is, performing a forward FFT on an input data set followed by an inverse FFT on the resulting set yields data that is equal to the input, scaled by the number of elements. Scaling either transform by the reciprocal of the size of the data set is left for the user to perform as seen fit. </section> <section id="type-definitions-for-callbacks"> <h3>3.15.3. Type definitions for callbacks<a class="headerlink" href="#type-definitions-for-callbacks" title="Permalink to this headline"></a></h3> The cuFFT library supports callback funtions for all combinations of single or double precision, real or complex data, load or store. These are enumerated in the parameter <code class="docutils literal notranslate">cufftXtCallbackType</code>. <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef enum cufftXtCallbackType_t { CUFFT_CB_LD_COMPLEX = 0x0, CUFFT_CB_LD_COMPLEX_DOUBLE = 0x1, CUFFT_CB_LD_REAL = 0x2, CUFFT_CB_LD_REAL_DOUBLE = 0x3, CUFFT_CB_ST_COMPLEX = 0x4, CUFFT_CB_ST_COMPLEX_DOUBLE = 0x5, CUFFT_CB_ST_REAL = 0x6, CUFFT_CB_ST_REAL_DOUBLE = 0x7, CUFFT_CB_UNDEFINED = 0x8 } cufftXtCallbackType; </pre></div> </div> <section id="type-definitions-for-lto-callbacks"> <h4>3.15.3.1. Type definitions for LTO callbacks<a class="headerlink" href="#type-definitions-for-lto-callbacks" title="Permalink to this headline"></a></h4> The LTO callback function prototypes and pointer type definitions are as follows: <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef cufftComplex (*cufftJITCallbackLoadC)(void *dataIn, unsigned long long offset, void *callerInfo, void *sharedPointer); typedef cufftDoubleComplex (*cufftJITCallbackLoadZ)(void *dataIn, unsigned long long offset, void *callerInfo, void *sharedPointer); typedef cufftReal (*cufftJITCallbackLoadR)(void *dataIn, unsigned long long offset, void *callerInfo, void *sharedPointer); typedef cufftDoubleReal(*cufftJITCallbackLoadD)(void *dataIn, unsigned long long offset, void *callerInfo, void *sharedPointer); typedef void (*cufftJITCallbackStoreC)(void *dataOut, unsigned long long offset, cufftComplex element, void *callerInfo, void *sharedPointer); typedef void (*cufftJITCallbackStoreZ)(void *dataOut, unsigned long long offset, cufftDoubleComplex element, void *callerInfo, void *sharedPointer); typedef void (*cufftJITCallbackStoreR)(void *dataOut, unsigned long long offset, cufftReal element, void *callerInfo, void *sharedPointer); typedef void (*cufftJITCallbackStoreD)(void *dataOut, unsigned long long offset, cufftDoubleReal element, void *callerInfo, void *sharedPointer); </pre></div> </div> Notice the difference in the type of the <code class="docutils literal notranslate">offset</code> parameter (<code class="docutils literal notranslate">unsigned long long</code>) vs. legacy callbacks (which use <code class="docutils literal notranslate">size_t</code>). </section> <section id="type-definitions-for-legacy-callbacks"> <h4>3.15.3.2. Type definitions for legacy callbacks<a class="headerlink" href="#type-definitions-for-legacy-callbacks" title="Permalink to this headline"></a></h4> The legacy callback function prototypes and pointer type definitions are as follows: <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef cufftComplex (*cufftCallbackLoadC)(void *dataIn, size_t offset, void *callerInfo, void *sharedPointer); typedef cufftDoubleComplex (*cufftCallbackLoadZ)(void *dataIn, size_t offset, void *callerInfo, void *sharedPointer); typedef cufftReal (*cufftCallbackLoadR)(void *dataIn, size_t offset, void *callerInfo, void *sharedPointer); typedef cufftDoubleReal(*cufftCallbackLoadD)(void *dataIn, size_t offset, void *callerInfo, void *sharedPointer); typedef void (*cufftCallbackStoreC)(void *dataOut, size_t offset, cufftComplex element, void *callerInfo, void *sharedPointer); typedef void (*cufftCallbackStoreZ)(void *dataOut, size_t offset, cufftDoubleComplex element, void *callerInfo, void *sharedPointer); typedef void (*cufftCallbackStoreR)(void *dataOut, size_t offset, cufftReal element, void *callerInfo, void *sharedPointer); typedef void (*cufftCallbackStoreD)(void *dataOut, size_t offset, cufftDoubleReal element, void *callerInfo, void *sharedPointer); </pre></div> </div> </section> </section> <section id="other-cufft-types"> <h3>3.15.4. Other cuFFT Types<a class="headerlink" href="#other-cufft-types" title="Permalink to this headline"></a></h3> <section id="cuffthandle"> <h4>3.15.4.1. cufftHandle<a class="headerlink" href="#cuffthandle" title="Permalink to this headline"></a></h4> <dl class="c type"> <dt class="sig sig-object c" id="c.cufftHandle"> type cufftHandle<a class="headerlink" href="#c.cufftHandle" title="Permalink to this definition"></a> </dt> <dd>A handle type used to store and access cuFFT plans. The user receives a handle after creating a cuFFT plan and uses this handle to execute the plan. </dd></dl> <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef unsigned int cufftHandle; </pre></div> </div> </section> <section id="cufftreal"> <h4>3.15.4.2. cufftReal<a class="headerlink" href="#cufftreal" title="Permalink to this headline"></a></h4> A single-precision, floating-point real data type. <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef float cufftReal; </pre></div> </div> </section> <section id="cufftdoublereal"> <h4>3.15.4.3. cufftDoubleReal<a class="headerlink" href="#cufftdoublereal" title="Permalink to this headline"></a></h4> A double-precision, floating-point real data type. <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef double cufftDoubleReal; </pre></div> </div> </section> <section id="cufftcomplex"> <h4>3.15.4.4. cufftComplex<a class="headerlink" href="#cufftcomplex" title="Permalink to this headline"></a></h4> A single-precision, floating-point complex data type that consists of interleaved real and imaginary components. <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef cuComplex cufftComplex; </pre></div> </div> </section> <section id="cufftdoublecomplex"> <h4>3.15.4.5. cufftDoubleComplex<a class="headerlink" href="#cufftdoublecomplex" title="Permalink to this headline"></a></h4> A double-precision, floating-point complex data type that consists of interleaved real and imaginary components. <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef cuDoubleComplex cufftDoubleComplex; </pre></div> </div> </section> </section> </section> <section id="common-types"> <h2>3.16. Common types<a class="headerlink" href="#common-types" title="Permalink to this headline"></a></h2> <section id="cudadatatype"> <h3>3.16.1. cudaDataType<a class="headerlink" href="#cudadatatype" title="Permalink to this headline"></a></h3> The <code class="docutils literal notranslate">cudaDataType</code> data type is an enumeration of the types supported by CUDA libraries. <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef enum cudaDataType_t { CUDA_R_16F= 2, // 16 bit real CUDA_C_16F= 6, // 16 bit complex CUDA_R_32F= 0, // 32 bit real CUDA_C_32F= 4, // 32 bit complex CUDA_R_64F= 1, // 64 bit real CUDA_C_64F= 5, // 64 bit complex CUDA_R_8I= 3, // 8 bit real as a signed integer CUDA_C_8I= 7, // 8 bit complex as a pair of signed integers CUDA_R_8U= 8, // 8 bit real as an unsigned integer CUDA_C_8U= 9 // 8 bit complex as a pair of unsigned integers } cudaDataType; </pre></div> </div> </section> <section id="librarypropertytype"> <h3>3.16.2. libraryPropertyType<a class="headerlink" href="#librarypropertytype" title="Permalink to this headline"></a></h3> The <code class="docutils literal notranslate">libraryPropertyType</code> data type is an enumeration of library property types. (ie. CUDA version X.Y.Z would yield <code class="docutils literal notranslate">MAJOR_VERSION=X</code>, <code class="docutils literal notranslate">MINOR_VERSION=Y</code>, <code class="docutils literal notranslate">PATCH_LEVEL=Z</code>) <div class="highlight-c++ notranslate"><div class="highlight"><pre>typedef enum libraryPropertyType_t { MAJOR_VERSION, MINOR_VERSION, PATCH_LEVEL } libraryPropertyType; </pre></div> </div> </section> </section> </section> <section id="multiple-gpu-data-organization"> <h1>4. Multiple GPU Data Organization<a class="headerlink" href="#multiple-gpu-data-organization" title="Permalink to this headline"></a></h1> This chapter explains how data are distributed between the GPUs, before and after a multiple GPU transform. For simplicity, it is assumed in this chapter that the caller has specified GPU 0 and GPU 1 to perform the transform. <section id="multiple-gpu-data-organization-for-batched-transforms"> <h2>4.1. Multiple GPU Data Organization for Batched Transforms<a class="headerlink" href="#multiple-gpu-data-organization-for-batched-transforms" title="Permalink to this headline"></a></h2> For batches of transforms, each individual transform is executed on a single GPU. If possible the batches are evenly distributed among the GPUs. For a batch of size <code class="docutils literal notranslate">m</code> performed on <code class="docutils literal notranslate">n</code> GPUs, where <code class="docutils literal notranslate">m</code> is not divisible by <code class="docutils literal notranslate">n</code>, the first <code class="docutils literal notranslate">m % n</code> GPUs will perform \(\left\lfloor \frac{m}{n} \right\rfloor+\ 1\) transforms. The remaining GPUs will perform \(\left\lfloor \frac{m}{n} \right\rfloor\) transforms. For example, in a batch of 15 transforms performed on 4 GPUs, the first three GPUs would perform 4 transforms, and the last GPU would perform 3 transforms. This approach removes the need for data exchange between the GPUs, and results in nearly perfect scaling for cases where the batch size is divisible by the number of GPUs. </section> <section id="multiple-gpu-data-organization-for-single-2d-and-3d-transforms"> <h2>4.2. Multiple GPU Data Organization for Single 2D and 3D Transforms<a class="headerlink" href="#multiple-gpu-data-organization-for-single-2d-and-3d-transforms" title="Permalink to this headline"></a></h2> Single transforms performed on multiple GPUs require the data to be divided between the GPUs. Then execution takes place in phases. For example with 2 GPUs, for 2D and 3D transforms with even sized dimensions, each GPU does half of the transform in (rank - 1) dimensions. Then data are exchanged between the GPUs so that the final dimension can be processed. Since 2D and 3D transforms support sizes other than powers of 2, it is possible that the data can not be evenly distributed among the GPUs. In general for the case of <code class="docutils literal notranslate">n</code> GPUs, a dimension of size <code class="docutils literal notranslate">m</code> that is not a multiple of <code class="docutils literal notranslate">n</code> would be distributed such that the first <code class="docutils literal notranslate">m % n</code> GPUs would get one extra row for 2D transforms, one extra plane for 3D transforms. Take for example, a 2D transform on 4 GPUs, using an array declared in C as <code class="docutils literal notranslate">data[x][y]</code>, where <code class="docutils literal notranslate">x</code> is 65 and <code class="docutils literal notranslate">y</code> is 99. The surface is distributed prior to the transform such that GPU 0 receives a surface with dimensions <code class="docutils literal notranslate">[17][99]</code>, and GPUs 1…3 receive surfaces with dimensions <code class="docutils literal notranslate">[16][99]</code>. After the transform, each GPU again has a portion of the surface, but divided in the y dimension. GPUs 0…2 have surfaces with dimensions <code class="docutils literal notranslate">[65][25]</code>. GPU 3 has a surface with dimensions <code class="docutils literal notranslate">[65][24]</code> For a 3D transform on 4 GPUs consider an array declared in C as <code class="docutils literal notranslate">data[x][y][z]</code>, where <code class="docutils literal notranslate">x</code> is 103, <code class="docutils literal notranslate">y</code> is 122, and <code class="docutils literal notranslate">z</code> is 64. The volume is distributed prior to the transform such that each GPUs 0…2 receive volumes with dimensions <code class="docutils literal notranslate">[26][122][64]</code>, and GPU 3 receives a volume with dimensions <code class="docutils literal notranslate">[25][122][64]</code>. After the transform, each GPU again has a portion of the surface, but divided in the y dimension. GPUs 0 and 1 have a volumes with dimensions <code class="docutils literal notranslate">[103][31][64]</code>, and GPUs 2 and 3 have volumes with dimensions <code class="docutils literal notranslate">[103][30][64]</code>. </section> <section id="multiple-gpu-data-organization-for-single-1d-transforms"> <h2>4.3. Multiple-GPU Data Organization for Single 1D Transforms<a class="headerlink" href="#multiple-gpu-data-organization-for-single-1d-transforms" title="Permalink to this headline"></a></h2> By default for 1D transforms, the initial distribution of data to the GPUs is similar to the 2D and 3D cases. For a transform of dimension x on two GPUs, GPU 0 receives data ranging from 0…(x/2-1). GPU 1 receives data ranging from (x/2)…(x-1). Similarly, with 4 GPUs, the data are evenly distributed among all 4 GPUs. Before computation can begin, data are redistributed among the GPUs. It is possible to perform this redistribution in the copy from host memory, in cases where the application does not need to pre-process the data prior to the transform. To do this, the application can create the data descriptor with <code class="docutils literal notranslate">cufftXtMalloc</code> using the sub-format <code class="docutils literal notranslate">CUFFT_XT_FORMAT_1D_INPUT_SHUFFLED</code>. This can significantly reduce the time it takes to execute the transform. cuFFT performs multiple GPU 1D transforms by decomposing the transform size into factors <code class="docutils literal notranslate">Factor1</code> and <code class="docutils literal notranslate">Factor2</code>, and treating the data as a grid of size <code class="docutils literal notranslate">Factor1</code> x <code class="docutils literal notranslate">Factor2</code>. The four steps done to calculate the 1D FFT are: <code class="docutils literal notranslate">Factor1</code> transforms of size <code class="docutils literal notranslate">Factor2</code>, data exchange between the GPUs, a pointwise twiddle multiplication, and <code class="docutils literal notranslate">Factor2</code> transforms of size <code class="docutils literal notranslate">Factor1</code>. To gain efficiency by overlapping computation with data exchange, cuFFT breaks the whole transform into independent segments or strings, which can be processed while others are in flight. A side effect of this algorithm is that the output of the transform is not in linear order. The output in GPU memory is in strings, each of which is composed of <code class="docutils literal notranslate">Factor2</code> substrings of equal size. Each substring contains contiguous results starting <code class="docutils literal notranslate">Factor1</code> elements subsequent to start of the previous substring. Each string starts substring size elements after the start of the previous string. The strings appear in order, the first half on GPU 0, and the second half on GPU 1. See the example below: <div class="highlight-c++ notranslate"><div class="highlight"><pre>transform size = 1024 number of strings = 8 Factor1 = 64 Factor2 = 16 substrings per string for output layout is Factor2 (16) string size = 1024/8 = 128 substring size = 128/16 = 8 stride between substrings = 1024/16 = Factor1 (64) On GPU 0: string 0 has substrings with indices 0...7 64...71 128...135 ... 960...967 string 1 has substrings with indices 8...15 72...79 136...143 ... 968...975 ... On GPU 1: string 4 has substrings with indices 32...39 96...103 160...167 ... 992...999 ... string 7 has substrings with indices 56...63 120...127 184...191 ... 1016...1023 </pre></div> </div> The cufftXtQueryPlan API allows the caller to retrieve a structure containing the number of strings, the decomposition factors, and (in the case of power of 2 size) some useful mask and shift elements. The example below shows how cufftXtQueryPlan is invoked. It also shows how to translate from an index in the host input array to the corresponding index on the device, and vice versa. <div class="highlight-c++ notranslate"><div class="highlight"><pre>/* * These routines demonstrate the use of cufftXtQueryPlan to get the 1D * factorization and convert between permuted and linear indexes. */ /* * Set up a 1D plan that will execute on GPU 0 and GPU1, and query * the decomposition factors */ int main(int argc, char **argv){ cufftHandle plan; cufftResult stat; int whichGPUs[2] = { 0, 1 }; cufftXt1dFactors factors; stat = cufftCreate( &plan ); if (stat != CUFFT_SUCCESS) { printf("Create error %d\n",stat); return 1; } stat = cufftXtSetGPUs( plan, 2, whichGPUs ); if (stat != CUFFT_SUCCESS) { printf("SetGPU error %d\n",stat); return 1; } stat = cufftMakePlan1d( plan, size, CUFFT_C2C, 1, workSizes ); if (stat != CUFFT_SUCCESS) { printf("MakePlan error %d\n",stat); return 1; } stat = cufftXtQueryPlan( plan, (void *) &factors, CUFFT_QUERY_1D_FACTORS ); if (stat != CUFFT_SUCCESS) { printf("QueryPlan error %d\n",stat); return 1; } printf("Factor 1 %zd, Factor2 %zd\n",factors.factor1,factors.factor2); cufftDestroy(plan); return 0; } </pre></div> </div> <div class="highlight-c++ notranslate"><div class="highlight"><pre>/* * Given an index into a permuted array, and the GPU index return the * corresponding linear index from the beginning of the input buffer. * * Parameters: * factors input: pointer to cufftXt1dFactors as returned by * cufftXtQueryPlan * permutedIx input: index of the desired element in the device output * array * linearIx output: index of the corresponding input element in the * host array * GPUix input: index of the GPU containing the desired element */ cufftResult permuted2Linear( cufftXt1dFactors * factors, size_t permutedIx, size_t *linearIx, int GPUIx ) { size_t indexInSubstring; size_t whichString; size_t whichSubstring; // the low order bits of the permuted index match those of the linear index indexInSubstring = permutedIx & factors->substringMask; // the next higher bits are the substring index whichSubstring = (permutedIx >> factors->substringShift) & factors->factor2Mask; // the next higher bits are the string index on this GPU whichString = (permutedIx >> factors->stringShift) & factors->stringMask; // now adjust the index for the second GPU if (GPUIx) { whichString += factors->stringCount/2; } // linear index low order bits are the same // next higher linear index bits are the string index *linearIx = indexInSubstring + ( whichString << factors->substringShift ); // next higher bits of linear address are the substring index *linearIx += whichSubstring << factors->factor1Shift; return CUFFT_SUCCESS; } </pre></div> </div> <div class="highlight-c++ notranslate"><div class="highlight"><pre>/* * Given a linear index into a 1D array, return the GPU containing the permuted * result, and index from the start of the data buffer for that element. * * Parameters: * factors input: pointer to cufftXt1dFactors as returned by * cufftXtQueryPlan * linearIx input: index of the desired element in the host input * array * permutedIx output: index of the corresponding result in the device * output array * GPUix output: index of the GPU containing the result */ cufftResult linear2Permuted( cufftXt1dFactors * factors, size_t linearIx, size_t *permutedIx, int *GPUIx ) { size_t indexInSubstring; size_t whichString; size_t whichSubstring; size_t whichStringMask; int whichStringShift; if (linearIx >= factors->size) { return CUFFT_INVALID_VALUE; } // get a useful additional mask and shift count whichStringMask = factors->stringCount -1; whichStringShift = (factors->factor1Shift + factors->factor2Shift) - factors->stringShift ; // the low order bits identify the index within the substring indexInSubstring = linearIx & factors->substringMask; // first determine which string has our linear index. // the low order bits indentify the index within the substring. // the next higher order bits identify which string. whichString = (linearIx >> factors->substringShift) & whichStringMask; // the first stringCount/2 strings are in the first GPU, // the rest are in the second. *GPUIx = whichString/(factors->stringCount/2); // next determine which substring within the string has our index // the substring index is in the next higher order bits of the index whichSubstring = (linearIx >>(factors->substringShift + whichStringShift)) & factors->factor2Mask; // now we can re-assemble the index *permutedIx = indexInSubstring; *permutedIx += whichSubstring << factors->substringShift; if ( !*GPUIx ) { *permutedIx += whichString << factors->stringShift; } else { *permutedIx += (whichString - (factors->stringCount/2) ) << factors->stringShift; } return CUFFT_SUCCESS; } </pre></div> </div> </section> </section> <section id="fftw-conversion-guide"> <h1>5. FFTW Conversion Guide<a class="headerlink" href="#fftw-conversion-guide" title="Permalink to this headline"></a></h1> cuFFT differs from FFTW in that FFTW has many plans and a single execute function while cuFFT has fewer plans, but multiple execute functions. The cuFFT execute functions determine the precision (single or double) and whether the input is complex or real valued. The following table shows the relationship between the two interfaces. <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 44%" /> <col style="width: 56%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">FFTW function</th> <th class="head">cuFFT function</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate">fftw_plan_dft_1d(), fftw_plan_dft_r2c_1d(), fftw_plan_dft_c2r_1d()</code></td> <td><code class="docutils literal notranslate">cufftPlan1d()</code></td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">fftw_plan_dft_2d(), fftw_plan_dft_r2c_2d(), fftw_plan_dft_c2r_2d()</code></td> <td><code class="docutils literal notranslate">cufftPlan2d()</code></td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">fftw_plan_dft_3d(), fftw_plan_dft_r2c_3d(), fftw_plan_dft_c2r_3d()</code></td> <td><code class="docutils literal notranslate">cufftPlan3d()</code></td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">fftw_plan_dft(), fftw_plan_dft_r2c(), fftw_plan_dft_c2r()</code></td> <td><code class="docutils literal notranslate">cufftPlanMany()</code></td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">fftw_plan_many_dft(), fftw_plan_many_dft_r2c(), fftw_plan_many_dft_c2r()</code></td> <td><code class="docutils literal notranslate">cufftPlanMany()</code></td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate">fftw_execute()</code></td> <td><code class="docutils literal notranslate">cufftExecC2C(), cufftExecZ2Z(), cufftExecR2C(), cufftExecD2Z(), cufftExecC2R(), cufftExecZ2D()</code></td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate">fftw_destroy_plan()</code></td> <td><code class="docutils literal notranslate">cufftDestroy()</code></td> </tr> </tbody> </table> </section> <section id="fftw-interface-to-cufft"> <h1>6. FFTW Interface to cuFFT<a class="headerlink" href="#fftw-interface-to-cufft" title="Permalink to this headline"></a></h1> NVIDIA provides FFTW3 interfaces to the cuFFT library. This allows applications using FFTW to use NVIDIA GPUs with minimal modifications to program source code. To use the interface first do the following two steps <ul class="simple"> <li>It is recommended that you replace the include file <code class="docutils literal notranslate">fftw3.h</code> with <code class="docutils literal notranslate">cufftw.h</code></li> <li>Instead of linking with the double/single precision libraries such as <code class="docutils literal notranslate">fftw3/fftw3f</code> libraries, link with both the cuFFT and cuFFTW libraries</li> <li>Ensure the search path includes the directory containing <code class="docutils literal notranslate">cuda_runtime_api.h</code></li> </ul> After an application is working using the FFTW3 interface, users may want to modify their code to move data to and from the GPU and use the routines documented in the <a class="reference external" href="index.html#fftw-conversion-guide">FFTW Conversion Guide</a> for the best performance. The following tables show which components and functions of FFTW3 are supported in cuFFT. <table class="table-no-stripes docutils align-default"> <colgroup> <col style="width: 15%" /> <col style="width: 17%" /> <col style="width: 68%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Section in FFTW manual</th> <th class="head">Supported</th> <th class="head">Unsupported</th> </tr> </thead> <tbody> <tr class="row-even"><td>Complex numbers</td> <td><code class="docutils literal notranslate">fftw_complex, fftwf_complex</code> types</td> <td></td> </tr> <tr class="row-odd"><td>Precision</td> <td>double <code class="docutils literal notranslate">fftw3</code>, single <code class="docutils literal notranslate">fftwf3</code></td> <td>long double <code class="docutils literal notranslate">fftw3l</code>, quad precision <code class="docutils literal notranslate">fftw3q</code> are not supported since CUDA functions operate on double and single precision floating-point quantities</td> </tr> <tr class="row-even"><td>Memory Allocation</td> <td></td> <td><code class="docutils literal notranslate">fftw_malloc(), fftw_free(), fftw_alloc_real(), fftw_alloc_complex(), fftwf_alloc_real(), fftwf_alloc_complex()</code></td> </tr> <tr class="row-odd"><td>Multi-threaded FFTW</td> <td></td> <td><code class="docutils literal notranslate">fftw3_threads, fftw3_omp</code> are not supported</td> </tr> <tr class="row-even"><td>Distributed-memory FFTW with MPI</td> <td></td> <td><code class="docutils literal notranslate">fftw3_mpi,fftw3f_mpi</code> are not supported</td> </tr> </tbody> </table> Note that for each of the double precision functions below there is a corresponding single precision version with the letters <code class="docutils literal notranslate">fftw</code> replaced by <code class="docutils literal notranslate">fftwf</code>. <table class="table-no-stripes longtable docutils align-default"> <colgroup> <col style="width: 9%" /> <col style="width: 49%" /> <col style="width: 42%" /> </colgroup> <thead> <tr class="row-odd"><th class="head">Section in FFTW manual</th> <th class="head">Supported</th> <th class="head">Unsupported</th> </tr> </thead> <tbody> <tr class="row-even"><td>Using Plans</td> <td><code class="docutils literal notranslate">fftw_execute(), fftw_destroy_plan(), fftw_cleanup()</code></td> <td><code class="docutils literal notranslate">fftw_print_plan(), fftw_cost(), fftw_flops()</code> exist but are not functional</td> </tr> <tr class="row-odd"><td>Basic Interface</td> <td></td> <td></td> </tr> <tr class="row-even"><td>Complex DFTs</td> <td><code class="docutils literal notranslate">fftw_plan_dft_1d(), fftw_plan_dft_2d(), fftw_plan_dft_3d(), fftw_plan_dft()</code></td> <td></td> </tr> <tr class="row-odd"><td>Planner Flags</td> <td></td> <td>Planner flags are ignored and the same plan is returned regardless</td> </tr> <tr class="row-even"><td>Real-data DFTs</td> <td><code class="docutils literal notranslate">fftw_plan_dft_r2c_1d(), fftw_plan_dft_r2c_2d(), fftw_plan_dft_r2c_3d(), fftw_plan_dft_r2c(), fftw_plan_dft_c2r_1d(), fftw_plan_dft_c2r_2d(), fftw_plan_dft_c2r_3d(), fftw_plan_dft_c2r()</code></td> <td></td> </tr> <tr class="row-odd"><td>Read-data DFT Array Format</td> <td></td> <td>Not supported</td> </tr> <tr class="row-even"><td>Read-to-Real Transform</td> <td></td> <td>Not supported</td> </tr> <tr class="row-odd"><td>Read-to-Real Transform Kinds</td> <td></td> <td>Not supported</td> </tr> <tr class="row-even"><td>Advanced Interface</td> <td></td> <td></td> </tr> <tr class="row-odd"><td>Advanced Complex DFTs</td> <td><code class="docutils literal notranslate">fftw_plan_many_dft()</code> with multiple 1D, 2D, 3D transforms</td> <td><code class="docutils literal notranslate">fftw_plan_many_dft()</code> with 4D or higher transforms or a 2D or higher batch of embedded transforms</td> </tr> <tr class="row-even"><td>Advanced Real-data DFTs</td> <td><code class="docutils literal notranslate">fftw_plan_many_dft_r2c(), fftw_plan_many_dft_c2r()</code> with multiple 1D, 2D, 3D transforms</td> <td><code class="docutils literal notranslate">fftw_plan_many_dft_r2c(), fftw_plan_many_dft_c2r()</code> with 4D or higher transforms or a 2D or higher batch of embedded transforms</td> </tr> <tr class="row-odd"><td>Advanced Real-to-Real Transforms</td> <td></td> <td>Not supported</td> </tr> <tr class="row-even"><td>Guru Interface</td> <td></td> <td></td> </tr> <tr class="row-odd"><td>Interleaved and split arrays</td> <td>Interleaved format</td> <td>Split format</td> </tr> <tr class="row-even"><td>Guru vector and transform sizes</td> <td><code class="docutils literal notranslate">fftw_iodim</code> struct</td> <td></td> </tr> <tr class="row-odd"><td>Guru Complex DFTs</td> <td><code class="docutils literal notranslate">fftw_plan_guru_dft(), fftw_plan_guru_dft_r2c(), fftw_plan_guru_dft_c2r()</code> with multiple 1D, 2D, 3D transforms</td> <td><code class="docutils literal notranslate">fftw_plan_guru_dft(), fftw_plan_guru_dft_r2c(), fftw_plan_guru_dft_c2r()</code> with 4D or higher transforms or a 2D or higher batch of transforms</td> </tr> <tr class="row-even"><td>Guru Real-data DFTs</td> <td></td> <td>Not supported</td> </tr> <tr class="row-odd"><td>Guru Real-to-real Transforms</td> <td></td> <td>Not supported</td> </tr> <tr class="row-even"><td>64-bit Guru Interface</td> <td><code class="docutils literal notranslate">fftw_plan_guru64_dft(), fftw_plan_guru64_dft_r2c(), fftw_plan_guru64_dft_c2r()</code> with multiple 1D, 2D, 3D transforms</td> <td><code class="docutils literal notranslate">fftw_plan_guru64_dft(), fftw_plan_guru64_dft_r2c(), fftw_plan_guru64_dft_c2r()</code> with 4D or higher transforms or a 2D or higher batch of transforms</td> </tr> <tr class="row-odd"><td>New-array Execute Functions</td> <td><code class="docutils literal notranslate">fftw_execute_dft(), fftw_execute_dft_r2c(), fftw_execute_dft_c2r()</code> with interleaved format</td> <td>Split format and real-to-real functions</td> </tr> <tr class="row-even"><td>Wisdom</td> <td></td> <td><code class="docutils literal notranslate">fftw_export_wisdom_to_file(), fftw_import_wisdom_from_file()</code> exist but are not functional. Other wisdom functions do not have entry points in the library.</td> </tr> </tbody> </table> </section> <section id="deprecated-functionality"> <h1>7. Deprecated Functionality<a class="headerlink" href="#deprecated-functionality" title="Permalink to this headline"></a></h1> Starting from CUDA 12.0: <ul class="simple"> <li>GPU architectures SM35 and SM37 are no longer supported. The minimum required architecture is SM50.</li> </ul> Starting from CUDA 11.8: <ul class="simple"> <li>CUDA Graphs capture is no longer supported for legacy callback routines that load data in out-of-place mode transforms. Starting from CUDA 12.6 Update 2, LTO callbacks can be used as a replacement for legacy callbacks without this limitation.</li> </ul> Starting from CUDA 11.4: <ul class="simple"> <li>Support for callback functionality using separately compiled device code (legacy callbacks) is deprecated on all GPU architectures. Callback functionality will continue to be supported for all GPU architectures.</li> </ul> Starting from CUDA 11.0: <ul class="simple"> <li>GPU architecture SM30 is no longer supported. The minimum required architecture is SM35.</li> <li>Support for GPU architectures SM35, SM37 (Kepler), and SM50, SM52 (Maxwell) is deprecated.</li> </ul> Function <code class="docutils literal notranslate">cufftSetCompatibilityMode</code> was removed in version 9.1. </section> <section id="notices"> <h1>8. Notices<a class="headerlink" href="#notices" title="Permalink to this headline"></a></h1> <section id="notice"> <h2>8.1. Notice<a class="headerlink" href="#notice" title="Permalink to this headline"></a></h2> This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality. NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice. Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete. NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document. NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk. NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs. No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA. Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices. THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product. </section> <section id="opencl"> <h2>8.2. OpenCL<a class="headerlink" href="#opencl" title="Permalink to this headline"></a></h2> OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc. </section> <section id="trademarks"> <h2>8.3. Trademarks<a class="headerlink" href="#trademarks" title="Permalink to this headline"></a></h2> NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. </section> </section> </div> </div> <footer> <hr/> <div role="contentinfo"> <img src="../_static/NVIDIA-LogoBlack.svg" class="only-light"/> <img src="../_static/NVIDIA-LogoWhite.svg" class="only-dark"/> <a href="https://www.nvidia.com/en-us/about-nvidia/privacy-policy/" target="_blank">Privacy Policy</a> | <a href="https://www.nvidia.com/en-us/about-nvidia/privacy-center/" target="_blank">Manage My Privacy</a> | <a href="https://www.nvidia.com/en-us/preferences/start/" target="_blank">Do Not Sell or Share My Data</a> | <a href="https://www.nvidia.com/en-us/about-nvidia/terms-of-service/" target="_blank">Terms of Service</a> | <a href="https://www.nvidia.com/en-us/about-nvidia/accessibility/" target="_blank">Accessibility</a> | <a href="https://www.nvidia.com/en-us/about-nvidia/company-policies/" target="_blank">Corporate Policies</a> | <a href="https://www.nvidia.com/en-us/product-security/" target="_blank">Product Security</a> | <a href="https://www.nvidia.com/en-us/contact/" target="_blank">Contact</a> Copyright © 2007-2024, NVIDIA Corporation & affiliates. All rights reserved. Last updated on Oct 10, 2024. </div> </footer> </div> </div> </section> </div> <script> jQuery(function () { SphinxRtdTheme.Navigation.enable(false); }); </script> <script type="text/javascript">if (typeof _satellite !== "undefined"){_satellite.pageBottom();}</script> </body> </html>