CINXE.COM
Core Features — NVTabular
<!DOCTYPE html> <html lang="en" data-content_root="" > <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.19: https://docutils.sourceforge.io/" /> <title>Core Features — NVTabular</title> <script data-cfasync="false"> document.documentElement.dataset.mode = localStorage.getItem("mode") || ""; document.documentElement.dataset.theme = localStorage.getItem("theme") || "light"; </script> <!-- Loaded before other Sphinx assets --> <link href="_static/styles/theme.css?digest=8d27b9dea8ad943066ae" rel="stylesheet" /> <link href="_static/styles/bootstrap.css?digest=8d27b9dea8ad943066ae" rel="stylesheet" /> <link href="_static/styles/pydata-sphinx-theme.css?digest=8d27b9dea8ad943066ae" rel="stylesheet" /> <link href="_static/vendor/fontawesome/6.5.1/css/all.min.css?digest=8d27b9dea8ad943066ae" rel="stylesheet" /> <link rel="preload" as="font" type="font/woff2" crossorigin href="_static/vendor/fontawesome/6.5.1/webfonts/fa-solid-900.woff2" /> <link rel="preload" as="font" type="font/woff2" crossorigin href="_static/vendor/fontawesome/6.5.1/webfonts/fa-brands-400.woff2" /> <link rel="preload" as="font" type="font/woff2" crossorigin href="_static/vendor/fontawesome/6.5.1/webfonts/fa-regular-400.woff2" /> <link rel="stylesheet" type="text/css" href="_static/pygments.css" /> <link rel="stylesheet" href="_static/styles/sphinx-book-theme.css?digest=14f4ca6b54d191a8c7657f6c759bf11a5fb86285" type="text/css" /> <link rel="stylesheet" type="text/css" href="_static/mystnb.4510f1fc1dee50b3e5859aac5469c37c29e427902b24a333a5f9fcb2f0b3ac41.css" /> <link rel="stylesheet" type="text/css" href="_static/css/custom.css" /> <link rel="stylesheet" type="text/css" href="_static/css/versions.css" /> <link rel="stylesheet" type="text/css" href="_static/design-style.1e8bd061cd6da7fc9cf755528e8ffc24.min.css" /> <!-- Pre-loaded scripts that we'll load fully later --> <link rel="preload" as="script" href="_static/scripts/bootstrap.js?digest=8d27b9dea8ad943066ae" /> <link rel="preload" as="script" href="_static/scripts/pydata-sphinx-theme.js?digest=8d27b9dea8ad943066ae" /> <script src="_static/vendor/fontawesome/6.5.1/js/all.min.js?digest=8d27b9dea8ad943066ae"></script> <script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script> <script src="_static/jquery.js"></script> <script src="_static/underscore.js"></script> <script src="_static/_sphinx_javascript_frameworks_compat.js"></script> <script src="_static/doctools.js"></script> <script src="_static/sphinx_highlight.js"></script> <script src="_static/scripts/sphinx-book-theme.js?digest=5a5c038af52cf7bc1a1ec88eea08e6366ee68824"></script> <script src="_static/js/rtd-version-switcher.js"></script> <script src="_static/design-tabs.js"></script> <script>DOCUMENTATION_OPTIONS.pagename = 'core_features';</script> <link rel="canonical" href="https://nvidia-merlin.github.io/NVTabular/stable/core_features.html" /> <link rel="shortcut icon" href="_static/favicon.png"/> <link rel="index" title="Index" href="genindex.html" /> <link rel="search" title="Search" href="search.html" /> <link rel="next" title="Accelerated Training" href="training/index.html" /> <link rel="prev" title="NVTabular" href="Introduction.html" /> <!-- Google Analytics --> <script async src="https://www.googletagmanager.com/gtag/js?id=G-NVJ1Y1YJHK"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'G-NVJ1Y1YJHK', { 'anonymize_ip': false, }); </script> <!-- Fonts --> <link rel="preconnect" href="https://fonts.googleapis.com"> <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin> <link href="https://fonts.googleapis.com/css2?family=Roboto+Mono:ital,wght@0,400;0,700;1,300&display=swap" rel="stylesheet"> </head> <body data-bs-spy="scroll" data-bs-target=".bd-toc-nav" data-offset="180" data-bs-root-margin="0px 0px -60%" data-default-mode=""> <a id="pst-skip-link" class="skip-link" href="#main-content">Skip to main content</a> <div id="pst-scroll-pixel-helper"></div> <button type="button" class="btn rounded-pill" id="pst-back-to-top"> <i class="fa-solid fa-arrow-up"></i> Back to top </button> <input type="checkbox" class="sidebar-toggle" name="__primary" id="__primary"/> <label class="overlay overlay-primary" for="__primary"></label> <input type="checkbox" class="sidebar-toggle" name="__secondary" id="__secondary"/> <label class="overlay overlay-secondary" for="__secondary"></label> <div class="search-button__wrapper"> <div class="search-button__overlay"></div> <div class="search-button__search-container"> <form class="bd-search d-flex align-items-center" action="search.html" method="get"> <i class="fa-solid fa-magnifying-glass"></i> <input type="search" class="form-control" name="q" id="search-input" placeholder="Search..." aria-label="Search..." autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/> <span class="search-button__kbd-shortcut"><kbd class="kbd-shortcut__modifier">Ctrl</kbd>+<kbd>K</kbd></span> </form></div> </div> <header class="bd-header navbar navbar-expand-lg bd-navbar"> </header> <div class="bd-container"> <div class="bd-container__inner bd-page-width"> <div class="bd-sidebar-primary bd-sidebar"> <div class="sidebar-header-items sidebar-primary__section"> </div> <div class="sidebar-primary-items__start sidebar-primary__section"> <div class="sidebar-primary-item"> <a class="navbar-brand logo" href="index.html"> <p class="title logo__title">NVIDIA Merlin NVTabular</p> </a></div> <div class="sidebar-primary-item"> <form class="bd-search d-flex align-items-center" action="search.html" method="get"> <i class="fa-solid fa-magnifying-glass"></i> <input type="search" class="form-control" name="q" id="search-input" placeholder="Search..." aria-label="Search..." autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/> <span class="search-button__kbd-shortcut"><kbd class="kbd-shortcut__modifier">Ctrl</kbd>+<kbd>K</kbd></span> </form></div> <div class="sidebar-primary-item"><nav class="bd-links" id="bd-docs-nav" aria-label="Main"> <div class="bd-toc-item navbar-nav active"> <p aria-level="2" class="caption" role="heading"><span class="caption-text">Contents</span></p> <ul class="current nav bd-sidenav"> <li class="toctree-l1"><a class="reference internal" href="Introduction.html">Introduction</a></li> <li class="toctree-l1 current active"><a class="current reference internal" href="#">Core Features</a></li> <li class="toctree-l1 has-children"><a class="reference internal" href="training/index.html">Accelerated Training</a><input class="toctree-checkbox" id="toctree-checkbox-1" name="toctree-checkbox-1" type="checkbox"/><label class="toctree-toggle" for="toctree-checkbox-1"><i class="fa-solid fa-chevron-down"></i></label><ul> <li class="toctree-l2"><a class="reference internal" href="training/tensorflow.html">TensorFlow</a></li> <li class="toctree-l2"><a class="reference internal" href="training/pytorch.html">PyTorch</a></li> <li class="toctree-l2"><a class="reference internal" href="training/hugectr.html">HugeCTR</a></li> </ul> </li> <li class="toctree-l1 has-children"><a class="reference internal" href="examples/index.html">Example Notebooks</a><input class="toctree-checkbox" id="toctree-checkbox-2" name="toctree-checkbox-2" type="checkbox"/><label class="toctree-toggle" for="toctree-checkbox-2"><i class="fa-solid fa-chevron-down"></i></label><ul> <li class="toctree-l2"><a class="reference internal" href="examples/01-Getting-started.html">Getting Started with NVTabular</a></li> <li class="toctree-l2"><a class="reference internal" href="examples/02-Advanced-NVTabular-workflow.html">Advanced NVTabular Workflow</a></li> <li class="toctree-l2"><a class="reference internal" href="examples/03-Running-on-multiple-GPUs-or-on-CPU.html">Run on multi-GPU or CPU-only</a></li> </ul> </li> <li class="toctree-l1 has-children"><a class="reference internal" href="api.html">API Documentation</a><input class="toctree-checkbox" id="toctree-checkbox-3" name="toctree-checkbox-3" type="checkbox"/><label class="toctree-toggle" for="toctree-checkbox-3"><i class="fa-solid fa-chevron-down"></i></label><ul> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.workflow.workflow.Workflow.html">nvtabular.workflow.workflow.Workflow</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.workflow.workflow.WorkflowNode.html">nvtabular.workflow.workflow.WorkflowNode</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.Bucketize.html">nvtabular.ops.Bucketize</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.Categorify.html">nvtabular.ops.Categorify</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.DropLowCardinality.html">nvtabular.ops.DropLowCardinality</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.HashBucket.html">nvtabular.ops.HashBucket</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.HashedCross.html">nvtabular.ops.HashedCross</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.TargetEncoding.html">nvtabular.ops.TargetEncoding</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.Clip.html">nvtabular.ops.Clip</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.LogOp.html">nvtabular.ops.LogOp</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.Normalize.html">nvtabular.ops.Normalize</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.NormalizeMinMax.html">nvtabular.ops.NormalizeMinMax</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.Dropna.html">nvtabular.ops.Dropna</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.FillMissing.html">nvtabular.ops.FillMissing</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.FillMedian.html">nvtabular.ops.FillMedian</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.DifferenceLag.html">nvtabular.ops.DifferenceLag</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.Filter.html">nvtabular.ops.Filter</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.Groupby.html">nvtabular.ops.Groupby</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.JoinExternal.html">nvtabular.ops.JoinExternal</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.JoinGroupby.html">nvtabular.ops.JoinGroupby</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.AddMetadata.html">nvtabular.ops.AddMetadata</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.AddProperties.html">nvtabular.ops.AddProperties</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.AddTags.html">nvtabular.ops.AddTags</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.Rename.html">nvtabular.ops.Rename</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.ReduceDtypeSize.html">nvtabular.ops.ReduceDtypeSize</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.TagAsItemFeatures.html">nvtabular.ops.TagAsItemFeatures</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.TagAsItemID.html">nvtabular.ops.TagAsItemID</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.TagAsUserFeatures.html">nvtabular.ops.TagAsUserFeatures</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.TagAsUserID.html">nvtabular.ops.TagAsUserID</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.ListSlice.html">nvtabular.ops.ListSlice</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.ValueCount.html">nvtabular.ops.ValueCount</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.ColumnSimilarity.html">nvtabular.ops.ColumnSimilarity</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.LambdaOp.html">nvtabular.ops.LambdaOp</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.Operator.html">nvtabular.ops.Operator</a></li> <li class="toctree-l2"><a class="reference internal" href="generated/nvtabular.ops.StatOperator.html">nvtabular.ops.StatOperator</a></li> </ul> </li> <li class="toctree-l1 has-children"><a class="reference internal" href="resources/index.html">Additional Resources</a><input class="toctree-checkbox" id="toctree-checkbox-4" name="toctree-checkbox-4" type="checkbox"/><label class="toctree-toggle" for="toctree-checkbox-4"><i class="fa-solid fa-chevron-down"></i></label><ul> <li class="toctree-l2"><a class="reference internal" href="resources/architecture.html">Architecture</a></li> <li class="toctree-l2"><a class="reference internal" href="resources/cloud_integration.html">Cloud Integration</a></li> <li class="toctree-l2"><a class="reference internal" href="resources/troubleshooting.html">Troubleshooting</a></li> <li class="toctree-l2"><a class="reference external" href="https://developer.nvidia.com/nvidia-merlin/nvtabular">developer.nvidia.com page</a></li> <li class="toctree-l2"><a class="reference internal" href="resources/links.html">Presentations and Blog Posts</a></li> <li class="toctree-l2"><a class="reference external" href="https://github.com/NVIDIA/NVTabular">Github Repo</a></li> </ul> </li> </ul> </div> </nav></div> <div class="sidebar-primary-item"><nav class="bd-links" id="bd-merlin-ecosystem-nav" aria-label="Merlin Ecosystem Nav"> <div class="bd-toc-item navbar-nav"> <p aria-level="2" class="caption" role="heading"><span class="caption-text">Ecosystem</span></p> <ul class="nav bd-sidenav"> <li class="toctree-l1"><a class="reference external" href="/models/stable">Models</a></li> <li class="toctree-l1"><a class="reference external" href="/systems/stable">Systems</a></li> <li class="toctree-l1"><a class="reference external" href="/core/stable">Core</a></li> <li class="toctree-l1"><a class="reference external" href="/Transformers4Rec/stable">Transformers4Rec</a></li> <li class="toctree-l1"><a class="reference external" href="/dataloader/stable">Dataloader</a></li> <li class="toctree-l1"><a class="reference external" href="/Merlin/stable">Merlin</a></li> </ul> </div> </nav></div> <div class="sidebar-primary-item"> <div class="rst-versions" data-toggle="rst-versions" role="note" aria-label="versions"> <span class="rst-current-version" data-toggle="rst-current-version"> <span class="fa fa-book"></span> v: stable <span class="fa fa-caret-down"></span> </span> <div class="rst-other-versions"> <dl> <dt>Tags</dt> <dd><a href="../v1.8.1/core_features.html">v1.8.1</a></dd> <dd><a href="../v23.02.00/core_features.html">v23.02.00</a></dd> <dd><a href="../v23.04.00/core_features.html">v23.04.00</a></dd> <dd><a href="../v23.05.00/core_features.html">v23.05.00</a></dd> <dd><a href="../v23.06.00/core_features.html">v23.06.00</a></dd> <dd><a href="../v23.08.00/core_features.html">v23.08.00</a></dd> </dl> <dl> <dt>Branches</dt> <dd><a href="../main/core_features.html">main</a></dd> <dd><a href="core_features.html">stable</a></dd> </dl> </div> </div></div> </div> <div class="sidebar-primary-items__end sidebar-primary__section"> </div> <div id="rtd-footer-container"></div> </div> <main id="main-content" class="bd-main"> <div class="sbt-scroll-pixel-helper"></div> <div class="bd-content"> <div class="bd-article-container"> <div class="bd-header-article"> <div class="header-article-items header-article__inner"> <div class="header-article-items__start"> <div class="header-article-item"><label class="sidebar-toggle primary-toggle btn btn-sm" for="__primary" title="Toggle primary sidebar" data-bs-placement="bottom" data-bs-toggle="tooltip"> <span class="fa-solid fa-bars"></span> </label></div> </div> <div class="header-article-items__end"> <div class="header-article-item"> <div class="article-header-buttons"> <a href="https://github.com/NVIDIA-Merlin/NVTabular" target="_blank" class="btn btn-sm btn-source-repository-button" title="Source repository" data-bs-placement="bottom" data-bs-toggle="tooltip" > <span class="btn__icon-container"> <i class="fab fa-github"></i> </span> </a> <div class="dropdown dropdown-download-buttons"> <button class="btn dropdown-toggle" type="button" data-bs-toggle="dropdown" aria-expanded="false" aria-label="Download this page"> <i class="fas fa-download"></i> </button> <ul class="dropdown-menu"> <li><a href="_sources/core_features.md" target="_blank" class="btn btn-sm btn-download-source-button dropdown-item" title="Download source file" data-bs-placement="left" data-bs-toggle="tooltip" > <span class="btn__icon-container"> <i class="fas fa-file"></i> </span> <span class="btn__text-container">.md</span> </a> </li> <li> <button onclick="window.print()" class="btn btn-sm btn-download-pdf-button dropdown-item" title="Print to PDF" data-bs-placement="left" data-bs-toggle="tooltip" > <span class="btn__icon-container"> <i class="fas fa-file-pdf"></i> </span> <span class="btn__text-container">.pdf</span> </button> </li> </ul> </div> <button onclick="toggleFullScreen()" class="btn btn-sm btn-fullscreen-button" title="Fullscreen mode" data-bs-placement="bottom" data-bs-toggle="tooltip" > <span class="btn__icon-container"> <i class="fas fa-expand"></i> </span> </button> <script> document.write(` <button class="btn btn-sm navbar-btn theme-switch-button" title="light/dark" aria-label="light/dark" data-bs-placement="bottom" data-bs-toggle="tooltip"> <span class="theme-switch nav-link" data-mode="light"><i class="fa-solid fa-sun fa-lg"></i></span> <span class="theme-switch nav-link" data-mode="dark"><i class="fa-solid fa-moon fa-lg"></i></span> <span class="theme-switch nav-link" data-mode="auto"><i class="fa-solid fa-circle-half-stroke fa-lg"></i></span> </button> `); </script> <script> document.write(` <button class="btn btn-sm navbar-btn search-button search-button__button" title="Search" aria-label="Search" data-bs-placement="bottom" data-bs-toggle="tooltip"> <i class="fa-solid fa-magnifying-glass fa-lg"></i> </button> `); </script> <label class="sidebar-toggle secondary-toggle btn btn-sm" for="__secondary"title="Toggle secondary sidebar" data-bs-placement="bottom" data-bs-toggle="tooltip"> <span class="fa-solid fa-list"></span> </label> </div></div> </div> </div> </div> <div id="jb-print-docs-body" class="onlyprint"> <h1>Core Features</h1> <!-- Table of contents --> <div id="print-main-content"> <div id="jb-print-toc"> <div> <h2> Contents </h2> </div> <nav aria-label="Page"> <ul class="visible nav section-nav flex-column"> <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#tensorflow-and-pytorch-interoperability">TensorFlow and PyTorch Interoperability</a></li> <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#hugectr-interoperability">HugeCTR Interoperability</a></li> <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#multi-gpu-support">Multi-GPU Support</a></li> <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#multi-node-support">Multi-Node Support</a></li> <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#multi-hot-encoding-and-pre-existing-embeddings">Multi-Hot Encoding and Pre-Existing Embeddings</a></li> <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#shuffling-datasets">Shuffling Datasets</a></li> <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#cloud-integration">Cloud Integration</a></li> <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#cpu-support">CPU Support</a></li> </ul> </nav> </div> </div> </div> <div id="searchbox"></div> <article class="bd-article"> <section id="core-features"> <h1>Core Features<a class="headerlink" href="#core-features" title="Permalink to this heading">#</a></h1> <p>NVTabular supports the following core features:</p> <ul class="simple"> <li><p><a class="reference internal" href="#tensorflow-and-pytorch-interoperability"><span class="std std-doc">TensorFlow and PyTorch Interoperability</span></a></p></li> <li><p><a class="reference internal" href="#hugectr-interoperability"><span class="std std-doc">HugeCTR Interoperability</span></a></p></li> <li><p><a class="reference internal" href="#multi-gpu-support"><span class="std std-doc">Multi-GPU Support</span></a></p></li> <li><p><a class="reference internal" href="#multi-node-support"><span class="std std-doc">Multi-Node Support</span></a></p></li> <li><p><a class="reference internal" href="#multi-hot-encoding-and-pre-existing-embeddings"><span class="std std-doc">Multi-Hot Encoding and Pre-Existing Embeddings</span></a></p></li> <li><p><a class="reference internal" href="#shuffling-datasets"><span class="std std-doc">Shuffling Datasets</span></a></p></li> <li><p><a class="reference internal" href="#cloud-integration"><span class="std std-doc">Cloud Integration</span></a></p></li> <li><p><a class="reference internal" href="#cpu-support"><span class="std std-doc">CPU Support</span></a></p></li> </ul> <section id="tensorflow-and-pytorch-interoperability"> <h2>TensorFlow and PyTorch Interoperability<a class="headerlink" href="#tensorflow-and-pytorch-interoperability" title="Permalink to this heading">#</a></h2> <p>In addition to providing mechanisms for transforming the data to prepare it for deep learning models, we also have framework-specific dataloaders implemented to help optimize getting that data to the GPU. Under a traditional dataloading scheme, data is read item by item and collated into a batch. With PyTorch, multiple processes can create many batches at the same time. However, this still leads to many individual rows of tabular data being accessed independently, which impacts I/O, especially when this data is on the disk and not in the CPU memory. TensorFlow loads and shuffles TFRecords by adopting a windowed buffering scheme that loads data sequentially to a buffer, which it randomly samples batches and replenishes with the next sequential elements from the disk. Larger buffer sizes ensure more randomness, but can quickly bottleneck performance as TensorFlow tries to keep the buffer saturated. Smaller buffer sizes mean that datasets, which aren’t uniformly distributed on the disk, lead to biased sampling and potentially degraded convergence.</p> </section> <section id="hugectr-interoperability"> <h2>HugeCTR Interoperability<a class="headerlink" href="#hugectr-interoperability" title="Permalink to this heading">#</a></h2> <p>NVTabular is also capable of preprocessing datasets that can be passed to HugeCTR for training. For additional information, see the <a class="reference external" href="https://github.com/NVIDIA-Merlin/NVTabular/blob/stable/examples/scaling-criteo/03-Training-with-HugeCTR.ipynb">HugeCTR Example Notebook</a> for details about how this works.</p> </section> <section id="multi-gpu-support"> <h2>Multi-GPU Support<a class="headerlink" href="#multi-gpu-support" title="Permalink to this heading">#</a></h2> <p>NVTabular supports multi-GPU scaling with <a class="reference external" href="https://github.com/rapidsai/dask-cuda">Dask-CUDA</a> and <a class="reference external" href="https://distributed.dask.org/en/latest/">dask.distributed</a>. To enable distributed parallelism, the NVTabular <code class="docutils literal notranslate"><span class="pre">Workflow</span></code> must be initialized with a <code class="docutils literal notranslate"><span class="pre">dask.distributed.Client</span></code> object as follows:</p> <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">nvtabular</span> <span class="k">as</span> <span class="nn">nvt</span> <span class="kn">from</span> <span class="nn">dask.distributed</span> <span class="kn">import</span> <span class="n">Client</span> <span class="c1"># Deploy a new cluster</span> <span class="c1"># (or specify the port of an existing scheduler)</span> <span class="n">cluster</span> <span class="o">=</span> <span class="s2">"tcp://MachineA:8786"</span> <span class="n">client</span> <span class="o">=</span> <span class="n">Client</span><span class="p">(</span><span class="n">cluster</span><span class="p">)</span> <span class="n">workflow</span> <span class="o">=</span> <span class="n">nvt</span><span class="o">.</span><span class="n">Workflow</span><span class="p">(</span><span class="o">...</span><span class="p">,</span> <span class="n">client</span><span class="o">=</span><span class="n">client</span><span class="p">)</span> <span class="o">...</span> </pre></div> </div> <p>Currently, there are many ways to deploy a “cluster” for Dask. This <a class="reference external" href="https://blog.dask.org/2020/07/23/current-state-of-distributed-dask-clusters">article</a> gives a summary of all the practical options. For a single machine with multiple GPUs, the <code class="docutils literal notranslate"><span class="pre">dask_cuda.LocalCUDACluster</span></code> API is typically the most convenient option.</p> <p>Since NVTabular already uses <a class="reference external" href="https://docs.rapids.ai/api/cudf/stable/">Dask-CuDF</a> for internal data processing, there are no other requirements for multi-GPU scaling. With that said, the parallel performance can depend strongly on (1) the size of <code class="docutils literal notranslate"><span class="pre">Dataset</span></code> partitions, (2) the shuffling procedure used for data output, and (3) the specific arguments used for both global-statistics and transformation operations. For additional information, see <a class="reference external" href="https://github.com/NVIDIA/NVTabular/blob/stable/examples/multi-gpu-toy-example/multi-gpu_dask.ipynb">Multi-GPU</a> for a simple step-by-step example.</p> </section> <section id="multi-node-support"> <h2>Multi-Node Support<a class="headerlink" href="#multi-node-support" title="Permalink to this heading">#</a></h2> <p>NVTabular supports multi-node scaling with <a class="reference external" href="https://github.com/rapidsai/dask-cuda">Dask-CUDA</a> and <a class="reference external" href="https://distributed.dask.org/en/latest/">dask.distributed</a>. To enable distributed parallelism, start a cluster and connect to it to run the application by doing the following:</p> <ol class="arabic simple"> <li><p>Start the scheduler <code class="docutils literal notranslate"><span class="pre">dask-scheduler</span></code>.</p></li> <li><p>Start the workers <code class="docutils literal notranslate"><span class="pre">dask-cuda-worker</span> <span class="pre">schedulerIP:schedulerPort</span></code>.</p></li> <li><p>Run the NVTabular application where the NVTabular <code class="docutils literal notranslate"><span class="pre">Workflow</span></code> has been initialized as described in the Multi-GPU Support section.</p></li> </ol> <p>For a detailed description of each existing method that is needed to start a cluster, please read this <a class="reference external" href="https://blog.dask.org/2020/07/23/current-state-of-distributed-dask-clusters">article</a>.</p> </section> <section id="multi-hot-encoding-and-pre-existing-embeddings"> <h2>Multi-Hot Encoding and Pre-Existing Embeddings<a class="headerlink" href="#multi-hot-encoding-and-pre-existing-embeddings" title="Permalink to this heading">#</a></h2> <p>NVTabular supports the:</p> <ul class="simple"> <li><p>processing of datasets with multi-hot categorical columns.</p></li> <li><p>passing of continuous vector features like pre-trained embeddings, which includes basic preprocessing and feature engineering, as well as full support in the dataloaders for training models with both TensorFlow and PyTorch.</p></li> </ul> <p>Multi-hot lets you represent a set of categories as a single feature. For example, in a movie recommendation system, each movie might have a list of genres associated with it like comedy, drama, horror, or science fiction. Since movies can belong to more than one genre, we can’t use single-hot encoding like we are doing for scalar columns. Instead we train models with multi-hot embeddings for these features by having the deep learning model look up an embedding for each category in the list and then average all the embeddings for each row. Both multi-hot categoricals and vector continuous features are represented using list columns in our datasets. cuDF has recently added support for list columns, and we’re leveraging that support in NVTabular to power this feature.</p> <p>Our Categorify and HashBucket operators can map list columns down to small contiguous integers, which are suitable for use in an embedding lookup table. This is only possible if the dataset contains two rows like <code class="docutils literal notranslate"><span class="pre">[['comedy',</span> <span class="pre">'horror'],</span> <span class="pre">['comedy',</span> <span class="pre">'sciencefiction']]</span></code> so that NVTabular can transform the strings for each row into categorical IDs like <code class="docutils literal notranslate"><span class="pre">[[0,</span> <span class="pre">1],</span> <span class="pre">[0,</span> <span class="pre">2]]</span></code> to be used in our embedding layers.</p> <p>Our PyTorch and TensorFlow dataloaders have been extended to handle both categorical and continuous list columns. In TensorFlow, the KerasSequenceLoader class will transform each list column into two tensors representing the values and offsets into those values for each batch. These tensors can be converted into RaggedTensors for multi-hot columns, and for vector continuous columns where the offsets tensor can be safely ignored. We’ve provided a <code class="docutils literal notranslate"><span class="pre">nvtabular.framework_utils.tensorflow.layers.DenseFeatures</span></code> Keras layer that will automatically handle these conversions for both continuous and categorical columns. For PyTorch, there’s support for multi-hot columns to our <code class="docutils literal notranslate"><span class="pre">nvtabular.framework_utils.torch.models.Model</span></code> class, which internally is using the PyTorch <a class="reference external" href="https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html">EmbeddingBag</a> layer to handle the multi-hot columns.</p> </section> <section id="shuffling-datasets"> <h2>Shuffling Datasets<a class="headerlink" href="#shuffling-datasets" title="Permalink to this heading">#</a></h2> <p>NVTabular makes it possible to shuffle during dataset creation. This creates a uniformly shuffled dataset that allows the dataloader to load large contiguous chunks of data, which are already randomized across the entire dataset. NVTabular also makes it possible to control the number of chunks that are combined into a batch, providing flexibility when trading off between performance and true randomization. This mechanism is critical when dealing with datasets that exceed CPU memory and individual epoch shuffling is desired during training. Full shuffle of such a dataset can exceed training time for the epoch by several orders of magnitude.</p> </section> <section id="cloud-integration"> <h2>Cloud Integration<a class="headerlink" href="#cloud-integration" title="Permalink to this heading">#</a></h2> <p>NVTabular offers cloud integration with Amazon Web Services (AWS) and Google Cloud Platform (GCP), giving you the ability to build, train, and deploy models on the cloud using datasets. For additional information, see <a class="reference internal" href="resources/cloud_integration.html#amazon-web-services"><span class="std std-doc">Amazon Web Services</span></a> and <a class="reference internal" href="resources/cloud_integration.html#google-cloud-platform"><span class="std std-doc">Google Cloud Platform</span></a>.</p> </section> <section id="cpu-support"> <h2>CPU Support<a class="headerlink" href="#cpu-support" title="Permalink to this heading">#</a></h2> <p>NVTabular supports CPU using <a class="reference external" href="https://pandas.pydata.org/">pandas</a>, <a class="reference external" href="https://arrow.apache.org/docs/python/">pyarrow</a>, and <a class="reference external" href="https://examples.dask.org/dataframe.html">dask dataframe</a>. To enable CPU, the Dataset class must be initialized with the <code class="docutils literal notranslate"><span class="pre">cpu</span></code> parameter as follows:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">dataset</span> <span class="o">=</span> <span class="n">Dataset</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">cpu</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> </pre></div> </div> <p>Processing will now take place on the CPU for that particular dataset, including feature engineering and preprocessing as well as TensorFlow and PyTorch training using NVTabular’s dataloaders.</p> </section> </section> </article> <footer class="prev-next-footer"> <div class="prev-next-area"> <a class="left-prev" href="Introduction.html" title="previous page"> <i class="fa-solid fa-angle-left"></i> <div class="prev-next-info"> <p class="prev-next-subtitle">previous</p> <p class="prev-next-title">NVTabular</p> </div> </a> <a class="right-next" href="training/index.html" title="next page"> <div class="prev-next-info"> <p class="prev-next-subtitle">next</p> <p class="prev-next-title">Accelerated Training</p> </div> <i class="fa-solid fa-angle-right"></i> </a> </div> </footer> </div> <div class="bd-sidebar-secondary bd-toc"><div class="sidebar-secondary-items sidebar-secondary__inner"> <div class="sidebar-secondary-item"> <div class="page-toc tocsection onthispage"> <i class="fa-solid fa-list"></i> Contents </div> <nav class="bd-toc-nav page-toc"> <ul class="visible nav section-nav flex-column"> <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#tensorflow-and-pytorch-interoperability">TensorFlow and PyTorch Interoperability</a></li> <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#hugectr-interoperability">HugeCTR Interoperability</a></li> <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#multi-gpu-support">Multi-GPU Support</a></li> <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#multi-node-support">Multi-Node Support</a></li> <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#multi-hot-encoding-and-pre-existing-embeddings">Multi-Hot Encoding and Pre-Existing Embeddings</a></li> <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#shuffling-datasets">Shuffling Datasets</a></li> <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#cloud-integration">Cloud Integration</a></li> <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#cpu-support">CPU Support</a></li> </ul> </nav></div> </div></div> </div> <footer class="bd-footer-content"> <div class="bd-footer-content__inner container"> <div class="footer-item"> <p class="copyright"> © Copyright 2021–2024, NVIDIA. <br/> </p> </div> <div class="footer-item"> <p> <a href="https://www.nvidia.com/en-us/about-nvidia/privacy-policy/" target="_blank">Privacy Policy</a> | <a href="https://www.nvidia.com/en-us/about-nvidia/privacy-center/" target="_blank">Manage My Privacy</a> | <a href="https://www.nvidia.com/en-us/preferences/start/" target="_blank">Do Not Sell or Share My Data</a> | <a href="https://www.nvidia.com/en-us/about-nvidia/terms-of-service/" target="_blank">Terms of Service</a> | <a href="https://www.nvidia.com/en-us/about-nvidia/accessibility/" target="_blank">Accessibility</a> | <a href="https://www.nvidia.com/en-us/about-nvidia/company-policies/" target="_blank">Corporate Policies</a> | <a href="https://www.nvidia.com/en-us/product-security/" target="_blank">Product Security</a> | <a href="https://www.nvidia.com/en-us/contact/" target="_blank">Contact</a> </p> </div> </div> </footer> </main> </div> </div> <!-- Scripts loaded after <body> so the DOM is not blocked --> <script src="_static/scripts/bootstrap.js?digest=8d27b9dea8ad943066ae"></script> <script src="_static/scripts/pydata-sphinx-theme.js?digest=8d27b9dea8ad943066ae"></script> <footer class="bd-footer"> </footer> </body> </html>