CINXE.COM
Deep Learning Energy Measurement and Optimization | PyTorch
<!DOCTYPE html> <html lang="en"> <head> <!-- Google Tag Manager --> <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src= 'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f); })(window,document,'script','dataLayer','GTM-T8XT4PS');</script> <!-- End Google Tag Manager --> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta http-equiv="X-UA-Compatible" content="ie=edge"> <link rel="shortcut icon" type="image/x-icon" href="/favicon.ico?"> <title> Deep Learning Energy Measurement and Optimization | PyTorch </title> <meta name="robots" content="index, follow" /> <meta name="description" content=" " /> <meta property="og:image" content="https://pytorch.org/assets/images/social-share.jpg" /> <meta name="twitter:image" content="https://pytorch.org/assets/images/social-share.jpg" /> <meta property="og:locale" content="en_US" /> <meta property="og:type" content="website" /> <meta property="og:title" content="Deep Learning Energy Measurement and Optimization" /> <meta property="og:description" content=" " /> <meta property="og:site_name" content="PyTorch" /> <meta name="twitter:card" content="summary_large_image" /> <meta name="twitter:title" content="Deep Learning Energy Measurement and Optimization" /> <meta name="twitter:description" content=" " /> <link rel="stylesheet" href="/assets/main.css"> <script src="/assets/vendor/jquery.min.js"></script> <script src="/assets/vendor/popper.min.js"></script> <script src="/assets/vendor/bootstrap.min.js"></script> <script src="/assets/vendor/anchor.min.js"></script> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ tex2jax: { skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'], inlineMath: [['$','$']] } }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> <script type="text/javascript" src="https://cdn.jsdelivr.net/npm/docsearch.js@2/dist/cdn/docsearch.min.js"></script> <script> !function(f,b,e,v,n,t,s) {if(f.fbq)return;n=f.fbq=function(){n.callMethod? n.callMethod.apply(n,arguments):n.queue.push(arguments)}; if(!f._fbq)f._fbq=n;n.push=n;n.loaded=!0;n.version='2.0'; n.queue=[];t=b.createElement(e);t.async=!0; t.src=v;s=b.getElementsByTagName(e)[0]; s.parentNode.insertBefore(t,s)}(window,document,'script', 'https://connect.facebook.net/en_US/fbevents.js'); fbq('init', '243028289693773'); fbq('track', 'PageView'); </script> <noscript> <img height="1" width="1" src="https://www.facebook.com/tr?id=243028289693773&ev=PageView &noscript=1"/> </noscript> <!-- Twitter universal website tag code --> <img height="1" width="1" style="display:none;" alt="" src="https://analytics.twitter.com/i/adsct?p_id=Twitter&p_user_id=0&txn_id=o2gi1&events=%5B%5B%22pageview%22%2Cnull%5D%5D&tw_sale_amount=0&tw_order_quantity=0 (https://urldefense.proofpoint.com/v2/url?u=https-3A__analytics.twitter.com_i_adsct-3Fp-5Fid-3DTwitter-26p-5Fuser-5Fid-3D0-26txn-5Fid-3Do2gi1-26events-3D-255B-255B-2522pageview-2522-252Cnull-255D-255D-26tw-5Fsale-5Famount-3D0-26tw-5Forder-5Fquantity-3D0&d=DwMGaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=GMr8XYCDyeQQZuD3noL91A&m=dAJyokk16UvYy-vMrGn_JwYiGfp_eEgo25B9iGDCG-A&s=o6i4D0V0088WH2RnzIoqiF-vj45PL-2sTrsxQ0SNO3A&e=)" /> <img height="1" width="1" style="display:none;" alt="" src="//t.co/i/adsct?p_id=Twitter&p_user_id=0&txn_id=o2gi1&events=%5B%5B%22pageview%22%2Cnull%5D%5D&tw_sale_amount=0&tw_order_quantity=0 (https://urldefense.proofpoint.com/v2/url?u=https-3A__linkprotect.cudasvc.com_url-3Fa-3Dhttp-253a-252f-252ft.co-252fi-252fadsct-253fp-5Fid-253dTwitter-2526p-5Fuser-5Fid-253d0-2526txn-5Fid-253do2gi1-2526events-253d-25255B-25255B-252522pageview-252522-25252Cnull-25255D-25255D-2526tw-5Fsale-5Famount-253d0-2526tw-5Forder-5Fquantity-253d0-26c-3DE-2C1-2CC33dLwIhtuEcl5FhdztSnUwsioeej5k-2DWy0RYREBAq51kGji32A2Cw94YU9vQBpY5tPN3AukEw3C-5F-2DlbtndnLoR7-5FA-5FLoH0Rr7zLtP1ykptN-26typo-3D1&d=DwMGaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=GMr8XYCDyeQQZuD3noL91A&m=dAJyokk16UvYy-vMrGn_JwYiGfp_eEgo25B9iGDCG-A&s=Abgc3XBkhESv8XBYtLchdDZyISGsK6v_BB6cLMJGyCw&e=)" /> <!-- End Twitter universal website tag code --> <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/docsearch.js@2/dist/cdn/docsearch.min.css" /> <link href="/feed.xml" type="application/atom+xml" rel="alternate" title="Pythorch Blog Posts" /> </head> <body class="blog"> <!-- Google Tag Manager (noscript) --> <noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-T8XT4PS" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript> <!-- End Google Tag Manager (noscript) --> <div class="main-background blog-background blog-detail-background"></div> <div class="hello-bar"> <div class="container"> Join us at PyTorch Conference in San Francisco, October 22-23. CFP open now! <a target="_blank" href="https://events.linuxfoundation.org/pytorch-conference/">Learn more</a>. </div> </div> <div class="container-fluid header-holder blog-detail-header"> <div class="container"> <div class="header-container"> <a class="header-logo" href="https://pytorch.org" aria-label="PyTorch"></a> <div class="main-menu"> <ul> <li class="main-menu-item"> <div id="dropdownMenuButton" data-toggle="resources-dropdown" class="resources-dropdown"> <a class="with-down-arrow"> Learn </a> <div class="resources-dropdown-menu"> <a class="nav-dropdown-item" href="/get-started"> <span class=dropdown-title>Get Started</span> <p>Run PyTorch locally or get started quickly with one of the supported cloud platforms</p> </a> <a class="nav-dropdown-item" href="https://pytorch.org/tutorials/"> <span class="dropdown-title">Tutorials</span> <p>Whats new in PyTorch tutorials</p> </a> <a class="nav-dropdown-item" href="https://pytorch.org/tutorials/beginner/basics/intro.html"> <span class="dropdown-title">Learn the Basics</span> <p>Familiarize yourself with PyTorch concepts and modules</p> </a> <a class="nav-dropdown-item" href="https://pytorch.org/tutorials/recipes/recipes_index.html"> <span class="dropdown-title">PyTorch Recipes</span> <p>Bite-size, ready-to-deploy PyTorch code examples</p> </a> <a class="nav-dropdown-item" href="https://pytorch.org/tutorials/beginner/introyt.html"> <span class="dropdown-title">Intro to PyTorch - YouTube Series</span> <p>Master PyTorch basics with our engaging YouTube tutorial series</p> </a> <a class="nav-dropdown-item" href="/new"> <span class="dropdown-title">New to PyTorch Foundation</span> </a> </div> </div> </li> <li class="main-menu-item"> <div id="dropdownMenuButton" data-toggle="resources-dropdown" class="resources-dropdown"> <a class="with-down-arrow"> Ecosystem </a> <div class="resources-dropdown-menu"> <a class="nav-dropdown-item" href="https://landscape.pytorch.org/" target="_blank"> <span class="dropdown-title">Tools</span> <p>Learn about the tools and frameworks in the PyTorch Ecosystem</p> </a> <a class="nav-dropdown-item" href="https://github.com/pytorch-fdn/ecosystem" target="_blank"> <span class="dropdown-title">Join the Ecosystem</span> </a> <a class="nav-dropdown-item" href="/#community-module"> <span class=dropdown-title>Community</span> <p>Join the PyTorch developer community to contribute, learn, and get your questions answered.</p> </a> <a class="nav-dropdown-item" href="https://discuss.pytorch.org" target="_blank"> <span class=dropdown-title>Forums</span> <p>A place to discuss PyTorch code, issues, install, research</p> </a> <a class="nav-dropdown-item" href="/resources"> <span class=dropdown-title>Developer Resources</span> <p>Find resources and get questions answered</p> </a> <a class="nav-dropdown-item" href="/ecosystem/contributor-awards-2024"> <span class="dropdown-title">Contributor Awards - 2024</span> <p>Award winners announced at this year's PyTorch Conference</p> </a> </div> </div> </li> <li class="main-menu-item"> <div id="dropdownMenuButton" data-toggle="resources-dropdown" class="resources-dropdown"> <a class="with-down-arrow"> Edge </a> <div class="resources-dropdown-menu"> <a class="nav-dropdown-item" href="/edge"> <span class="dropdown-title">About PyTorch Edge</span> <p>Build innovative and privacy-aware AI experiences for edge devices</p> </a> <a class="nav-dropdown-item" href="/executorch-overview"> <span class="dropdown-title">ExecuTorch</span> <p>End-to-end solution for enabling on-device inference capabilities across mobile and edge devices</p> </a> <a class="nav-dropdown-item" target="_blank" href="https://pytorch.org/executorch/stable/index.html"> <span class="dropdown-title">ExecuTorch Documentation</span> </a> </div> </div> </li> <li class="main-menu-item"> <div id="docsDropdownButton" data-toggle="resources-dropdown" class="resources-dropdown"> <a class="with-down-arrow"> Docs </a> <div class="resources-dropdown-menu"> <a class="nav-dropdown-item" href="https://pytorch.org/docs"> <span class="dropdown-title">PyTorch</span> <p>Explore the documentation for comprehensive guidance on how to use PyTorch.</p> </a> <a class="nav-dropdown-item" href="/pytorch-domains"> <span class="dropdown-title">PyTorch Domains</span> <p> Read the PyTorch Domains documentation to learn more about domain-specific libraries.</p> </a> </div> </div> </li> <li class="main-menu-item"> <div id="dropdownMenuButton" data-toggle="resources-dropdown" class="resources-dropdown"> <a class="with-down-arrow"> Blog & News </a> <div class="resources-dropdown-menu"> <a class="nav-dropdown-item" href="/blog"> <span class="dropdown-title">PyTorch Blog</span> <p>Catch up on the latest technical news and happenings</p> </a> <a class="nav-dropdown-item" href="/community-blog"> <span class="dropdown-title">Community Blog</span> <p>Stories from the PyTorch ecosystem</p> </a> <a class="nav-dropdown-item" href="/videos"> <span class="dropdown-title">Videos</span> <p>Learn about the latest PyTorch tutorials, new, and more </p> </a> <a class="nav-dropdown-item" href="/community-stories"> <span class="dropdown-title">Community Stories</span> <p>Learn how our community solves real, everyday machine learning problems with PyTorch</p> </a> <a class="nav-dropdown-item" href="/events"> <span class=dropdown-title>Events</span> <p>Find events, webinars, and podcasts</p> </a> <a class="nav-dropdown-item" href="/newsletter"> <span class=dropdown-title>Newsletter</span> <p>Stay up-to-date with the latest updates</p> </a> </div> </div> </li> <li class="main-menu-item"> <div id="resourcesDropdownButton" data-toggle="resources-dropdown" class="resources-dropdown"> <a class="with-down-arrow"> About </a> <div class="resources-dropdown-menu"> <a class="nav-dropdown-item" href="/foundation"> <span class=dropdown-title>PyTorch Foundation</span> <p>Learn more about the PyTorch Foundation.</p> </a> <a class="nav-dropdown-item" href="/governing-board"> <span class=dropdown-title>Governing Board</span> </a> <a class="nav-dropdown-item" href="/credits"> <span class=dropdown-title>Cloud Credit Program</span> </a> <a class="nav-dropdown-item" href="/tac"> <span class=dropdown-title>Technical Advisory Council</span> </a> <a class="nav-dropdown-item" href="/staff"> <span class=dropdown-title>Staff</span> </a> <a class="nav-dropdown-item" href="/contact-us"> <span class=dropdown-title>Contact Us</span> </a> </div> </div> </li> <li class="main-menu-item"> <a href="/join" data-cta="join"> Become a Member </a> </li> <li class="main-menu-item" id="github-main-menu-link"> <a href="https://github.com/pytorch/pytorch" title="Go to PyTorch GitHub"> <div id="topnav-gh-icon"></div> </a> </li> <li class="navSearchWrapper reactNavSearchWrapper" key="search"> <div class="search-border"> <div id="search-icon"></div> <input id="search-input" type="text" title="Search" /> <div id="close-search">X</div> </div> </li> </ul> </div> <script src="/assets/main-menu-dropdown.js"></script> <a class="main-menu-open-button" href="#" data-behavior="open-mobile-menu"></a> </div> </div> </div> <div class="jumbotron jumbotron-fluid blog-detail-jumbotron"> <div class="container blog-detail-container"> <p class="featured-post">May 11, 2024</p> <h1> <a class="blog-title">Deep Learning Energy Measurement and Optimization</a> </h1> </div> </div> <div class="main-content-wrapper blog-detail-wrapper"> <div class="main-content blog-detail-content"> <div class="container"> <img src="/assets/images/logo-icon.svg" class="img-fluid author-icon"> <article class="pytorch-article"> <p class="author"> by Jae-Won Chung </p> <p><img src="/assets/images/zeus/fig1.png" alt="Zeus logo" style="width:100%;display: block; max-width: 400px; margin-right: auto; margin-left: auto" /></p> <p><em>This post is authored by <a href="https://jaewonchung.me/about">Jae-Won Chung</a>, a PhD student at the University of Michigan and the lead of the <a href="https://ml.energy">ML.ENERGY Initiative</a>.</em></p> <p>Deep learning consumes quite a bit of energy. For instance, training a single 200B LLM on AWS p4d instances consumed around 11.9 GWh (source: <a href="https://mvdirona.com/jrh/talksandpapers/JamesHamiltonCIDR2024.pdf">CIDR 2024 keynote</a>), which is an amount that can single-handedly power more than a thousand <a href="https://www.eia.gov/tools/faqs/faq.php?id=97&t=3">average US households</a> for a year.</p> <p><a href="https://github.com/ml-energy/zeus">Zeus</a> is an open-source toolbox for measuring and optimizing the energy consumption of deep learning workloads. Our goal is to make energy optimization based on accurate measurements as easy as possible for diverse deep learning workloads and setups by offering composable tools with minimal assumptions.</p> <p>Zeus largely provides two types of tools:</p> <ol> <li>Programmatic and command line GPU energy <strong>measurement</strong> tools</li> <li>Several energy <strong>optimization</strong> tools that find the best ML and/or GPU configurations</li> </ol> <p>Zeus can benefit those who would like to</p> <ul> <li>measure and optimize their electricity cost</li> <li>reduce heat dissipation from their GPUs (by lowering power draw)</li> <li>report energy usage from research and development</li> <li>reduce carbon footprint from electricity usage</li> </ul> <h2 id="part-1-measuring-energy">Part 1: Measuring Energy</h2> <p>Just like performance optimization, accurate measurement is the basis of effective energy optimization. Popular proxies for estimating power consumption like the maximum power draw of the hardware <a href="https://ml.energy/blog/energy/measurement/measuring-gpu-energy-best-practices/">can sometimes be vastly off</a> compared to actual measurement.</p> <p>To make energy measurement as easy and transparent as possible, the core utility Zeus offers is the <code class="language-plaintext highlighter-rouge">ZeusMonitor</code> class. Let’s take a look at the actual snippet:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">zeus.monitor</span> <span class="kn">import</span> <span class="n">ZeusMonitor</span> <span class="c1"># All four GPUs are measured simultaneously. </span><span class="n">monitor</span> <span class="o">=</span> <span class="n">ZeusMonitor</span><span class="p">(</span><span class="n">gpu_indices</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">])</span> <span class="c1"># Measure total time and energy within the window. </span><span class="n">monitor</span><span class="p">.</span><span class="n">begin_window</span><span class="p">(</span><span class="s">"training"</span><span class="p">)</span> <span class="k">for</span> <span class="n">e</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span> <span class="c1"># Measurement windows can arbitrarily be overlapped. </span> <span class="n">monitor</span><span class="p">.</span><span class="n">begin_window</span><span class="p">(</span><span class="s">"epoch"</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">train_dataloader</span><span class="p">:</span> <span class="n">y_hat</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="n">loss</span> <span class="o">=</span> <span class="n">criterion</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">y_hat</span><span class="p">)</span> <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span> <span class="n">optim</span><span class="p">.</span><span class="n">step</span><span class="p">()</span> <span class="n">measurement</span> <span class="o">=</span> <span class="n">monitor</span><span class="p">.</span><span class="n">end_window</span><span class="p">(</span><span class="s">"epoch"</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Epoch </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">measurement</span><span class="p">.</span><span class="n">time</span><span class="si">}</span><span class="s"> s, </span><span class="si">{</span><span class="n">measurement</span><span class="p">.</span><span class="n">total_energy</span><span class="si">}</span><span class="s"> J"</span><span class="p">)</span> <span class="n">measurement</span> <span class="o">=</span> <span class="n">monitor</span><span class="p">.</span><span class="n">end_window</span><span class="p">(</span><span class="s">"training"</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Entire training: </span><span class="si">{</span><span class="n">measurement</span><span class="p">.</span><span class="n">time</span><span class="si">}</span><span class="s"> s, </span><span class="si">{</span><span class="n">measurement</span><span class="p">.</span><span class="n">total_energy</span><span class="si">}</span><span class="s"> J"</span><span class="p">)</span> </code></pre></div></div> <p>What you see above is a typical PyTorch training loop which uses four GPUs for data parallel training. Inside, we created an instance of <code class="language-plaintext highlighter-rouge">ZeusMonitor</code> and passed in a list of GPU indices to monitor. Then, using the monitor, we can measure the time and energy consumption of arbitrary execution <em>windows</em> within the training script by pairing calls to <code class="language-plaintext highlighter-rouge">begin_window</code> and <code class="language-plaintext highlighter-rouge">end_window</code>. Multiple windows can overlap and nest in arbitrary ways without affecting the measurement of each, as long as their names are different.</p> <p><code class="language-plaintext highlighter-rouge">ZeusMonitor</code> adds very little overhead – typically single digit milliseconds – around the window. This allows <code class="language-plaintext highlighter-rouge">ZeusMonitor</code> to be used in various applications. For instance:</p> <ul> <li><a href="https://ml.energy/leaderboard">The ML.ENERGY Leaderboard</a>: The first open-source benchmark on how much energy LLM text generation consumes.</li> <li><a href="https://ml.energy/leaderboard">The ML.ENERGY Colosseum</a>: An online service that lets users compare LLM responses side-by-side based on response quality <em>and</em> energy consumption.</li> </ul> <p>See our <a href="https://ml.energy/blog/energy/measurement/measuring-gpu-energy-best-practices/">blog post</a> for a deeper technical dive into accurate GPU energy measurement.</p> <h2 id="part-2-optimizing-energy">Part 2: Optimizing Energy</h2> <p>Let me introduce you to two of the energy optimizers provided by Zeus.</p> <h3 id="globalpowerlimitoptimizer">GlobalPowerLimitOptimizer</h3> <p>GPUs allow users to configure its maximum power draw, called <em>power limit</em>. Typically, as you lower the GPU’s power limit from the default maximum, computation may get slightly slower, but you’ll save disproportionately more energy. The <code class="language-plaintext highlighter-rouge">GlobalPowerLimitOptimizer</code> in Zeus automatically finds the optimal GPU power limit globally across all GPUs.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">zeus.monitor</span> <span class="kn">import</span> <span class="n">ZeusMonitor</span> <span class="kn">from</span> <span class="nn">zeus.optimizer.power_limit</span> <span class="kn">import</span> <span class="n">GlobalPowerLimitOptimizer</span> <span class="c1"># The optimizer measures time and energy through the ZeusMonitor. </span><span class="n">monitor</span> <span class="o">=</span> <span class="n">ZeusMonitor</span><span class="p">(</span><span class="n">gpu_indices</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">])</span> <span class="n">plo</span> <span class="o">=</span> <span class="n">GlobalPowerLimitOptimizer</span><span class="p">(</span><span class="n">monitor</span><span class="p">)</span> <span class="k">for</span> <span class="n">e</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span> <span class="n">plo</span><span class="p">.</span><span class="n">on_epoch_begin</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">train_dataloader</span><span class="p">:</span> <span class="n">plo</span><span class="p">.</span><span class="n">on_step_begin</span><span class="p">()</span> <span class="n">y_hat</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="n">loss</span> <span class="o">=</span> <span class="n">criterion</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">y_hat</span><span class="p">)</span> <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span> <span class="n">optim</span><span class="p">.</span><span class="n">step</span><span class="p">()</span> <span class="n">plo</span><span class="p">.</span><span class="n">on_step_end</span><span class="p">()</span> <span class="n">plo</span><span class="p">.</span><span class="n">on_epoch_end</span><span class="p">()</span> </code></pre></div></div> <p>In our familiar PyTorch training loop, we have instantiated <code class="language-plaintext highlighter-rouge">GlobalPowerLimitOptimizer</code> and passed it an instance of the <code class="language-plaintext highlighter-rouge">ZeusMonitor</code>, through which the optimizer sees the GPUs. Then, we just need to let the optimizer know about training progress (step and epoch boundaries), and the optimizer will transparently do all the necessary profiling and converge to the optimal power limit.</p> <p>If you’re using the HuggingFace <a href="https://huggingface.co/docs/transformers/main_classes/trainer">Trainer</a> or <a href="https://huggingface.co/docs/trl/main/en/sft_trainer">SFTTrainer</a>, integration is even easier:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">zeus.monitor</span> <span class="kn">import</span> <span class="n">ZeusMonitor</span> <span class="kn">from</span> <span class="nn">zeus.optimizer.power_limit</span> <span class="kn">import</span> <span class="n">HFGlobalPowerLimitOptimizer</span> <span class="c1"># ZeusMonitor actually auto-detects CUDA_VISIBLE_DEVICES. </span><span class="n">monitor</span> <span class="o">=</span> <span class="n">ZeusMonitor</span><span class="p">()</span> <span class="n">pl_optimizer</span> <span class="o">=</span> <span class="n">HFGlobalPowerLimitOptimizer</span><span class="p">(</span><span class="n">monitor</span><span class="p">)</span> <span class="c1"># Pass in the optimizer as a Trainer callback. Also works for SFTTrainer. </span><span class="n">trainer</span> <span class="o">=</span> <span class="n">Trainer</span><span class="p">(</span> <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span> <span class="n">train_dataset</span><span class="o">=</span><span class="n">train_dataset</span><span class="p">,</span> <span class="p">...,</span> <span class="n">callbacks</span><span class="o">=</span><span class="p">[</span><span class="n">pl_optimizer</span><span class="p">],</span> <span class="p">)</span> </code></pre></div></div> <p>The <code class="language-plaintext highlighter-rouge">HFGlobalPowerLimitOptimizer</code> wraps <code class="language-plaintext highlighter-rouge">GlobalPowerLimitOptimizer</code> so that it automatically detects step and epoch boundaries. We have example integrations <a href="https://github.com/ml-energy/zeus/tree/master/examples/huggingface">here</a>, including running Gemma 7B supervised fine-tuning with QLoRA.</p> <p>Now, we know how to integrate the optimizer, but what is the <em>optimal</em> power limit? We know different users can have different preferences regarding trading off time and energy, so we allow users to specify an <code class="language-plaintext highlighter-rouge">OptimumSelector</code> (basically the <a href="https://en.wikipedia.org/wiki/Strategy_pattern">Strategy Pattern</a>) to express their needs.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Built-in strategies for selecting the optimal power limit. </span><span class="kn">from</span> <span class="nn">zeus.optimizer.power_limit</span> <span class="kn">import</span> <span class="p">(</span> <span class="n">GlobalPowerLimitOptimizer</span><span class="p">,</span> <span class="n">Time</span><span class="p">,</span> <span class="n">Energy</span><span class="p">,</span> <span class="n">MaxSlowdownConstraint</span><span class="p">,</span> <span class="p">)</span> <span class="c1"># Minimize energy while tolerating at most 10% slowdown. </span><span class="n">plo</span> <span class="o">=</span> <span class="n">GlobalPowerLimitOptimizer</span><span class="p">(</span> <span class="n">monitor</span><span class="p">,</span> <span class="n">MaxSlowdownConstraint</span><span class="p">(</span><span class="n">factor</span><span class="o">=</span><span class="mf">1.1</span><span class="p">),</span> <span class="p">)</span> </code></pre></div></div> <p>Some of the built-in strategies include “Minimize time” (<a href="https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.Time">Time</a>, this might still reduce the power limit from the default since some workloads exhibit almost no slowdown even on lower power limits), “Minimize energy” (<a href="https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.Energy">Energy</a>), “Somewhere in between” (<a href="https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.ZeusCost">ZeusCost</a>), and “Minimize energy given maximum slowdown” (<a href="https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.MaxSlowdownConstraint">MaxSlowdownConstraint</a>). Users can also create their own optimum selectors as needed.</p> <h3 id="pipelinefrequencyoptimizer">PipelineFrequencyOptimizer</h3> <p>The pipeline frequency optimizer, based on our research paper <a href="https://ml.energy/zeus/research_overview/perseus">Perseus</a>, is our latest work on energy optimization for large model training, like GPT-3. Perseus can reduce the energy consumption of large model training with no or negligible training throughput degradation. We’ll briefly talk about how.</p> <p><img src="/assets/images/zeus/fig2.png" alt="one iteration of training with four stage pipeline parallelism" style="width:100%;" /></p> <p>The above is a visualization of one iteration of training with four stage <em>pipeline parallelism</em> running with the 1F1B schedule. Each box is either a forward or a backward computation, and is colored with its power consumption.</p> <p>The key observation here is that when models are partitioned into pipeline stages, it’s very difficult to slice them in perfectly equal sizes. This leads to forward/backward boxes of varying widths and therefore computation <em>idle time</em> between boxes. You would notice that those smaller boxes can run slightly slower than wider boxes and the overall critical path (blue line) will not change at all.</p> <p><img src="/assets/images/zeus/fig3.png" alt="one iteration of training with four stage pipeline parallelism" style="width:100%;" /></p> <p>That’s what Perseus automatically does. Based on profiling, it identifies computation boxes that are not on the critical path and figures out the precise amount of slowdown for each box that minimizes energy consumption. When done correctly, computations we slowed down will consume less power & energy, but the overall iteration time of the pipeline does not change.</p> <p>See <a href="https://ml.energy/zeus/optimize/pipeline_frequency_optimizer/">our guide</a> to get started with Perseus!</p> <h2 id="final-words">Final Words</h2> <p>For users who run their own on-premise compute, energy consumption and the resulting electricity bill is not something that can be easily overlooked. On a larger scale, energy consumption is not just about electricity bills, but also about data center power delivery. With thousands of GPUs running in clusters, finding stable, affordable, and sustainable electricity sources to power data centers is becoming <a href="https://www.cbre.com/insights/reports/north-america-data-center-trends-h1-2023">increasingly challenging</a>. Finding ways to reduce energy disproportionately more than slowdown leads to lower average power consumption, which can help with the power delivery challenge.</p> <p>With Zeus, we hope to take the first step towards deep learning energy measurement and optimization.</p> <p>Wondering where to go from here? Here are a couple helpful links:</p> <ul> <li><a href="https://ml.energy/zeus">Zeus homepage/documentation</a></li> <li><a href="https://github.com/ml-energy/zeus">Zeus GitHub repository</a></li> <li><a href="https://github.com/ml-energy/zeus/tree/master/examples">Zeus usage and integration examples</a></li> <li><a href="https://ml.energy">ML.ENERGY Initiative</a> (i.e., the people building Zeus)</li> </ul> </article> </div> </div> </div> <!-- --> <div class="container-fluid docs-tutorials-resources"> <div class="container"> <div class="row"> <div class="col-md-4 text-center"> <h2>Docs</h2> <p>Access comprehensive developer documentation for PyTorch</p> <a class="with-right-arrow" href="/docs">View Docs</a> </div> <div class="col-md-4 text-center"> <h2>Tutorials</h2> <p>Get in-depth tutorials for beginners and advanced developers</p> <a class="with-right-arrow" href="https://pytorch.org/tutorials">View Tutorials</a> </div> <div class="col-md-4 text-center"> <h2>Resources</h2> <p>Find development resources and get your questions answered</p> <a class="with-right-arrow" href="/resources">View Resources</a> </div> </div> </div> </div> <footer class="site-footer"> <div class="container footer-container"> <div class="newsletter" id="newsletter"> <p class="newsletter__title is-style-max-width-800"><strong>Stay in touch</strong> for updates, event info, and the latest news</p> <script charset="utf-8" type="text/javascript" src="//js.hsforms.net/forms/embed/v2.js"></script> <script> hbspt.forms.create({ region: "na1", portalId: "8112310", formId: "2fb2231c-000b-4ec5-88a0-1ab242549c9e" }); </script> <p class="newsletter__privacy">By submitting this form, I consent to receive marketing emails from the LF and its projects regarding their events, training, research, developments, and related announcements. I understand that I can unsubscribe at any time using the links in the footers of the emails I receive. <a href="https://www.linuxfoundation.org/privacy/">Privacy Policy</a>.</p> </div> <div class="lf-grid"> <div class="footer-logo-wrapper"> <a href="https://pytorch.org" class="footer-logo"> <img src="/assets/images/logo-icon.svg" alt="PyTorch logo" width="40"> </a> </div> <ul class="social-links"> <li><a href="https://www.facebook.com/pytorch" target="_blank" title="PyTorch on Facebook"> <svg xmlns="http://www.w3.org/2000/svg" viewbox="-0.51 -0.26 26.45 26.45" aria-label="Facebook"><path fill="currentColor" d="M25.497 13.075c0-2.45-.698-4.848-2.011-6.911a12.765 12.765 0 0 0-5.398-4.73A12.671 12.671 0 0 0 11.008.38a12.705 12.705 0 0 0-6.529 2.95A12.827 12.827 0 0 0 .563 9.358a12.896 12.896 0 0 0-.07 7.201 12.831 12.831 0 0 0 3.801 6.103 12.709 12.709 0 0 0 6.471 3.078v-8.957H7.53v-3.708h3.235v-2.824c0-3.213 1.903-4.988 4.813-4.988.956.014 1.909.097 2.852.25V8.67h-1.607a1.83 1.83 0 0 0-1.518.497 1.854 1.854 0 0 0-.561 1.505v2.404h3.535l-.563 3.708h-2.97v8.957a12.725 12.725 0 0 0 7.697-4.337 12.87 12.87 0 0 0 3.054-8.328z"/></svg> </a></li> <li><a href="https://twitter.com/pytorch" target="_blank" title="PyTorch on X"> <svg xmlns="http://www.w3.org/2000/svg" viewbox="0 0 300 300" aria-label="X"><path fill="currentColor" d="M178.57 127.15 290.27 0h-26.46l-97.03 110.38L89.34 0H0l117.13 166.93L0 300.25h26.46l102.4-116.59 81.8 116.59h89.34M36.01 19.54H76.66l187.13 262.13h-40.66"/></svg> </a></li> <li><a href="https://www.youtube.com/pytorch" target="_blank" title="PyTorch on YouTube"> <svg xmlns="http://www.w3.org/2000/svg" viewbox="0.21 0.27 34.45 25.07" aria-label="YouTube"><path fill="currentColor" d="M33.729 6.084s-.327-2.33-1.317-3.356a4.691 4.691 0 0 0-3.32-1.432c-4.634-.34-11.589-.34-11.589-.34h-.014s-6.954 0-11.59.342a4.692 4.692 0 0 0-3.32 1.432c-.993 1.025-1.315 3.354-1.315 3.354a52.189 52.189 0 0 0-.331 5.473v2.566c.014 1.829.125 3.656.331 5.472 0 0 .322 2.33 1.316 3.36 1.26 1.345 2.916 1.3 3.653 1.445 2.65.26 11.263.34 11.263.34s6.96-.01 11.597-.353a4.691 4.691 0 0 0 3.32-1.432c.993-1.026 1.316-3.356 1.316-3.356.206-1.817.316-3.644.33-5.473v-2.57a52.26 52.26 0 0 0-.33-5.472zM14.076 17.232V7.729l8.951 4.768-8.95 4.735z"/></svg> </a></li> <li><a href="https://www.linkedin.com/company/pytorch" target="_blank" title="PyTorch on LinkedIn"> <svg xmlns="http://www.w3.org/2000/svg" viewbox="-10.23 -10.23 531.96 531.96" aria-label="LinkedIn"><rect width="512" height="512" rx="0" fill="currentColor"/><circle fill="#000" cx="142" cy="138" r="37"/><path stroke="#000" stroke-width="66" d="M244 194v198M142 194v198"/><path fill="#000" d="M276 282c0-20 13-40 36-40 24 0 33 18 33 45v105h66V279c0-61-32-89-76-89-34 0-51 19-59 32"/></svg> </a></li> <li><a href="https://join.slack.com/t/pytorch/shared_invite/zt-2j2la612p-miUinTTaxXczKOJw48poHA" target="_blank" title="PyTorch Slack"> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0.16 -0.03 21.19 21.19" aria-label="Slack"><path fill="currentColor" d="M4.896 13.27a2.147 2.147 0 0 1-2.141 2.142A2.147 2.147 0 0 1 .613 13.27c0-1.178.963-2.141 2.142-2.141h2.141v2.141zm1.08 0c0-1.178.962-2.141 2.141-2.141s2.142.963 2.142 2.141v5.363a2.147 2.147 0 0 1-2.142 2.141 2.147 2.147 0 0 1-2.141-2.142V13.27zm2.141-8.6a2.147 2.147 0 0 1-2.141-2.14c0-1.18.962-2.142 2.141-2.142s2.142.963 2.142 2.141v2.142H8.117zm0 1.08c1.179 0 2.141.962 2.141 2.141a2.147 2.147 0 0 1-2.141 2.142H2.755A2.147 2.147 0 0 1 .613 7.89c0-1.179.963-2.141 2.142-2.141h5.362zm8.599 2.141c0-1.179.963-2.141 2.141-2.141 1.179 0 2.143.962 2.143 2.14a2.147 2.147 0 0 1-2.142 2.142h-2.141V7.89zm-1.08 0a2.147 2.147 0 0 1-2.141 2.142 2.147 2.147 0 0 1-2.141-2.142V2.53c0-1.178.962-2.141 2.141-2.141s2.142.963 2.142 2.141v5.362zm-2.141 8.6c1.179 0 2.142.962 2.142 2.14a2.147 2.147 0 0 1-2.142 2.142 2.147 2.147 0 0 1-2.141-2.141V16.49h2.141zm0-1.08a2.147 2.147 0 0 1-2.141-2.141c0-1.179.962-2.142 2.141-2.142h5.362c1.179 0 2.142.963 2.142 2.142a2.147 2.147 0 0 1-2.142 2.142h-5.362z"></path></svg> </a></li> <li><a href="/wechat" title="PyTorch on WeChat"> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0.14 -0.17 38.02 33.02" aria-label="WeChat"><path fill="currentColor" d="M26.289 10.976a12.972 12.972 0 0 0-8.742 3.53 10.386 10.386 0 0 0-3.224 8.795c-1.326-.164-2.535-.345-3.75-.448a2.332 2.332 0 0 0-1.273.216c-1.18.666-2.311 1.418-3.652 2.255.246-1.112.405-2.087.687-3.024a1.15 1.15 0 0 0-.523-1.52C1.737 17.902.02 13.601 1.307 9.165c1.189-4.1 4.11-6.587 8.077-7.884A13.54 13.54 0 0 1 24.18 5.617a10.135 10.135 0 0 1 2.109 5.359zM10.668 9.594a1.564 1.564 0 0 0-2.095-1.472 1.52 1.52 0 0 0-.895 1.964 1.502 1.502 0 0 0 1.391.966 1.545 1.545 0 0 0 1.598-1.46v.002zm8.15-1.566a1.567 1.567 0 0 0-1.528 1.543 1.528 1.528 0 0 0 1.571 1.492 1.52 1.52 0 0 0 1.375-2.117 1.518 1.518 0 0 0-1.415-.919l-.003.001z"></path><path fill="currentColor" d="M33.914 32.137c-1.075-.478-2.062-1.196-3.11-1.306-1.049-.11-2.145.494-3.24.605a10.821 10.821 0 0 1-8.781-2.864c-4.682-4.33-4.013-10.97 1.403-14.518 4.811-3.154 11.874-2.102 15.268 2.273a8.671 8.671 0 0 1-1.002 12.095c-1.046.929-1.422 1.693-.751 2.917.102.257.174.525.213.798zM21.68 20.292a1.264 1.264 0 1 0 .01-2.528 1.264 1.264 0 0 0-.01 2.528zm7.887-2.526a1.266 1.266 0 0 0-1.256 1.21 1.247 1.247 0 1 0 1.256-1.21z"></path></svg> </a></li> </ul> </div> <div class="privacy-policy"> <div class="copyright"> <p>© Copyright The Linux Foundation. The PyTorch Foundation is a project of The Linux Foundation. For web site terms of use, trademark policy and other policies applicable to The PyTorch Foundation please see <a href="https://www.linuxfoundation.org/legal/policies/">Linux Foundation Policies</a>. The PyTorch Foundation supports the PyTorch open source project, which has been established as PyTorch Project a Series of LF Projects, LLC. For policies applicable to the PyTorch Project a Series of LF Projects, LLC, please see <a href="https://www.lfprojects.org/policies/">LF Projects, LLC Policies</a>. <a href="https://www.linuxfoundation.org/privacy">Privacy Policy</a> and <a href="https://www.linuxfoundation.org/terms">Terms of Use</a>.</p> </div> </div> </div> </footer> <div class="mobile-main-menu"> <div class="container-fluid"> <div class="container"> <div class="mobile-main-menu-header-container"> <a class="header-logo" href="https://pytorch.org" aria-label="PyTorch"></a> <a class="main-menu-close-button" href="#" data-behavior="close-mobile-menu"></a> </div> </div> </div> <div class="mobile-main-menu-links-container"> <div class="main-menu"> <ul> <li class="navSearchWrapper reactNavSearchWrapper tabletSearchWrapper" key="search"> <div class="mobile-search-border"> <input id="mobile-search-input" type="text" title="Search" /> <div id="mobile-search-icon"></div> </div> </li> <li class="resources-mobile-menu-title"> <a>Learn</a> </li> <ul class="resources-mobile-menu-items"> <li> <a href="/get-started">Get Started</a> </li> <li> <a href="https://pytorch.org/tutorials">Tutorials</a> </li> <li> <a href="https://pytorch.org/tutorials/beginner/basics/intro.html">Learn the Basics</a> </li> <li> <a href="https://pytorch.org/tutorials/recipes/recipes_index.html">PyTorch Recipes</a> </li> <li> <a href="https://pytorch.org/tutorials/beginner/introyt.html">Introduction to PyTorch - YouTube Series</a> </li> <li> <a href="/new">New to PyTorch Foundation</a> </li> </ul> <li class="resources-mobile-menu-title"> <a>Ecosystem</a> </li> <ul class="resources-mobile-menu-items"> <li> <a href="https://landscape.pytorch.org/">Tools</a> </li> <li> <a href="https://github.com/pytorch-fdn/ecosystem">Join the Ecosystem</a> </li> <li> <a href="/#community-module">Community</a> </li> <li> <a href="https://discuss.pytorch.org">Forums</a> </li> <li> <a href="/resources">Developer Resources</a> </li> <li> <a href="/ecosystem/contributor-awards-2024">Contributor Awards - 2024</a> </li> </ul> <li class="resources-mobile-menu-title"> <a>Edge</a> </li> <ul class="resources-mobile-menu-items"> <li> <a href="/edge">About PyTorch Edge</a> </li> <li> <a href="/executorch-overview">ExecuTorch</a> </li> <li> <a href="https://pytorch.org/executorch/stable/index.html">ExecuTorch Documentation</a> </li> </ul> <li class="resources-mobile-menu-title"> <a>Docs</a> </li> <ul class="resources-mobile-menu-items"> <li> <a href="https://pytorch.org/docs">PyTorch</a> </li> <li> <a href="/pytorch-domains">PyTorch Domains</a> </li> </ul> <li class="resources-mobile-menu-title"> <a>Blog & News</a> </li> <ul class="resources-mobile-menu-items"> <li> <a href="/blog">PyTorch Blog</a> </li> <li> <a href="/community-blog">Community Blog</a> </li> <li> <a href="/videos">Videos</a> </li> <li> <a href="/community-stories">Community Stories</a> </li> <li> <a href="/events">Events</a> </li> <li> <a href="/newsletter">Newsletter</a> </li> </ul> <li class="resources-mobile-menu-title"> <a>About</a> </li> <ul class="resources-mobile-menu-items"> <li> <a href="/foundation">PyTorch Foundation</a> </li> <li> <a href="/governing-board">Governing Board</a> </li> <li> <a href="/credits">Cloud Credit Program</a> </li> <li> <a href="/tac">Technical Advisory Council</a> </li> <li> <a href="/staff">Staff</a> </li> <li> <a href="/contact-us">Contact Us</a> </li> </ul> <li class="resources-mobile-menu-title"> <a href="/join">Become a Member</a> </li> <li class="resources-mobile-menu-title"> <a href="https://github.com/pytorch/pytorch" title="Go to PyTorch GitHub"><div id="topnav-gh-icon"></div></a> </li> </ul> </div> </div> </div> <script src="/assets/mobile-menu.js"></script> <script src="/assets/scroll-to-anchor.js"></script> <script src="/assets/external-links-new-tab.js"></script> <script src="/assets/search-bar.js"></script> <script src="/assets/cookie-banner.js"></script> <script type="text/javascript"> mobileMenu.bind(); anchors.add('.pytorch-article h2, .pytorch-article h3, .pytorch-article h4, .pytorch-article h5'); // Add class to links that have code blocks, since we cannot create links in code blocks $("a code.highlighter-rouge").each(function(e) { $(this).closest("a").addClass("has-code"); }); scrollToAnchor.bind(); var hasStaticHeader = $(".blog-header, .blog-detail-header, .resources-header, .get-started-header, .features-header, .ecosystem-header, .hub-header, .mobile-header, .announcement-header, .comm-stories-header").length > 0; if (!hasStaticHeader) { $(window).on("scroll", function() { var top = $(this).scrollTop(); var fullPosition = $(".main-background").height() - $(".header-holder").height(); if (top <= 40) { $(".header-holder").css({"backgroundColor": "rgba(0, 0, 0, 0.165)"}); } else if (top >= fullPosition) { $(".header-holder").css({"backgroundColor": "#000000"}); } else { var bgColor = "rgba(0, 0, 0, " + top / fullPosition + ")"; $(".header-holder").css({"backgroundColor": bgColor}); } }); } </script> <script src="/assets/track-events.js"></script> <script>trackEvents.bind();</script> <div class="cookie-banner-wrapper"> <div class="container"> <p class="gdpr-notice">To analyze traffic and optimize your experience, we serve cookies on this site. By clicking or navigating, you agree to allow our usage of cookies. As the current maintainers of this site, Facebook’s Cookies Policy applies. Learn more, including about available controls: <a href="https://www.facebook.com/policies/cookies/">Cookies Policy</a>.</p> <img class="close-button" src="/assets/images/pytorch-x.svg"> </div> </div> </body> </html>