Multi-GPU distributed training with JAX

<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="description" content="Keras documentation"> <meta name="author" content="Keras Team"> <link rel="shortcut icon" href="https://keras.io/img/favicon.ico"> <link rel="canonical" href="https://keras.io/guides/distributed_training_with_jax/" />  <meta property="og:title" content="Keras documentation: Multi-GPU distributed training with JAX"> <meta property="og:image" content="https://keras.io/img/logo-k-keras-wb.png"> <meta name="twitter:title" content="Keras documentation: Multi-GPU distributed training with JAX"> <meta name="twitter:image" content="https://keras.io/img/k-keras-social.png"> <meta name="twitter:card" content="summary"> <title>Multi-GPU distributed training with JAX</title>  <link href="/css/bootstrap.min.css" rel="stylesheet">  <link href="https://fonts.googleapis.com/css2?family=Open+Sans:wght@400;600;700;800&display=swap" rel="stylesheet">  <link href="/css/docs.css" rel="stylesheet"> <link href="/css/monokai.css" rel="stylesheet">  <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src= 'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f); })(window,document,'script','dataLayer','GTM-5DNGF4N'); </script> <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-175165319-128', 'auto'); ga('send', 'pageview'); </script>  <script async defer src="https://buttons.github.io/buttons.js"></script> </head> <body>  <noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-5DNGF4N" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>  <div class='k-page'> <div class="k-nav" id="nav-menu"> <a href='/'><img src='/img/logo-small.png' class='logo-small' /></a> <div class="nav flex-column nav-pills" role="tablist" aria-orientation="vertical"> <a class="nav-link" href="/about/" role="tab" aria-selected="">About Keras</a> <a class="nav-link" href="/getting_started/" role="tab" aria-selected="">Getting started</a> <a class="nav-link active" href="/guides/" role="tab" aria-selected="">Developer guides</a> <a class="nav-sublink" href="/guides/functional_api/">The Functional API</a> <a class="nav-sublink" href="/guides/sequential_model/">The Sequential model</a> <a class="nav-sublink" href="/guides/making_new_layers_and_models_via_subclassing/">Making new layers & models via subclassing</a> <a class="nav-sublink" href="/guides/training_with_built_in_methods/">Training & evaluation with the built-in methods</a> <a class="nav-sublink" href="/guides/custom_train_step_in_jax/">Customizing `fit()` with JAX</a> <a class="nav-sublink" href="/guides/custom_train_step_in_tensorflow/">Customizing `fit()` with TensorFlow</a> <a class="nav-sublink" href="/guides/custom_train_step_in_torch/">Customizing `fit()` with PyTorch</a> <a class="nav-sublink" href="/guides/writing_a_custom_training_loop_in_jax/">Writing a custom training loop in JAX</a> <a class="nav-sublink" href="/guides/writing_a_custom_training_loop_in_tensorflow/">Writing a custom training loop in TensorFlow</a> <a class="nav-sublink" href="/guides/writing_a_custom_training_loop_in_torch/">Writing a custom training loop in PyTorch</a> <a class="nav-sublink" href="/guides/serialization_and_saving/">Serialization & saving</a> <a class="nav-sublink" href="/guides/customizing_saving_and_serialization/">Customizing saving & serialization</a> <a class="nav-sublink" href="/guides/writing_your_own_callbacks/">Writing your own callbacks</a> <a class="nav-sublink" href="/guides/transfer_learning/">Transfer learning & fine-tuning</a> <a class="nav-sublink active" href="/guides/distributed_training_with_jax/">Distributed training with JAX</a> <a class="nav-sublink" href="/guides/distributed_training_with_tensorflow/">Distributed training with TensorFlow</a> <a class="nav-sublink" href="/guides/distributed_training_with_torch/">Distributed training with PyTorch</a> <a class="nav-sublink" href="/guides/distribution/">Distributed training with Keras 3</a> <a class="nav-sublink" href="/guides/migrating_to_keras_3/">Migrating Keras 2 code to Keras 3</a> <a class="nav-sublink" href="/guides/keras_tuner/">Hyperparameter Tuning</a> <a class="nav-sublink" href="/guides/keras_cv/">KerasCV</a> <a class="nav-sublink" href="/guides/keras_nlp/">KerasNLP</a> <a class="nav-sublink" href="/guides/keras_hub/">KerasHub</a> <a class="nav-link" href="/api/" role="tab" aria-selected="">Keras 3 API documentation</a> <a class="nav-link" href="/2.18/api/" role="tab" aria-selected="">Keras 2 API documentation</a> <a class="nav-link" href="/examples/" role="tab" aria-selected="">Code examples</a> <a class="nav-link" href="/keras_tuner/" role="tab" aria-selected="">KerasTuner: Hyperparameter Tuning</a> <a class="nav-link" href="/keras_hub/" role="tab" aria-selected="">KerasHub: Pretrained Models</a> <a class="nav-link" href="/keras_cv/" role="tab" aria-selected="">KerasCV: Computer Vision Workflows</a> <a class="nav-link" href="/keras_nlp/" role="tab" aria-selected="">KerasNLP: Natural Language Workflows</a> </div> </div> <div class='k-main'> <div class='k-main-top'> <script> function displayDropdownMenu() { e = document.getElementById("nav-menu"); if (e.style.display == "block") { e.style.display = "none"; } else { e.style.display = "block"; document.getElementById("dropdown-nav").style.display = "block"; } } function resetMobileUI() { if (window.innerWidth <= 840) { document.getElementById("nav-menu").style.display = "none"; document.getElementById("dropdown-nav").style.display = "block"; } else { document.getElementById("nav-menu").style.display = "block"; document.getElementById("dropdown-nav").style.display = "none"; } var navmenu = document.getElementById("nav-menu"); var menuheight = navmenu.clientHeight; var kmain = document.getElementById("k-main-id"); kmain.style.minHeight = (menuheight + 100) + 'px'; } window.onresize = resetMobileUI; window.addEventListener("load", (event) => { resetMobileUI() }); </script> <div id='dropdown-nav' onclick="displayDropdownMenu();"> <svg viewBox="-20 -20 120 120" width="60" height="60"> <rect width="100" height="20"></rect> <rect y="30" width="100" height="20"></rect> <rect y="60" width="100" height="20"></rect> </svg> </div> <form class="bd-search d-flex align-items-center k-search-form" id="search-form"> <input type="search" class="k-search-input" id="search-input" placeholder="Search Keras documentation..." aria-label="Search Keras documentation..." autocomplete="off"> <button class="k-search-btn"> <svg width="13" height="13" viewBox="0 0 13 13"><title>search</title><path d="m4.8495 7.8226c0.82666 0 1.5262-0.29146 2.0985-0.87438 0.57232-0.58292 0.86378-1.2877 0.87438-2.1144 0.010599-0.82666-0.28086-1.5262-0.87438-2.0985-0.59352-0.57232-1.293-0.86378-2.0985-0.87438-0.8055-0.010599-1.5103 0.28086-2.1144 0.87438-0.60414 0.59352-0.8956 1.293-0.87438 2.0985 0.021197 0.8055 0.31266 1.5103 0.87438 2.1144 0.56172 0.60414 1.2665 0.8956 2.1144 0.87438zm4.4695 0.2115 3.681 3.6819-1.259 1.284-3.6817-3.7 0.0019784-0.69479-0.090043-0.098846c-0.87973 0.76087-1.92 1.1413-3.1207 1.1413-1.3553 0-2.5025-0.46363-3.4417-1.3909s-1.4088-2.0686-1.4088-3.4239c0-1.3553 0.4696-2.4966 1.4088-3.4239 0.9392-0.92727 2.0864-1.3969 3.4417-1.4088 1.3553-0.011889 2.4906 0.45771 3.406 1.4088 0.9154 0.95107 1.379 2.0924 1.3909 3.4239 0 1.2126-0.38043 2.2588-1.1413 3.1385l0.098834 0.090049z"></path></svg> </button> </form> <script> var form = document.getElementById('search-form'); form.onsubmit = function(e) { e.preventDefault(); var query = document.getElementById('search-input').value; window.location.href = '/search.html?query=' + query; return False } </script> </div> <div class='k-main-inner' id='k-main-id'> <div class='k-location-slug'> ► <a href='/guides/'>Developer guides</a> / Multi-GPU distributed training with JAX </div> <div class='k-content'> <h1 id="multigpu-distributed-training-with-jax">Multi-GPU distributed training with JAX</h1> Author: <a href="https://twitter.com/fchollet">fchollet</a> Date created: 2023/07/11 Last modified: 2023/07/11 Description: Guide to multi-GPU/TPU training for Keras models with JAX. <img class="k-inline-icon" src="https://colab.research.google.com/img/colab_favicon.ico"/> <a href="https://colab.research.google.com/github/keras-team/keras-io/blob/master/guides/ipynb/distributed_training_with_jax.ipynb">View in Colab</a> •<img class="k-inline-icon" src="https://github.com/favicon.ico"/> <a href="https://github.com/keras-team/keras-io/blob/master/guides/distributed_training_with_jax.py">GitHub source</a> <hr /> <h2 id="introduction">Introduction</h2> There are generally two ways to distribute computation across multiple devices: Data parallelism, where a single model gets replicated on multiple devices or multiple machines. Each of them processes different batches of data, then they merge their results. There exist many variants of this setup, that differ in how the different model replicas merge results, in whether they stay in sync at every batch or whether they are more loosely coupled, etc. Model parallelism, where different parts of a single model run on different devices, processing a single batch of data together. This works best with models that have a naturally-parallel architecture, such as models that feature multiple branches. This guide focuses on data parallelism, in particular synchronous data parallelism, where the different replicas of the model stay in sync after each batch they process. Synchronicity keeps the model convergence behavior identical to what you would see for single-device training. Specifically, this guide teaches you how to use <code>jax.sharding</code> APIs to train Keras models, with minimal changes to your code, on multiple GPUs or TPUS (typically 2 to 16) installed on a single machine (single host, multi-device training). This is the most common setup for researchers and small-scale industry workflows. <hr /> <h2 id="setup">Setup</h2> Let's start by defining the function that creates the model that we will train, and the function that creates the dataset we will train on (MNIST in this case). <div class="codehilite"><pre><code>import os os.environ["KERAS_BACKEND"] = "jax" import jax import numpy as np import tensorflow as tf import keras from jax.experimental import mesh_utils from jax.sharding import Mesh from jax.sharding import NamedSharding from jax.sharding import PartitionSpec as P def get_model(): # Make a simple convnet with batch normalization and dropout. inputs = keras.Input(shape=(28, 28, 1)) x = keras.layers.Rescaling(1.0 / 255.0)(inputs) x = keras.layers.Conv2D(filters=12, kernel_size=3, padding="same", use_bias=False)( x ) x = keras.layers.BatchNormalization(scale=False, center=True)(x) x = keras.layers.ReLU()(x) x = keras.layers.Conv2D( filters=24, kernel_size=6, use_bias=False, strides=2, )(x) x = keras.layers.BatchNormalization(scale=False, center=True)(x) x = keras.layers.ReLU()(x) x = keras.layers.Conv2D( filters=32, kernel_size=6, padding="same", strides=2, name="large_k", )(x) x = keras.layers.BatchNormalization(scale=False, center=True)(x) x = keras.layers.ReLU()(x) x = keras.layers.GlobalAveragePooling2D()(x) x = keras.layers.Dense(256, activation="relu")(x) x = keras.layers.Dropout(0.5)(x) outputs = keras.layers.Dense(10)(x) model = keras.Model(inputs, outputs) return model def get_datasets(): # Load the data and split it between train and test sets (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data() # Scale images to the [0, 1] range x_train = x_train.astype("float32") x_test = x_test.astype("float32") # Make sure images have shape (28, 28, 1) x_train = np.expand_dims(x_train, -1) x_test = np.expand_dims(x_test, -1) print("x_train shape:", x_train.shape) print(x_train.shape[0], "train samples") print(x_test.shape[0], "test samples") # Create TF Datasets train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train)) eval_data = tf.data.Dataset.from_tensor_slices((x_test, y_test)) return train_data, eval_data </code></pre></div> <hr /> <h2 id="singlehost-multidevice-synchronous-training">Single-host, multi-device synchronous training</h2> In this setup, you have one machine with several GPUs or TPUs on it (typically 2 to 16). Each device will run a copy of your model (called a replica). For simplicity, in what follows, we'll assume we're dealing with 8 GPUs, at no loss of generality. How it works At each step of training: <ul> <li>The current batch of data (called global batch) is split into 8 different sub-batches (called local batches). For instance, if the global batch has 512 samples, each of the 8 local batches will have 64 samples.</li> <li>Each of the 8 replicas independently processes a local batch: they run a forward pass, then a backward pass, outputting the gradient of the weights with respect to the loss of the model on the local batch.</li> <li>The weight updates originating from local gradients are efficiently merged across the 8 replicas. Because this is done at the end of every step, the replicas always stay in sync.</li> </ul> In practice, the process of synchronously updating the weights of the model replicas is handled at the level of each individual weight variable. This is done through a using a <code>jax.sharding.NamedSharding</code> that is configured to replicate the variables. How to use it To do single-host, multi-device synchronous training with a Keras model, you would use the <code>jax.sharding</code> features. Here's how it works: <ul> <li>We first create a device mesh using <code>mesh_utils.create_device_mesh</code>.</li> <li>We use <code>jax.sharding.Mesh</code>, <code>jax.sharding.NamedSharding</code> and <code>jax.sharding.PartitionSpec</code> to define how to partition JAX arrays. - We specify that we want to replicate the model and optimizer variables across all devices by using a spec with no axis. - We specify that we want to shard the data across devices by using a spec that splits along the batch dimension.</li> <li>We use <code>jax.device_put</code> to replicate the model and optimizer variables across devices. This happens once at the beginning.</li> <li>In the training loop, for each batch that we process, we use <code>jax.device_put</code> to split the batch across devices before invoking the train step.</li> </ul> Here's the flow, where each step is split into its own utility function: <div class="codehilite"><pre><code># Config num_epochs = 2 batch_size = 64 train_data, eval_data = get_datasets() train_data = train_data.batch(batch_size, drop_remainder=True) model = get_model() optimizer = keras.optimizers.Adam(1e-3) loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True) # Initialize all state with .build() (one_batch, one_batch_labels) = next(iter(train_data)) model.build(one_batch) optimizer.build(model.trainable_variables) # This is the loss function that will be differentiated. # Keras provides a pure functional forward pass: model.stateless_call def compute_loss(trainable_variables, non_trainable_variables, x, y): y_pred, updated_non_trainable_variables = model.stateless_call( trainable_variables, non_trainable_variables, x, training=True ) loss_value = loss(y, y_pred) return loss_value, updated_non_trainable_variables # Function to compute gradients compute_gradients = jax.value_and_grad(compute_loss, has_aux=True) # Training step, Keras provides a pure functional optimizer.stateless_apply @jax.jit def train_step(train_state, x, y): trainable_variables, non_trainable_variables, optimizer_variables = train_state (loss_value, non_trainable_variables), grads = compute_gradients( trainable_variables, non_trainable_variables, x, y ) trainable_variables, optimizer_variables = optimizer.stateless_apply( optimizer_variables, grads, trainable_variables ) return loss_value, ( trainable_variables, non_trainable_variables, optimizer_variables, ) # Replicate the model and optimizer variable on all devices def get_replicated_train_state(devices): # All variables will be replicated on all devices var_mesh = Mesh(devices, axis_names=("_")) # In NamedSharding, axes not mentioned are replicated (all axes here) var_replication = NamedSharding(var_mesh, P()) # Apply the distribution settings to the model variables trainable_variables = jax.device_put(model.trainable_variables, var_replication) non_trainable_variables = jax.device_put( model.non_trainable_variables, var_replication ) optimizer_variables = jax.device_put(optimizer.variables, var_replication) # Combine all state in a tuple return (trainable_variables, non_trainable_variables, optimizer_variables) num_devices = len(jax.local_devices()) print(f"Running on {num_devices} devices: {jax.local_devices()}") devices = mesh_utils.create_device_mesh((num_devices,)) # Data will be split along the batch axis data_mesh = Mesh(devices, axis_names=("batch",)) # naming axes of the mesh data_sharding = NamedSharding( data_mesh, P( "batch", ), ) # naming axes of the sharded partition # Display data sharding x, y = next(iter(train_data)) sharded_x = jax.device_put(x.numpy(), data_sharding) print("Data sharding") jax.debug.visualize_array_sharding(jax.numpy.reshape(sharded_x, [-1, 28 * 28])) train_state = get_replicated_train_state(devices) # Custom training loop for epoch in range(num_epochs): data_iter = iter(train_data) for data in data_iter: x, y = data sharded_x = jax.device_put(x.numpy(), data_sharding) loss_value, train_state = train_step(train_state, sharded_x, y.numpy()) print("Epoch", epoch, "loss:", loss_value) # Post-processing model state update to write them back into the model trainable_variables, non_trainable_variables, optimizer_variables = train_state for variable, value in zip(model.trainable_variables, trainable_variables): variable.assign(value) for variable, value in zip(model.non_trainable_variables, non_trainable_variables): variable.assign(value) </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>x_train shape: (60000, 28, 28, 1) 60000 train samples 10000 test samples Running on 1 devices: [CpuDevice(id=0)] Data sharding </code></pre></div> </div> <pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"> CPU 0 </pre> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>Epoch 0 loss: 0.28599858 Epoch 1 loss: 0.23666474 </code></pre></div> </div> That's it! </div> <div class='k-outline'> <div class='k-outline-depth-1'> <a href='#multigpu-distributed-training-with-jax'>Multi-GPU distributed training with JAX</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#introduction'>Introduction</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#setup'>Setup</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#singlehost-multidevice-synchronous-training'>Single-host, multi-device synchronous training</a> </div> </div> </div> </div> </div> </body> <footer style="float: left; width: 100%; padding: 1em; border-top: solid 1px #bbb;"> <a href="https://policies.google.com/terms">Terms</a> | <a href="https://policies.google.com/privacy">Privacy</a> </footer> </html>

CINXE.COM

Multi-GPU distributed training with JAX