Image classification with modern MLP models

<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="description" content="Keras documentation"> <meta name="author" content="Keras Team"> <link rel="shortcut icon" href="https://keras.io/img/favicon.ico"> <link rel="canonical" href="https://keras.io/examples/vision/mlp_image_classification/" />  <meta property="og:title" content="Keras documentation: Image classification with modern MLP models"> <meta property="og:image" content="https://keras.io/img/logo-k-keras-wb.png"> <meta name="twitter:title" content="Keras documentation: Image classification with modern MLP models"> <meta name="twitter:image" content="https://keras.io/img/k-keras-social.png"> <meta name="twitter:card" content="summary"> <title>Image classification with modern MLP models</title>  <link href="/css/bootstrap.min.css" rel="stylesheet">  <link href="https://fonts.googleapis.com/css2?family=Open+Sans:wght@400;600;700;800&display=swap" rel="stylesheet">  <link href="/css/docs.css" rel="stylesheet"> <link href="/css/monokai.css" rel="stylesheet">  <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src= 'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f); })(window,document,'script','dataLayer','GTM-5DNGF4N'); </script> <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-175165319-128', 'auto'); ga('send', 'pageview'); </script>  <script async defer src="https://buttons.github.io/buttons.js"></script> </head> <body>  <noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-5DNGF4N" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>  <div class='k-page'> <div class="k-nav" id="nav-menu"> <a href='/'><img src='/img/logo-small.png' class='logo-small' /></a> <div class="nav flex-column nav-pills" role="tablist" aria-orientation="vertical"> <a class="nav-link" href="/about/" role="tab" aria-selected="">About Keras</a> <a class="nav-link" href="/getting_started/" role="tab" aria-selected="">Getting started</a> <a class="nav-link" href="/guides/" role="tab" aria-selected="">Developer guides</a> <a class="nav-link active" href="/examples/" role="tab" aria-selected="">Code examples</a> <a class="nav-sublink active" href="/examples/vision/">Computer Vision</a> <a class="nav-sublink2" href="/examples/vision/image_classification_from_scratch/">Image classification from scratch</a> <a class="nav-sublink2" href="/examples/vision/mnist_convnet/">Simple MNIST convnet</a> <a class="nav-sublink2" href="/examples/vision/image_classification_efficientnet_fine_tuning/">Image classification via fine-tuning with EfficientNet</a> <a class="nav-sublink2" href="/examples/vision/image_classification_with_vision_transformer/">Image classification with Vision Transformer</a> <a class="nav-sublink2" href="/examples/vision/attention_mil_classification/">Classification using Attention-based Deep Multiple Instance Learning</a> <a class="nav-sublink2 active" href="/examples/vision/mlp_image_classification/">Image classification with modern MLP models</a> <a class="nav-sublink2" href="/examples/vision/mobilevit/">A mobile-friendly Transformer-based model for image classification</a> <a class="nav-sublink2" href="/examples/vision/xray_classification_with_tpus/">Pneumonia Classification on TPU</a> <a class="nav-sublink2" href="/examples/vision/cct/">Compact Convolutional Transformers</a> <a class="nav-sublink2" href="/examples/vision/convmixer/">Image classification with ConvMixer</a> <a class="nav-sublink2" href="/examples/vision/eanet/">Image classification with EANet (External Attention Transformer)</a> <a class="nav-sublink2" href="/examples/vision/involution/">Involutional neural networks</a> <a class="nav-sublink2" href="/examples/vision/perceiver_image_classification/">Image classification with Perceiver</a> <a class="nav-sublink2" href="/examples/vision/reptile/">Few-Shot learning with Reptile</a> <a class="nav-sublink2" href="/examples/vision/semisupervised_simclr/">Semi-supervised image classification using contrastive pretraining with SimCLR</a> <a class="nav-sublink2" href="/examples/vision/swin_transformers/">Image classification with Swin Transformers</a> <a class="nav-sublink2" href="/examples/vision/vit_small_ds/">Train a Vision Transformer on small datasets</a> <a class="nav-sublink2" href="/examples/vision/shiftvit/">A Vision Transformer without Attention</a> <a class="nav-sublink2" href="/examples/vision/image_classification_using_global_context_vision_transformer/">Image Classification using Global Context Vision Transformer</a> <a class="nav-sublink2" href="/examples/vision/temporal_latent_bottleneck/">When Recurrence meets Transformers</a> <a class="nav-sublink2" href="/examples/vision/oxford_pets_image_segmentation/">Image segmentation with a U-Net-like architecture</a> <a class="nav-sublink2" href="/examples/vision/deeplabv3_plus/">Multiclass semantic segmentation using DeepLabV3+</a> <a class="nav-sublink2" href="/examples/vision/basnet_segmentation/">Highly accurate boundaries segmentation using BASNet</a> <a class="nav-sublink2" href="/examples/vision/fully_convolutional_network/">Image Segmentation using Composable Fully-Convolutional Networks</a> <a class="nav-sublink2" href="/examples/vision/retinanet/">Object Detection with RetinaNet</a> <a class="nav-sublink2" href="/examples/vision/keypoint_detection/">Keypoint Detection with Transfer Learning</a> <a class="nav-sublink2" href="/examples/vision/object_detection_using_vision_transformer/">Object detection with Vision Transformers</a> <a class="nav-sublink2" href="/examples/vision/3D_image_classification/">3D image classification from CT scans</a> <a class="nav-sublink2" href="/examples/vision/depth_estimation/">Monocular depth estimation</a> <a class="nav-sublink2" href="/examples/vision/nerf/">3D volumetric rendering with NeRF</a> <a class="nav-sublink2" href="/examples/vision/pointnet_segmentation/">Point cloud segmentation with PointNet</a> <a class="nav-sublink2" href="/examples/vision/pointnet/">Point cloud classification</a> <a class="nav-sublink2" href="/examples/vision/captcha_ocr/">OCR model for reading Captchas</a> <a class="nav-sublink2" href="/examples/vision/handwriting_recognition/">Handwriting recognition</a> <a class="nav-sublink2" href="/examples/vision/autoencoder/">Convolutional autoencoder for image denoising</a> <a class="nav-sublink2" href="/examples/vision/mirnet/">Low-light image enhancement using MIRNet</a> <a class="nav-sublink2" href="/examples/vision/super_resolution_sub_pixel/">Image Super-Resolution using an Efficient Sub-Pixel CNN</a> <a class="nav-sublink2" href="/examples/vision/edsr/">Enhanced Deep Residual Networks for single-image super-resolution</a> <a class="nav-sublink2" href="/examples/vision/zero_dce/">Zero-DCE for low-light image enhancement</a> <a class="nav-sublink2" href="/examples/vision/cutmix/">CutMix data augmentation for image classification</a> <a class="nav-sublink2" href="/examples/vision/mixup/">MixUp augmentation for image classification</a> <a class="nav-sublink2" href="/examples/vision/randaugment/">RandAugment for Image Classification for Improved Robustness</a> <a class="nav-sublink2" href="/examples/vision/image_captioning/">Image captioning</a> <a class="nav-sublink2" href="/examples/vision/nl_image_search/">Natural language image search with a Dual Encoder</a> <a class="nav-sublink2" href="/examples/vision/visualizing_what_convnets_learn/">Visualizing what convnets learn</a> <a class="nav-sublink2" href="/examples/vision/integrated_gradients/">Model interpretability with Integrated Gradients</a> <a class="nav-sublink2" href="/examples/vision/probing_vits/">Investigating Vision Transformer representations</a> <a class="nav-sublink2" href="/examples/vision/grad_cam/">Grad-CAM class activation visualization</a> <a class="nav-sublink2" href="/examples/vision/near_dup_search/">Near-duplicate image search</a> <a class="nav-sublink2" href="/examples/vision/semantic_image_clustering/">Semantic Image Clustering</a> <a class="nav-sublink2" href="/examples/vision/siamese_contrastive/">Image similarity estimation using a Siamese Network with a contrastive loss</a> <a class="nav-sublink2" href="/examples/vision/siamese_network/">Image similarity estimation using a Siamese Network with a triplet loss</a> <a class="nav-sublink2" href="/examples/vision/metric_learning/">Metric learning for image similarity search</a> <a class="nav-sublink2" href="/examples/vision/metric_learning_tf_similarity/">Metric learning for image similarity search using TensorFlow Similarity</a> <a class="nav-sublink2" href="/examples/vision/nnclr/">Self-supervised contrastive learning with NNCLR</a> <a class="nav-sublink2" href="/examples/vision/video_classification/">Video Classification with a CNN-RNN Architecture</a> <a class="nav-sublink2" href="/examples/vision/conv_lstm/">Next-Frame Video Prediction with Convolutional LSTMs</a> <a class="nav-sublink2" href="/examples/vision/video_transformers/">Video Classification with Transformers</a> <a class="nav-sublink2" href="/examples/vision/vivit/">Video Vision Transformer</a> <a class="nav-sublink2" href="/examples/vision/bit/">Image Classification using BigTransfer (BiT)</a> <a class="nav-sublink2" href="/examples/vision/gradient_centralization/">Gradient Centralization for Better Training Performance</a> <a class="nav-sublink2" href="/examples/vision/token_learner/">Learning to tokenize in Vision Transformers</a> <a class="nav-sublink2" href="/examples/vision/knowledge_distillation/">Knowledge Distillation</a> <a class="nav-sublink2" href="/examples/vision/fixres/">FixRes: Fixing train-test resolution discrepancy</a> <a class="nav-sublink2" href="/examples/vision/cait/">Class Attention Image Transformers with LayerScale</a> <a class="nav-sublink2" href="/examples/vision/patch_convnet/">Augmenting convnets with aggregated attention</a> <a class="nav-sublink2" href="/examples/vision/learnable_resizer/">Learning to Resize</a> <a class="nav-sublink2" href="/examples/vision/adamatch/">Semi-supervision and domain adaptation with AdaMatch</a> <a class="nav-sublink2" href="/examples/vision/barlow_twins/">Barlow Twins for Contrastive SSL</a> <a class="nav-sublink2" href="/examples/vision/consistency_training/">Consistency training with supervision</a> <a class="nav-sublink2" href="/examples/vision/deit/">Distilling Vision Transformers</a> <a class="nav-sublink2" href="/examples/vision/focal_modulation_network/">Focal Modulation: A replacement for Self-Attention</a> <a class="nav-sublink2" href="/examples/vision/forwardforward/">Using the Forward-Forward Algorithm for Image Classification</a> <a class="nav-sublink2" href="/examples/vision/masked_image_modeling/">Masked image modeling with Autoencoders</a> <a class="nav-sublink2" href="/examples/vision/sam/">Segment Anything Model with 🤗Transformers</a> <a class="nav-sublink2" href="/examples/vision/segformer/">Semantic segmentation with SegFormer and Hugging Face Transformers</a> <a class="nav-sublink2" href="/examples/vision/simsiam/">Self-supervised contrastive learning with SimSiam</a> <a class="nav-sublink2" href="/examples/vision/supervised-contrastive-learning/">Supervised Contrastive Learning</a> <a class="nav-sublink2" href="/examples/vision/yolov8/">Efficient Object Detection with YOLOV8 and KerasCV</a> <a class="nav-sublink" href="/examples/nlp/">Natural Language Processing</a> <a class="nav-sublink" href="/examples/structured_data/">Structured Data</a> <a class="nav-sublink" href="/examples/timeseries/">Timeseries</a> <a class="nav-sublink" href="/examples/generative/">Generative Deep Learning</a> <a class="nav-sublink" href="/examples/audio/">Audio Data</a> <a class="nav-sublink" href="/examples/rl/">Reinforcement Learning</a> <a class="nav-sublink" href="/examples/graph/">Graph Data</a> <a class="nav-sublink" href="/examples/keras_recipes/">Quick Keras Recipes</a> <a class="nav-link" href="/api/" role="tab" aria-selected="">Keras 3 API documentation</a> <a class="nav-link" href="/2.18/api/" role="tab" aria-selected="">Keras 2 API documentation</a> <a class="nav-link" href="/keras_tuner/" role="tab" aria-selected="">KerasTuner: Hyperparam Tuning</a> <a class="nav-link" href="/keras_hub/" role="tab" aria-selected="">KerasHub: Pretrained Models</a> </div> </div> <div class='k-main'> <div class='k-main-top'> <script> function displayDropdownMenu() { e = document.getElementById("nav-menu"); if (e.style.display == "block") { e.style.display = "none"; } else { e.style.display = "block"; document.getElementById("dropdown-nav").style.display = "block"; } } function resetMobileUI() { if (window.innerWidth <= 840) { document.getElementById("nav-menu").style.display = "none"; document.getElementById("dropdown-nav").style.display = "block"; } else { document.getElementById("nav-menu").style.display = "block"; document.getElementById("dropdown-nav").style.display = "none"; } var navmenu = document.getElementById("nav-menu"); var menuheight = navmenu.clientHeight; var kmain = document.getElementById("k-main-id"); kmain.style.minHeight = (menuheight + 100) + 'px'; } window.onresize = resetMobileUI; window.addEventListener("load", (event) => { resetMobileUI() }); </script> <div id='dropdown-nav' onclick="displayDropdownMenu();"> <svg viewBox="-20 -20 120 120" width="60" height="60"> <rect width="100" height="20"></rect> <rect y="30" width="100" height="20"></rect> <rect y="60" width="100" height="20"></rect> </svg> </div> <form class="bd-search d-flex align-items-center k-search-form" id="search-form"> <input type="search" class="k-search-input" id="search-input" placeholder="Search Keras documentation..." aria-label="Search Keras documentation..." autocomplete="off"> <button class="k-search-btn"> <svg width="13" height="13" viewBox="0 0 13 13"><title>search</title><path d="m4.8495 7.8226c0.82666 0 1.5262-0.29146 2.0985-0.87438 0.57232-0.58292 0.86378-1.2877 0.87438-2.1144 0.010599-0.82666-0.28086-1.5262-0.87438-2.0985-0.59352-0.57232-1.293-0.86378-2.0985-0.87438-0.8055-0.010599-1.5103 0.28086-2.1144 0.87438-0.60414 0.59352-0.8956 1.293-0.87438 2.0985 0.021197 0.8055 0.31266 1.5103 0.87438 2.1144 0.56172 0.60414 1.2665 0.8956 2.1144 0.87438zm4.4695 0.2115 3.681 3.6819-1.259 1.284-3.6817-3.7 0.0019784-0.69479-0.090043-0.098846c-0.87973 0.76087-1.92 1.1413-3.1207 1.1413-1.3553 0-2.5025-0.46363-3.4417-1.3909s-1.4088-2.0686-1.4088-3.4239c0-1.3553 0.4696-2.4966 1.4088-3.4239 0.9392-0.92727 2.0864-1.3969 3.4417-1.4088 1.3553-0.011889 2.4906 0.45771 3.406 1.4088 0.9154 0.95107 1.379 2.0924 1.3909 3.4239 0 1.2126-0.38043 2.2588-1.1413 3.1385l0.098834 0.090049z"></path></svg> </button> </form> <script> var form = document.getElementById('search-form'); form.onsubmit = function(e) { e.preventDefault(); var query = document.getElementById('search-input').value; window.location.href = '/search.html?query=' + query; return False } </script> </div> <div class='k-main-inner' id='k-main-id'> <div class='k-location-slug'> ► <a href='/examples/'>Code examples</a> / <a href='/examples/vision/'>Computer Vision</a> / Image classification with modern MLP models </div> <div class='k-content'> <h1 id="image-classification-with-modern-mlp-models">Image classification with modern MLP models</h1> Author: <a href="https://www.linkedin.com/in/khalid-salama-24403144/">Khalid Salama</a> Date created: 2021/05/30 Last modified: 2023/08/03 Description: Implementing the MLP-Mixer, FNet, and gMLP models for CIFAR-100 image classification. <div class='example_version_banner keras_3'>ⓘ This example uses Keras 3</div> <img class="k-inline-icon" src="https://colab.research.google.com/img/colab_favicon.ico"/> <a href="https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/vision/ipynb/mlp_image_classification.ipynb">View in Colab</a> •<img class="k-inline-icon" src="https://github.com/favicon.ico"/> <a href="https://github.com/keras-team/keras-io/blob/master/examples/vision/mlp_image_classification.py">GitHub source</a> <hr /> <h2 id="introduction">Introduction</h2> This example implements three modern attention-free, multi-layer perceptron (MLP) based models for image classification, demonstrated on the CIFAR-100 dataset: <ol> <li>The <a href="https://arxiv.org/abs/2105.01601">MLP-Mixer</a> model, by Ilya Tolstikhin et al., based on two types of MLPs.</li> <li>The <a href="https://arxiv.org/abs/2105.03824">FNet</a> model, by James Lee-Thorp et al., based on unparameterized Fourier Transform.</li> <li>The <a href="https://arxiv.org/abs/2105.08050">gMLP</a> model, by Hanxiao Liu et al., based on MLP with gating.</li> </ol> The purpose of the example is not to compare between these models, as they might perform differently on different datasets with well-tuned hyperparameters. Rather, it is to show simple implementations of their main building blocks. <hr /> <h2 id="setup">Setup</h2> <div class="codehilite"><pre><code>import numpy as np import keras from keras import layers </code></pre></div> <hr /> <h2 id="prepare-the-data">Prepare the data</h2> <div class="codehilite"><pre><code>num_classes = 100 input_shape = (32, 32, 3) (x_train, y_train), (x_test, y_test) = keras.datasets.cifar100.load_data() print(f"x_train shape: {x_train.shape} - y_train shape: {y_train.shape}") print(f"x_test shape: {x_test.shape} - y_test shape: {y_test.shape}") </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>x_train shape: (50000, 32, 32, 3) - y_train shape: (50000, 1) x_test shape: (10000, 32, 32, 3) - y_test shape: (10000, 1) </code></pre></div> </div> <hr /> <h2 id="configure-the-hyperparameters">Configure the hyperparameters</h2> <div class="codehilite"><pre><code>weight_decay = 0.0001 batch_size = 128 num_epochs = 1 # Recommended num_epochs = 50 dropout_rate = 0.2 image_size = 64 # We'll resize input images to this size. patch_size = 8 # Size of the patches to be extracted from the input images. num_patches = (image_size // patch_size) ** 2 # Size of the data array. embedding_dim = 256 # Number of hidden units. num_blocks = 4 # Number of blocks. print(f"Image size: {image_size} X {image_size} = {image_size ** 2}") print(f"Patch size: {patch_size} X {patch_size} = {patch_size ** 2} ") print(f"Patches per image: {num_patches}") print(f"Elements per patch (3 channels): {(patch_size ** 2) * 3}") </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>Image size: 64 X 64 = 4096 Patch size: 8 X 8 = 64 Patches per image: 64 Elements per patch (3 channels): 192 </code></pre></div> </div> <hr /> <h2 id="build-a-classification-model">Build a classification model</h2> We implement a method that builds a classifier given the processing blocks. <div class="codehilite"><pre><code>def build_classifier(blocks, positional_encoding=False): inputs = layers.Input(shape=input_shape) # Augment data. augmented = data_augmentation(inputs) # Create patches. patches = Patches(patch_size)(augmented) # Encode patches to generate a [batch_size, num_patches, embedding_dim] tensor. x = layers.Dense(units=embedding_dim)(patches) if positional_encoding: x = x + PositionEmbedding(sequence_length=num_patches)(x) # Process x using the module blocks. x = blocks(x) # Apply global average pooling to generate a [batch_size, embedding_dim] representation tensor. representation = layers.GlobalAveragePooling1D()(x) # Apply dropout. representation = layers.Dropout(rate=dropout_rate)(representation) # Compute logits outputs. logits = layers.Dense(num_classes)(representation) # Create the Keras model. return keras.Model(inputs=inputs, outputs=logits) </code></pre></div> <hr /> <h2 id="define-an-experiment">Define an experiment</h2> We implement a utility function to compile, train, and evaluate a given model. <div class="codehilite"><pre><code>def run_experiment(model): # Create Adam optimizer with weight decay. optimizer = keras.optimizers.AdamW( learning_rate=learning_rate, weight_decay=weight_decay, ) # Compile the model. model.compile( optimizer=optimizer, loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=[ keras.metrics.SparseCategoricalAccuracy(name="acc"), keras.metrics.SparseTopKCategoricalAccuracy(5, name="top5-acc"), ], ) # Create a learning rate scheduler callback. reduce_lr = keras.callbacks.ReduceLROnPlateau( monitor="val_loss", factor=0.5, patience=5 ) # Create an early stopping callback. early_stopping = keras.callbacks.EarlyStopping( monitor="val_loss", patience=10, restore_best_weights=True ) # Fit the model. history = model.fit( x=x_train, y=y_train, batch_size=batch_size, epochs=num_epochs, validation_split=0.1, callbacks=[early_stopping, reduce_lr], verbose=0, ) _, accuracy, top_5_accuracy = model.evaluate(x_test, y_test) print(f"Test accuracy: {round(accuracy * 100, 2)}%") print(f"Test top 5 accuracy: {round(top_5_accuracy * 100, 2)}%") # Return history to plot learning curves. return history </code></pre></div> <hr /> <h2 id="use-data-augmentation">Use data augmentation</h2> <div class="codehilite"><pre><code>data_augmentation = keras.Sequential( [ layers.Normalization(), layers.Resizing(image_size, image_size), layers.RandomFlip("horizontal"), layers.RandomZoom(height_factor=0.2, width_factor=0.2), ], name="data_augmentation", ) # Compute the mean and the variance of the training data for normalization. data_augmentation.layers[0].adapt(x_train) </code></pre></div> <hr /> <h2 id="implement-patch-extraction-as-a-layer">Implement patch extraction as a layer</h2> <div class="codehilite"><pre><code>class Patches(layers.Layer): def __init__(self, patch_size, **kwargs): super().__init__(**kwargs) self.patch_size = patch_size def call(self, x): patches = keras.ops.image.extract_patches(x, self.patch_size) batch_size = keras.ops.shape(patches)[0] num_patches = keras.ops.shape(patches)[1] * keras.ops.shape(patches)[2] patch_dim = keras.ops.shape(patches)[3] out = keras.ops.reshape(patches, (batch_size, num_patches, patch_dim)) return out </code></pre></div> <hr /> <h2 id="implement-position-embedding-as-a-layer">Implement position embedding as a layer</h2> <div class="codehilite"><pre><code>class PositionEmbedding(keras.layers.Layer): def __init__( self, sequence_length, initializer="glorot_uniform", **kwargs, ): super().__init__(**kwargs) if sequence_length is None: raise ValueError("`sequence_length` must be an Integer, received `None`.") self.sequence_length = int(sequence_length) self.initializer = keras.initializers.get(initializer) def get_config(self): config = super().get_config() config.update( { "sequence_length": self.sequence_length, "initializer": keras.initializers.serialize(self.initializer), } ) return config def build(self, input_shape): feature_size = input_shape[-1] self.position_embeddings = self.add_weight( name="embeddings", shape=[self.sequence_length, feature_size], initializer=self.initializer, trainable=True, ) super().build(input_shape) def call(self, inputs, start_index=0): shape = keras.ops.shape(inputs) feature_length = shape[-1] sequence_length = shape[-2] # trim to match the length of the input sequence, which might be less # than the sequence_length of the layer. position_embeddings = keras.ops.convert_to_tensor(self.position_embeddings) position_embeddings = keras.ops.slice( position_embeddings, (start_index, 0), (sequence_length, feature_length), ) return keras.ops.broadcast_to(position_embeddings, shape) def compute_output_shape(self, input_shape): return input_shape </code></pre></div> <hr /> <h2 id="the-mlpmixer-model">The MLP-Mixer model</h2> The MLP-Mixer is an architecture based exclusively on multi-layer perceptrons (MLPs), that contains two types of MLP layers: <ol> <li>One applied independently to image patches, which mixes the per-location features.</li> <li>The other applied across patches (along channels), which mixes spatial information.</li> </ol> This is similar to a <a href="https://arxiv.org/abs/1610.02357">depthwise separable convolution based model</a> such as the Xception model, but with two chained dense transforms, no max pooling, and layer normalization instead of batch normalization. <h3 id="implement-the-mlpmixer-module">Implement the MLP-Mixer module</h3> <div class="codehilite"><pre><code>class MLPMixerLayer(layers.Layer): def __init__(self, num_patches, hidden_units, dropout_rate, *args, **kwargs): super().__init__(*args, **kwargs) self.mlp1 = keras.Sequential( [ layers.Dense(units=num_patches, activation="gelu"), layers.Dense(units=num_patches), layers.Dropout(rate=dropout_rate), ] ) self.mlp2 = keras.Sequential( [ layers.Dense(units=num_patches, activation="gelu"), layers.Dense(units=hidden_units), layers.Dropout(rate=dropout_rate), ] ) self.normalize = layers.LayerNormalization(epsilon=1e-6) def build(self, input_shape): return super().build(input_shape) def call(self, inputs): # Apply layer normalization. x = self.normalize(inputs) # Transpose inputs from [num_batches, num_patches, hidden_units] to [num_batches, hidden_units, num_patches]. x_channels = keras.ops.transpose(x, axes=(0, 2, 1)) # Apply mlp1 on each channel independently. mlp1_outputs = self.mlp1(x_channels) # Transpose mlp1_outputs from [num_batches, hidden_dim, num_patches] to [num_batches, num_patches, hidden_units]. mlp1_outputs = keras.ops.transpose(mlp1_outputs, axes=(0, 2, 1)) # Add skip connection. x = mlp1_outputs + inputs # Apply layer normalization. x_patches = self.normalize(x) # Apply mlp2 on each patch independtenly. mlp2_outputs = self.mlp2(x_patches) # Add skip connection. x = x + mlp2_outputs return x </code></pre></div> <h3 id="build-train-and-evaluate-the-mlpmixer-model">Build, train, and evaluate the MLP-Mixer model</h3> Note that training the model with the current settings on a V100 GPUs takes around 8 seconds per epoch. <div class="codehilite"><pre><code>mlpmixer_blocks = keras.Sequential( [MLPMixerLayer(num_patches, embedding_dim, dropout_rate) for _ in range(num_blocks)] ) learning_rate = 0.005 mlpmixer_classifier = build_classifier(mlpmixer_blocks) history = run_experiment(mlpmixer_classifier) </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>Test accuracy: 9.76% Test top 5 accuracy: 30.8% </code></pre></div> </div> The MLP-Mixer model tends to have much less number of parameters compared to convolutional and transformer-based models, which leads to less training and serving computational cost. As mentioned in the <a href="https://arxiv.org/abs/2105.01601">MLP-Mixer</a> paper, when pre-trained on large datasets, or with modern regularization schemes, the MLP-Mixer attains competitive scores to state-of-the-art models. You can obtain better results by increasing the embedding dimensions, increasing the number of mixer blocks, and training the model for longer. You may also try to increase the size of the input images and use different patch sizes. <hr /> <h2 id="the-fnet-model">The FNet model</h2> The FNet uses a similar block to the Transformer block. However, FNet replaces the self-attention layer in the Transformer block with a parameter-free 2D Fourier transformation layer: <ol> <li>One 1D Fourier Transform is applied along the patches.</li> <li>One 1D Fourier Transform is applied along the channels.</li> </ol> <h3 id="implement-the-fnet-module">Implement the FNet module</h3> <div class="codehilite"><pre><code>class FNetLayer(layers.Layer): def __init__(self, embedding_dim, dropout_rate, *args, **kwargs): super().__init__(*args, **kwargs) self.ffn = keras.Sequential( [ layers.Dense(units=embedding_dim, activation="gelu"), layers.Dropout(rate=dropout_rate), layers.Dense(units=embedding_dim), ] ) self.normalize1 = layers.LayerNormalization(epsilon=1e-6) self.normalize2 = layers.LayerNormalization(epsilon=1e-6) def call(self, inputs): # Apply fourier transformations. real_part = inputs im_part = keras.ops.zeros_like(inputs) x = keras.ops.fft2((real_part, im_part))[0] # Add skip connection. x = x + inputs # Apply layer normalization. x = self.normalize1(x) # Apply Feedfowrad network. x_ffn = self.ffn(x) # Add skip connection. x = x + x_ffn # Apply layer normalization. return self.normalize2(x) </code></pre></div> <h3 id="build-train-and-evaluate-the-fnet-model">Build, train, and evaluate the FNet model</h3> Note that training the model with the current settings on a V100 GPUs takes around 8 seconds per epoch. <div class="codehilite"><pre><code>fnet_blocks = keras.Sequential( [FNetLayer(embedding_dim, dropout_rate) for _ in range(num_blocks)] ) learning_rate = 0.001 fnet_classifier = build_classifier(fnet_blocks, positional_encoding=True) history = run_experiment(fnet_classifier) </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>Test accuracy: 13.82% Test top 5 accuracy: 36.15% </code></pre></div> </div> As shown in the <a href="https://arxiv.org/abs/2105.03824">FNet</a> paper, better results can be achieved by increasing the embedding dimensions, increasing the number of FNet blocks, and training the model for longer. You may also try to increase the size of the input images and use different patch sizes. The FNet scales very efficiently to long inputs, runs much faster than attention-based Transformer models, and produces competitive accuracy results. <hr /> <h2 id="the-gmlp-model">The gMLP model</h2> The gMLP is a MLP architecture that features a Spatial Gating Unit (SGU). The SGU enables cross-patch interactions across the spatial (channel) dimension, by: <ol> <li>Transforming the input spatially by applying linear projection across patches (along channels).</li> <li>Applying element-wise multiplication of the input and its spatial transformation.</li> </ol> <h3 id="implement-the-gmlp-module">Implement the gMLP module</h3> <div class="codehilite"><pre><code>class gMLPLayer(layers.Layer): def __init__(self, num_patches, embedding_dim, dropout_rate, *args, **kwargs): super().__init__(*args, **kwargs) self.channel_projection1 = keras.Sequential( [ layers.Dense(units=embedding_dim * 2, activation="gelu"), layers.Dropout(rate=dropout_rate), ] ) self.channel_projection2 = layers.Dense(units=embedding_dim) self.spatial_projection = layers.Dense( units=num_patches, bias_initializer="Ones" ) self.normalize1 = layers.LayerNormalization(epsilon=1e-6) self.normalize2 = layers.LayerNormalization(epsilon=1e-6) def spatial_gating_unit(self, x): # Split x along the channel dimensions. # Tensors u and v will in the shape of [batch_size, num_patchs, embedding_dim]. u, v = keras.ops.split(x, indices_or_sections=2, axis=2) # Apply layer normalization. v = self.normalize2(v) # Apply spatial projection. v_channels = keras.ops.transpose(v, axes=(0, 2, 1)) v_projected = self.spatial_projection(v_channels) v_projected = keras.ops.transpose(v_projected, axes=(0, 2, 1)) # Apply element-wise multiplication. return u * v_projected def call(self, inputs): # Apply layer normalization. x = self.normalize1(inputs) # Apply the first channel projection. x_projected shape: [batch_size, num_patches, embedding_dim * 2]. x_projected = self.channel_projection1(x) # Apply the spatial gating unit. x_spatial shape: [batch_size, num_patches, embedding_dim]. x_spatial = self.spatial_gating_unit(x_projected) # Apply the second channel projection. x_projected shape: [batch_size, num_patches, embedding_dim]. x_projected = self.channel_projection2(x_spatial) # Add skip connection. return x + x_projected </code></pre></div> <h3 id="build-train-and-evaluate-the-gmlp-model">Build, train, and evaluate the gMLP model</h3> Note that training the model with the current settings on a V100 GPUs takes around 9 seconds per epoch. <div class="codehilite"><pre><code>gmlp_blocks = keras.Sequential( [gMLPLayer(num_patches, embedding_dim, dropout_rate) for _ in range(num_blocks)] ) learning_rate = 0.003 gmlp_classifier = build_classifier(gmlp_blocks) history = run_experiment(gmlp_classifier) </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>Test accuracy: 17.05% Test top 5 accuracy: 42.57% </code></pre></div> </div> As shown in the <a href="https://arxiv.org/abs/2105.08050">gMLP</a> paper, better results can be achieved by increasing the embedding dimensions, increasing the number of gMLP blocks, and training the model for longer. You may also try to increase the size of the input images and use different patch sizes. Note that, the paper used advanced regularization strategies, such as MixUp and CutMix, as well as AutoAugment. </div> <div class='k-outline'> <div class='k-outline-depth-1'> <a href='#image-classification-with-modern-mlp-models'>Image classification with modern MLP models</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#introduction'>Introduction</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#setup'>Setup</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#prepare-the-data'>Prepare the data</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#configure-the-hyperparameters'>Configure the hyperparameters</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#build-a-classification-model'>Build a classification model</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#define-an-experiment'>Define an experiment</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#use-data-augmentation'>Use data augmentation</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#implement-patch-extraction-as-a-layer'>Implement patch extraction as a layer</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#implement-position-embedding-as-a-layer'>Implement position embedding as a layer</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#the-mlpmixer-model'>The MLP-Mixer model</a> </div> <div class='k-outline-depth-3'> <a href='#implement-the-mlpmixer-module'>Implement the MLP-Mixer module</a> </div> <div class='k-outline-depth-3'> <a href='#build-train-and-evaluate-the-mlpmixer-model'>Build, train, and evaluate the MLP-Mixer model</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#the-fnet-model'>The FNet model</a> </div> <div class='k-outline-depth-3'> <a href='#implement-the-fnet-module'>Implement the FNet module</a> </div> <div class='k-outline-depth-3'> <a href='#build-train-and-evaluate-the-fnet-model'>Build, train, and evaluate the FNet model</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#the-gmlp-model'>The gMLP model</a> </div> <div class='k-outline-depth-3'> <a href='#implement-the-gmlp-module'>Implement the gMLP module</a> </div> <div class='k-outline-depth-3'> <a href='#build-train-and-evaluate-the-gmlp-model'>Build, train, and evaluate the gMLP model</a> </div> </div> </div> </div> </div> </body> <footer style="float: left; width: 100%; padding: 1em; border-top: solid 1px #bbb;"> <a href="https://policies.google.com/terms">Terms</a> | <a href="https://policies.google.com/privacy">Privacy</a> </footer> </html>

CINXE.COM

Image classification with modern MLP models