CINXE.COM

<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="description" content="Keras documentation"> <meta name="author" content="Keras Team"> <link rel="shortcut icon" href="https://keras.io/img/favicon.ico"> <link rel="canonical" href="https://keras.io/examples/vision/simsiam/" />  <meta property="og:title" content="Keras documentation: Self-supervised contrastive learning with SimSiam"> <meta property="og:image" content="https://keras.io/img/logo-k-keras-wb.png"> <meta name="twitter:title" content="Keras documentation: Self-supervised contrastive learning with SimSiam"> <meta name="twitter:image" content="https://keras.io/img/k-keras-social.png"> <meta name="twitter:card" content="summary"> <title>Self-supervised contrastive learning with SimSiam</title>  <link href="/css/bootstrap.min.css" rel="stylesheet">  <link href="https://fonts.googleapis.com/css2?family=Open+Sans:wght@400;600;700;800&display=swap" rel="stylesheet">  <link href="/css/docs.css" rel="stylesheet"> <link href="/css/monokai.css" rel="stylesheet">  <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src= 'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f); })(window,document,'script','dataLayer','GTM-5DNGF4N'); </script> <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-175165319-128', 'auto'); ga('send', 'pageview'); </script>  <script async defer src="https://buttons.github.io/buttons.js"></script> </head> <body>  <noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-5DNGF4N" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>  <div class='k-page'> <div class="k-nav" id="nav-menu"> <a href='/'><img src='/img/logo-small.png' class='logo-small' /></a> <div class="nav flex-column nav-pills" role="tablist" aria-orientation="vertical"> <a class="nav-link" href="/about/" role="tab" aria-selected="">About Keras</a> <a class="nav-link" href="/getting_started/" role="tab" aria-selected="">Getting started</a> <a class="nav-link" href="/guides/" role="tab" aria-selected="">Developer guides</a> <a class="nav-link active" href="/examples/" role="tab" aria-selected="">Code examples</a> <a class="nav-sublink active" href="/examples/vision/">Computer Vision</a> <a class="nav-sublink2" href="/examples/vision/image_classification_from_scratch/">Image classification from scratch</a> <a class="nav-sublink2" href="/examples/vision/mnist_convnet/">Simple MNIST convnet</a> <a class="nav-sublink2" href="/examples/vision/image_classification_efficientnet_fine_tuning/">Image classification via fine-tuning with EfficientNet</a> <a class="nav-sublink2" href="/examples/vision/image_classification_with_vision_transformer/">Image classification with Vision Transformer</a> <a class="nav-sublink2" href="/examples/vision/attention_mil_classification/">Classification using Attention-based Deep Multiple Instance Learning</a> <a class="nav-sublink2" href="/examples/vision/mlp_image_classification/">Image classification with modern MLP models</a> <a class="nav-sublink2" href="/examples/vision/mobilevit/">A mobile-friendly Transformer-based model for image classification</a> <a class="nav-sublink2" href="/examples/vision/xray_classification_with_tpus/">Pneumonia Classification on TPU</a> <a class="nav-sublink2" href="/examples/vision/cct/">Compact Convolutional Transformers</a> <a class="nav-sublink2" href="/examples/vision/convmixer/">Image classification with ConvMixer</a> <a class="nav-sublink2" href="/examples/vision/eanet/">Image classification with EANet (External Attention Transformer)</a> <a class="nav-sublink2" href="/examples/vision/involution/">Involutional neural networks</a> <a class="nav-sublink2" href="/examples/vision/perceiver_image_classification/">Image classification with Perceiver</a> <a class="nav-sublink2" href="/examples/vision/reptile/">Few-Shot learning with Reptile</a> <a class="nav-sublink2" href="/examples/vision/semisupervised_simclr/">Semi-supervised image classification using contrastive pretraining with SimCLR</a> <a class="nav-sublink2" href="/examples/vision/swin_transformers/">Image classification with Swin Transformers</a> <a class="nav-sublink2" href="/examples/vision/vit_small_ds/">Train a Vision Transformer on small datasets</a> <a class="nav-sublink2" href="/examples/vision/shiftvit/">A Vision Transformer without Attention</a> <a class="nav-sublink2" href="/examples/vision/image_classification_using_global_context_vision_transformer/">Image Classification using Global Context Vision Transformer</a> <a class="nav-sublink2" href="/examples/vision/temporal_latent_bottleneck/">When Recurrence meets Transformers</a> <a class="nav-sublink2" href="/examples/vision/oxford_pets_image_segmentation/">Image segmentation with a U-Net-like architecture</a> <a class="nav-sublink2" href="/examples/vision/deeplabv3_plus/">Multiclass semantic segmentation using DeepLabV3+</a> <a class="nav-sublink2" href="/examples/vision/basnet_segmentation/">Highly accurate boundaries segmentation using BASNet</a> <a class="nav-sublink2" href="/examples/vision/fully_convolutional_network/">Image Segmentation using Composable Fully-Convolutional Networks</a> <a class="nav-sublink2" href="/examples/vision/retinanet/">Object Detection with RetinaNet</a> <a class="nav-sublink2" href="/examples/vision/keypoint_detection/">Keypoint Detection with Transfer Learning</a> <a class="nav-sublink2" href="/examples/vision/object_detection_using_vision_transformer/">Object detection with Vision Transformers</a> <a class="nav-sublink2" href="/examples/vision/3D_image_classification/">3D image classification from CT scans</a> <a class="nav-sublink2" href="/examples/vision/depth_estimation/">Monocular depth estimation</a> <a class="nav-sublink2" href="/examples/vision/nerf/">3D volumetric rendering with NeRF</a> <a class="nav-sublink2" href="/examples/vision/pointnet_segmentation/">Point cloud segmentation with PointNet</a> <a class="nav-sublink2" href="/examples/vision/pointnet/">Point cloud classification</a> <a class="nav-sublink2" href="/examples/vision/captcha_ocr/">OCR model for reading Captchas</a> <a class="nav-sublink2" href="/examples/vision/handwriting_recognition/">Handwriting recognition</a> <a class="nav-sublink2" href="/examples/vision/autoencoder/">Convolutional autoencoder for image denoising</a> <a class="nav-sublink2" href="/examples/vision/mirnet/">Low-light image enhancement using MIRNet</a> <a class="nav-sublink2" href="/examples/vision/super_resolution_sub_pixel/">Image Super-Resolution using an Efficient Sub-Pixel CNN</a> <a class="nav-sublink2" href="/examples/vision/edsr/">Enhanced Deep Residual Networks for single-image super-resolution</a> <a class="nav-sublink2" href="/examples/vision/zero_dce/">Zero-DCE for low-light image enhancement</a> <a class="nav-sublink2" href="/examples/vision/cutmix/">CutMix data augmentation for image classification</a> <a class="nav-sublink2" href="/examples/vision/mixup/">MixUp augmentation for image classification</a> <a class="nav-sublink2" href="/examples/vision/randaugment/">RandAugment for Image Classification for Improved Robustness</a> <a class="nav-sublink2" href="/examples/vision/image_captioning/">Image captioning</a> <a class="nav-sublink2" href="/examples/vision/nl_image_search/">Natural language image search with a Dual Encoder</a> <a class="nav-sublink2" href="/examples/vision/visualizing_what_convnets_learn/">Visualizing what convnets learn</a> <a class="nav-sublink2" href="/examples/vision/integrated_gradients/">Model interpretability with Integrated Gradients</a> <a class="nav-sublink2" href="/examples/vision/probing_vits/">Investigating Vision Transformer representations</a> <a class="nav-sublink2" href="/examples/vision/grad_cam/">Grad-CAM class activation visualization</a> <a class="nav-sublink2" href="/examples/vision/near_dup_search/">Near-duplicate image search</a> <a class="nav-sublink2" href="/examples/vision/semantic_image_clustering/">Semantic Image Clustering</a> <a class="nav-sublink2" href="/examples/vision/siamese_contrastive/">Image similarity estimation using a Siamese Network with a contrastive loss</a> <a class="nav-sublink2" href="/examples/vision/siamese_network/">Image similarity estimation using a Siamese Network with a triplet loss</a> <a class="nav-sublink2" href="/examples/vision/metric_learning/">Metric learning for image similarity search</a> <a class="nav-sublink2" href="/examples/vision/metric_learning_tf_similarity/">Metric learning for image similarity search using TensorFlow Similarity</a> <a class="nav-sublink2" href="/examples/vision/nnclr/">Self-supervised contrastive learning with NNCLR</a> <a class="nav-sublink2" href="/examples/vision/video_classification/">Video Classification with a CNN-RNN Architecture</a> <a class="nav-sublink2" href="/examples/vision/conv_lstm/">Next-Frame Video Prediction with Convolutional LSTMs</a> <a class="nav-sublink2" href="/examples/vision/video_transformers/">Video Classification with Transformers</a> <a class="nav-sublink2" href="/examples/vision/vivit/">Video Vision Transformer</a> <a class="nav-sublink2" href="/examples/vision/bit/">Image Classification using BigTransfer (BiT)</a> <a class="nav-sublink2" href="/examples/vision/gradient_centralization/">Gradient Centralization for Better Training Performance</a> <a class="nav-sublink2" href="/examples/vision/token_learner/">Learning to tokenize in Vision Transformers</a> <a class="nav-sublink2" href="/examples/vision/knowledge_distillation/">Knowledge Distillation</a> <a class="nav-sublink2" href="/examples/vision/fixres/">FixRes: Fixing train-test resolution discrepancy</a> <a class="nav-sublink2" href="/examples/vision/cait/">Class Attention Image Transformers with LayerScale</a> <a class="nav-sublink2" href="/examples/vision/patch_convnet/">Augmenting convnets with aggregated attention</a> <a class="nav-sublink2" href="/examples/vision/learnable_resizer/">Learning to Resize</a> <a class="nav-sublink2" href="/examples/vision/adamatch/">Semi-supervision and domain adaptation with AdaMatch</a> <a class="nav-sublink2" href="/examples/vision/barlow_twins/">Barlow Twins for Contrastive SSL</a> <a class="nav-sublink2" href="/examples/vision/consistency_training/">Consistency training with supervision</a> <a class="nav-sublink2" href="/examples/vision/deit/">Distilling Vision Transformers</a> <a class="nav-sublink2" href="/examples/vision/focal_modulation_network/">Focal Modulation: A replacement for Self-Attention</a> <a class="nav-sublink2" href="/examples/vision/forwardforward/">Using the Forward-Forward Algorithm for Image Classification</a> <a class="nav-sublink2" href="/examples/vision/masked_image_modeling/">Masked image modeling with Autoencoders</a> <a class="nav-sublink2" href="/examples/vision/sam/">Segment Anything Model with 🤗Transformers</a> <a class="nav-sublink2" href="/examples/vision/segformer/">Semantic segmentation with SegFormer and Hugging Face Transformers</a> <a class="nav-sublink2 active" href="/examples/vision/simsiam/">Self-supervised contrastive learning with SimSiam</a> <a class="nav-sublink2" href="/examples/vision/supervised-contrastive-learning/">Supervised Contrastive Learning</a> <a class="nav-sublink2" href="/examples/vision/yolov8/">Efficient Object Detection with YOLOV8 and KerasCV</a> <a class="nav-sublink" href="/examples/nlp/">Natural Language Processing</a> <a class="nav-sublink" href="/examples/structured_data/">Structured Data</a> <a class="nav-sublink" href="/examples/timeseries/">Timeseries</a> <a class="nav-sublink" href="/examples/generative/">Generative Deep Learning</a> <a class="nav-sublink" href="/examples/audio/">Audio Data</a> <a class="nav-sublink" href="/examples/rl/">Reinforcement Learning</a> <a class="nav-sublink" href="/examples/graph/">Graph Data</a> <a class="nav-sublink" href="/examples/keras_recipes/">Quick Keras Recipes</a> <a class="nav-link" href="/api/" role="tab" aria-selected="">Keras 3 API documentation</a> <a class="nav-link" href="/2.18/api/" role="tab" aria-selected="">Keras 2 API documentation</a> <a class="nav-link" href="/keras_tuner/" role="tab" aria-selected="">KerasTuner: Hyperparam Tuning</a> <a class="nav-link" href="/keras_hub/" role="tab" aria-selected="">KerasHub: Pretrained Models</a> </div> </div> <div class='k-main'> <div class='k-main-top'> <script> function displayDropdownMenu() { e = document.getElementById("nav-menu"); if (e.style.display == "block") { e.style.display = "none"; } else { e.style.display = "block"; document.getElementById("dropdown-nav").style.display = "block"; } } function resetMobileUI() { if (window.innerWidth <= 840) { document.getElementById("nav-menu").style.display = "none"; document.getElementById("dropdown-nav").style.display = "block"; } else { document.getElementById("nav-menu").style.display = "block"; document.getElementById("dropdown-nav").style.display = "none"; } var navmenu = document.getElementById("nav-menu"); var menuheight = navmenu.clientHeight; var kmain = document.getElementById("k-main-id"); kmain.style.minHeight = (menuheight + 100) + 'px'; } window.onresize = resetMobileUI; window.addEventListener("load", (event) => { resetMobileUI() }); </script> <div id='dropdown-nav' onclick="displayDropdownMenu();"> <svg viewBox="-20 -20 120 120" width="60" height="60"> <rect width="100" height="20"></rect> <rect y="30" width="100" height="20"></rect> <rect y="60" width="100" height="20"></rect> </svg> </div> <form class="bd-search d-flex align-items-center k-search-form" id="search-form"> <input type="search" class="k-search-input" id="search-input" placeholder="Search Keras documentation..." aria-label="Search Keras documentation..." autocomplete="off"> <button class="k-search-btn"> <svg width="13" height="13" viewBox="0 0 13 13"><title>search</title><path d="m4.8495 7.8226c0.82666 0 1.5262-0.29146 2.0985-0.87438 0.57232-0.58292 0.86378-1.2877 0.87438-2.1144 0.010599-0.82666-0.28086-1.5262-0.87438-2.0985-0.59352-0.57232-1.293-0.86378-2.0985-0.87438-0.8055-0.010599-1.5103 0.28086-2.1144 0.87438-0.60414 0.59352-0.8956 1.293-0.87438 2.0985 0.021197 0.8055 0.31266 1.5103 0.87438 2.1144 0.56172 0.60414 1.2665 0.8956 2.1144 0.87438zm4.4695 0.2115 3.681 3.6819-1.259 1.284-3.6817-3.7 0.0019784-0.69479-0.090043-0.098846c-0.87973 0.76087-1.92 1.1413-3.1207 1.1413-1.3553 0-2.5025-0.46363-3.4417-1.3909s-1.4088-2.0686-1.4088-3.4239c0-1.3553 0.4696-2.4966 1.4088-3.4239 0.9392-0.92727 2.0864-1.3969 3.4417-1.4088 1.3553-0.011889 2.4906 0.45771 3.406 1.4088 0.9154 0.95107 1.379 2.0924 1.3909 3.4239 0 1.2126-0.38043 2.2588-1.1413 3.1385l0.098834 0.090049z"></path></svg> </button> </form> <script> var form = document.getElementById('search-form'); form.onsubmit = function(e) { e.preventDefault(); var query = document.getElementById('search-input').value; window.location.href = '/search.html?query=' + query; return False } </script> </div> <div class='k-main-inner' id='k-main-id'> <div class='k-location-slug'> ► <a href='/examples/'>Code examples</a> / <a href='/examples/vision/'>Computer Vision</a> / Self-supervised contrastive learning with SimSiam </div> <div class='k-content'> <h1 id="selfsupervised-contrastive-learning-with-simsiam">Self-supervised contrastive learning with SimSiam</h1> Author: <a href="https://twitter.com/RisingSayak">Sayak Paul</a> Date created: 2021/03/19 Last modified: 2023/12/29 Description: Implementation of a self-supervised learning method for computer vision. <div class='example_version_banner keras_2'>ⓘ This example uses Keras 2</div> <img class="k-inline-icon" src="https://colab.research.google.com/img/colab_favicon.ico"/> <a href="https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/vision/ipynb/simsiam.ipynb">View in Colab</a> •<img class="k-inline-icon" src="https://github.com/favicon.ico"/> <a href="https://github.com/keras-team/keras-io/blob/master/examples/vision/simsiam.py">GitHub source</a> Self-supervised learning (SSL) is an interesting branch of study in the field of representation learning. SSL systems try to formulate a supervised signal from a corpus of unlabeled data points. An example is we train a deep neural network to predict the next word from a given set of words. In literature, these tasks are known as pretext tasks or auxiliary tasks. If we <a href="https://arxiv.org/abs/1801.06146">train such a network</a> on a huge dataset (such as the <a href="https://www.corpusdata.org/wikipedia.asp">Wikipedia text corpus</a>) it learns very effective representations that transfer well to downstream tasks. Language models like <a href="https://arxiv.org/abs/1810.04805">BERT</a>, <a href="https://arxiv.org/abs/2005.14165">GPT-3</a>, <a href="https://allennlp.org/elmo">ELMo</a> all benefit from this. Much like the language models we can train computer vision models using similar approaches. To make things work in computer vision, we need to formulate the learning tasks such that the underlying model (a deep neural network) is able to make sense of the semantic information present in vision data. One such task is to a model to contrast between two different versions of the same image. The hope is that in this way the model will have learn representations where the similar images are grouped as together possible while the dissimilar images are further away. In this example, we will be implementing one such system called SimSiam proposed in <a href="https://arxiv.org/abs/2011.10566">Exploring Simple Siamese Representation Learning</a>. It is implemented as the following: <ol> <li>We create two different versions of the same dataset with a stochastic data augmentation pipeline. Note that the random initialization seed needs to be the same during create these versions.</li> <li>We take a ResNet without any classification head (backbone) and we add a shallow fully-connected network (projection head) on top of it. Collectively, this is known as the encoder.</li> <li>We pass the output of the encoder through a predictor which is again a shallow fully-connected network having an <a href="https://en.wikipedia.org/wiki/Autoencoder">AutoEncoder</a> like structure.</li> <li>We then train our encoder to maximize the cosine similarity between the two different versions of our dataset.</li> </ol> <hr /> <h2 id="setup">Setup</h2> <div class="codehilite"><pre><code>import os os.environ["KERAS_BACKEND"] = "tensorflow" import keras import keras_cv from keras import ops import matplotlib.pyplot as plt import numpy as np </code></pre></div> <hr /> <h2 id="define-hyperparameters">Define hyperparameters</h2> <div class="codehilite"><pre><code>AUTO = tf.data.AUTOTUNE BATCH_SIZE = 128 EPOCHS = 5 CROP_TO = 32 SEED = 26 PROJECT_DIM = 2048 LATENT_DIM = 512 WEIGHT_DECAY = 0.0005 </code></pre></div> <hr /> <h2 id="load-the-cifar10-dataset">Load the CIFAR-10 dataset</h2> <div class="codehilite"><pre><code>(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data() print(f"Total training examples: {len(x_train)}") print(f"Total test examples: {len(x_test)}") </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>Total training examples: 50000 Total test examples: 10000 </code></pre></div> </div> <hr /> <h2 id="defining-our-data-augmentation-pipeline">Defining our data augmentation pipeline</h2> As studied in <a href="https://arxiv.org/abs/2002.05709">SimCLR</a> having the right data augmentation pipeline is critical for SSL systems to work effectively in computer vision. Two particular augmentation transforms that seem to matter the most are: 1.) Random resized crops and 2.) Color distortions. Most of the other SSL systems for computer vision (such as <a href="https://arxiv.org/abs/2006.07733">BYOL</a>, <a href="https://arxiv.org/abs/2003.04297">MoCoV2</a>, <a href="https://arxiv.org/abs/2006.09882">SwAV</a>, etc.) include these in their training pipelines. <div class="codehilite"><pre><code>strength = [0.4, 0.4, 0.4, 0.1] random_flip = layers.RandomFlip(mode="horizontal_and_vertical") random_crop = layers.RandomCrop(CROP_TO, CROP_TO) random_brightness = layers.RandomBrightness(0.8 * strength[0]) random_contrast = layers.RandomContrast((1 - 0.8 * strength[1], 1 + 0.8 * strength[1])) random_saturation = keras_cv.layers.RandomSaturation( (0.5 - 0.8 * strength[2], 0.5 + 0.8 * strength[2]) ) random_hue = keras_cv.layers.RandomHue(0.2 * strength[3], [0,255]) grayscale = keras_cv.layers.Grayscale() def flip_random_crop(image): # With random crops we also apply horizontal flipping. image = random_flip(image) image = random_crop(image) return image def color_jitter(x, strength=[0.4, 0.4, 0.3, 0.1]): x = random_brightness(x) x = random_contrast(x) x = random_saturation(x) x = random_hue(x) # Affine transformations can disturb the natural range of # RGB images, hence this is needed. x = ops.clip(x, 0, 255) return x def color_drop(x): x = grayscale(x) x = ops.tile(x, [1, 1, 3]) return x def random_apply(func, x, p): if keras.random.uniform([], minval=0, maxval=1) < p: return func(x) else: return x def custom_augment(image): # As discussed in the SimCLR paper, the series of augmentation # transformations (except for random crops) need to be applied # randomly to impose translational invariance. image = flip_random_crop(image) image = random_apply(color_jitter, image, p=0.8) image = random_apply(color_drop, image, p=0.2) return image </code></pre></div> It should be noted that an augmentation pipeline is generally dependent on various properties of the dataset we are dealing with. For example, if images in the dataset are heavily object-centric then taking random crops with a very high probability may hurt the training performance. Let's now apply our augmentation pipeline to our dataset and visualize a few outputs. <hr /> <h2 id="convert-the-data-into-tensorflow-dataset-objects">Convert the data into TensorFlow <code>Dataset</code> objects</h2> Here we create two different versions of our dataset without any ground-truth labels. <div class="codehilite"><pre><code>ssl_ds_one = tf.data.Dataset.from_tensor_slices(x_train) ssl_ds_one = ( ssl_ds_one.shuffle(1024, seed=SEED) .map(custom_augment, num_parallel_calls=AUTO) .batch(BATCH_SIZE) .prefetch(AUTO) ) ssl_ds_two = tf.data.Dataset.from_tensor_slices(x_train) ssl_ds_two = ( ssl_ds_two.shuffle(1024, seed=SEED) .map(custom_augment, num_parallel_calls=AUTO) .batch(BATCH_SIZE) .prefetch(AUTO) ) # We then zip both of these datasets. ssl_ds = tf.data.Dataset.zip((ssl_ds_one, ssl_ds_two)) # Visualize a few augmented images. sample_images_one = next(iter(ssl_ds_one)) plt.figure(figsize=(10, 10)) for n in range(25): ax = plt.subplot(5, 5, n + 1) plt.imshow(sample_images_one[n].numpy().astype("int")) plt.axis("off") plt.show() # Ensure that the different versions of the dataset actually contain # identical images. sample_images_two = next(iter(ssl_ds_two)) plt.figure(figsize=(10, 10)) for n in range(25): ax = plt.subplot(5, 5, n + 1) plt.imshow(sample_images_two[n].numpy().astype("int")) plt.axis("off") plt.show() </code></pre></div> <img alt="png" src="/img/examples/vision/simsiam/simsiam_12_0.png" /> <img alt="png" src="/img/examples/vision/simsiam/simsiam_12_1.png" /> Notice that the images in <code>samples_images_one</code> and <code>sample_images_two</code> are essentially the same but are augmented differently. <hr /> <h2 id="defining-the-encoder-and-the-predictor">Defining the encoder and the predictor</h2> We use an implementation of ResNet20 that is specifically configured for the CIFAR10 dataset. The code is taken from the <a href="https://github.com/GoogleCloudPlatform/keras-idiomatic-programmer/blob/master/zoo/resnet/resnet_cifar10_v2.py">keras-idiomatic-programmer</a> repository. The hyperparameters of these architectures have been referred from Section 3 and Appendix A of <a href="https://arxiv.org/abs/2011.10566">the original paper</a>. <div class="codehilite"><pre><code>!wget -q https://git.io/JYx2x -O resnet_cifar10_v2.py </code></pre></div> <div class="codehilite"><pre><code>import resnet_cifar10_v2 N = 2 DEPTH = N * 9 + 2 NUM_BLOCKS = ((DEPTH - 2) // 9) - 1 def get_encoder(): # Input and backbone. inputs = layers.Input((CROP_TO, CROP_TO, 3)) x = layers.Rescaling(scale=1.0 / 127.5, offset=-1)( inputs ) x = resnet_cifar10_v2.stem(x) x = resnet_cifar10_v2.learner(x, NUM_BLOCKS) x = layers.GlobalAveragePooling2D(name="backbone_pool")(x) # Projection head. x = layers.Dense( PROJECT_DIM, use_bias=False, kernel_regularizer=regularizers.l2(WEIGHT_DECAY) )(x) x = layers.BatchNormalization()(x) x = layers.ReLU()(x) x = layers.Dense( PROJECT_DIM, use_bias=False, kernel_regularizer=regularizers.l2(WEIGHT_DECAY) )(x) outputs = layers.BatchNormalization()(x) return keras.Model(inputs, outputs, name="encoder") def get_predictor(): model = keras.Sequential( [ # Note the AutoEncoder-like structure. layers.Input((PROJECT_DIM,)), layers.Dense( LATENT_DIM, use_bias=False, kernel_regularizer=regularizers.l2(WEIGHT_DECAY), ), layers.ReLU(), layers.BatchNormalization(), layers.Dense(PROJECT_DIM), ], name="predictor", ) return model </code></pre></div> <hr /> <h2 id="defining-the-pretraining-loop">Defining the (pre-)training loop</h2> One of the main reasons behind training networks with these kinds of approaches is to utilize the learned representations for downstream tasks like classification. This is why this particular training phase is also referred to as pre-training. We start by defining the loss function. <div class="codehilite"><pre><code>def compute_loss(p, z): # The authors of SimSiam emphasize the impact of # the `stop_gradient` operator in the paper as it # has an important role in the overall optimization. z = ops.stop_gradient(z) p = keras.utils.normalize(p, axis=1, order=2) z = keras.utils.normalize(z, axis=1, order=2) # Negative cosine similarity (minimizing this is # equivalent to maximizing the similarity). return -ops.mean(ops.sum((p * z), axis=1)) </code></pre></div> We then define our training loop by overriding the <code>train_step()</code> function of the <a href="/api/models/model#model-class"><code>keras.Model</code></a> class. <div class="codehilite"><pre><code>class SimSiam(keras.Model): def __init__(self, encoder, predictor): super().__init__() self.encoder = encoder self.predictor = predictor self.loss_tracker = keras.metrics.Mean(name="loss") @property def metrics(self): return [self.loss_tracker] def train_step(self, data): # Unpack the data. ds_one, ds_two = data # Forward pass through the encoder and predictor. with tf.GradientTape() as tape: z1, z2 = self.encoder(ds_one), self.encoder(ds_two) p1, p2 = self.predictor(z1), self.predictor(z2) # Note that here we are enforcing the network to match # the representations of two differently augmented batches # of data. loss = compute_loss(p1, z2) / 2 + compute_loss(p2, z1) / 2 # Compute gradients and update the parameters. learnable_params = ( self.encoder.trainable_variables + self.predictor.trainable_variables ) gradients = tape.gradient(loss, learnable_params) self.optimizer.apply_gradients(zip(gradients, learnable_params)) # Monitor loss. self.loss_tracker.update_state(loss) return {"loss": self.loss_tracker.result()} </code></pre></div> <hr /> <h2 id="pretraining-our-networks">Pre-training our networks</h2> In the interest of this example, we will train the model for only 5 epochs. In reality, this should at least be 100 epochs. <div class="codehilite"><pre><code># Create a cosine decay learning scheduler. num_training_samples = len(x_train) steps = EPOCHS * (num_training_samples // BATCH_SIZE) lr_decayed_fn = keras.optimizers.schedules.CosineDecay( initial_learning_rate=0.03, decay_steps=steps ) # Create an early stopping callback. early_stopping = keras.callbacks.EarlyStopping( monitor="loss", patience=5, restore_best_weights=True ) # Compile model and start training. simsiam = SimSiam(get_encoder(), get_predictor()) simsiam.compile(optimizer=keras.optimizers.SGD(lr_decayed_fn, momentum=0.6)) history = simsiam.fit(ssl_ds, epochs=EPOCHS, callbacks=[early_stopping]) # Visualize the training progress of the model. plt.plot(history.history["loss"]) plt.grid() plt.title("Negative Cosine Similairty") plt.show() </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>Epoch 1/5 391/391 [==============================] - 33s 42ms/step - loss: -0.8973 Epoch 2/5 391/391 [==============================] - 16s 40ms/step - loss: -0.9129 Epoch 3/5 391/391 [==============================] - 16s 40ms/step - loss: -0.9165 Epoch 4/5 391/391 [==============================] - 16s 40ms/step - loss: -0.9176 Epoch 5/5 391/391 [==============================] - 16s 40ms/step - loss: -0.9182 </code></pre></div> </div> <img alt="png" src="/img/examples/vision/simsiam/simsiam_22_1.png" /> If your solution gets very close to -1 (minimum value of our loss) very quickly with a different dataset and a different backbone architecture that is likely because of representation collapse. It is a phenomenon where the encoder yields similar output for all the images. In that case additional hyperparameter tuning is required especially in the following areas: <ul> <li>Strength of the color distortions and their probabilities.</li> <li>Learning rate and its schedule.</li> <li>Architecture of both the backbone and their projection head.</li> </ul> <hr /> <h2 id="evaluating-our-ssl-method">Evaluating our SSL method</h2> The most popularly used method to evaluate a SSL method in computer vision (or any other pre-training method as such) is to learn a linear classifier on the frozen features of the trained backbone model (in this case it is ResNet20) and evaluate the classifier on unseen images. Other methods include <a href="https://keras.io/guides/transfer_learning/">fine-tuning</a> on the source dataset or even a target dataset with 5% or 10% labels present. Practically, we can use the backbone model for any downstream task such as semantic segmentation, object detection, and so on where the backbone models are usually pre-trained with pure supervised learning. <div class="codehilite"><pre><code># We first create labeled `Dataset` objects. train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train)) test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)) # Then we shuffle, batch, and prefetch this dataset for performance. We # also apply random resized crops as an augmentation but only to the # training set. train_ds = ( train_ds.shuffle(1024) .map(lambda x, y: (flip_random_crop(x), y), num_parallel_calls=AUTO) .batch(BATCH_SIZE) .prefetch(AUTO) ) test_ds = test_ds.batch(BATCH_SIZE).prefetch(AUTO) # Extract the backbone ResNet20. backbone = keras.Model( simsiam.encoder.input, simsiam.encoder.get_layer("backbone_pool").output ) # We then create our linear classifier and train it. backbone.trainable = False inputs = layers.Input((CROP_TO, CROP_TO, 3)) x = backbone(inputs, training=False) outputs = layers.Dense(10, activation="softmax")(x) linear_model = keras.Model(inputs, outputs, name="linear_model") # Compile model and start training. linear_model.compile( loss="sparse_categorical_crossentropy", metrics=["accuracy"], optimizer=keras.optimizers.SGD(lr_decayed_fn, momentum=0.9), ) history = linear_model.fit( train_ds, validation_data=test_ds, epochs=EPOCHS, callbacks=[early_stopping] ) _, test_acc = linear_model.evaluate(test_ds) print("Test accuracy: {:.2f}%".format(test_acc * 100)) </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>Epoch 1/5 391/391 [==============================] - 7s 11ms/step - loss: 3.8072 - accuracy: 0.1527 - val_loss: 3.7449 - val_accuracy: 0.2046 Epoch 2/5 391/391 [==============================] - 3s 8ms/step - loss: 3.7356 - accuracy: 0.2107 - val_loss: 3.7055 - val_accuracy: 0.2308 Epoch 3/5 391/391 [==============================] - 3s 8ms/step - loss: 3.7036 - accuracy: 0.2228 - val_loss: 3.6874 - val_accuracy: 0.2329 Epoch 4/5 391/391 [==============================] - 3s 8ms/step - loss: 3.6893 - accuracy: 0.2276 - val_loss: 3.6808 - val_accuracy: 0.2334 Epoch 5/5 391/391 [==============================] - 3s 9ms/step - loss: 3.6845 - accuracy: 0.2305 - val_loss: 3.6798 - val_accuracy: 0.2339 79/79 [==============================] - 1s 7ms/step - loss: 3.6798 - accuracy: 0.2339 Test accuracy: 23.39% </code></pre></div> </div> <hr /> <h2 id="notes">Notes</h2> <ul> <li>More data and longer pre-training schedule benefit SSL in general.</li> <li>SSL is particularly very helpful when you do not have access to very limited labeled training data but you can manage to build a large corpus of unlabeled data. Recently, using an SSL method called <a href="https://arxiv.org/abs/2006.09882">SwAV</a>, a group of researchers at Facebook trained a <a href="https://arxiv.org/abs/2006.09882">RegNet</a> on 2 Billion images. They were able to achieve downstream performance very close to those achieved by pure supervised pre-training. For some downstream tasks, their method even outperformed the supervised counterparts. You can check out <a href="https://arxiv.org/pdf/2103.01988.pdf">their paper</a> to know the details.</li> <li>If you are interested to understand why contrastive SSL helps networks learn meaningful representations, you can check out the following resources:<ul> <li><a href="https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence/">Self-supervised learning: The dark matter of intelligence</a></li> <li><a href="https://sslneuips20.github.io/files/CameraReadys%203-77/64/CameraReady/Understanding_self_supervised_learning.pdf">Understanding self-supervised learning using controlled datasets with known structure</a></li> </ul> </li> </ul> </div> <div class='k-outline'> <div class='k-outline-depth-1'> <a href='#selfsupervised-contrastive-learning-with-simsiam'>Self-supervised contrastive learning with SimSiam</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#setup'>Setup</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#define-hyperparameters'>Define hyperparameters</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#load-the-cifar10-dataset'>Load the CIFAR-10 dataset</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#defining-our-data-augmentation-pipeline'>Defining our data augmentation pipeline</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#convert-the-data-into-tensorflow-dataset-objects'>Convert the data into TensorFlow <code>Dataset</code> objects</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#defining-the-encoder-and-the-predictor'>Defining the encoder and the predictor</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#defining-the-pretraining-loop'>Defining the (pre-)training loop</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#pretraining-our-networks'>Pre-training our networks</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#evaluating-our-ssl-method'>Evaluating our SSL method</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#notes'>Notes</a> </div> </div> </div> </div> </div> </body> <footer style="float: left; width: 100%; padding: 1em; border-top: solid 1px #bbb;"> <a href="https://policies.google.com/terms">Terms</a> | <a href="https://policies.google.com/privacy">Privacy</a> </footer> </html>