CINXE.COM

<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="description" content="Keras documentation"> <meta name="author" content="Keras Team"> <link rel="shortcut icon" href="https://keras.io/img/favicon.ico"> <link rel="canonical" href="https://keras.io/examples/nlp/parameter_efficient_finetuning_of_gpt2_with_lora/" />  <meta property="og:title" content="Keras documentation: Parameter-efficient fine-tuning of GPT-2 with LoRA"> <meta property="og:image" content="https://keras.io/img/logo-k-keras-wb.png"> <meta name="twitter:title" content="Keras documentation: Parameter-efficient fine-tuning of GPT-2 with LoRA"> <meta name="twitter:image" content="https://keras.io/img/k-keras-social.png"> <meta name="twitter:card" content="summary"> <title>Parameter-efficient fine-tuning of GPT-2 with LoRA</title>  <link href="/css/bootstrap.min.css" rel="stylesheet">  <link href="https://fonts.googleapis.com/css2?family=Open+Sans:wght@400;600;700;800&display=swap" rel="stylesheet">  <link href="/css/docs.css" rel="stylesheet"> <link href="/css/monokai.css" rel="stylesheet">  <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src= 'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f); })(window,document,'script','dataLayer','GTM-5DNGF4N'); </script> <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-175165319-128', 'auto'); ga('send', 'pageview'); </script>  <script async defer src="https://buttons.github.io/buttons.js"></script> </head> <body>  <noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-5DNGF4N" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>  <div class='k-page'> <div class="k-nav" id="nav-menu"> <a href='/'><img src='/img/logo-small.png' class='logo-small' /></a> <div class="nav flex-column nav-pills" role="tablist" aria-orientation="vertical"> <a class="nav-link" href="/about/" role="tab" aria-selected="">About Keras</a> <a class="nav-link" href="/getting_started/" role="tab" aria-selected="">Getting started</a> <a class="nav-link" href="/guides/" role="tab" aria-selected="">Developer guides</a> <a class="nav-link" href="/api/" role="tab" aria-selected="">Keras 3 API documentation</a> <a class="nav-link" href="/2.18/api/" role="tab" aria-selected="">Keras 2 API documentation</a> <a class="nav-link active" href="/examples/" role="tab" aria-selected="">Code examples</a> <a class="nav-sublink" href="/examples/vision/">Computer Vision</a> <a class="nav-sublink active" href="/examples/nlp/">Natural Language Processing</a> <a class="nav-sublink2" href="/examples/nlp/text_classification_from_scratch/">Text classification from scratch</a> <a class="nav-sublink2" href="/examples/nlp/active_learning_review_classification/">Review Classification using Active Learning</a> <a class="nav-sublink2" href="/examples/nlp/fnet_classification_with_keras_hub/">Text Classification using FNet</a> <a class="nav-sublink2" href="/examples/nlp/multi_label_classification/">Large-scale multi-label text classification</a> <a class="nav-sublink2" href="/examples/nlp/text_classification_with_transformer/">Text classification with Transformer</a> <a class="nav-sublink2" href="/examples/nlp/text_classification_with_switch_transformer/">Text classification with Switch Transformer</a> <a class="nav-sublink2" href="/examples/nlp/tweet-classification-using-tfdf/">Text classification using Decision Forests and pretrained embeddings</a> <a class="nav-sublink2" href="/examples/nlp/pretrained_word_embeddings/">Using pre-trained word embeddings</a> <a class="nav-sublink2" href="/examples/nlp/bidirectional_lstm_imdb/">Bidirectional LSTM on IMDB</a> <a class="nav-sublink2" href="/examples/nlp/data_parallel_training_with_keras_hub/">Data Parallel Training with KerasHub and tf.distribute</a> <a class="nav-sublink2" href="/examples/nlp/neural_machine_translation_with_keras_hub/">English-to-Spanish translation with KerasHub</a> <a class="nav-sublink2" href="/examples/nlp/neural_machine_translation_with_transformer/">English-to-Spanish translation with a sequence-to-sequence Transformer</a> <a class="nav-sublink2" href="/examples/nlp/lstm_seq2seq/">Character-level recurrent sequence-to-sequence model</a> <a class="nav-sublink2" href="/examples/nlp/multimodal_entailment/">Multimodal entailment</a> <a class="nav-sublink2" href="/examples/nlp/ner_transformers/">Named Entity Recognition using Transformers</a> <a class="nav-sublink2" href="/examples/nlp/text_extraction_with_bert/">Text Extraction with BERT</a> <a class="nav-sublink2" href="/examples/nlp/addition_rnn/">Sequence to sequence learning for performing number addition</a> <a class="nav-sublink2" href="/examples/nlp/semantic_similarity_with_keras_hub/">Semantic Similarity with KerasHub</a> <a class="nav-sublink2" href="/examples/nlp/semantic_similarity_with_bert/">Semantic Similarity with BERT</a> <a class="nav-sublink2" href="/examples/nlp/sentence_embeddings_with_sbert/">Sentence embeddings using Siamese RoBERTa-networks</a> <a class="nav-sublink2" href="/examples/nlp/masked_language_modeling/">End-to-end Masked Language Modeling with BERT</a> <a class="nav-sublink2" href="/examples/nlp/abstractive_summarization_with_bart/">Abstractive Text Summarization with BART</a> <a class="nav-sublink2" href="/examples/nlp/pretraining_BERT/">Pretraining BERT with Hugging Face Transformers</a> <a class="nav-sublink2 active" href="/examples/nlp/parameter_efficient_finetuning_of_gpt2_with_lora/">Parameter-efficient fine-tuning of GPT-2 with LoRA</a> <a class="nav-sublink2" href="/examples/nlp/mlm_training_tpus/">Training a language model from scratch with 🤗 Transformers and TPUs</a> <a class="nav-sublink2" href="/examples/nlp/multiple_choice_task_with_transfer_learning/">MultipleChoice Task with Transfer Learning</a> <a class="nav-sublink2" href="/examples/nlp/question_answering/">Question Answering with Hugging Face Transformers</a> <a class="nav-sublink2" href="/examples/nlp/t5_hf_summarization/">Abstractive Summarization with Hugging Face Transformers</a> <a class="nav-sublink" href="/examples/structured_data/">Structured Data</a> <a class="nav-sublink" href="/examples/timeseries/">Timeseries</a> <a class="nav-sublink" href="/examples/generative/">Generative Deep Learning</a> <a class="nav-sublink" href="/examples/audio/">Audio Data</a> <a class="nav-sublink" href="/examples/rl/">Reinforcement Learning</a> <a class="nav-sublink" href="/examples/graph/">Graph Data</a> <a class="nav-sublink" href="/examples/keras_recipes/">Quick Keras Recipes</a> <a class="nav-link" href="/keras_tuner/" role="tab" aria-selected="">KerasTuner: Hyperparameter Tuning</a> <a class="nav-link" href="/keras_hub/" role="tab" aria-selected="">KerasHub: Pretrained Models</a> <a class="nav-link" href="/keras_cv/" role="tab" aria-selected="">KerasCV: Computer Vision Workflows</a> <a class="nav-link" href="/keras_nlp/" role="tab" aria-selected="">KerasNLP: Natural Language Workflows</a> </div> </div> <div class='k-main'> <div class='k-main-top'> <script> function displayDropdownMenu() { e = document.getElementById("nav-menu"); if (e.style.display == "block") { e.style.display = "none"; } else { e.style.display = "block"; document.getElementById("dropdown-nav").style.display = "block"; } } function resetMobileUI() { if (window.innerWidth <= 840) { document.getElementById("nav-menu").style.display = "none"; document.getElementById("dropdown-nav").style.display = "block"; } else { document.getElementById("nav-menu").style.display = "block"; document.getElementById("dropdown-nav").style.display = "none"; } var navmenu = document.getElementById("nav-menu"); var menuheight = navmenu.clientHeight; var kmain = document.getElementById("k-main-id"); kmain.style.minHeight = (menuheight + 100) + 'px'; } window.onresize = resetMobileUI; window.addEventListener("load", (event) => { resetMobileUI() }); </script> <div id='dropdown-nav' onclick="displayDropdownMenu();"> <svg viewBox="-20 -20 120 120" width="60" height="60"> <rect width="100" height="20"></rect> <rect y="30" width="100" height="20"></rect> <rect y="60" width="100" height="20"></rect> </svg> </div> <form class="bd-search d-flex align-items-center k-search-form" id="search-form"> <input type="search" class="k-search-input" id="search-input" placeholder="Search Keras documentation..." aria-label="Search Keras documentation..." autocomplete="off"> <button class="k-search-btn"> <svg width="13" height="13" viewBox="0 0 13 13"><title>search</title><path d="m4.8495 7.8226c0.82666 0 1.5262-0.29146 2.0985-0.87438 0.57232-0.58292 0.86378-1.2877 0.87438-2.1144 0.010599-0.82666-0.28086-1.5262-0.87438-2.0985-0.59352-0.57232-1.293-0.86378-2.0985-0.87438-0.8055-0.010599-1.5103 0.28086-2.1144 0.87438-0.60414 0.59352-0.8956 1.293-0.87438 2.0985 0.021197 0.8055 0.31266 1.5103 0.87438 2.1144 0.56172 0.60414 1.2665 0.8956 2.1144 0.87438zm4.4695 0.2115 3.681 3.6819-1.259 1.284-3.6817-3.7 0.0019784-0.69479-0.090043-0.098846c-0.87973 0.76087-1.92 1.1413-3.1207 1.1413-1.3553 0-2.5025-0.46363-3.4417-1.3909s-1.4088-2.0686-1.4088-3.4239c0-1.3553 0.4696-2.4966 1.4088-3.4239 0.9392-0.92727 2.0864-1.3969 3.4417-1.4088 1.3553-0.011889 2.4906 0.45771 3.406 1.4088 0.9154 0.95107 1.379 2.0924 1.3909 3.4239 0 1.2126-0.38043 2.2588-1.1413 3.1385l0.098834 0.090049z"></path></svg> </button> </form> <script> var form = document.getElementById('search-form'); form.onsubmit = function(e) { e.preventDefault(); var query = document.getElementById('search-input').value; window.location.href = '/search.html?query=' + query; return False } </script> </div> <div class='k-main-inner' id='k-main-id'> <div class='k-location-slug'> ► <a href='/examples/'>Code examples</a> / <a href='/examples/nlp/'>Natural Language Processing</a> / Parameter-efficient fine-tuning of GPT-2 with LoRA </div> <div class='k-content'> <h1 id="parameterefficient-finetuning-of-gpt2-with-lora">Parameter-efficient fine-tuning of GPT-2 with LoRA</h1> Author: <a href="https://github.com/abheesht17/">Abheesht Sharma</a>, <a href="https://github.com/mattdangerw/">Matthew Watson</a> Date created: 2023/05/27 Last modified: 2023/05/27 Description: Use KerasHub to fine-tune a GPT-2 LLM with LoRA. <div class='example_version_banner keras_3'>ⓘ This example uses Keras 3</div> <img class="k-inline-icon" src="https://colab.research.google.com/img/colab_favicon.ico"/> <a href="https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/parameter_efficient_finetuning_of_gpt2_with_lora.ipynb">View in Colab</a> •<img class="k-inline-icon" src="https://github.com/favicon.ico"/> <a href="https://github.com/keras-team/keras-io/blob/master/examples/nlp/parameter_efficient_finetuning_of_gpt2_with_lora.py">GitHub source</a> <hr /> <h2 id="introduction">Introduction</h2> Large Language Models (LLMs) have been shown to be effective at a variety of NLP tasks. An LLM is first pre-trained on a large corpus of text in a self-supervised fashion. Pre-training helps LLMs learn general-purpose knowledge, such as statistical relationships between words. An LLM can then be fine-tuned on a downstream task of interest (such as sentiment analysis). However, LLMs are extremely large in size, and we don't need to train all the parameters in the model while fine-tuning, especially because datasets on which the model is fine-tuned are relatively small. Another way of saying this is that LLMs are over-parametrized for fine-tuning. This is where <a href="https://arxiv.org/abs/2106.09685">Low-Rank Adaptation (LoRA)</a> comes in; it significantly reduces the number of trainable parameters. This results in a decrease in training time and GPU memory usage, while maintaining the quality of the outputs. In this example, we will explain LoRA in technical terms, show how the technical explanation translates to code, hack KerasHub's <a href="https://keras.io/api/keras_hub/models/gpt2/">GPT-2 model</a> and fine-tune it on the next token prediction task using LoRA. We will compare LoRA GPT-2 with a fully fine-tuned GPT-2 in terms of the quality of the generated text, training time and GPU memory usage. Note: This example runs on the TensorFlow backend purely for the <a href="https://www.tensorflow.org/api_docs/python/tf/config/experimental/get_memory_info"><code>tf.config.experimental.get_memory_info</code></a> API to easily plot memory usage. Outside of the memory usage callback, this example will run on <code>jax</code> and <code>torch</code> backends. <hr /> <h2 id="setup">Setup</h2> Before we start implementing the pipeline, let's install and import all the libraries we need. We'll be using the KerasHub library. Secondly, let's enable mixed precision training. This will help us reduce the training time. <div class="codehilite"><pre><code>!pip install -q --upgrade keras-hub !pip install -q --upgrade keras # Upgrade to Keras 3. </code></pre></div> <div class="codehilite"><pre><code>import os os.environ["KERAS_BACKEND"] = "tensorflow" import keras_hub import keras import matplotlib.pyplot as plt import tensorflow as tf import tensorflow_datasets as tfds import time keras.mixed_precision.set_global_policy("mixed_float16") </code></pre></div> Let's also define our hyperparameters. <div class="codehilite"><pre><code># General hyperparameters BATCH_SIZE = 32 NUM_BATCHES = 500 EPOCHS = 1 # Can be set to a higher value for better results MAX_SEQUENCE_LENGTH = 128 MAX_GENERATION_LENGTH = 200 GPT2_PRESET = "gpt2_base_en" # LoRA-specific hyperparameters RANK = 4 ALPHA = 32.0 </code></pre></div> <hr /> <h2 id="dataset">Dataset</h2> Let's load a Reddit dataset. We will fine-tune both the GPT-2 model and the LoRA GPT-2 model on a subset of this dataset. The aim is to produce text similar in style to Reddit posts. <div class="codehilite"><pre><code>reddit_ds = tfds.load("reddit_tifu", split="train", as_supervised=True) </code></pre></div> The dataset has two fields: <code>document</code> and <code>title</code>. <div class="codehilite"><pre><code>for document, title in reddit_ds: print(document.numpy()) print(title.numpy()) break </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>b"me and a friend decided to go to the beach last sunday. we loaded up and headed out. we were about half way there when i decided that i was not leaving till i had seafood. \n\nnow i'm not talking about red lobster. no friends i'm talking about a low country boil. i found the restaurant and got directions. i don't know if any of you have heard about the crab shack on tybee island but let me tell you it's worth it. \n\nwe arrived and was seated quickly. we decided to get a seafood sampler for two and split it. the waitress bought it out on separate platters for us. the amount of food was staggering. two types of crab, shrimp, mussels, crawfish, andouille sausage, red potatoes, and corn on the cob. i managed to finish it and some of my friends crawfish and mussels. it was a day to be a fat ass. we finished paid for our food and headed to the beach. \n\nfunny thing about seafood. it runs through me faster than a kenyan \n\nwe arrived and walked around a bit. it was about 45min since we arrived at the beach when i felt a rumble from the depths of my stomach. i ignored it i didn't want my stomach to ruin our fun. i pushed down the feeling and continued. about 15min later the feeling was back and stronger than before. again i ignored it and continued. 5min later it felt like a nuclear reactor had just exploded in my stomach. i started running. i yelled to my friend to hurry the fuck up. \n\nrunning in sand is extremely hard if you did not know this. we got in his car and i yelled at him to floor it. my stomach was screaming and if he didn't hurry i was gonna have this baby in his car and it wasn't gonna be pretty. after a few red lights and me screaming like a woman in labor we made it to the store. \n\ni practically tore his car door open and ran inside. i ran to the bathroom opened the door and barely got my pants down before the dam burst and a flood of shit poured from my ass. \n\ni finished up when i felt something wet on my ass. i rubbed it thinking it was back splash. no, mass was covered in the after math of me abusing the toilet. i grabbed all the paper towels i could and gave my self a whores bath right there. \n\ni sprayed the bathroom down with the air freshener and left. an elderly lady walked in quickly and closed the door. i was just about to walk away when i heard gag. instead of walking i ran. i got to the car and told him to get the hell out of there." b'liking seafood' </code></pre></div> </div> We'll now batch the dataset and retain only the <code>document</code> field because we are fine-tuning the model on the next word prediction task. Take a subset of the dataset for the purpose of this example. <div class="codehilite"><pre><code>train_ds = ( reddit_ds.map(lambda document, _: document) .batch(BATCH_SIZE) .cache() .prefetch(tf.data.AUTOTUNE) ) train_ds = train_ds.take(NUM_BATCHES) </code></pre></div> <hr /> <h2 id="helper-functions">Helper functions</h2> Before we begin fine-tuning the models, let's define a few helper functions and classes. <h3 id="callback-for-tracking-gpu-memory-usage">Callback for tracking GPU memory usage</h3> We'll define a custom callback function which tracks GPU memory usage. The callback function uses TensorFlow's <a href="https://www.tensorflow.org/api_docs/python/tf/config/experimental/get_memory_info"><code>tf.config.experimental.get_memory_info</code></a> API. Here, we assume that we are using a single GPU, <code>GPU:0</code>. <div class="codehilite"><pre><code>class GPUMemoryCallback(keras.callbacks.Callback): def __init__( self, target_batches, print_stats=False, **kwargs, ): super().__init__(**kwargs) self.target_batches = target_batches self.print_stats = print_stats self.memory_usage = [] self.labels = [] def _compute_memory_usage(self): memory_stats = tf.config.experimental.get_memory_info("GPU:0") # Convert bytes to GB and store in list. peak_usage = round(memory_stats["peak"] / (2**30), 3) self.memory_usage.append(peak_usage) def on_epoch_begin(self, epoch, logs=None): self._compute_memory_usage() self.labels.append(f"epoch {epoch} start") def on_train_batch_begin(self, batch, logs=None): if batch in self.target_batches: self._compute_memory_usage() self.labels.append(f"batch {batch}") def on_epoch_end(self, epoch, logs=None): self._compute_memory_usage() self.labels.append(f"epoch {epoch} end") </code></pre></div> <h3 id="function-for-text-generation">Function for text generation</h3> Here is a helper function to generate text. <div class="codehilite"><pre><code>def generate_text(model, input_text, max_length=200): start = time.time() output = model.generate(input_text, max_length=max_length) print("\nOutput:") print(output) end = time.time() print(f"Total Time Elapsed: {end - start:.2f}s") </code></pre></div> <h3 id="define-optimizer-and-loss">Define optimizer and loss</h3> We will use AdamW optimizer and cross-entropy loss for training both models. <div class="codehilite"><pre><code>def get_optimizer_and_loss(): optimizer = keras.optimizers.AdamW( learning_rate=5e-5, weight_decay=0.01, epsilon=1e-6, global_clipnorm=1.0, # Gradient clipping. ) # Exclude layernorm and bias terms from weight decay. optimizer.exclude_from_weight_decay(var_names=["bias"]) optimizer.exclude_from_weight_decay(var_names=["gamma"]) optimizer.exclude_from_weight_decay(var_names=["beta"]) loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True) return optimizer, loss </code></pre></div> <hr /> <h2 id="finetune-gpt2">Fine-tune GPT-2</h2> Let's load the model and preprocessor first. We use a sequence length of 128 instead of 1024 (which is the default sequence length). This will limit our ability to predict long sequences, but will allow us to run this example quickly on Colab. <div class="codehilite"><pre><code>preprocessor = keras_hub.models.GPT2CausalLMPreprocessor.from_preset( "gpt2_base_en", sequence_length=MAX_SEQUENCE_LENGTH, ) gpt2_lm = keras_hub.models.GPT2CausalLM.from_preset( "gpt2_base_en", preprocessor=preprocessor ) gpt2_lm.summary() </code></pre></div> <pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace">Preprocessor: "gpt2_causal_lm_preprocessor" </pre> <pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace">┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Tokenizer (type) ┃ Vocab # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ gpt2_tokenizer (GPT2Tokenizer) │ 50,257 │ └────────────────────────────────────────────────────┴─────────────────────────────────────────────────────┘ </pre> <pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace">Model: "gpt2_causal_lm" </pre> <pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace">┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ padding_mask (InputLayer) │ (None, None) │ 0 │ - │ ├───────────────────────────────┼───────────────────────────┼─────────────┼────────────────────────────────┤ │ token_ids (InputLayer) │ (None, None) │ 0 │ - │ ├───────────────────────────────┼───────────────────────────┼─────────────┼────────────────────────────────┤ │ gpt2_backbone (GPT2Backbone) │ (None, None, 768) │ 124,439,808 │ padding_mask[0][0], │ │ │ │ │ token_ids[0][0] │ ├───────────────────────────────┼───────────────────────────┼─────────────┼────────────────────────────────┤ │ token_embedding │ (None, None, 50257) │ 38,597,376 │ gpt2_backbone[0][0] │ │ (ReversibleEmbedding) │ │ │ │ └───────────────────────────────┴───────────────────────────┴─────────────┴────────────────────────────────┘ </pre> <pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"> Total params: 124,439,808 (474.70 MB) </pre> <pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"> Trainable params: 124,439,808 (474.70 MB) </pre> <pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"> Non-trainable params: 0 (0.00 B) </pre> Initialize the GPU memory tracker callback object, and compile the model. We use the Adam optimizer with a linearly decaying learning rate. <div class="codehilite"><pre><code>gpu_memory_callback = GPUMemoryCallback( target_batches=[5, 10, 25, 50, 100, 150, 200, 300, 400, 500], print_stats=True, ) optimizer, loss = get_optimizer_and_loss() gpt2_lm.compile( optimizer=optimizer, loss=loss, weighted_metrics=["accuracy"], ) </code></pre></div> We are all set to train the model! <div class="codehilite"><pre><code>gpt2_lm.fit(train_ds, epochs=EPOCHS, callbacks=[gpu_memory_callback]) gpt2_lm_memory_usage = gpu_memory_callback.memory_usage </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1701128462.076856 38706 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. W0000 00:00:1701128462.146837 38706 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update 500/500 ━━━━━━━━━━━━━━━━━━━━ 114s 128ms/step - accuracy: 0.3183 - loss: 3.3682 </code></pre></div> </div> As a final step, let's generate some text. We will harness the power of XLA. The first call to <code>generate()</code> will be slow because of XLA compilation, but subsequent calls will be super-fast. :) <div class="codehilite"><pre><code>generate_text(gpt2_lm, "I like basketball", max_length=MAX_GENERATION_LENGTH) generate_text(gpt2_lm, "That Italian restaurant is", max_length=MAX_GENERATION_LENGTH) </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>Output: I like basketball, but this one actually happened a few months ago. </code></pre></div> </div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>i was on my way to a party in the city when i noticed a group of guys were playing basketball. one of my friends, a guy named "jenny," was playing. jenny's mom, a very nice girl, was sitting on her couch. </code></pre></div> </div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>jenny and jenny were sitting in a circle around her, and i started to play some of my favorite basketball games. i got to the end of the circle and jenny started to run. i didn't know how jenny was doing. she ran, but it Total Time Elapsed: 6.66s </code></pre></div> </div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>Output: That Italian restaurant is a bit of a mystery, because the place is closed. so i was at my friends house and i went to grab some food, so i got the usual pizza and some chicken, but it wasn't really the pizza, so i just grabbed my friend's pizza. i had a lot of chicken, but i was hungry, so i decided to grab a few of the other pizza's that were already in there. </code></pre></div> </div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>i was eating the pizza with some friends and i was eating the pizza and then i got a knock on the door. </code></pre></div> </div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>the guy in front of me is Total Time Elapsed: 0.22s </code></pre></div> </div> <hr /> <h2 id="lora-gpt2">LoRA GPT-2</h2> In this section, we discuss the technical details of LoRA, build a LoRA GPT-2 model, fine-tune it and generate text. <h3 id="what-exactly-is-lora">What exactly is LoRA?</h3> LoRA is a parameter-efficient fine-tuning technique for LLMs. It freezes the weights of the LLM, and injects trainable rank-decomposition matrices. Let's understand this more clearly. Assume we have an <code>n x n</code> pre-trained dense layer (or weight matrix), <code>W0</code>. We initialize two dense layers, <code>A</code> and <code>B</code>, of shapes <code>n x rank</code>, and <code>rank x n</code>, respectively. <code>rank</code> is much smaller than <code>n</code>. In the paper, values between 1 and 4 are shown to work well. <h4 id="lora-equation">LoRA equation</h4> The original equation is <code>output = W0x + b0</code>, where <code>x</code> is the input, <code>W0</code> and <code>b0</code> are the weight matrix and bias terms of the original dense layer (frozen). The LoRA equation is: <code>output = W0x + b0 + BAx</code>, where <code>A</code> and <code>B</code> are the rank-decomposition matrices. LoRA is based on the idea that updates to the weights of the pre-trained language model have a low "intrinsic rank" since pre-trained language models are over-parametrized. Predictive performance of full fine-tuning can be replicated even by constraining <code>W0</code>'s updates to low-rank decomposition matrices. <img src="https://i.imgur.com/f4TFqMi.png" alt="lora_diagram" height="250"/> <h4 id="number-of-trainable-parameters">Number of trainable parameters</h4> Let's do some quick math. Suppose <code>n</code> is 768, and <code>rank</code> is 4. <code>W0</code> has <code>768 x 768 = 589,824</code> parameters, whereas the LoRA layers, <code>A</code> and <code>B</code> together have <code>768 x 4 + 4 x 768 = 6,144</code> parameters. So, for the dense layer, we go from <code>589,824</code> trainable parameters to <code>6,144</code> trainable parameters! <h4 id="why-does-lora-reduce-memory-footprint">Why does LoRA reduce memory footprint?</h4> Even though the total number of parameters increase (since we are adding LoRA layers), the memory footprint reduces, because the number of trainable parameters reduces. Let's dive deeper into this. The memory usage of a model can be split into four parts: <ul> <li>Model memory: This is the memory required to store the model weights. This will be slightly higher for LoRA than GPT-2.</li> <li>Forward pass memory: This mostly depends on batch size, sequence length, etc. We keep this constant for both models for a fair comparison.</li> <li>Backward pass memory: This is the memory required to store the gradients. Note that the gradients are computed only for the trainable parameters.</li> <li>Optimizer memory: This is the memory required to store the optimizer state. For example, the Adam optimizer stores the "1st moment vectors" and "2nd moment vectors" for the trainable parameters.</li> </ul> Since, with LoRA, there is a huge reduction in the number of trainable parameters, the optimizer memory and the memory required to store the gradients for LoRA is much less than GPT-2. This is where most of the memory savings happen. <h4 id="why-is-lora-so-popular">Why is LoRA so popular?</h4> <ul> <li>Reduces GPU memory usage;</li> <li>Faster training; and</li> <li>No additional inference latency.</li> </ul> <h3 id="create-lora-layer">Create LoRA layer</h3> According to the technical description above, let's create a LoRA layer. In a transformer model, the LoRA layer is created and injected for the query and value projection matrices. In <a href="/api/layers/attention_layers/multi_head_attention#multiheadattention-class"><code>keras.layers.MultiHeadAttention</code></a>, the query/value projection layers are <a href="/api/layers/core_layers/einsum_dense#einsumdense-class"><code>keras.layers.EinsumDense</code></a> layers. <div class="codehilite"><pre><code>import math class LoraLayer(keras.layers.Layer): def __init__( self, original_layer, rank=8, alpha=32, trainable=False, **kwargs, ): # We want to keep the name of this layer the same as the original # dense layer. original_layer_config = original_layer.get_config() name = original_layer_config["name"] kwargs.pop("name", None) super().__init__(name=name, trainable=trainable, **kwargs) self.rank = rank self.alpha = alpha self._scale = alpha / rank self._num_heads = original_layer_config["output_shape"][-2] self._hidden_dim = self._num_heads * original_layer_config["output_shape"][-1] # Layers. # Original dense layer. self.original_layer = original_layer # No matter whether we are training the model or are in inference mode, # this layer should be frozen. self.original_layer.trainable = False # LoRA dense layers. self.A = keras.layers.Dense( units=rank, use_bias=False, # Note: the original paper mentions that normal distribution was # used for initialization. However, the official LoRA implementation # uses "Kaiming/He Initialization". kernel_initializer=keras.initializers.VarianceScaling( scale=math.sqrt(5), mode="fan_in", distribution="uniform" ), trainable=trainable, name=f"lora_A", ) # B has the same `equation` and `output_shape` as the original layer. # `equation = abc,cde->abde`, where `a`: batch size, `b`: sequence # length, `c`: `hidden_dim`, `d`: `num_heads`, # `e`: `hidden_dim//num_heads`. The only difference is that in layer `B`, # `c` represents `rank`. self.B = keras.layers.EinsumDense( equation=original_layer_config["equation"], output_shape=original_layer_config["output_shape"], kernel_initializer="zeros", trainable=trainable, name=f"lora_B", ) def call(self, inputs): original_output = self.original_layer(inputs) if self.trainable: # If we are fine-tuning the model, we will add LoRA layers' output # to the original layer's output. lora_output = self.B(self.A(inputs)) * self._scale return original_output + lora_output # If we are in inference mode, we "merge" the LoRA layers' weights into # the original layer's weights - more on this in the text generation # section! return original_output </code></pre></div> <h3 id="inject-lora-layer-into-the-model">Inject LoRA layer into the model</h3> We will now hack the original GPT-2 model and inject LoRA layers into it. Let's do a couple of things before doing that: <ul> <li>Delete previous model;</li> <li>Reset "peak" GPU memory usage using <a href="https://www.tensorflow.org/api_docs/python/tf/config/experimental/reset_memory_stats"><code>tf.config.experimental.reset_memory_stats</code></a>;</li> <li>Load a new GPT-2 model.</li> </ul> <div class="codehilite"><pre><code>del gpt2_lm del optimizer del loss # This resets "peak" memory usage to "current" memory usage. tf.config.experimental.reset_memory_stats("GPU:0") # Load the original model. preprocessor = keras_hub.models.GPT2CausalLMPreprocessor.from_preset( "gpt2_base_en", sequence_length=128, ) lora_model = keras_hub.models.GPT2CausalLM.from_preset( "gpt2_base_en", preprocessor=preprocessor, ) </code></pre></div> We will now override the original query/value projection matrices with our new LoRA layers. <div class="codehilite"><pre><code>for layer_idx in range(lora_model.backbone.num_layers): # Change query dense layer. decoder_layer = lora_model.backbone.get_layer(f"transformer_layer_{layer_idx}") self_attention_layer = decoder_layer._self_attention_layer # Allow mutation to Keras layer state. self_attention_layer._tracker.locked = False # Change query dense layer. self_attention_layer._query_dense = LoraLayer( self_attention_layer._query_dense, rank=RANK, alpha=ALPHA, trainable=True, ) # Change value dense layer. self_attention_layer._value_dense = LoraLayer( self_attention_layer._value_dense, rank=RANK, alpha=ALPHA, trainable=True, ) </code></pre></div> Let's now do a forward pass to make sure we still have a valid chain of computation. <div class="codehilite"><pre><code>lora_model(preprocessor(["LoRA is very useful for quick LLM finetuning"])[0]) pass </code></pre></div> Freeze the entire LLM, only the LoRA layers should be trainable. <div class="codehilite"><pre><code>for layer in lora_model._flatten_layers(): lst_of_sublayers = list(layer._flatten_layers()) if len(lst_of_sublayers) == 1: # "leaves of the model" if layer.name in ["lora_A", "lora_B"]: layer.trainable = True else: layer.trainable = False </code></pre></div> Print the model's summary and see if the number of non-trainable parameters and total parameters are correct. In a previous section, we had calculated the number of parameters associated with the LoRA layers to be 6,144. The total trainable parameters in the model should be <code>num_layers * (query, value) * 6,144 = 12 * 2 * 6,144 = 147,456</code>. The number of non-trainable parameters should be the same as the total number of parameters in the original GPT-2 model, which is <code>124,439,808</code>. <div class="codehilite"><pre><code>lora_model.summary() </code></pre></div> <pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace">Preprocessor: "gpt2_causal_lm_preprocessor_1" </pre> <pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace">┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Tokenizer (type) ┃ Vocab # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ gpt2_tokenizer_1 (GPT2Tokenizer) │ 50,257 │ └────────────────────────────────────────────────────┴─────────────────────────────────────────────────────┘ </pre> <pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace">Model: "gpt2_causal_lm_1" </pre> <pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace">┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ padding_mask (InputLayer) │ (None, None) │ 0 │ - │ ├───────────────────────────────┼───────────────────────────┼─────────────┼────────────────────────────────┤ │ token_ids (InputLayer) │ (None, None) │ 0 │ - │ ├───────────────────────────────┼───────────────────────────┼─────────────┼────────────────────────────────┤ │ gpt2_backbone_1 │ (None, None, 768) │ 124,587,264 │ padding_mask[0][0], │ │ (GPT2Backbone) │ │ │ token_ids[0][0] │ ├───────────────────────────────┼───────────────────────────┼─────────────┼────────────────────────────────┤ │ token_embedding │ (None, None, 50257) │ 38,597,376 │ gpt2_backbone_1[0][0] │ │ (ReversibleEmbedding) │ │ │ │ └───────────────────────────────┴───────────────────────────┴─────────────┴────────────────────────────────┘ </pre> <pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"> Total params: 124,587,264 (475.26 MB) </pre> <pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"> Trainable params: 147,456 (576.00 KB) </pre> <pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"> Non-trainable params: 124,439,808 (474.70 MB) </pre> <h3 id="finetune-lora-gpt2">Fine-tune LoRA GPT-2</h3> Now that we have hacked and verified the LoRA GPT-2 model, let's train it! <div class="codehilite"><pre><code>gpu_memory_callback = GPUMemoryCallback( target_batches=[5, 10, 25, 50, 100, 150, 200, 300, 400, 500], print_stats=True, ) optimizer, loss = get_optimizer_and_loss() lora_model.compile( optimizer=optimizer, loss=loss, weighted_metrics=["accuracy"], ) lora_model.fit( train_ds, epochs=EPOCHS, callbacks=[gpu_memory_callback], ) lora_model_memory_usage = gpu_memory_callback.memory_usage </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code> 2/500 [37m━━━━━━━━━━━━━━━━━━━━ 41s 84ms/step - accuracy: 0.2828 - loss: 3.7188 W0000 00:00:1701128576.353742 38699 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update 500/500 ━━━━━━━━━━━━━━━━━━━━ 80s 81ms/step - accuracy: 0.2930 - loss: 3.6158 </code></pre></div> </div> And we are done fine-tuning the model! Before we generate text, let's compare the training time and memory usage of the two models. The training time of GPT-2 on a 16 GB Tesla T4 (Colab) is 7 minutes, and for LoRA, it is 5 minutes, a 30% decrease. The memory usage of LoRA GPT-2 is roughly 35% times less than GPT-2. <div class="codehilite"><pre><code>plt.bar( ["GPT-2", "LoRA GPT-2"], [max(gpt2_lm_memory_usage), max(lora_model_memory_usage)], color=["red", "blue"], ) plt.xlabel("Time") plt.ylabel("GPU Memory Usage (in GB)") plt.title("GPU Memory Usage Comparison") plt.legend() plt.show() </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>WARNING:matplotlib.legend:No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument. </code></pre></div> </div> <img alt="png" src="/img/examples/nlp/parameter_efficient_finetuning_of_gpt2_with_lora/parameter_efficient_finetuning_of_gpt2_with_lora_43_1.png" /> <h3 id="merge-weights-and-generate-text">Merge weights and generate text!</h3> One of the biggest advantages of LoRA over other adapter methods is that it does not incur any additional inference latency. Let's understand why. Recall our LoRA equation: <code>output = W0x + b0 + BAx</code>. We can rewrite this as: <code>output = = Wx + b0 = (W0 + BA)x + b0</code>, where <code>W = W0 + BA</code>. This means that if we merge the weights of the original model and the adapter, we will be essentially doing the same computation as the original model! <div class="codehilite"><pre><code>for layer_idx in range(lora_model.backbone.num_layers): self_attention_layer = lora_model.backbone.get_layer( f"transformer_layer_{layer_idx}" )._self_attention_layer # Merge query dense layer. query_lora_layer = self_attention_layer._query_dense A_weights = query_lora_layer.A.kernel # (768, 1) (a, b) B_weights = query_lora_layer.B.kernel # (1, 12, 64) (b, c, d) increment_weights = tf.einsum("ab,bcd->acd", A_weights, B_weights) * (ALPHA / RANK) query_lora_layer.original_layer.kernel.assign_add(increment_weights) # Merge value dense layer. value_lora_layer = self_attention_layer._value_dense A_weights = value_lora_layer.A.kernel # (768, 1) (a, b) B_weights = value_lora_layer.B.kernel # (1, 12, 64) (b, c, d) increment_weights = tf.einsum("ab,bcd->acd", A_weights, B_weights) * (ALPHA / RANK) value_lora_layer.original_layer.kernel.assign_add(increment_weights) # Put back in place the original layers with updated weights self_attention_layer._query_dense = query_lora_layer.original_layer self_attention_layer._value_dense = value_lora_layer.original_layer </code></pre></div> We are now all set to generate text with our LoRA model :). <div class="codehilite"><pre><code># Freezing weights not necessary during generation since no weights are updated. generate_text(lora_model, "I like basketball", max_length=MAX_GENERATION_LENGTH) generate_text( lora_model, "That Italian restaurant is", max_length=MAX_GENERATION_LENGTH ) </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>Output: I like basketball. i've played this game for about a week and i'm pretty tired. today, i'm playing with my friend, who is a really good player. i'm a little older than the average player and i'm a bit too young. Total Time Elapsed: 6.81s </code></pre></div> </div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>Output: That Italian restaurant is in the city center and is located on a street that was recently renovated for the summer. </code></pre></div> </div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>i was in a group of friends and had a great time. </code></pre></div> </div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>Total Time Elapsed: 0.32s </code></pre></div> </div> And we're all done! </div> <div class='k-outline'> <div class='k-outline-depth-1'> <a href='#parameterefficient-finetuning-of-gpt2-with-lora'>Parameter-efficient fine-tuning of GPT-2 with LoRA</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#introduction'>Introduction</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#setup'>Setup</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#dataset'>Dataset</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#helper-functions'>Helper functions</a> </div> <div class='k-outline-depth-3'> <a href='#callback-for-tracking-gpu-memory-usage'>Callback for tracking GPU memory usage</a> </div> <div class='k-outline-depth-3'> <a href='#function-for-text-generation'>Function for text generation</a> </div> <div class='k-outline-depth-3'> <a href='#define-optimizer-and-loss'>Define optimizer and loss</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#finetune-gpt2'>Fine-tune GPT-2</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#lora-gpt2'>LoRA GPT-2</a> </div> <div class='k-outline-depth-3'> <a href='#what-exactly-is-lora'>What exactly is LoRA?</a> </div> <div class='k-outline-depth-3'> <a href='#create-lora-layer'>Create LoRA layer</a> </div> <div class='k-outline-depth-3'> <a href='#inject-lora-layer-into-the-model'>Inject LoRA layer into the model</a> </div> <div class='k-outline-depth-3'> <a href='#finetune-lora-gpt2'>Fine-tune LoRA GPT-2</a> </div> <div class='k-outline-depth-3'> <a href='#merge-weights-and-generate-text'>Merge weights and generate text!</a> </div> </div> </div> </div> </div> </body> <footer style="float: left; width: 100%; padding: 1em; border-top: solid 1px #bbb;"> <a href="https://policies.google.com/terms">Terms</a> | <a href="https://policies.google.com/privacy">Privacy</a> </footer> </html>