CINXE.COM
Training a language model from scratch with 🤗 Transformers and TPUs
<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="description" content="Keras documentation"> <meta name="author" content="Keras Team"> <link rel="shortcut icon" href="https://keras.io/img/favicon.ico"> <link rel="canonical" href="https://keras.io/examples/nlp/mlm_training_tpus/" /> <!-- Social --> <meta property="og:title" content="Keras documentation: Training a language model from scratch with 🤗 Transformers and TPUs"> <meta property="og:image" content="https://keras.io/img/logo-k-keras-wb.png"> <meta name="twitter:title" content="Keras documentation: Training a language model from scratch with 🤗 Transformers and TPUs"> <meta name="twitter:image" content="https://keras.io/img/k-keras-social.png"> <meta name="twitter:card" content="summary"> <title>Training a language model from scratch with 🤗 Transformers and TPUs</title> <!-- Bootstrap core CSS --> <link href="/css/bootstrap.min.css" rel="stylesheet"> <!-- Custom fonts for this template --> <link href="https://fonts.googleapis.com/css2?family=Open+Sans:wght@400;600;700;800&display=swap" rel="stylesheet"> <!-- Custom styles for this template --> <link href="/css/docs.css" rel="stylesheet"> <link href="/css/monokai.css" rel="stylesheet"> <!-- Google Tag Manager --> <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src= 'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f); })(window,document,'script','dataLayer','GTM-5DNGF4N'); </script> <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-175165319-128', 'auto'); ga('send', 'pageview'); </script> <!-- End Google Tag Manager --> <script async defer src="https://buttons.github.io/buttons.js"></script> </head> <body> <!-- Google Tag Manager (noscript) --> <noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-5DNGF4N" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript> <!-- End Google Tag Manager (noscript) --> <div class='k-page'> <div class="k-nav" id="nav-menu"> <a href='/'><img src='/img/logo-small.png' class='logo-small' /></a> <div class="nav flex-column nav-pills" role="tablist" aria-orientation="vertical"> <a class="nav-link" href="/about/" role="tab" aria-selected="">About Keras</a> <a class="nav-link" href="/getting_started/" role="tab" aria-selected="">Getting started</a> <a class="nav-link" href="/guides/" role="tab" aria-selected="">Developer guides</a> <a class="nav-link active" href="/examples/" role="tab" aria-selected="">Code examples</a> <a class="nav-sublink" href="/examples/vision/">Computer Vision</a> <a class="nav-sublink active" href="/examples/nlp/">Natural Language Processing</a> <a class="nav-sublink2" href="/examples/nlp/text_classification_from_scratch/">Text classification from scratch</a> <a class="nav-sublink2" href="/examples/nlp/active_learning_review_classification/">Review Classification using Active Learning</a> <a class="nav-sublink2" href="/examples/nlp/fnet_classification_with_keras_hub/">Text Classification using FNet</a> <a class="nav-sublink2" href="/examples/nlp/multi_label_classification/">Large-scale multi-label text classification</a> <a class="nav-sublink2" href="/examples/nlp/text_classification_with_transformer/">Text classification with Transformer</a> <a class="nav-sublink2" href="/examples/nlp/text_classification_with_switch_transformer/">Text classification with Switch Transformer</a> <a class="nav-sublink2" href="/examples/nlp/tweet-classification-using-tfdf/">Text classification using Decision Forests and pretrained embeddings</a> <a class="nav-sublink2" href="/examples/nlp/pretrained_word_embeddings/">Using pre-trained word embeddings</a> <a class="nav-sublink2" href="/examples/nlp/bidirectional_lstm_imdb/">Bidirectional LSTM on IMDB</a> <a class="nav-sublink2" href="/examples/nlp/data_parallel_training_with_keras_hub/">Data Parallel Training with KerasHub and tf.distribute</a> <a class="nav-sublink2" href="/examples/nlp/neural_machine_translation_with_keras_hub/">English-to-Spanish translation with KerasHub</a> <a class="nav-sublink2" href="/examples/nlp/neural_machine_translation_with_transformer/">English-to-Spanish translation with a sequence-to-sequence Transformer</a> <a class="nav-sublink2" href="/examples/nlp/lstm_seq2seq/">Character-level recurrent sequence-to-sequence model</a> <a class="nav-sublink2" href="/examples/nlp/multimodal_entailment/">Multimodal entailment</a> <a class="nav-sublink2" href="/examples/nlp/ner_transformers/">Named Entity Recognition using Transformers</a> <a class="nav-sublink2" href="/examples/nlp/text_extraction_with_bert/">Text Extraction with BERT</a> <a class="nav-sublink2" href="/examples/nlp/addition_rnn/">Sequence to sequence learning for performing number addition</a> <a class="nav-sublink2" href="/examples/nlp/semantic_similarity_with_keras_hub/">Semantic Similarity with KerasHub</a> <a class="nav-sublink2" href="/examples/nlp/semantic_similarity_with_bert/">Semantic Similarity with BERT</a> <a class="nav-sublink2" href="/examples/nlp/sentence_embeddings_with_sbert/">Sentence embeddings using Siamese RoBERTa-networks</a> <a class="nav-sublink2" href="/examples/nlp/masked_language_modeling/">End-to-end Masked Language Modeling with BERT</a> <a class="nav-sublink2" href="/examples/nlp/abstractive_summarization_with_bart/">Abstractive Text Summarization with BART</a> <a class="nav-sublink2" href="/examples/nlp/pretraining_BERT/">Pretraining BERT with Hugging Face Transformers</a> <a class="nav-sublink2" href="/examples/nlp/parameter_efficient_finetuning_of_gpt2_with_lora/">Parameter-efficient fine-tuning of GPT-2 with LoRA</a> <a class="nav-sublink2 active" href="/examples/nlp/mlm_training_tpus/">Training a language model from scratch with 🤗 Transformers and TPUs</a> <a class="nav-sublink2" href="/examples/nlp/multiple_choice_task_with_transfer_learning/">MultipleChoice Task with Transfer Learning</a> <a class="nav-sublink2" href="/examples/nlp/question_answering/">Question Answering with Hugging Face Transformers</a> <a class="nav-sublink2" href="/examples/nlp/t5_hf_summarization/">Abstractive Summarization with Hugging Face Transformers</a> <a class="nav-sublink" href="/examples/structured_data/">Structured Data</a> <a class="nav-sublink" href="/examples/timeseries/">Timeseries</a> <a class="nav-sublink" href="/examples/generative/">Generative Deep Learning</a> <a class="nav-sublink" href="/examples/audio/">Audio Data</a> <a class="nav-sublink" href="/examples/rl/">Reinforcement Learning</a> <a class="nav-sublink" href="/examples/graph/">Graph Data</a> <a class="nav-sublink" href="/examples/keras_recipes/">Quick Keras Recipes</a> <a class="nav-link" href="/api/" role="tab" aria-selected="">Keras 3 API documentation</a> <a class="nav-link" href="/2.18/api/" role="tab" aria-selected="">Keras 2 API documentation</a> <a class="nav-link" href="/keras_tuner/" role="tab" aria-selected="">KerasTuner: Hyperparam Tuning</a> <a class="nav-link" href="/keras_hub/" role="tab" aria-selected="">KerasHub: Pretrained Models</a> </div> </div> <div class='k-main'> <div class='k-main-top'> <script> function displayDropdownMenu() { e = document.getElementById("nav-menu"); if (e.style.display == "block") { e.style.display = "none"; } else { e.style.display = "block"; document.getElementById("dropdown-nav").style.display = "block"; } } function resetMobileUI() { if (window.innerWidth <= 840) { document.getElementById("nav-menu").style.display = "none"; document.getElementById("dropdown-nav").style.display = "block"; } else { document.getElementById("nav-menu").style.display = "block"; document.getElementById("dropdown-nav").style.display = "none"; } var navmenu = document.getElementById("nav-menu"); var menuheight = navmenu.clientHeight; var kmain = document.getElementById("k-main-id"); kmain.style.minHeight = (menuheight + 100) + 'px'; } window.onresize = resetMobileUI; window.addEventListener("load", (event) => { resetMobileUI() }); </script> <div id='dropdown-nav' onclick="displayDropdownMenu();"> <svg viewBox="-20 -20 120 120" width="60" height="60"> <rect width="100" height="20"></rect> <rect y="30" width="100" height="20"></rect> <rect y="60" width="100" height="20"></rect> </svg> </div> <form class="bd-search d-flex align-items-center k-search-form" id="search-form"> <input type="search" class="k-search-input" id="search-input" placeholder="Search Keras documentation..." aria-label="Search Keras documentation..." autocomplete="off"> <button class="k-search-btn"> <svg width="13" height="13" viewBox="0 0 13 13"><title>search</title><path d="m4.8495 7.8226c0.82666 0 1.5262-0.29146 2.0985-0.87438 0.57232-0.58292 0.86378-1.2877 0.87438-2.1144 0.010599-0.82666-0.28086-1.5262-0.87438-2.0985-0.59352-0.57232-1.293-0.86378-2.0985-0.87438-0.8055-0.010599-1.5103 0.28086-2.1144 0.87438-0.60414 0.59352-0.8956 1.293-0.87438 2.0985 0.021197 0.8055 0.31266 1.5103 0.87438 2.1144 0.56172 0.60414 1.2665 0.8956 2.1144 0.87438zm4.4695 0.2115 3.681 3.6819-1.259 1.284-3.6817-3.7 0.0019784-0.69479-0.090043-0.098846c-0.87973 0.76087-1.92 1.1413-3.1207 1.1413-1.3553 0-2.5025-0.46363-3.4417-1.3909s-1.4088-2.0686-1.4088-3.4239c0-1.3553 0.4696-2.4966 1.4088-3.4239 0.9392-0.92727 2.0864-1.3969 3.4417-1.4088 1.3553-0.011889 2.4906 0.45771 3.406 1.4088 0.9154 0.95107 1.379 2.0924 1.3909 3.4239 0 1.2126-0.38043 2.2588-1.1413 3.1385l0.098834 0.090049z"></path></svg> </button> </form> <script> var form = document.getElementById('search-form'); form.onsubmit = function(e) { e.preventDefault(); var query = document.getElementById('search-input').value; window.location.href = '/search.html?query=' + query; return False } </script> </div> <div class='k-main-inner' id='k-main-id'> <div class='k-location-slug'> <span class="k-location-slug-pointer">►</span> <a href='/examples/'>Code examples</a> / <a href='/examples/nlp/'>Natural Language Processing</a> / Training a language model from scratch with 🤗 Transformers and TPUs </div> <div class='k-content'> <h1 id="training-a-language-model-from-scratch-with-🤗-transformers-and-tpus">Training a language model from scratch with 🤗 Transformers and TPUs</h1> <p><strong>Authors:</strong> <a href="https://twitter.com/carrigmat">Matthew Carrigan</a>, <a href="https://twitter.com/RisingSayak">Sayak Paul</a><br> <strong>Date created:</strong> 2023/05/21<br> <strong>Last modified:</strong> 2023/05/21<br> <strong>Description:</strong> Train a masked language model on TPUs using 🤗 Transformers.</p> <div class='example_version_banner keras_2'>ⓘ This example uses Keras 2</div> <p><img class="k-inline-icon" src="https://colab.research.google.com/img/colab_favicon.ico"/> <a href="https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/mlm_training_tpus.ipynb"><strong>View in Colab</strong></a> <span class="k-dot">•</span><img class="k-inline-icon" src="https://github.com/favicon.ico"/> <a href="https://github.com/keras-team/keras-io/blob/master/examples/nlp/mlm_training_tpus.py"><strong>GitHub source</strong></a></p> <hr /> <h2 id="introduction">Introduction</h2> <p>In this example, we cover how to train a masked language model using TensorFlow, <a href="https://huggingface.co/transformers/index">🤗 Transformers</a>, and TPUs.</p> <p>TPU training is a useful skill to have: TPU pods are high-performance and extremely scalable, making it easy to train models at any scale from a few tens of millions of parameters up to truly enormous sizes: Google's PaLM model (over 500 billion parameters!) was trained entirely on TPU pods.</p> <p>We've previously written a <a href="https://huggingface.co/docs/transformers/main/perf_train_tpu_tf"><strong>tutorial</strong></a> and a <a href="https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb"><strong>Colab example</strong></a> showing small-scale TPU training with TensorFlow and introducing the core concepts you need to understand to get your model working on TPU. However, our Colab example doesn't contain all the steps needed to train a language model from scratch such as training the tokenizer. So, we wanted to provide a consolidated example of walking you through every critical step involved there.</p> <p>As in our Colab example, we're taking advantage of TensorFlow's very clean TPU support via XLA and <code>TPUStrategy</code>. We'll also be benefiting from the fact that the majority of the TensorFlow models in 🤗 Transformers are fully <a href="https://huggingface.co/blog/tf-xla-generate">XLA-compatible</a>. So surprisingly, little work is needed to get them to run on TPU.</p> <p>This example is designed to be <strong>scalable</strong> and much closer to a realistic training run – although we only use a BERT-sized model by default, the code could be expanded to a much larger model and a much more powerful TPU pod slice by changing a few configuration options.</p> <p>The following diagram gives you a pictorial overview of the steps involved in training a language model with 🤗 Transformers using TensorFlow and TPUs:</p> <p><img alt="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/tf_tpu/tf_tpu_steps.png" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/tf_tpu/tf_tpu_steps.png" /></p> <p><em>(Contents of this example overlap with <a href="https://huggingface.co/blog/tf_tpu">this blog post</a>).</em></p> <hr /> <h2 id="data">Data</h2> <p>We use the <a href="https://huggingface.co/datasets/wikitext">WikiText dataset (v1)</a>. You can head over to the <a href="https://huggingface.co/datasets/wikitext">dataset page on the Hugging Face Hub</a> to explore the dataset.</p> <p><img alt="data_preview_wikitext" src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/data_preview_wikitext.png" /></p> <p>Since the dataset is already available on the Hub in a compatible format, we can easily load and interact with it using <a href="https://hf.co/docs/datasets">🤗 datasets</a>. However, training a language model from scratch also requires a separate tokenizer training step. We skip that part in this example for brevity, but, here's a gist of what we can do to train a tokenizer from scratch:</p> <ul> <li>Load the <code>train</code> split of the WikiText using 🤗 datasets.</li> <li>Leverage <a href="https://huggingface.co/docs/tokenizers/index">🤗 tokenizers</a> to train a <a href="https://huggingface.co/course/chapter6/7?fw=pt"><strong>Unigram model</strong></a>.</li> <li>Upload the trained tokenizer on the Hub.</li> </ul> <p>You can find the tokenizer training code <a href="https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling-tpu#training-a-tokenizer"><strong>here</strong></a> and the tokenizer <a href="https://huggingface.co/tf-tpu/unigram-tokenizer-wikitext"><strong>here</strong></a>. This script also allows you to run it with <a href="https://huggingface.co/datasets?task_ids=task_ids:language-modeling"><strong>any compatible dataset</strong></a> from the Hub.</p> <hr /> <h2 id="tokenizing-the-data-and-creating-tfrecords">Tokenizing the data and creating TFRecords</h2> <p>Once the tokenizer is trained, we can use it on all the dataset splits (<code>train</code>, <code>validation</code>, and <code>test</code> in this case) and create TFRecord shards out of them. Having the data splits spread across multiple TFRecord shards helps with massively parallel processing as opposed to having each split in single TFRecord files.</p> <p>We tokenize the samples individually. We then take a batch of samples, concatenate them together, and split them into several chunks of a fixed size (128 in our case). We follow this strategy rather than tokenizing a batch of samples with a fixed length to avoid aggressively discarding text content (because of truncation).</p> <p>We then take these tokenized samples in batches and serialize those batches as multiple TFRecord shards, where the total dataset length and individual shard size determine the number of shards. Finally, these shards are pushed to a <a href="https://cloud.google.com/storage/docs/json_api/v1/buckets">Google Cloud Storage (GCS) bucket</a>.</p> <p>If you're using a TPU node for training, then the data needs to be streamed from a GCS bucket since the node host memory is very small. But for TPU VMs, we can use datasets locally or even attach persistent storage to those VMs. Since TPU nodes (which is what we have in a Colab) are still quite heavily used, we based our example on using a GCS bucket for data storage.</p> <p>You can see all of this in code in <a href="https://github.com/huggingface/transformers/blob/main/examples/tensorflow/language-modeling-tpu/prepare_tfrecord_shards.py">this script</a>. For convenience, we have also hosted the resultant TFRecord shards in <a href="https://huggingface.co/datasets/tf-tpu/wikitext-v1-tfrecords">this repository</a> on the Hub.</p> <p>Once the data is tokenized and serialized into TFRecord shards, we can proceed toward training.</p> <hr /> <h2 id="training">Training</h2> <h3 id="setup-and-imports">Setup and imports</h3> <p>Let's start by installing 🤗 Transformers.</p> <div class="codehilite"><pre><span></span><code><span class="err">!</span><span class="n">pip</span> <span class="n">install</span> <span class="n">transformers</span> <span class="o">-</span><span class="n">q</span> </code></pre></div> <p>Then, let's import the modules we need.</p> <div class="codehilite"><pre><span></span><code><span class="kn">import</span> <span class="nn">os</span> <span class="kn">import</span> <span class="nn">re</span> <span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="nn">tf</span> <span class="kn">import</span> <span class="nn">transformers</span> </code></pre></div> <h3 id="initialize-tpus">Initialize TPUs</h3> <p>Then let's connect to our TPU and determine the distribution strategy:</p> <div class="codehilite"><pre><span></span><code><span class="n">tpu</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">distribute</span><span class="o">.</span><span class="n">cluster_resolver</span><span class="o">.</span><span class="n">TPUClusterResolver</span><span class="p">()</span> <span class="n">tf</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">experimental_connect_to_cluster</span><span class="p">(</span><span class="n">tpu</span><span class="p">)</span> <span class="n">tf</span><span class="o">.</span><span class="n">tpu</span><span class="o">.</span><span class="n">experimental</span><span class="o">.</span><span class="n">initialize_tpu_system</span><span class="p">(</span><span class="n">tpu</span><span class="p">)</span> <span class="n">strategy</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">distribute</span><span class="o">.</span><span class="n">TPUStrategy</span><span class="p">(</span><span class="n">tpu</span><span class="p">)</span> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Available number of replicas: </span><span class="si">{</span><span class="n">strategy</span><span class="o">.</span><span class="n">num_replicas_in_sync</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><span></span><code>Available number of replicas: 8 </code></pre></div> </div> <p>We then load the tokenizer. For more details on the tokenizer, check out <a href="https://huggingface.co/tf-tpu/unigram-tokenizer-wikitext">its repository</a>. For the model, we use RoBERTa (the base variant), introduced in <a href="https://arxiv.org/abs/1907.11692">this paper</a>.</p> <h3 id="initialize-the-tokenizer">Initialize the tokenizer</h3> <div class="codehilite"><pre><span></span><code><span class="n">tokenizer</span> <span class="o">=</span> <span class="s2">"tf-tpu/unigram-tokenizer-wikitext"</span> <span class="n">pretrained_model_config</span> <span class="o">=</span> <span class="s2">"roberta-base"</span> <span class="n">tokenizer</span> <span class="o">=</span> <span class="n">transformers</span><span class="o">.</span><span class="n">AutoTokenizer</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">)</span> <span class="n">config</span> <span class="o">=</span> <span class="n">transformers</span><span class="o">.</span><span class="n">AutoConfig</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">pretrained_model_config</span><span class="p">)</span> <span class="n">config</span><span class="o">.</span><span class="n">vocab_size</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">vocab_size</span> </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><span></span><code>Downloading (…)okenizer_config.json: 0%| | 0.00/483 [00:00<?, ?B/s] Downloading (…)/main/tokenizer.json: 0%| | 0.00/1.61M [00:00<?, ?B/s] Downloading (…)cial_tokens_map.json: 0%| | 0.00/286 [00:00<?, ?B/s] Downloading (…)lve/main/config.json: 0%| | 0.00/481 [00:00<?, ?B/s] </code></pre></div> </div> <h3 id="prepare-the-datasets">Prepare the datasets</h3> <p>We now load the TFRecord shards of the WikiText dataset (which the Hugging Face team prepared beforehand for this example):</p> <div class="codehilite"><pre><span></span><code><span class="n">train_dataset_path</span> <span class="o">=</span> <span class="s2">"gs://tf-tpu-training-resources/train"</span> <span class="n">eval_dataset_path</span> <span class="o">=</span> <span class="s2">"gs://tf-tpu-training-resources/validation"</span> <span class="n">training_records</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">io</span><span class="o">.</span><span class="n">gfile</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">train_dataset_path</span><span class="p">,</span> <span class="s2">"*.tfrecord"</span><span class="p">))</span> <span class="n">eval_records</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">io</span><span class="o">.</span><span class="n">gfile</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">eval_dataset_path</span><span class="p">,</span> <span class="s2">"*.tfrecord"</span><span class="p">))</span> </code></pre></div> <p>Now, we will write a utility to count the number of training samples we have. We need to know this value in order properly initialize our optimizer later:</p> <div class="codehilite"><pre><span></span><code><span class="k">def</span> <span class="nf">count_samples</span><span class="p">(</span><span class="n">file_list</span><span class="p">):</span> <span class="n">num_samples</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">file</span> <span class="ow">in</span> <span class="n">file_list</span><span class="p">:</span> <span class="n">filename</span> <span class="o">=</span> <span class="n">file</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">"/"</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="n">sample_count</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="sa">r</span><span class="s2">"-\d+-(\d+)\.tfrecord"</span><span class="p">,</span> <span class="n">filename</span><span class="p">)</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="n">sample_count</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">sample_count</span><span class="p">)</span> <span class="n">num_samples</span> <span class="o">+=</span> <span class="n">sample_count</span> <span class="k">return</span> <span class="n">num_samples</span> <span class="n">num_train_samples</span> <span class="o">=</span> <span class="n">count_samples</span><span class="p">(</span><span class="n">training_records</span><span class="p">)</span> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Number of total training samples: </span><span class="si">{</span><span class="n">num_train_samples</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><span></span><code>Number of total training samples: 300917 </code></pre></div> </div> <p>Let's now prepare our datasets for training and evaluation. We start by writing our utilities. First, we need to be able to decode the TFRecords:</p> <div class="codehilite"><pre><span></span><code><span class="n">max_sequence_length</span> <span class="o">=</span> <span class="mi">512</span> <span class="k">def</span> <span class="nf">decode_fn</span><span class="p">(</span><span class="n">example</span><span class="p">):</span> <span class="n">features</span> <span class="o">=</span> <span class="p">{</span> <span class="s2">"input_ids"</span><span class="p">:</span> <span class="n">tf</span><span class="o">.</span><span class="n">io</span><span class="o">.</span><span class="n">FixedLenFeature</span><span class="p">(</span> <span class="n">dtype</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">int64</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="n">max_sequence_length</span><span class="p">,)</span> <span class="p">),</span> <span class="s2">"attention_mask"</span><span class="p">:</span> <span class="n">tf</span><span class="o">.</span><span class="n">io</span><span class="o">.</span><span class="n">FixedLenFeature</span><span class="p">(</span> <span class="n">dtype</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">int64</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="n">max_sequence_length</span><span class="p">,)</span> <span class="p">),</span> <span class="p">}</span> <span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">io</span><span class="o">.</span><span class="n">parse_single_example</span><span class="p">(</span><span class="n">example</span><span class="p">,</span> <span class="n">features</span><span class="p">)</span> </code></pre></div> <p>Here, <code>max_sequence_length</code> needs to be the same as the one used during preparing the TFRecord shards.Refer to <a href="https://github.com/huggingface/transformers/blob/main/examples/tensorflow/language-modeling-tpu/prepare_tfrecord_shards.py">this script</a> for more details.</p> <p>Next up, we have our masking utility that is responsible for masking parts of the inputs and preparing labels for the masked language model to learn from. We leverage the <a href="https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling"><code>DataCollatorForLanguageModeling</code></a> for this purpose.</p> <div class="codehilite"><pre><span></span><code><span class="c1"># We use a standard masking probability of 0.15. `mlm_probability` denotes</span> <span class="c1"># probability with which we mask the input tokens in a sequence.</span> <span class="n">mlm_probability</span> <span class="o">=</span> <span class="mf">0.15</span> <span class="n">data_collator</span> <span class="o">=</span> <span class="n">transformers</span><span class="o">.</span><span class="n">DataCollatorForLanguageModeling</span><span class="p">(</span> <span class="n">tokenizer</span><span class="o">=</span><span class="n">tokenizer</span><span class="p">,</span> <span class="n">mlm_probability</span><span class="o">=</span><span class="n">mlm_probability</span><span class="p">,</span> <span class="n">mlm</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">return_tensors</span><span class="o">=</span><span class="s2">"tf"</span> <span class="p">)</span> <span class="k">def</span> <span class="nf">mask_with_collator</span><span class="p">(</span><span class="n">batch</span><span class="p">):</span> <span class="n">special_tokens_mask</span> <span class="o">=</span> <span class="p">(</span> <span class="o">~</span><span class="n">tf</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="n">batch</span><span class="p">[</span><span class="s2">"attention_mask"</span><span class="p">],</span> <span class="n">tf</span><span class="o">.</span><span class="n">bool</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">batch</span><span class="p">[</span><span class="s2">"input_ids"</span><span class="p">]</span> <span class="o">==</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">cls_token_id</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">batch</span><span class="p">[</span><span class="s2">"input_ids"</span><span class="p">]</span> <span class="o">==</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">sep_token_id</span><span class="p">)</span> <span class="p">)</span> <span class="n">batch</span><span class="p">[</span><span class="s2">"input_ids"</span><span class="p">],</span> <span class="n">batch</span><span class="p">[</span><span class="s2">"labels"</span><span class="p">]</span> <span class="o">=</span> <span class="n">data_collator</span><span class="o">.</span><span class="n">tf_mask_tokens</span><span class="p">(</span> <span class="n">batch</span><span class="p">[</span><span class="s2">"input_ids"</span><span class="p">],</span> <span class="n">vocab_size</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">),</span> <span class="n">mask_token_id</span><span class="o">=</span><span class="n">tokenizer</span><span class="o">.</span><span class="n">mask_token_id</span><span class="p">,</span> <span class="n">special_tokens_mask</span><span class="o">=</span><span class="n">special_tokens_mask</span><span class="p">,</span> <span class="p">)</span> <span class="k">return</span> <span class="n">batch</span> </code></pre></div> <p>And now is the time to write the final data preparation utility to put it all together in a <a href="https://www.tensorflow.org/api_docs/python/tf/data/Dataset"><code>tf.data.Dataset</code></a> object:</p> <div class="codehilite"><pre><span></span><code><span class="n">auto</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">AUTOTUNE</span> <span class="n">shuffle_buffer_size</span> <span class="o">=</span> <span class="mi">2</span><span class="o">**</span><span class="mi">18</span> <span class="k">def</span> <span class="nf">prepare_dataset</span><span class="p">(</span> <span class="n">records</span><span class="p">,</span> <span class="n">decode_fn</span><span class="p">,</span> <span class="n">mask_fn</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">,</span> <span class="n">shuffle</span><span class="p">,</span> <span class="n">shuffle_buffer_size</span><span class="o">=</span><span class="kc">None</span> <span class="p">):</span> <span class="n">num_samples</span> <span class="o">=</span> <span class="n">count_samples</span><span class="p">(</span><span class="n">records</span><span class="p">)</span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">Dataset</span><span class="o">.</span><span class="n">from_tensor_slices</span><span class="p">(</span><span class="n">records</span><span class="p">)</span> <span class="k">if</span> <span class="n">shuffle</span><span class="p">:</span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">shuffle</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">dataset</span><span class="p">))</span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">TFRecordDataset</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">num_parallel_reads</span><span class="o">=</span><span class="n">auto</span><span class="p">)</span> <span class="c1"># TF can't infer the total sample count because it doesn't read</span> <span class="c1"># all the records yet, so we assert it here.</span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">experimental</span><span class="o">.</span><span class="n">assert_cardinality</span><span class="p">(</span><span class="n">num_samples</span><span class="p">))</span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">decode_fn</span><span class="p">,</span> <span class="n">num_parallel_calls</span><span class="o">=</span><span class="n">auto</span><span class="p">)</span> <span class="k">if</span> <span class="n">shuffle</span><span class="p">:</span> <span class="k">assert</span> <span class="n">shuffle_buffer_size</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">shuffle_buffer_size</span><span class="p">)</span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">batch</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">drop_remainder</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">mask_fn</span><span class="p">,</span> <span class="n">num_parallel_calls</span><span class="o">=</span><span class="n">auto</span><span class="p">)</span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">prefetch</span><span class="p">(</span><span class="n">auto</span><span class="p">)</span> <span class="k">return</span> <span class="n">dataset</span> </code></pre></div> <p>Let's prepare our datasets with these utilities:</p> <div class="codehilite"><pre><span></span><code><span class="n">per_replica_batch_size</span> <span class="o">=</span> <span class="mi">16</span> <span class="c1"># Change as needed.</span> <span class="n">batch_size</span> <span class="o">=</span> <span class="n">per_replica_batch_size</span> <span class="o">*</span> <span class="n">strategy</span><span class="o">.</span><span class="n">num_replicas_in_sync</span> <span class="n">shuffle_buffer_size</span> <span class="o">=</span> <span class="mi">2</span><span class="o">**</span><span class="mi">18</span> <span class="c1"># Default corresponds to a 1GB buffer for seq_len 512</span> <span class="n">train_dataset</span> <span class="o">=</span> <span class="n">prepare_dataset</span><span class="p">(</span> <span class="n">training_records</span><span class="p">,</span> <span class="n">decode_fn</span><span class="o">=</span><span class="n">decode_fn</span><span class="p">,</span> <span class="n">mask_fn</span><span class="o">=</span><span class="n">mask_with_collator</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">shuffle_buffer_size</span><span class="o">=</span><span class="n">shuffle_buffer_size</span><span class="p">,</span> <span class="p">)</span> <span class="n">eval_dataset</span> <span class="o">=</span> <span class="n">prepare_dataset</span><span class="p">(</span> <span class="n">eval_records</span><span class="p">,</span> <span class="n">decode_fn</span><span class="o">=</span><span class="n">decode_fn</span><span class="p">,</span> <span class="n">mask_fn</span><span class="o">=</span><span class="n">mask_with_collator</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="p">)</span> </code></pre></div> <p>Let's now investigate how a single batch of dataset looks like.</p> <div class="codehilite"><pre><span></span><code><span class="n">single_batch</span> <span class="o">=</span> <span class="nb">next</span><span class="p">(</span><span class="nb">iter</span><span class="p">(</span><span class="n">train_dataset</span><span class="p">))</span> <span class="nb">print</span><span class="p">(</span><span class="n">single_batch</span><span class="o">.</span><span class="n">keys</span><span class="p">())</span> </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><span></span><code>dict_keys(['attention_mask', 'input_ids', 'labels']) </code></pre></div> </div> <ul> <li><code>input_ids</code> denotes the tokenized versions of the input samples containing the mask tokens as well.</li> <li><code>attention_mask</code> denotes the mask to be used when performing attention operations.</li> <li><code>labels</code> denotes the actual values of masked tokens the model is supposed to learn from.</li> </ul> <div class="codehilite"><pre><span></span><code><span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">single_batch</span><span class="p">:</span> <span class="k">if</span> <span class="n">k</span> <span class="o">==</span> <span class="s2">"input_ids"</span><span class="p">:</span> <span class="n">input_ids</span> <span class="o">=</span> <span class="n">single_batch</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Input shape: </span><span class="si">{</span><span class="n">input_ids</span><span class="o">.</span><span class="n">shape</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> <span class="k">if</span> <span class="n">k</span> <span class="o">==</span> <span class="s2">"labels"</span><span class="p">:</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">single_batch</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Label shape: </span><span class="si">{</span><span class="n">labels</span><span class="o">.</span><span class="n">shape</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><span></span><code>Input shape: (128, 512) Label shape: (128, 512) </code></pre></div> </div> <p>Now, we can leverage our <code>tokenizer</code> to investigate the values of the tokens. Let's start with <code>input_ids</code>:</p> <div class="codehilite"><pre><span></span><code><span class="n">idx</span> <span class="o">=</span> <span class="mi">0</span> <span class="nb">print</span><span class="p">(</span><span class="s2">"Taking the first sample:</span><span class="se">\n</span><span class="s2">"</span><span class="p">)</span> <span class="nb">print</span><span class="p">(</span><span class="n">tokenizer</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="n">input_ids</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span><span class="o">.</span><span class="n">numpy</span><span class="p">()))</span> </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><span></span><code>Taking the first sample: </code></pre></div> </div> <div class="k-default-codeblock"> <div class="codehilite"><pre><span></span><code>they called the character of Tsugum[MASK] one of the[MASK] tragic heroines[MASK] had encountered in a game. Chandran ranked the game as the third best role @[MASK][MASK] playing game from the sixth generation of video[MASK] consoles, saying that it was his favorite in the[MASK]Infinity[MASK], and one his favorite[MASK] games overall[MASK].[MASK] [SEP][CLS][SEP][CLS][SEP][CLS] =[MASK] Sea party 1914[MASK]– 16 = [SEP][CLS][SEP][CLS] The Ross Sea party was a component of Sir[MASK] Shackleton's Imperial Trans @-@ Antarctic Expedition 1914 garde 17.[MASK] task was to lay a series of supply depots across the Great Ice Barrier from the Ross Sea to the Beardmore Glacier, along the[MASK] route established by earlier Antarctic expeditions[MASK]. The expedition's main party, under[MASK], was to land[MASK]on the opposite, Weddell Sea coast of Antarctica [MASK] and to march across the continent via the South[MASK] to the Ross Sea. As the main party would be un[MASK] to carry[MASK] fuel and supplies for the whole distance[MASK], their survival depended on the Ross Sea party's depots[MASK][MASK][MASK] would cover the[MASK] quarter of their journey. [SEP][CLS][MASK] set sail from London on[MASK] ship Endurance, bound[MASK] the Weddell Sea in August 1914. Meanwhile, the Ross Sea party[MASK] gathered in Australia, prior[MASK] Probabl for the Ross Sea in[MASK] second expedition ship, SY Aurora. Organisational and financial problems[MASK]ed their[MASK] until December 1914, which shortened their first depot @-@[MASK] season.[MASK][MASK] arrival the inexperienced party struggle[MASK] to master the art of Antarctic travel, in the[MASK] losing most of their sledge dogs [MASK]อ greater misfortune[MASK]ed when, at the onset of the southern winter, Aurora[MASK] torn from its [MASK]ings during [MASK] severe storm and was un[MASK] to return, leaving the shore party stranded. [SEP][CLS] Crossroadspite[MASK] setbacks, the Ross Sea party survived inter @-@ personnel disputes, extreme weather[MASK], illness, and Pay deaths of three of its members to carry[MASK] its[MASK] in full during its[MASK] Antarctic season. This success proved ultimate[MASK] without purpose, because Shackleton's Grimaldi expedition was un </code></pre></div> </div> <p>As expected, the decoded tokens contain the special tokens including the mask tokens as well. Let's now investigate the mask tokens:</p> <div class="codehilite"><pre><span></span><code><span class="c1"># Taking the first 30 tokens of the first sequence.</span> <span class="nb">print</span><span class="p">(</span><span class="n">labels</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">numpy</span><span class="p">()[:</span><span class="mi">30</span><span class="p">])</span> </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><span></span><code>[-100 -100 -100 -100 -100 -100 -100 -100 -100 43 -100 -100 -100 -100 351 -100 -100 -100 99 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100] </code></pre></div> </div> <p>Here, <code>-100</code> means that the corresponding tokens in the <code>input_ids</code> are NOT masked and non <code>-100</code> values denote the actual values of the masked tokens.</p> <hr /> <h2 id="initialize-the-mode-and-and-the-optimizer">Initialize the mode and and the optimizer</h2> <p>With the datasets prepared, we now initialize and compile our model and optimizer within the <code>strategy.scope()</code>:</p> <div class="codehilite"><pre><span></span><code><span class="c1"># For this example, we keep this value to 10. But for a realistic run, start with 500.</span> <span class="n">num_epochs</span> <span class="o">=</span> <span class="mi">10</span> <span class="n">steps_per_epoch</span> <span class="o">=</span> <span class="n">num_train_samples</span> <span class="o">//</span> <span class="p">(</span> <span class="n">per_replica_batch_size</span> <span class="o">*</span> <span class="n">strategy</span><span class="o">.</span><span class="n">num_replicas_in_sync</span> <span class="p">)</span> <span class="n">total_train_steps</span> <span class="o">=</span> <span class="n">steps_per_epoch</span> <span class="o">*</span> <span class="n">num_epochs</span> <span class="n">learning_rate</span> <span class="o">=</span> <span class="mf">0.0001</span> <span class="n">weight_decay_rate</span> <span class="o">=</span> <span class="mf">1e-3</span> <span class="k">with</span> <span class="n">strategy</span><span class="o">.</span><span class="n">scope</span><span class="p">():</span> <span class="n">model</span> <span class="o">=</span> <span class="n">transformers</span><span class="o">.</span><span class="n">TFAutoModelForMaskedLM</span><span class="o">.</span><span class="n">from_config</span><span class="p">(</span><span class="n">config</span><span class="p">)</span> <span class="n">model</span><span class="p">(</span> <span class="n">model</span><span class="o">.</span><span class="n">dummy_inputs</span> <span class="p">)</span> <span class="c1"># Pass some dummy inputs through the model to ensure all the weights are built</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">schedule</span> <span class="o">=</span> <span class="n">transformers</span><span class="o">.</span><span class="n">create_optimizer</span><span class="p">(</span> <span class="n">num_train_steps</span><span class="o">=</span><span class="n">total_train_steps</span><span class="p">,</span> <span class="n">num_warmup_steps</span><span class="o">=</span><span class="n">total_train_steps</span> <span class="o">//</span> <span class="mi">20</span><span class="p">,</span> <span class="n">init_lr</span><span class="o">=</span><span class="n">learning_rate</span><span class="p">,</span> <span class="n">weight_decay_rate</span><span class="o">=</span><span class="n">weight_decay_rate</span><span class="p">,</span> <span class="p">)</span> <span class="n">model</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="n">optimizer</span><span class="p">,</span> <span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s2">"accuracy"</span><span class="p">])</span> </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><span></span><code>No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss. </code></pre></div> </div> <p>A couple of things to note here: * The <a href="https://huggingface.co/docs/transformers/main_classes/optimizer_schedules#transformers.create_optimizer"><code>create_optimizer()</code></a> function creates an Adam optimizer with a learning rate schedule using a warmup phase followed by a linear decay. Since we're using weight decay here, under the hood, <code>create_optimizer()</code> instantiates <a href="https://github.com/huggingface/transformers/blob/118e9810687dd713b6be07af79e80eeb1d916908/src/transformers/optimization_tf.py#L172">the right variant of Adam</a> to enable weight decay. * While compiling the model, we're NOT using any <code>loss</code> argument. This is because the TensorFlow models internally compute the loss when expected labels are provided. Based on the model type and the labels being used, <code>transformers</code> will automatically infer the loss to use.</p> <h3 id="start-training">Start training!</h3> <p>Next, we set up a handy callback to push the intermediate training checkpoints to the Hugging Face Hub. To be able to operationalize this callback, we need to log in to our Hugging Face account (if you don't have one, you create one <a href="https://huggingface.co/join">here</a> for free). Execute the code below for logging in:</p> <div class="codehilite"><pre><span></span><code><span class="kn">from</span> <span class="nn">huggingface_hub</span> <span class="kn">import</span> <span class="n">notebook_login</span> <span class="n">notebook_login</span><span class="p">()</span> </code></pre></div> <p>Let's now define the <a href="https://huggingface.co/docs/transformers/main_classes/keras_callbacks#transformers.PushToHubCallback"><code>PushToHubCallback</code></a>:</p> <div class="codehilite"><pre><span></span><code><span class="n">hub_model_id</span> <span class="o">=</span> <span class="n">output_dir</span> <span class="o">=</span> <span class="s2">"masked-lm-tpu"</span> <span class="n">callbacks</span> <span class="o">=</span> <span class="p">[]</span> <span class="n">callbacks</span><span class="o">.</span><span class="n">append</span><span class="p">(</span> <span class="n">transformers</span><span class="o">.</span><span class="n">PushToHubCallback</span><span class="p">(</span> <span class="n">output_dir</span><span class="o">=</span><span class="n">output_dir</span><span class="p">,</span> <span class="n">hub_model_id</span><span class="o">=</span><span class="n">hub_model_id</span><span class="p">,</span> <span class="n">tokenizer</span><span class="o">=</span><span class="n">tokenizer</span> <span class="p">)</span> <span class="p">)</span> </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><span></span><code>Cloning https://huggingface.co/sayakpaul/masked-lm-tpu into local empty directory. WARNING:huggingface_hub.repository:Cloning https://huggingface.co/sayakpaul/masked-lm-tpu into local empty directory. Download file tf_model.h5: 0%| | 15.4k/477M [00:00<?, ?B/s] Clean file tf_model.h5: 0%| | 1.00k/477M [00:00<?, ?B/s] </code></pre></div> </div> <p>And now, we're ready to chug the TPUs:</p> <div class="codehilite"><pre><span></span><code><span class="c1"># In the interest of the runtime of this example,</span> <span class="c1"># we limit the number of batches to just 2.</span> <span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span> <span class="n">train_dataset</span><span class="o">.</span><span class="n">take</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span> <span class="n">validation_data</span><span class="o">=</span><span class="n">eval_dataset</span><span class="o">.</span><span class="n">take</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span> <span class="n">epochs</span><span class="o">=</span><span class="n">num_epochs</span><span class="p">,</span> <span class="n">callbacks</span><span class="o">=</span><span class="n">callbacks</span><span class="p">,</span> <span class="p">)</span> <span class="c1"># After training we also serialize the final model.</span> <span class="n">model</span><span class="o">.</span><span class="n">save_pretrained</span><span class="p">(</span><span class="n">output_dir</span><span class="p">)</span> </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><span></span><code>Epoch 1/10 2/2 [==============================] - 96s 35s/step - loss: 10.2116 - accuracy: 0.0000e+00 - val_loss: 10.1957 - val_accuracy: 2.2888e-05 Epoch 2/10 2/2 [==============================] - 9s 2s/step - loss: 10.2017 - accuracy: 0.0000e+00 - val_loss: 10.1798 - val_accuracy: 0.0000e+00 Epoch 3/10 2/2 [==============================] - ETA: 0s - loss: 10.1890 - accuracy: 7.6294e-06 WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0045s vs `on_train_batch_end` time: 9.1679s). Check your callbacks. 2/2 [==============================] - 35s 27s/step - loss: 10.1890 - accuracy: 7.6294e-06 - val_loss: 10.1604 - val_accuracy: 1.5259e-05 Epoch 4/10 2/2 [==============================] - 8s 2s/step - loss: 10.1733 - accuracy: 1.5259e-05 - val_loss: 10.1145 - val_accuracy: 7.6294e-06 Epoch 5/10 2/2 [==============================] - 34s 26s/step - loss: 10.1336 - accuracy: 1.5259e-05 - val_loss: 10.0666 - val_accuracy: 7.6294e-06 Epoch 6/10 2/2 [==============================] - 10s 2s/step - loss: 10.0906 - accuracy: 6.1035e-05 - val_loss: 10.0200 - val_accuracy: 5.4169e-04 Epoch 7/10 2/2 [==============================] - 33s 25s/step - loss: 10.0360 - accuracy: 6.1035e-04 - val_loss: 9.9646 - val_accuracy: 0.0049 Epoch 8/10 2/2 [==============================] - 8s 2s/step - loss: 9.9830 - accuracy: 0.0038 - val_loss: 9.8938 - val_accuracy: 0.0155 Epoch 9/10 2/2 [==============================] - 33s 26s/step - loss: 9.9067 - accuracy: 0.0116 - val_loss: 9.8225 - val_accuracy: 0.0198 Epoch 10/10 2/2 [==============================] - 8s 2s/step - loss: 9.8302 - accuracy: 0.0196 - val_loss: 9.7454 - val_accuracy: 0.0215 </code></pre></div> </div> <p>Once your training is complete, you can easily perform inference like so:</p> <div class="codehilite"><pre><span></span><code><span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">pipeline</span> <span class="c1"># Replace your `model_id` here.</span> <span class="c1"># Here, we're using a model that the Hugging Face team trained for longer.</span> <span class="n">model_id</span> <span class="o">=</span> <span class="s2">"tf-tpu/roberta-base-epochs-500-no-wd"</span> <span class="n">unmasker</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">(</span><span class="s2">"fill-mask"</span><span class="p">,</span> <span class="n">model</span><span class="o">=</span><span class="n">model_id</span><span class="p">,</span> <span class="n">framework</span><span class="o">=</span><span class="s2">"tf"</span><span class="p">)</span> <span class="nb">print</span><span class="p">(</span><span class="n">unmasker</span><span class="p">(</span><span class="s2">"Goal of my life is to [MASK]."</span><span class="p">))</span> </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><span></span><code>Downloading (…)lve/main/config.json: 0%| | 0.00/649 [00:00<?, ?B/s] Downloading tf_model.h5: 0%| | 0.00/500M [00:00<?, ?B/s] All model checkpoint layers were used when initializing TFRobertaForMaskedLM. </code></pre></div> </div> <div class="k-default-codeblock"> <div class="codehilite"><pre><span></span><code>All the layers of TFRobertaForMaskedLM were initialized from the model checkpoint at tf-tpu/roberta-base-epochs-500-no-wd. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training. Downloading (…)okenizer_config.json: 0%| | 0.00/683 [00:00<?, ?B/s] Downloading (…)/main/tokenizer.json: 0%| | 0.00/1.61M [00:00<?, ?B/s] Downloading (…)cial_tokens_map.json: 0%| | 0.00/286 [00:00<?, ?B/s] [{'score': 0.10031876713037491, 'token': 52, 'token_str': 'be', 'sequence': 'Goal of my life is to be.'}, {'score': 0.032648470252752304, 'token': 5, 'token_str': '', 'sequence': 'Goal of my life is to .'}, {'score': 0.02152678370475769, 'token': 138, 'token_str': 'work', 'sequence': 'Goal of my life is to work.'}, {'score': 0.019547568634152412, 'token': 984, 'token_str': 'act', 'sequence': 'Goal of my life is to act.'}, {'score': 0.01939115859568119, 'token': 73, 'token_str': 'have', 'sequence': 'Goal of my life is to have.'}] </code></pre></div> </div> <p>And that's it!</p> <p>If you enjoyed this example, we encourage you to check out the full codebase <a href="https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling-tpu">here</a> and the accompanying blog post <a href="https://huggingface.co/blog/tf_tpu">here</a>.</p> </div> <div class='k-outline'> <div class='k-outline-depth-1'> <a href='#training-a-language-model-from-scratch-with-🤗-transformers-and-tpus'>Training a language model from scratch with 🤗 Transformers and TPUs</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#introduction'>Introduction</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#data'>Data</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#tokenizing-the-data-and-creating-tfrecords'>Tokenizing the data and creating TFRecords</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#training'>Training</a> </div> <div class='k-outline-depth-3'> <a href='#setup-and-imports'>Setup and imports</a> </div> <div class='k-outline-depth-3'> <a href='#initialize-tpus'>Initialize TPUs</a> </div> <div class='k-outline-depth-3'> <a href='#initialize-the-tokenizer'>Initialize the tokenizer</a> </div> <div class='k-outline-depth-3'> <a href='#prepare-the-datasets'>Prepare the datasets</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#initialize-the-mode-and-and-the-optimizer'>Initialize the mode and and the optimizer</a> </div> <div class='k-outline-depth-3'> <a href='#start-training'>Start training!</a> </div> </div> </div> </div> </div> </body> <footer style="float: left; width: 100%; padding: 1em; border-top: solid 1px #bbb;"> <a href="https://policies.google.com/terms">Terms</a> | <a href="https://policies.google.com/privacy">Privacy</a> </footer> </html>