Pretraining BERT with Hugging Face Transformers

<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="description" content="Keras documentation"> <meta name="author" content="Keras Team"> <link rel="shortcut icon" href="https://keras.io/img/favicon.ico"> <link rel="canonical" href="https://keras.io/examples/nlp/pretraining_BERT/" />  <meta property="og:title" content="Keras documentation: Pretraining BERT with Hugging Face Transformers"> <meta property="og:image" content="https://keras.io/img/logo-k-keras-wb.png"> <meta name="twitter:title" content="Keras documentation: Pretraining BERT with Hugging Face Transformers"> <meta name="twitter:image" content="https://keras.io/img/k-keras-social.png"> <meta name="twitter:card" content="summary"> <title>Pretraining BERT with Hugging Face Transformers</title>  <link href="/css/bootstrap.min.css" rel="stylesheet">  <link href="https://fonts.googleapis.com/css2?family=Open+Sans:wght@400;600;700;800&display=swap" rel="stylesheet">  <link href="/css/docs.css" rel="stylesheet"> <link href="/css/monokai.css" rel="stylesheet">  <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src= 'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f); })(window,document,'script','dataLayer','GTM-5DNGF4N'); </script> <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-175165319-128', 'auto'); ga('send', 'pageview'); </script>  <script async defer src="https://buttons.github.io/buttons.js"></script> </head> <body>  <noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-5DNGF4N" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>  <div class='k-page'> <div class="k-nav" id="nav-menu"> <a href='/'><img src='/img/logo-small.png' class='logo-small' /></a> <div class="nav flex-column nav-pills" role="tablist" aria-orientation="vertical"> <a class="nav-link" href="/about/" role="tab" aria-selected="">About Keras</a> <a class="nav-link" href="/getting_started/" role="tab" aria-selected="">Getting started</a> <a class="nav-link" href="/guides/" role="tab" aria-selected="">Developer guides</a> <a class="nav-link active" href="/examples/" role="tab" aria-selected="">Code examples</a> <a class="nav-sublink" href="/examples/vision/">Computer Vision</a> <a class="nav-sublink active" href="/examples/nlp/">Natural Language Processing</a> <a class="nav-sublink2" href="/examples/nlp/text_classification_from_scratch/">Text classification from scratch</a> <a class="nav-sublink2" href="/examples/nlp/active_learning_review_classification/">Review Classification using Active Learning</a> <a class="nav-sublink2" href="/examples/nlp/fnet_classification_with_keras_hub/">Text Classification using FNet</a> <a class="nav-sublink2" href="/examples/nlp/multi_label_classification/">Large-scale multi-label text classification</a> <a class="nav-sublink2" href="/examples/nlp/text_classification_with_transformer/">Text classification with Transformer</a> <a class="nav-sublink2" href="/examples/nlp/text_classification_with_switch_transformer/">Text classification with Switch Transformer</a> <a class="nav-sublink2" href="/examples/nlp/tweet-classification-using-tfdf/">Text classification using Decision Forests and pretrained embeddings</a> <a class="nav-sublink2" href="/examples/nlp/pretrained_word_embeddings/">Using pre-trained word embeddings</a> <a class="nav-sublink2" href="/examples/nlp/bidirectional_lstm_imdb/">Bidirectional LSTM on IMDB</a> <a class="nav-sublink2" href="/examples/nlp/data_parallel_training_with_keras_hub/">Data Parallel Training with KerasHub and tf.distribute</a> <a class="nav-sublink2" href="/examples/nlp/neural_machine_translation_with_keras_hub/">English-to-Spanish translation with KerasHub</a> <a class="nav-sublink2" href="/examples/nlp/neural_machine_translation_with_transformer/">English-to-Spanish translation with a sequence-to-sequence Transformer</a> <a class="nav-sublink2" href="/examples/nlp/lstm_seq2seq/">Character-level recurrent sequence-to-sequence model</a> <a class="nav-sublink2" href="/examples/nlp/multimodal_entailment/">Multimodal entailment</a> <a class="nav-sublink2" href="/examples/nlp/ner_transformers/">Named Entity Recognition using Transformers</a> <a class="nav-sublink2" href="/examples/nlp/text_extraction_with_bert/">Text Extraction with BERT</a> <a class="nav-sublink2" href="/examples/nlp/addition_rnn/">Sequence to sequence learning for performing number addition</a> <a class="nav-sublink2" href="/examples/nlp/semantic_similarity_with_keras_hub/">Semantic Similarity with KerasHub</a> <a class="nav-sublink2" href="/examples/nlp/semantic_similarity_with_bert/">Semantic Similarity with BERT</a> <a class="nav-sublink2" href="/examples/nlp/sentence_embeddings_with_sbert/">Sentence embeddings using Siamese RoBERTa-networks</a> <a class="nav-sublink2" href="/examples/nlp/masked_language_modeling/">End-to-end Masked Language Modeling with BERT</a> <a class="nav-sublink2" href="/examples/nlp/abstractive_summarization_with_bart/">Abstractive Text Summarization with BART</a> <a class="nav-sublink2 active" href="/examples/nlp/pretraining_BERT/">Pretraining BERT with Hugging Face Transformers</a> <a class="nav-sublink2" href="/examples/nlp/parameter_efficient_finetuning_of_gpt2_with_lora/">Parameter-efficient fine-tuning of GPT-2 with LoRA</a> <a class="nav-sublink2" href="/examples/nlp/multiple_choice_task_with_transfer_learning/">MultipleChoice Task with Transfer Learning</a> <a class="nav-sublink2" href="/examples/nlp/question_answering/">Question Answering with Hugging Face Transformers</a> <a class="nav-sublink2" href="/examples/nlp/t5_hf_summarization/">Abstractive Summarization with Hugging Face Transformers</a> <a class="nav-sublink" href="/examples/structured_data/">Structured Data</a> <a class="nav-sublink" href="/examples/timeseries/">Timeseries</a> <a class="nav-sublink" href="/examples/generative/">Generative Deep Learning</a> <a class="nav-sublink" href="/examples/audio/">Audio Data</a> <a class="nav-sublink" href="/examples/rl/">Reinforcement Learning</a> <a class="nav-sublink" href="/examples/graph/">Graph Data</a> <a class="nav-sublink" href="/examples/keras_recipes/">Quick Keras Recipes</a> <a class="nav-link" href="/api/" role="tab" aria-selected="">Keras 3 API documentation</a> <a class="nav-link" href="/2.18/api/" role="tab" aria-selected="">Keras 2 API documentation</a> <a class="nav-link" href="/keras_tuner/" role="tab" aria-selected="">KerasTuner: Hyperparam Tuning</a> <a class="nav-link" href="/keras_hub/" role="tab" aria-selected="">KerasHub: Pretrained Models</a> </div> </div> <div class='k-main'> <div class='k-main-top'> <script> function displayDropdownMenu() { e = document.getElementById("nav-menu"); if (e.style.display == "block") { e.style.display = "none"; } else { e.style.display = "block"; document.getElementById("dropdown-nav").style.display = "block"; } } function resetMobileUI() { if (window.innerWidth <= 840) { document.getElementById("nav-menu").style.display = "none"; document.getElementById("dropdown-nav").style.display = "block"; } else { document.getElementById("nav-menu").style.display = "block"; document.getElementById("dropdown-nav").style.display = "none"; } var navmenu = document.getElementById("nav-menu"); var menuheight = navmenu.clientHeight; var kmain = document.getElementById("k-main-id"); kmain.style.minHeight = (menuheight + 100) + 'px'; } window.onresize = resetMobileUI; window.addEventListener("load", (event) => { resetMobileUI() }); </script> <div id='dropdown-nav' onclick="displayDropdownMenu();"> <svg viewBox="-20 -20 120 120" width="60" height="60"> <rect width="100" height="20"></rect> <rect y="30" width="100" height="20"></rect> <rect y="60" width="100" height="20"></rect> </svg> </div> <form class="bd-search d-flex align-items-center k-search-form" id="search-form"> <input type="search" class="k-search-input" id="search-input" placeholder="Search Keras documentation..." aria-label="Search Keras documentation..." autocomplete="off"> <button class="k-search-btn"> <svg width="13" height="13" viewBox="0 0 13 13"><title>search</title><path d="m4.8495 7.8226c0.82666 0 1.5262-0.29146 2.0985-0.87438 0.57232-0.58292 0.86378-1.2877 0.87438-2.1144 0.010599-0.82666-0.28086-1.5262-0.87438-2.0985-0.59352-0.57232-1.293-0.86378-2.0985-0.87438-0.8055-0.010599-1.5103 0.28086-2.1144 0.87438-0.60414 0.59352-0.8956 1.293-0.87438 2.0985 0.021197 0.8055 0.31266 1.5103 0.87438 2.1144 0.56172 0.60414 1.2665 0.8956 2.1144 0.87438zm4.4695 0.2115 3.681 3.6819-1.259 1.284-3.6817-3.7 0.0019784-0.69479-0.090043-0.098846c-0.87973 0.76087-1.92 1.1413-3.1207 1.1413-1.3553 0-2.5025-0.46363-3.4417-1.3909s-1.4088-2.0686-1.4088-3.4239c0-1.3553 0.4696-2.4966 1.4088-3.4239 0.9392-0.92727 2.0864-1.3969 3.4417-1.4088 1.3553-0.011889 2.4906 0.45771 3.406 1.4088 0.9154 0.95107 1.379 2.0924 1.3909 3.4239 0 1.2126-0.38043 2.2588-1.1413 3.1385l0.098834 0.090049z"></path></svg> </button> </form> <script> var form = document.getElementById('search-form'); form.onsubmit = function(e) { e.preventDefault(); var query = document.getElementById('search-input').value; window.location.href = '/search.html?query=' + query; return False } </script> </div> <div class='k-main-inner' id='k-main-id'> <div class='k-location-slug'> ► <a href='/examples/'>Code examples</a> / <a href='/examples/nlp/'>Natural Language Processing</a> / Pretraining BERT with Hugging Face Transformers </div> <div class='k-content'> <h1 id="pretraining-bert-with-hugging-face-transformers">Pretraining BERT with Hugging Face Transformers</h1> Author: Sreyan Ghosh Date created: 2022/07/01 Last modified: 2022/08/27 Description: Pretraining BERT using Hugging Face Transformers on NSP and MLM. <div class='example_version_banner keras_2'>ⓘ This example uses Keras 2</div> <img class="k-inline-icon" src="https://colab.research.google.com/img/colab_favicon.ico"/> <a href="https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/pretraining/ipynb/pretraining_BERT.ipynb">View in Colab</a> •<img class="k-inline-icon" src="https://github.com/favicon.ico"/> <a href="https://github.com/keras-team/keras-io/blob/master/examples/pretraining/pretraining_BERT.py">GitHub source</a> <hr /> <h2 id="introduction">Introduction</h2> <h3 id="bert-bidirectional-encoder-representations-from-transformers">BERT (Bidirectional Encoder Representations from Transformers)</h3> In the field of computer vision, researchers have repeatedly shown the value of transfer learning — pretraining a neural network model on a known task/dataset, for instance ImageNet classification, and then performing fine-tuning — using the trained neural network as the basis of a new specific-purpose model. In recent years, researchers have shown that a similar technique can be useful in many natural language tasks. BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or subwords) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed workings of Transformer are described in a paper by Google. As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word). When training language models, a challenge is defining a prediction goal. Many models predict the next word in a sequence (e.g. <code>"The child came home from _"</code>), a directional approach which inherently limits context learning. To overcome this challenge, BERT uses two training strategies: <h3 id="masked-language-modeling-mlm">Masked Language Modeling (MLM)</h3> Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a <code>[MASK]</code> token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. <h3 id="next-sentence-prediction-nsp">Next Sentence Prediction (NSP)</h3> In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. The assumption is that the random sentence will represent a disconnect from the first sentence. Though Google provides a pretrained BERT checkpoint for English, you may often need to either pretrain the model from scratch for a different language, or do a continued-pretraining to fit the model to a new domain. In this notebook, we pretrain BERT from scratch optimizing both MLM and NSP objectves using 🤗 Transformers on the <code>WikiText</code> English dataset loaded from 🤗 Datasets. <hr /> <h2 id="setup">Setup</h2> <h3 id="installing-the-requirements">Installing the requirements</h3> <div class="codehilite"><pre><code>pip install git+https://github.com/huggingface/transformers.git pip install datasets pip install huggingface-hub pip install nltk </code></pre></div> <h3 id="importing-the-necessary-libraries">Importing the necessary libraries</h3> <div class="codehilite"><pre><code>import nltk import random import logging import tensorflow as tf from tensorflow import keras nltk.download("punkt") # Only log error messages tf.get_logger().setLevel(logging.ERROR) # Set random seed tf.keras.utils.set_random_seed(42) </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>[nltk_data] Downloading package punkt to /speech/sreyan/nltk_data... [nltk_data] Package punkt is already up-to-date! </code></pre></div> </div> <h3 id="define-certain-variables">Define certain variables</h3> <div class="codehilite"><pre><code>TOKENIZER_BATCH_SIZE = 256 # Batch-size to train the tokenizer on TOKENIZER_VOCABULARY = 25000 # Total number of unique subwords the tokenizer can have BLOCK_SIZE = 128 # Maximum number of tokens in an input sample NSP_PROB = 0.50 # Probability that the next sentence is the actual next sentence in NSP SHORT_SEQ_PROB = 0.1 # Probability of generating shorter sequences to minimize the mismatch between pretraining and fine-tuning. MAX_LENGTH = 512 # Maximum number of tokens in an input sample after padding MLM_PROB = 0.2 # Probability with which tokens are masked in MLM TRAIN_BATCH_SIZE = 2 # Batch-size for pretraining the model on MAX_EPOCHS = 1 # Maximum number of epochs to train the model for LEARNING_RATE = 1e-4 # Learning rate for training the model MODEL_CHECKPOINT = "bert-base-cased" # Name of pretrained model from 🤗 Model Hub </code></pre></div> <hr /> <h2 id="load-the-wikitext-dataset">Load the WikiText dataset</h2> We now download the <code>WikiText</code> language modeling dataset. It is a collection of over 100 million tokens extracted from the set of verified "Good" and "Featured" articles on Wikipedia. We load the dataset from <a href="https://github.com/huggingface/datasets">🤗 Datasets</a>. For the purpose of demonstration in this notebook, we work with only the <code>train</code> split of the dataset. This can be easily done with the <code>load_dataset</code> function. <div class="codehilite"><pre><code>from datasets import load_dataset dataset = load_dataset("wikitext", "wikitext-2-raw-v1") </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.90 MiB, post-processed: Unknown size, total: 17.40 MiB) to /speech/sreyan/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126... Downloading data: 0%| | 0.00/4.72M [00:00<?, ?B/s] Generating test split: 0%| | 0/4358 [00:00<?, ? examples/s] Generating train split: 0%| | 0/36718 [00:00<?, ? examples/s] Generating validation split: 0%| | 0/3760 [00:00<?, ? examples/s] Dataset wikitext downloaded and prepared to /speech/sreyan/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data. 0%| | 0/3 [00:00<?, ?it/s] </code></pre></div> </div> The dataset just has one column which is the raw text, and this is all we need for pretraining BERT! <div class="codehilite"><pre><code>print(dataset) </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>DatasetDict({ test: Dataset({ features: ['text'], num_rows: 4358 }) train: Dataset({ features: ['text'], num_rows: 36718 }) validation: Dataset({ features: ['text'], num_rows: 3760 }) }) </code></pre></div> </div> <hr /> <h2 id="training-a-new-tokenizer">Training a new Tokenizer</h2> First we train our own tokenizer from scratch on our corpus, so that can we can use it to train our language model from scratch. But why would you need to train a tokenizer? That's because Transformer models very often use subword tokenization algorithms, and they need to be trained to identify the parts of words that are often present in the corpus you are using. The 🤗 Transformers <code>Tokenizer</code> (as the name indicates) will tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires. First we make a list of all the raw documents from the <code>WikiText</code> corpus: <div class="codehilite"><pre><code>all_texts = [ doc for doc in dataset["train"]["text"] if len(doc) > 0 and not doc.startswith(" =") ] </code></pre></div> Next we make a <code>batch_iterator</code> function that will aid us to train our tokenizer. <div class="codehilite"><pre><code>def batch_iterator(): for i in range(0, len(all_texts), TOKENIZER_BATCH_SIZE): yield all_texts[i : i + TOKENIZER_BATCH_SIZE] </code></pre></div> In this notebook, we train a tokenizer with the exact same algorithms and parameters as an existing one. For instance, we train a new version of the <code>BERT-CASED</code> tokenzier on <code>Wikitext-2</code> using the same tokenization algorithm. First we need to load the tokenizer we want to use as a model: <div class="codehilite"><pre><code>from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT) </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`. Moving 52 files to the new cache system 0%| | 0/52 [00:00<?, ?it/s] vocab_file vocab.txt tokenizer_file tokenizer.json added_tokens_file added_tokens.json special_tokens_map_file special_tokens_map.json tokenizer_config_file tokenizer_config.json </code></pre></div> </div> Now we train our tokenizer using the entire <code>train</code> split of the <code>Wikitext-2</code> dataset. <div class="codehilite"><pre><code>tokenizer = tokenizer.train_new_from_iterator( batch_iterator(), vocab_size=TOKENIZER_VOCABULARY ) </code></pre></div> So now we our done training our new tokenizer! Next we move on to the data pre-processing steps. <hr /> <h2 id="data-preprocessing">Data Pre-processing</h2> For the sake of demonstrating the workflow, in this notebook we only take small subsets of the entire WikiText <code>train</code> and <code>test</code> splits. <div class="codehilite"><pre><code>dataset["train"] = dataset["train"].select([i for i in range(1000)]) dataset["validation"] = dataset["validation"].select([i for i in range(1000)]) </code></pre></div> Before we can feed those texts to our model, we need to pre-process them and get them ready for the task. As mentioned earlier, the BERT pretraining task includes two tasks in total, the <code>NSP</code> task and the <code>MLM</code> task. 🤗 Transformers have an easy to implement <code>collator</code> called the <code>DataCollatorForLanguageModeling</code>. However, we need to get the data ready for <code>NSP</code> manually. Next we write a simple function called the <code>prepare_train_features</code> that helps us in the pre-processing and is compatible with 🤗 Datasets. To summarize, our pre-processing function should: <ul> <li>Get the dataset ready for the NSP task by creating pairs of sentences (A,B), where B either actually follows A, or B is randomly sampled from somewhere else in the corpus. It should also generate a corresponding label for each pair, which is 1 if B actually follows A and 0 if not.</li> <li>Tokenize the text dataset into it's corresponding token ids that will be used for embedding look-up in BERT</li> <li>Create additional inputs for the model like <code>token_type_ids</code>, <code>attention_mask</code>, etc.</li> </ul> <div class="codehilite"><pre><code># We define the maximum number of tokens after tokenization that each training sample # will have max_num_tokens = BLOCK_SIZE - tokenizer.num_special_tokens_to_add(pair=True) def prepare_train_features(examples): """Function to prepare features for NSP task Arguments: examples: A dictionary with 1 key ("text") text: List of raw documents (str) Returns: examples: A dictionary with 4 keys input_ids: List of tokenized, concatnated, and batched sentences from the individual raw documents (int) token_type_ids: List of integers (0 or 1) corresponding to: 0 for senetence no. 1 and padding, 1 for sentence no. 2 attention_mask: List of integers (0 or 1) corresponding to: 1 for non-padded tokens, 0 for padded next_sentence_label: List of integers (0 or 1) corresponding to: 1 if the second sentence actually follows the first, 0 if the senetence is sampled from somewhere else in the corpus """ # Remove un-wanted samples from the training set examples["document"] = [ d.strip() for d in examples["text"] if len(d) > 0 and not d.startswith(" =") ] # Split the documents from the dataset into it's individual sentences examples["sentences"] = [ nltk.tokenize.sent_tokenize(document) for document in examples["document"] ] # Convert the tokens into ids using the trained tokenizer examples["tokenized_sentences"] = [ [tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sent)) for sent in doc] for doc in examples["sentences"] ] # Define the outputs examples["input_ids"] = [] examples["token_type_ids"] = [] examples["attention_mask"] = [] examples["next_sentence_label"] = [] for doc_index, document in enumerate(examples["tokenized_sentences"]): current_chunk = [] # a buffer stored current working segments current_length = 0 i = 0 # We *usually* want to fill up the entire sequence since we are padding # to `block_size` anyways, so short sequences are generally wasted # computation. However, we *sometimes* # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter # sequences to minimize the mismatch between pretraining and fine-tuning. # The `target_seq_length` is just a rough target however, whereas # `block_size` is a hard limit. target_seq_length = max_num_tokens if random.random() < SHORT_SEQ_PROB: target_seq_length = random.randint(2, max_num_tokens) while i < len(document): segment = document[i] current_chunk.append(segment) current_length += len(segment) if i == len(document) - 1 or current_length >= target_seq_length: if current_chunk: # `a_end` is how many segments from `current_chunk` go into the `A` # (first) sentence. a_end = 1 if len(current_chunk) >= 2: a_end = random.randint(1, len(current_chunk) - 1) tokens_a = [] for j in range(a_end): tokens_a.extend(current_chunk[j]) tokens_b = [] if len(current_chunk) == 1 or random.random() < NSP_PROB: is_random_next = True target_b_length = target_seq_length - len(tokens_a) # This should rarely go for more than one iteration for large # corpora. However, just to be careful, we try to make sure that # the random document is not the same as the document # we're processing. for _ in range(10): random_document_index = random.randint( 0, len(examples["tokenized_sentences"]) - 1 ) if random_document_index != doc_index: break random_document = examples["tokenized_sentences"][ random_document_index ] random_start = random.randint(0, len(random_document) - 1) for j in range(random_start, len(random_document)): tokens_b.extend(random_document[j]) if len(tokens_b) >= target_b_length: break # We didn't actually use these segments so we "put them back" so # they don't go to waste. num_unused_segments = len(current_chunk) - a_end i -= num_unused_segments else: is_random_next = False for j in range(a_end, len(current_chunk)): tokens_b.extend(current_chunk[j]) input_ids = tokenizer.build_inputs_with_special_tokens( tokens_a, tokens_b ) # add token type ids, 0 for sentence a, 1 for sentence b token_type_ids = tokenizer.create_token_type_ids_from_sequences( tokens_a, tokens_b ) padded = tokenizer.pad( {"input_ids": input_ids, "token_type_ids": token_type_ids}, padding="max_length", max_length=MAX_LENGTH, ) examples["input_ids"].append(padded["input_ids"]) examples["token_type_ids"].append(padded["token_type_ids"]) examples["attention_mask"].append(padded["attention_mask"]) examples["next_sentence_label"].append(1 if is_random_next else 0) current_chunk = [] current_length = 0 i += 1 # We delete all the un-necessary columns from our dataset del examples["document"] del examples["sentences"] del examples["text"] del examples["tokenized_sentences"] return examples tokenized_dataset = dataset.map( prepare_train_features, batched=True, remove_columns=["text"], num_proc=1, ) </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>Parameter 'function'=<function prepare_train_features at 0x7fd4a214cb90> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. 0%| | 0/5 [00:00<?, ?ba/s] 0%| | 0/1 [00:00<?, ?ba/s] 0%| | 0/1 [00:00<?, ?ba/s] </code></pre></div> </div> For MLM we are going to use the same preprocessing as before for our dataset with one additional step: we randomly mask some tokens (by replacing them by [MASK]) and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens). If you use a tokenizer you trained yourself, make sure the [MASK] token is among the special tokens you passed during training! To get the data ready for MLM, we simply use the <code>collator</code> called the <code>DataCollatorForLanguageModeling</code> provided by the 🤗 Transformers library on our dataset that is already ready for the NSP task. The <code>collator</code> expects certain parameters. We use the default ones from the original BERT paper in this notebook. The <code>return_tensors='tf'</code> ensures that we get <a href="https://www.tensorflow.org/api_docs/python/tf/Tensor"><code>tf.Tensor</code></a> objects back. <div class="codehilite"><pre><code>from transformers import DataCollatorForLanguageModeling collater = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=True, mlm_probability=MLM_PROB, return_tensors="tf" ) </code></pre></div> Next we define our training set with which we train our model. Again, 🤗 Datasets provides us with the <code>to_tf_dataset</code> method which will help us integrate our dataset with the <code>collator</code> defined above. The method expects certain parameters: <ul> <li>columns: the columns which will serve as our independent variables</li> <li>label_cols: the columns which will serve as our labels or dependant variables</li> <li>batch_size: our batch size for training</li> <li>shuffle: whether we want to shuffle our training dataset</li> <li>collate_fn: our collator function</li> </ul> <div class="codehilite"><pre><code>train = tokenized_dataset["train"].to_tf_dataset( columns=["input_ids", "token_type_ids", "attention_mask"], label_cols=["labels", "next_sentence_label"], batch_size=TRAIN_BATCH_SIZE, shuffle=True, collate_fn=collater, ) validation = tokenized_dataset["validation"].to_tf_dataset( columns=["input_ids", "token_type_ids", "attention_mask"], label_cols=["labels", "next_sentence_label"], batch_size=TRAIN_BATCH_SIZE, shuffle=True, collate_fn=collater, ) </code></pre></div> <hr /> <h2 id="defining-the-model">Defining the model</h2> To define our model, first we need to define a config which will help us define certain parameters of our model architecture. This includes parameters like number of transformer layers, number of attention heads, hidden dimension, etc. For this notebook, we try to define the exact config defined in the original BERT paper. We can easily achieve this using the <code>BertConfig</code> class from the 🤗 Transformers library. The <code>from_pretrained()</code> method expects the name of a model. Here we define the simplest model with which we also trained our model, i.e., <code>bert-base-cased</code>. <div class="codehilite"><pre><code>from transformers import BertConfig config = BertConfig.from_pretrained(MODEL_CHECKPOINT) </code></pre></div> For defining our model we use the <code>TFBertForPreTraining</code> class from the 🤗 Transformers library. This class internally handles everything starting from defining our model, to unpacking our inputs and calculating the loss. So we need not do anything ourselves except defining the model with the correct <code>config</code> we want! <div class="codehilite"><pre><code>from transformers import TFBertForPreTraining model = TFBertForPreTraining(config) </code></pre></div> Now we define our optimizer and compile the model. The loss calculation is handled internally and so we need not worry about that! <div class="codehilite"><pre><code>optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE) model.compile(optimizer=optimizer) </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss. </code></pre></div> </div> Finally all steps are done and now we can start training our model! <div class="codehilite"><pre><code>model.fit(train, validation_data=validation, epochs=MAX_EPOCHS) </code></pre></div> <div class="k-default-codeblock"> <div class="codehilite"><pre><code>483/483 [==============================] - 96s 141ms/step - loss: 8.3765 - val_loss: 8.5572 <keras.callbacks.History at 0x7fd27c219790> </code></pre></div> </div> Our model has now been trained! We suggest to please train the model on the complete dataset for atleast 50 epochs for decent performance. The pretrained model now acts as a language model and is meant to be fine-tuned on a downstream task. Thus it can now be fine-tuned on any downstream task like Question Answering, Text Classification etc.! Now you can push this model to 🤗 Model Hub and also share it with with all your friends, family, favorite pets: they can all load it with the identifier <code>"your-username/the-name-you-picked"</code> so for instance: <div class="codehilite"><pre><code>model.push_to_hub("pretrained-bert", organization="keras-io") tokenizer.push_to_hub("pretrained-bert", organization="keras-io") </code></pre></div> And after you push your model this is how you can load it in the future! <div class="codehilite"><pre><code>from transformers import TFBertForPreTraining model = TFBertForPreTraining.from_pretrained("your-username/my-awesome-model") </code></pre></div> or, since it's a pretrained model and you would generally use it for fine-tuning on a downstream task, you can also load it for some other task like: <div class="codehilite"><pre><code>from transformers import TFBertForSequenceClassification model = TFBertForSequenceClassification.from_pretrained("your-username/my-awesome-model") </code></pre></div> In this case, the pretraining head will be dropped and the model will just be initialized with the transformer layers. A new task-specific head will be added with random weights. </div> <div class='k-outline'> <div class='k-outline-depth-1'> <a href='#pretraining-bert-with-hugging-face-transformers'>Pretraining BERT with Hugging Face Transformers</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#introduction'>Introduction</a> </div> <div class='k-outline-depth-3'> <a href='#bert-bidirectional-encoder-representations-from-transformers'>BERT (Bidirectional Encoder Representations from Transformers)</a> </div> <div class='k-outline-depth-3'> <a href='#masked-language-modeling-mlm'>Masked Language Modeling (MLM)</a> </div> <div class='k-outline-depth-3'> <a href='#next-sentence-prediction-nsp'>Next Sentence Prediction (NSP)</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#setup'>Setup</a> </div> <div class='k-outline-depth-3'> <a href='#installing-the-requirements'>Installing the requirements</a> </div> <div class='k-outline-depth-3'> <a href='#importing-the-necessary-libraries'>Importing the necessary libraries</a> </div> <div class='k-outline-depth-3'> <a href='#define-certain-variables'>Define certain variables</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#load-the-wikitext-dataset'>Load the WikiText dataset</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#training-a-new-tokenizer'>Training a new Tokenizer</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#data-preprocessing'>Data Pre-processing</a> </div> <div class='k-outline-depth-2'> ◆ <a href='#defining-the-model'>Defining the model</a> </div> </div> </div> </div> </div> </body> <footer style="float: left; width: 100%; padding: 1em; border-top: solid 1px #bbb;"> <a href="https://policies.google.com/terms">Terms</a> | <a href="https://policies.google.com/privacy">Privacy</a> </footer> </html>

CINXE.COM

Pretraining BERT with Hugging Face Transformers