IKH

FinIe-Tuning BERT Model – Part 1

In this segment, we will use one of the variants of the Transformer model, BERT, and fine-tune it to perform sentence-pair classification. This task is part of the semantic textual similarity problem, wherein you are provided with two pairs of questions and are required to model the textual interaction between them.

You can download the notebook used in this segment from here.

In the next video, Ankush will explain the problem statement and the data used.

Here is the problem statement: Predict whether any given two sentences (questions) are semantically similar to each other. We will use the Quora Question Pair (QQP) data set, which is part of the GLUE benchmark. We will use two evaluation metrics, F1 and accuracy metrics.

By the end of this case study, you should be able to:

Work with Hugging Face data sets.

Load, train and save BERT-based models (BERT and ALBERT, among others).

Perform end-to-end implementation (training, validation, prediction, and evaluation).


You can download the data set from this link.

Example

Python
train = pd.read_csv('/content/drive/MyDrive/sentence_pair_classification_data/train.csv')
train.sample(5)

Output

There are 363,846 entries of data with the following four columns: question1, question2, label, and idx. However, to use the Transformer API, we need to use the load_datset() function, which automatically converts the given data into a dictionary.

Example

Python
dataset = load_dataset('csv', data_files={'train':  '/content/drive/MyDrive/sentence_pair_classification_data/train.csv',\
                                              'valid':'/content/drive/MyDrive/sentence_pair_classification_data/val.csv',
                                                'test': '/content/driveMyDrive/sentence_pair_classification_data/test.csv'},)

Output

PowerShell
DatasetDict({
    train: Dataset({
        features: ['question1', 'question2', 'label', 'idx'],
        num_rows: 363846
    })
    valid: Dataset({
        features: ['question1', 'question2', 'label', 'idx'],
        num_rows: 40430
    })
    test: Dataset({
        features: ['question1', 'question2', 'label', 'idx'],
        num_rows: 390965
    })
})

In order to pre-process the inputs, we need to initialize the model_name/checkpoint and load the tokenizer such that it can automatically load the tokenizer for it.

Example

Python
model_checkpoint = "bert-base-cased"
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Output

After the tokenizer is loaded, we can apply it to a sample input, as given below.

PowerShell
tokenizer(train.question1[0], train.question2[0], 
                                      padding='max_length',  # Pad to max_length
                                      truncation=True,  # Truncate to max_length
                                      max_length=100,  
                                      return_tensors='tf',return_token_type_ids = True) 

The output is a dictionary consisting of the following three keys:

‘input_ids’, ‘token_type_ids’ and ‘attention_mask’.

We need to handle the two sequences as a pair and apply the appropriate preprocessing simultaneously. For this, we pass both questions as combined input_ids, which is performed with the help of special tokens, such as the classifier ([CLS]) and separator ([SEP]) tokens. Also, the BERT model expects the processed input in the format in which two sentences are separated using [SEP] tokens.

However, since padding is maintained at 100(max_length), the extra values in the input_ids are filled with 0’s.

Generally, special tokens are capable for the model to understand the presence of two sequences. However, the BERT models take in token_type_ids as well. These ids are represented as a binary mask identifying the two types of sequences in the model, where all the tokens of the first sequences are filled with 0’s and all the tokens of the second sequences are filled with 1’s.

For example, if the first question is of 10 tokens and the second question is of 8 tokens, you will observe the following output:

PowerShell
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

To return these ids, you need to set the argument return_token_type_ids = True.

Now that you have seen how the tokenizer works for a sample input, let’s apply the tokenizer to pre-process the entire data set.

To apply the tokenizer to the entire data set, we need to first create a function and apply it using the map() function.

Example

Python
def preprocess_function(records):
    return tokenizer(records['question1'], records['question2'], truncation=True, return_token_type_ids=True, max_length = 75)
encoded_dataset = dataset.map(preprocess_function, batched=True )

Output

PowerShell
DatasetDict({
    train: Dataset({
        features: ['question1', 'question2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 363846
    })
    valid: Dataset({
        features: ['question1', 'question2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 40430
    })
    test: Dataset({
        features: ['question1', 'question2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 390965
    })
})

However, the transformed data set consists of original features, so we need to remove them once the data set is encoded.

Example

Python
pre_tokenizer_columns = set(dataset["train"].features)
tokenizer_columns = list(set(encoded_dataset["train"].features) - pre_tokenizer_columns)
print("Columns added by tokenizer:", tokenizer_columns)

Output

PowerShell
Columns added by tokenizer: ['token_type_ids', 'attention_mask', 'input_ids']

Now that you have encoded and processed the input, the next step is to convert the format of the data set to be compatible with the chosen Tensorflow framework using the to_tf_dataset() function. You also need to import a data collator from the Transformers to combine the varying sequence lengths into a single batch of equal lengths. Let’s find out how to do this in the next video.

Here is the code to create the train and validation dataset:

PowerShell
from transformers import DataCollatorWithPadding
 
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf",)
 
 
tf_train_dataset = encoded_dataset["train"].to_tf_dataset(
    columns=tokenizer_columns,
    label_cols=["labels"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
tf_validation_dataset = encoded_dataset["valid"].to_tf_dataset(
    columns=tokenizer_columns,
    label_cols=["labels"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

Note

Here, we have utilized shuffling to apply variation for the training data set only.

Since the transformed tf.dataset is an iterator object, we can check one sample from it using the next() function to observe how the processing is executed.

Example

Python
z = next(iter(tf_train_dataset))
tokenizer.decode(z[0]['input_ids'][0])

Output

PowerShell
[CLS] How should I prepare for CA final law? [SEP] How should I prepare for CA final law? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

You may have noticed the presence of special tokens(to separate the two question inputs) and the [PAD] token. Also, while training a model, we need to provide the count of labels we want the model to train on. This can be extracted using train.label.nunique().

Now that you know how to load a data set and pre-process it, in the next segment, you will learn how to configure your model and train it.

Report an error