Exploring the Tokenization Pipeline

As you learned earlier, the first component inside the pipeline() function is pre-processing, which is done by the tokenizers inside the Transformer API. The tokenizers convert the text in the raw format and transform it into the desired manner, for the model to start processing. Since no model can feed raw text automatically, the first job of tokenizers is to convert text inputs to numerical data.

In this segment, Ankush will explain what exactly happens in the tokenization pipeline.

Let’s understand how input text is processed.

Example

Python

tokenized_text = "Learning NLP is so much rewarding".split()pp.pprint(tokenized_text)

Output

PowerShell

['Learning', 'NLP', 'is', 'so', 'much', 'rewarding']

Here, the input is split into each of the tokens using the split() function. However, the transformer tokenizers convert the input text into the numerical representation of each of the tokens.

Example

Python

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer("Learning NLP is so much rewarding")

Output

PowerShell

{'input_ids': [101, 9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Note:

Here, we have downloaded the tokenizer used in the BERT model.

Here, you may observe that the input text is transformed into a dictionary consisting of the following three keys: ‘input_ids’, ‘token_type_ids’ and ‘attention_mask’.

Let’s first see how each word is tokenized.

Example

Python

tokens = tokenizer.tokenize("Learning NLP is so much rewarding", )pp.pprint(tokens)

Output

PowerShell

['Learning', 'NL', '##P', 'is', 'so', 'much', 'reward', '##ing']

Here, we have used the Subword tokenization technique. This type of tokenization algorithm relies on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. This helps in avoiding a large vocabulary and thus ensures faster processing.

After these tokens are generated, they can be converted into ‘input_ids’ using the convert_tokens_to_ids method.

Example

Python

ids = tokenizer.convert_tokens_to_ids(tokens)pp.pprint(ids)

Output

PowerShell

[9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158]

However, the input_ids generated earlier consists of more values, which are given below.

PowerShell

 'input_ids': [101, 9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158, 102]

On comparing, you notice that the special tokens, start id(101) and end id(102), are missing in the returned output.

Example

Python

tokens = tokenizer.tokenize("Learning NLP is so much rewarding", add_special_tokens = True )
pp.pprint(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
pp.pprint(ids)

Output

PowerShell

['[CLS]', 'Learning', 'NL', '##P', 'is', 'so', 'much', 'reward', '##ing', '[SEP]']
[101, 9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158, 102]

While decoding, you may also get the special tokens that are not required in production.

PowerShell

[CLS] Learning NLP is so much rewarding [SEP].

You may prevent these special tokens from being generated by using the following argument: skip_special_tokens=True

Example

Python

tokenizer.decode(ids, skip_special_tokens=True)

Output

PowerShell

Learning NLP is so much rewarding

Great! Now you know how a sentence is converted into tokens and then converted into ids. But what if we have multiple sentences? In the upcoming video, you will learn how to deal with multiple sentences.

Let’s take a look at another example.

Example

Python

tokenized_output = tokenizer(["Learning NLP is so much rewarding","Another test sentence"])tokenized_output['input_ids']

Output

PowerShell

[[101, 9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158, 102],
 [101, 2543, 2774, 5650, 102]]

Here, the tokenized outputs are not of the same length. The tokenizer function allows us to control the output using the following argument: padding and truncation.

Example

Python

sequences = ["Learning NLP is so much rewarding","Another test sentence"]
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")
pp.pprint(model_inputs)

Since both input sequences are of different lengths, we have applied padding based on the maximum length. Here, the first sequence has 10 words, so the token length is 10.

The second sequence needs another 5 tokens to fill this gap, so it adds them by padding 0’s at the tail. This is reflected in attention_mask and input_ids.

Output

PowerShell

'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]]
'input_ids': [[101, 9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158, 102],
               [101, 2543, 2774, 5650, 102, 0, 0, 0, 0, 0]],

The 0’s in the attention masks signal the model to ignore them, as they are empty entries and consider only values that are marked 1.

If we apply max_length to the padding, it will add 0’s till the maximum dimension of the model, which is 512 in this case.

PowerShell

model_inputs = tokenizer(sequences, padding="max_length", max_length=6)pp.pprint(model_inputs)

However, this is an inefficient method if the input sequences are short in length. This is because the gap between the text length and max_length is filled by 0’s, which ultimately increases the processing time.

Instead of applying padding, you may truncate the values.

Example

Python

model_inputs = tokenizer(sequences, max_length=6, truncation=True)
pp.pprint(model_inputs)

Output

PowerShell

{'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]],
 'input_ids': [[101, 9681, 21239, 2101, 1110, 102], [101, 2543, 2774, 5650, 102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]}

Here, all the tokens are adjusted to the length of 6. Therefore, no 0’s are added in attention_mask. However, max_length should be applied carefully because truncation removes the information/tokens.

You can also change the returned output to a different language/framework using the return_tensors argument. Let’s see how to do this in the next video.

This brings us to the end of the session on tokenization pipeline. However, we have not looked at the utility of ‘token_type_ids‘. In the next segment, you will learn about ‘token_type_ids’ with the help of a case study. In this case study, we need two different sequences to be joined in a single “input_ids” entry.

Report an error