• ML Spring
  • Posts
  • Understanding Tokenization inย NLP.

Understanding Tokenization inย NLP.

A hands on tutorial with code! ๐Ÿš€

Understanding Tokenization in NLP!

Tokenization is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. Now these tokens can be numerically encoded & given as an input to model.

The performance of your model largely depends on how well you are able to tokenize the inputs and this makes tokenization an essential preprocessing step when working on NLP tasks.

To better understand tokenization, letโ€™s talk about different way a raw piece of text can be converted into tokens and weigh the advantages and disadvantages of each method.

  • Character Tokenization

  • Word Tokenization

  • Subword Tokenization

  • Entire Dataset Tokenization

This would be a hand-on exercise so get you Jupyter Notebooks running!! ๐Ÿ“’ย ๐Ÿš€

Character Tokenization:

The simplest tokenization strategy would be to break down the text into individual character tokens and feed them to the model. And implementing this in python doesnโ€™t get easier.

Check this out ๐Ÿ‘‡

raw_text = "We love NLP!"
tokens = list(raw_text)
print(tokens)

๐Ÿ‘‰ ['W', 'e', ' ', 'l', 'o', 'v', 'e', ' ', 'N', 'L', 'P', '!']

Now, our model needs each token to be converted into an integer.

Hereโ€™s how we do it ๐Ÿ‘‡

# Numerical encoding of individual character

token2idx = {char: idx for idx, char in enumerate(sorted(set(tokens)))}
print(token2idx)

๐Ÿ‘‰ {' ': 0, '!': 1, 'L': 2, 'N': 3, 'P': 4, 'W': 5, 'e': 6, 'l': 7, 'o': 8, 'v': 9}

# Using token2idx to map our tokenized text to integers

integer_tokens = [token2idx[token] for token in tokens]
print(integer_tokens)

๐Ÿ‘‰ [5, 6, 0, 7, 8, 9, 6, 0, 3, 2, 4, 1]

We now have each of our token mapped to a unique integer. The next step would be to one-hot encode each of these integers.

Check this out ๐Ÿ‘‡

# One-hot encoding the numbers

import torch
import torch.nn.functional as F

integer_tokens = torch.tensor(integer_tokens)
one_hot_encode_tokens = F.one_hot(integer_tokens, num_classes=len(token2idx))

print(f"Token = {tokens[0]}")
print(f"Integer Encoded Token = {integer_tokens[0]}")
print(f"One hot encoded Token = {one_hot_encode_tokens[0]}")

๐Ÿ‘‰ Token = W
๐Ÿ‘‰ Integer Encoded Token = 5
๐Ÿ‘‰ One hot encoded Token = tensor([0, 0, 0, 0, 0, 1, 0, 0, 0, 0])

Although, character tokenization proves to be effective while dealing with misspellings and rare words but since itโ€™s just a stream of characters, it lacks the semantics and some structure of how words are formed in a language. Hence itโ€™s rarely used.

With this we wrap up character tokenization & move onto Word tokenization.

Word Tokenization:

As the name suggests, here we split the text into individual word tokens and map them to an integer. Now the model doesnโ€™t have and overhead of learning words from characters, which reduces the complexity of learning.

The most intuitive and simplest way for word tokenization is to split the raw text using whitespaces. Again, this can be easily done in Python.

Check this out ๐Ÿ‘‡

# Splitting raw text based on whitespaces

word_tokens = raw_text.split()
print(word_tokens)

๐Ÿ‘‰ ['We', 'love', 'NLP!']

But thereโ€™s a problem with this โœ‹ .

We did not consider punctuation, for instance โ€œNLP!โ€ is considered as a single token. Now, if we also consider conjugations, declination, misspellings and assign unique token to each of them, the size of vocabulary can grow into millions.

And this pauses a problem because it would require the neural network to have a very large number of parameters.

A remedy for this is discarding rare words and using the most common say 100, 000 words & all the remaining words are classified as unknown and share a common token UNK.

With all the learnings so far, we continue our quest to find a better tokenization approach and discuss Subword tokenization in the next section.

Subword Tokenization:

Having accessed the pros and cons of character and word tokenization in the last two sections, letโ€™s say subword tokenization would try to combine the best of both worlds.

It does two things:

  • Splitting rare words into smaller units to deal with misspellings and complex words.

  • Common words are kept as is and assigned a unique token.

We will discuss WordPiece (a subword tokenization algorithm, developed by Google to pretrain BERT) & the best way to understand is to see things in action. ๐Ÿš€

HuggingFace ๐Ÿค— provides a class called AutoTokenizer that letโ€™s you access the tokenizer associated with a pretrained model.

Check this out๐Ÿ‘‡

# install the HuggingFace ๐Ÿค— library
!pip install transformers

from transformers import AutoTokenizer

model_ckpt = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

from transformers import DistilBertTokenizer
distilbert_tokenizer = DistilBertTokenizer.from_pretrained(model_ckpt)

# Lets see the tokenizer in action now 

encoded_text = tokenizer(raw_text)
print(encoded_text)

๐Ÿ‘‰ {'input_ids': [101, 2057, 2293, 17953, 2361, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)

๐Ÿ‘‰ ['[CLS]', 'we', 'love', 'nl', '##p', '!', '[SEP]']

Some key observations ๐Ÿ‘‡

  • Special tokens [CLS] & [SEP] added at both ends of the sequence to mark start and end of the sequence. The name of these token may differ from model to model but their purpose remains same.

  • Each token has been lowercased.

  • Exclamation mark (!) is treated as a separate token, which makes sense

  • NLP has been split into two tokens (โ€˜nlโ€™ & โ€˜##pโ€™), since itโ€™s not a common word.

Now you might be wondering what does the prefix ## means in ##p. It signifies that the preceding string is not a whitespace. A token beginning with ## must be merged with the token before it.

Now we have an idea of how tokenization practically works for a single string but how about tokenising the whole dataset, thatโ€™s what we see in out next section. ๐Ÿš€

Tokenising entire Dataset:

To understand this, we would pick up a real example and see things in action. We will be using the tweet emotion dataset available on ๐Ÿค— hub.

Check this out ๐Ÿ‘‡

# install datasets library
!pip install datasets

from datasets import load_dataset

tweet_emotions = load_dataset("emotion")
print(tweet_emotions)
________________________________________
๐Ÿ‘‰ DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

Tweet emotions dataset contains 2000 pairs of (tweet, label).

Letโ€™s define a tokenization function and check first two examples ๐Ÿ‘‡

# function for tokenization

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

print(tokenize(tweet_emotions["train"][:2]))

๐Ÿ‘‰ {'input_ids': [[101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2064, 2175, 2013, 3110, 2061, 20625, 2000, 2061, 9636, 17772, 2074, 2013, 2108, 2105, 2619, 2040, 14977, 1998, 2003, 8300, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

The tokenization takes as input two important parameters:

  • padding: Zero pads a sample to match the size of longest sample in the batch.

  • truncation: Truncates the sample to models maximum allowed input size.

We also observe that along with encoded tweets, tokenizer return an attention_mask, which helps the model to ignore padded parts of the input and only learn from actual stuff.

Hereโ€™s a visualization to help you understand truncation and padding better!๐Ÿ‘‡

A visual explanation of truncation & padding.

DataSetDict object has a method map(), which we can use to apply tokenization across all the splits of dataset that too in a single line of code.

Check this out ๐Ÿ‘‡

# Applying tokenization across entire data set

tweet_emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)

print(tweet_emotions_encoded['test'].column_names)

๐Ÿ‘‰ ['text', 'label', 'input_ids', 'attention_mask']

Parameter batched = True makes sure the tweets are encoded in batches & once we have the data encoded we can see the encoded input_ids & attention_mask along with text and labels.

Now, we have tokenized data ready to be fed into a model ๐Ÿ”ฅ

You can find all the code on my GitHub ๐Ÿ‘

That's a wrap!

If you interested in:

  • Python

  • Data Science

  • Machine Learning

  • Maths for ML

  • MLOps

  • CV/NLP

  • LLMs

Everyday, I share content on the above topics on twitter, and will do a weekly deep dive on ML Spring!

Subscribe if you donโ€™t want to miss out on the amazing content I have planned for you!

Subscribe to ML Spring!

Follow me on twitter ๐Ÿ‘‡

Thanks for reading! ๐Ÿ™ย 

Subscribe to keep reading

This content is free, but you must be subscribed to ML Spring to continue reading.

Already a subscriber?Sign In.Not now