Understanding Tokenization inΒ NLP.
A hands on tutorial with code! π
Understanding Tokenization in NLP!

Tokenization is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. Now these tokens can be numerically encoded & given as an input to model.
The performance of your model largely depends on how well you are able to tokenize the inputs and this makes tokenization an essential preprocessing step when working on NLP tasks.
To better understand tokenization, letβs talk about different way a raw piece of text can be converted into tokens and weigh the advantages and disadvantages of each method.
Character Tokenization
Word Tokenization
Subword Tokenization
Entire Dataset Tokenization
This would be a hand-on exercise so get you Jupyter Notebooks running!! πΒ π
Character Tokenization:
The simplest tokenization strategy would be to break down the text into individual character tokens and feed them to the model. And implementing this in python doesnβt get easier.
Check this out π
raw_text = "We love NLP!"
tokens = list(raw_text)
print(tokens)
π ['W', 'e', ' ', 'l', 'o', 'v', 'e', ' ', 'N', 'L', 'P', '!']
Now, our model needs each token to be converted into an integer.
Hereβs how we do it π
# Numerical encoding of individual character
token2idx = {char: idx for idx, char in enumerate(sorted(set(tokens)))}
print(token2idx)
π {' ': 0, '!': 1, 'L': 2, 'N': 3, 'P': 4, 'W': 5, 'e': 6, 'l': 7, 'o': 8, 'v': 9}
# Using token2idx to map our tokenized text to integers
integer_tokens = [token2idx[token] for token in tokens]
print(integer_tokens)
π [5, 6, 0, 7, 8, 9, 6, 0, 3, 2, 4, 1]
We now have each of our token mapped to a unique integer. The next step would be to one-hot encode each of these integers.
Check this out π
# One-hot encoding the numbers
import torch
import torch.nn.functional as F
integer_tokens = torch.tensor(integer_tokens)
one_hot_encode_tokens = F.one_hot(integer_tokens, num_classes=len(token2idx))
print(f"Token = {tokens[0]}")
print(f"Integer Encoded Token = {integer_tokens[0]}")
print(f"One hot encoded Token = {one_hot_encode_tokens[0]}")
π Token = W
π Integer Encoded Token = 5
π One hot encoded Token = tensor([0, 0, 0, 0, 0, 1, 0, 0, 0, 0])
Although, character tokenization proves to be effective while dealing with misspellings and rare words but since itβs just a stream of characters, it lacks the semantics and some structure of how words are formed in a language. Hence itβs rarely used.
With this we wrap up character tokenization & move onto Word tokenization.
Word Tokenization:
As the name suggests, here we split the text into individual word tokens and map them to an integer. Now the model doesnβt have and overhead of learning words from characters, which reduces the complexity of learning.
The most intuitive and simplest way for word tokenization is to split the raw text using whitespaces. Again, this can be easily done in Python.
Check this out π
# Splitting raw text based on whitespaces
word_tokens = raw_text.split()
print(word_tokens)
π ['We', 'love', 'NLP!']
But thereβs a problem with this β .
We did not consider punctuation, for instance βNLP!β is considered as a single token. Now, if we also consider conjugations, declination, misspellings and assign unique token to each of them, the size of vocabulary can grow into millions.
And this pauses a problem because it would require the neural network to have a very large number of parameters.
A remedy for this is discarding rare words and using the most common say 100, 000 words & all the remaining words are classified as unknown and share a common token UNK.
With all the learnings so far, we continue our quest to find a better tokenization approach and discuss Subword tokenization in the next section.
Subword Tokenization:
Having accessed the pros and cons of character and word tokenization in the last two sections, letβs say subword tokenization would try to combine the best of both worlds.
It does two things:
Splitting rare words into smaller units to deal with misspellings and complex words.
Common words are kept as is and assigned a unique token.
We will discuss WordPiece (a subword tokenization algorithm, developed by Google to pretrain BERT) & the best way to understand is to see things in action. π
HuggingFace π€ provides a class called AutoTokenizer that letβs you access the tokenizer associated with a pretrained model.
Check this outπ
# install the HuggingFace π€ library
!pip install transformers
from transformers import AutoTokenizer
model_ckpt = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
from transformers import DistilBertTokenizer
distilbert_tokenizer = DistilBertTokenizer.from_pretrained(model_ckpt)
# Lets see the tokenizer in action now
encoded_text = tokenizer(raw_text)
print(encoded_text)
π {'input_ids': [101, 2057, 2293, 17953, 2361, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)
π ['[CLS]', 'we', 'love', 'nl', '##p', '!', '[SEP]']
Some key observations π
Special tokens [CLS] & [SEP] added at both ends of the sequence to mark start and end of the sequence. The name of these token may differ from model to model but their purpose remains same.
Each token has been lowercased.
Exclamation mark (!) is treated as a separate token, which makes sense
NLP has been split into two tokens (βnlβ & β##pβ), since itβs not a common word.
Now you might be wondering what does the prefix ## means in ##p. It signifies that the preceding string is not a whitespace. A token beginning with ## must be merged with the token before it.
Now we have an idea of how tokenization practically works for a single string but how about tokenising the whole dataset, thatβs what we see in out next section. π
Tokenising entire Dataset:
To understand this, we would pick up a real example and see things in action. We will be using the tweet emotion dataset available on π€ hub.
Check this out π
# install datasets library
!pip install datasets
from datasets import load_dataset
tweet_emotions = load_dataset("emotion")
print(tweet_emotions)
________________________________________
π DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 16000
})
validation: Dataset({
features: ['text', 'label'],
num_rows: 2000
})
test: Dataset({
features: ['text', 'label'],
num_rows: 2000
})
})
Tweet emotions dataset contains 2000 pairs of (tweet, label).
Letβs define a tokenization function and check first two examples π
# function for tokenization
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True)
print(tokenize(tweet_emotions["train"][:2]))
π {'input_ids': [[101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2064, 2175, 2013, 3110, 2061, 20625, 2000, 2061, 9636, 17772, 2074, 2013, 2108, 2105, 2619, 2040, 14977, 1998, 2003, 8300, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
The tokenization takes as input two important parameters:
padding: Zero pads a sample to match the size of longest sample in the batch.
truncation: Truncates the sample to models maximum allowed input size.
We also observe that along with encoded tweets, tokenizer return an attention_mask, which helps the model to ignore padded parts of the input and only learn from actual stuff.
Hereβs a visualization to help you understand truncation and padding better!π

A visual explanation of truncation & padding.
DataSetDict object has a method map(), which we can use to apply tokenization across all the splits of dataset that too in a single line of code.
Check this out π
# Applying tokenization across entire data set
tweet_emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)
print(tweet_emotions_encoded['test'].column_names)
π ['text', 'label', 'input_ids', 'attention_mask']
Parameter batched = True makes sure the tweets are encoded in batches & once we have the data encoded we can see the encoded input_ids & attention_mask along with text and labels.
Now, we have tokenized data ready to be fed into a model π₯
You can find all the code on my GitHub π
That's a wrap!
If you interested in:
Python
Data Science
Machine Learning
Maths for ML
MLOps
CV/NLP
LLMs
Everyday, I share content on the above topics on twitter, and will do a weekly deep dive on ML Spring!
Subscribe if you donβt want to miss out on the amazing content I have planned for you!
Subscribe to ML Spring!
Follow me on twitter π
Thanks for reading! πΒ