- ML Spring
- Posts
- Understanding Tokenization inΒ NLP.
Understanding Tokenization in NLP.
A hands on tutorial with code! π
Understanding Tokenization in NLP!
Tokenization is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. Now these tokens can be numerically encoded & given as an input to model.
The performance of your model largely depends on how well you are able to tokenize the inputs and this makes tokenization an essential preprocessing step when working on NLP tasks.
To better understand tokenization, letβs talk about different way a raw piece of text can be converted into tokens and weigh the advantages and disadvantages of each method.
Character Tokenization
Word Tokenization
Subword Tokenization
Entire Dataset Tokenization
This would be a hand-on exercise so get you Jupyter Notebooks running!! π π
Character Tokenization:
The simplest tokenization strategy would be to break down the text into individual character tokens and feed them to the model. And implementing this in python doesnβt get easier.
Check this out π
raw_text = "We love NLP!"
tokens = list(raw_text)
print(tokens)
π ['W', 'e', ' ', 'l', 'o', 'v', 'e', ' ', 'N', 'L', 'P', '!']
Now, our model needs each token to be converted into an integer.
Hereβs how we do it π
# Numerical encoding of individual character
token2idx = {char: idx for idx, char in enumerate(sorted(set(tokens)))}
print(token2idx)
π {' ': 0, '!': 1, 'L': 2, 'N': 3, 'P': 4, 'W': 5, 'e': 6, 'l': 7, 'o': 8, 'v': 9}
# Using token2idx to map our tokenized text to integers
integer_tokens = [token2idx[token] for token in tokens]
print(integer_tokens)
π [5, 6, 0, 7, 8, 9, 6, 0, 3, 2, 4, 1]
We now have each of our token mapped to a unique integer. The next step would be to one-hot encode each of these integers.
Check this out π
Reply