• ML Spring
  • Posts
  • Understanding Tokenization inΒ NLP.

Understanding Tokenization in NLP.

A hands on tutorial with code! πŸš€

Understanding Tokenization in NLP!

Tokenization is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. Now these tokens can be numerically encoded & given as an input to model.

The performance of your model largely depends on how well you are able to tokenize the inputs and this makes tokenization an essential preprocessing step when working on NLP tasks.

To better understand tokenization, let’s talk about different way a raw piece of text can be converted into tokens and weigh the advantages and disadvantages of each method.

  • Character Tokenization

  • Word Tokenization

  • Subword Tokenization

  • Entire Dataset Tokenization

This would be a hand-on exercise so get you Jupyter Notebooks running!! πŸ“’ πŸš€

Character Tokenization:

The simplest tokenization strategy would be to break down the text into individual character tokens and feed them to the model. And implementing this in python doesn’t get easier.

Check this out πŸ‘‡

raw_text = "We love NLP!"
tokens = list(raw_text)

πŸ‘‰ ['W', 'e', ' ', 'l', 'o', 'v', 'e', ' ', 'N', 'L', 'P', '!']

Now, our model needs each token to be converted into an integer.

Here’s how we do it πŸ‘‡

# Numerical encoding of individual character

token2idx = {char: idx for idx, char in enumerate(sorted(set(tokens)))}

πŸ‘‰ {' ': 0, '!': 1, 'L': 2, 'N': 3, 'P': 4, 'W': 5, 'e': 6, 'l': 7, 'o': 8, 'v': 9}

# Using token2idx to map our tokenized text to integers

integer_tokens = [token2idx[token] for token in tokens]

πŸ‘‰ [5, 6, 0, 7, 8, 9, 6, 0, 3, 2, 4, 1]

We now have each of our token mapped to a unique integer. The next step would be to one-hot encode each of these integers.

Check this out πŸ‘‡

Subscribe to keep reading

This content is free, but you must be subscribed to ML Spring to continue reading.

Already a subscriber?Sign In.Not now


or to participate.