Tokenisation

Last updated:
Created:

The simplest tokeniser:

  1. Splits text into chunks in some rule-based way, e.g. a word-based tokeniser would split on sentences and punctuation.
text = "Hello, world!"
tokens = text.split()
print(tokens)
['Hello,', 'world!']

Python's split() method only splits on spaces by default. We can use regular expressions to cover spaces and punctuation that also delimit words.

How does the re.findall(r'\w+|[^\w\s]') regex work?

It has two parts separated by | (which means "or"):

  • \w+ matches one or more word characters (letters, digits, underscores).
  • [^\w\s] matches anything that's not a word character and not whitespace. So it matches punctuation.

The ^ inside the brackets [^...] means "not", so [^\w\s] literally means "not a word character, not whitespace".


Both split() and findall() break text into pieces, but they work differently:

  • split(): You tell it what to remove (the separators), and it gives you what's left.
  • findall(): You tell it what to keep (the pattern), and it gives you all matches.
import re
tokens = re.findall(r'\w+|[^\w\s]', text)
print(tokens)
['Hello', ',', 'world', '!']
  1. Maps those chunks to numeric values:
vocab = {token: idx for idx, token in enumerate(set(tokens))}
print(vocab)
{'Hello': 0, ',': 1, '!': 2, 'world': 3}
token_ids = [vocab[token] for token in tokens]
print(token_ids)
[0, 1, 3, 2]

If we try to tokenise text with a word not in our vocabulary, we'd get a KeyError. So we need our first example of a special token, <UNK>.

vocab['<UNK>'] = len(vocab)

Now we can safely encode any text by using vocab.get(token, vocab['<UNK>']) instead of vocab[token].

new_text = "Goodbye, world!"
new_tokens = re.findall(r'\w+|[^\w\s]', new_text)
new_ids = [vocab.get(token, vocab['<UNK>']) for token in new_tokens]
print(new_tokens)
print(new_ids)
['Goodbye', ',', 'world', '!']
[5, 1, 3, 2]
  • Vocabulary size

Byte pair encoding (BPE)

Instead of having separate tokens for every word, BPE constructs an efficient set of subword units.

  1. Start with individual characters as tokens.
  2. Find the most frequent pair of adjacent tokens.
  3. Merge that pair into a new token.
  4. Repeat until we reach the desired size of our vocabulary.
  • Word-level BPE splits on whitespace first, then applies BPE within each word. This is what GPT-2 does.

  • Pure character-level BPE treats the entire text as one sequence and applies merges across everything, including spaces.

  • Special space markers (like GPT2)

  • Byte-level encoding

Implementation from scratch at this link

class BPE:
    def __init__(self, vocab_size: int):
        self.vocab: dict[int, bytes] = {}
        self.merges: dict[tuple[int, int], int] = {}
        self.vocab_size = vocab_size
        self.split_pattern = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
        self.special_tokens = ["<eot>", "<eod>", "<eos>", "<pad>", "<unk>", "<mask>", "<cls>", "<sep>"]
 
    def encode(self, text: str) -> list[list[int]]:
        text_chunks = re.findall(self.split_pattern, text)
        encoded_tokens = []
        for tc in text_chunks:
            encoded_tokens.append(self._encode_single(tc))
        return encoded_tokens
 
    def decode(self, tokens: list[list[int]]) -> str:
        decoded_tokens = [self._decode_single(tc) for tc in tokens]
        decoded_text = "".join(decoded_tokens)
        return decoded_text
 
    def train(self, text: str):
        self.vocab = {i: bytes([i]) for i in range(256)}
        text_chunks = re.findall(self.split_pattern, text)
        tokenised_chunks = [self.get_tokens(ch) for ch in text_chunks]
        for i in range(256, self.vocab_size - len(self.special_tokens)):
            pair_counts = {}
            for tc in tokenised_chunks:
                for pair, count in self.get_pair_counts(tc).items():
                    pair_counts[pair] = pair_counts.get(pair, 0) + count
            most_common_pair = max(pair_counts, key=lambda p: pair_counts[p])
            tokenised_chunks = [self.merge_pair(tc, most_common_pair, i) for tc in tokenised_chunks]
            self.merges[most_common_pair] = i
            self.vocab[i] = self.vocab[most_common_pair[0]] + self.vocab[most_common_pair[1]]
        for i, st in enumerate(self.special_tokens):
            self.vocab[self.vocab_size - len(self.special_tokens) + i] = st.encode("utf-8")
 
    def _encode_single(self, text: str) -> list[int]:
        tokens = self.get_tokens(text)
        for pair, idx in self.merges.items():
            tokens = self.merge_pair(tokens, pair, idx)
        return tokens
 
    def _decode_single(self, tokens: list[int]) -> str:
        token_bytes: bytes = b"".join(self.vocab[token] for token in tokens)
        return token_bytes.decode("utf-8", errors="replace")
 
    def get_tokens(self, text: str) -> list[int]:
        tokens: bytes = text.encode(encoding="utf-8")
        return list(map(int, tokens))
 
    def get_pair_counts(self, tokens: list[int]) -> dict[tuple[int, int], int]:
        counts: dict[tuple[int, int], int] = {}
        for pair in zip(tokens, tokens[1:]):
            counts[pair] = counts.get(pair, 0) + 1
        return counts
 
    def merge_pair(self, tokens: list[int], pair: tuple[int, int], idx: int) -> list[int]:
        new_tokens = []
        i = 0
        while i < len(tokens):
            if i < len(tokens) - 1 and (tokens[i], tokens[i + 1]) == pair:
                new_tokens.append(idx)
                i += 2
            else:
                new_tokens.append(tokens[i])
                i += 1
        return new_tokens

Avoiding merging across accross word boundaries

Paraphrasing the GPT-2 paper here.

Consider the word "dog". In a large corpus it appears in many contexts:

  • dog
  • dog.
  • dog!
  • dog?
  • dog,
  • dog

If we trained a BPE tokenizer like the one above, it might decide to merge d+o+g+. into a single token dog., and separately d+o+g+! into dog!, etc. Each variant would eat up a slot in our fixed vocabulary budget. We'd be wasting indexes on what is essentially the same concept, and the model have to learn that all these variants referred to almosts the same thing.

If we "pre-tokenise" the text using a regex that splits on character categories (e.g. word characters, whitespace, punctuation), and don't allow the tokeniser to merge across these category boundaries, we can avoid this problem. Something like dog+. can never be fused into a single token.

We make an exception for the space character. This can merge with other word characters. This is because spaces are so common we get enough benefits from compressing the text that it outweights some word fragmentation.

SentencePiece

(Implementation here)

Dummy whitespace tokens

In my byte pair encoding from scratch notebook, I used a regex to split text into chunks. Merges can never take place across those hard world boundaries.

SentencePiece runs on the entire input as a single character stream and lets merges happen freely across any adjacent characters.

It replaces every space character with the special \u2581 (▁) character (which resembles an underscore but is much rarer, so it won't collide with real underscores).

Because ▁ can merge with any other character, word boundary information is carried inside the token itself. We end up with merges like + h -> ▁he.

We prepend ▁ to the start of an input as well. This helps the model avoid encoding a word differently just because it happened to appear at the start of the input.

It's pretty easy to implement this in Python. We can take our NaiveBPE implementation and just tweak the get_tokens and decode methods.

We add these two lines to the start of the get_tokens method:

text = f" {text}"
text = text.replace(" ", self.space_token)

And this line to the end of the decode method:

text = text.replace(self.space_token, " ").strip()

Byte Fallback

Rather than emitting unk for out-of-vocabulary characters, unknown characters are encoded as their individual UTF-8 bytes, represented as special tokens 0x00 through 0xFF. This makes the encoding lossless.

Character coverage

This parameter (SentencePiece defaults to 0.9995) sets the minimum fraction of characters in the training corpus that must be representable. Rare characters beyond the threshold are either dropped or handled via byte fallback. This prevents rare Unicode characters from wasting vocabulary slots.

Unicode Normalisation

Applying NFKC normalisation to the input before training and encoding means that visually equivalent characters are treated as identical.

Subword regularisation / stochastic segmentation

At training time, instead of always using the single best tokenisation, we can sample from multiple valid segmentations according to their probabilities. This acts as data augmentation and makes the model more robust to different tokenisations of the same text.

Unigram language model

In addition to BPE, SentencePiece also supports a Unigram language model. This starts with a large candidate vocabulary and prunes it using an EM procedure that maximises the likelihood of the training data under a unigram language model. It's more principled probabilistically. T5 and XLNet use it.

Tags: AI