The simplest tokeniser:
text = "Hello, world!"
tokens = text.split()
print(tokens)['Hello,', 'world!']Python's split() method only splits on spaces by default. We can use regular expressions to cover spaces and punctuation that also delimit words.
re.findall(r'\w+|[^\w\s]') regex work?It has two parts separated by | (which means "or"):
\w+ matches one or more word characters (letters, digits, underscores).[^\w\s] matches anything that's not a word character and not whitespace. So it matches punctuation.The ^ inside the brackets [^...] means "not", so [^\w\s] literally means "not a word character, not whitespace".
Both split() and findall() break text into pieces, but they work differently:
split(): You tell it what to remove (the separators), and it gives you what's left.findall(): You tell it what to keep (the pattern), and it gives you all matches.import re
tokens = re.findall(r'\w+|[^\w\s]', text)
print(tokens)['Hello', ',', 'world', '!']vocab = {token: idx for idx, token in enumerate(set(tokens))}
print(vocab){'Hello': 0, ',': 1, '!': 2, 'world': 3}token_ids = [vocab[token] for token in tokens]
print(token_ids)[0, 1, 3, 2]If we try to tokenise text with a word not in our vocabulary, we'd get a KeyError. So we need our first example of a special token, <UNK>.
vocab['<UNK>'] = len(vocab)Now we can safely encode any text by using vocab.get(token, vocab['<UNK>']) instead of vocab[token].
new_text = "Goodbye, world!"
new_tokens = re.findall(r'\w+|[^\w\s]', new_text)
new_ids = [vocab.get(token, vocab['<UNK>']) for token in new_tokens]
print(new_tokens)
print(new_ids)['Goodbye', ',', 'world', '!']
[5, 1, 3, 2]Instead of having separate tokens for every word, BPE constructs an efficient set of subword units.
Word-level BPE splits on whitespace first, then applies BPE within each word. This is what GPT-2 does.
Pure character-level BPE treats the entire text as one sequence and applies merges across everything, including spaces.
Special space markers (like GPT2)
Byte-level encoding
Implementation from scratch at this link
class BPE:
def __init__(self, vocab_size: int):
self.vocab: dict[int, bytes] = {}
self.merges: dict[tuple[int, int], int] = {}
self.vocab_size = vocab_size
self.split_pattern = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
self.special_tokens = ["<eot>", "<eod>", "<eos>", "<pad>", "<unk>", "<mask>", "<cls>", "<sep>"]
def encode(self, text: str) -> list[list[int]]:
text_chunks = re.findall(self.split_pattern, text)
encoded_tokens = []
for tc in text_chunks:
encoded_tokens.append(self._encode_single(tc))
return encoded_tokens
def decode(self, tokens: list[list[int]]) -> str:
decoded_tokens = [self._decode_single(tc) for tc in tokens]
decoded_text = "".join(decoded_tokens)
return decoded_text
def train(self, text: str):
self.vocab = {i: bytes([i]) for i in range(256)}
text_chunks = re.findall(self.split_pattern, text)
tokenised_chunks = [self.get_tokens(ch) for ch in text_chunks]
for i in range(256, self.vocab_size - len(self.special_tokens)):
pair_counts = {}
for tc in tokenised_chunks:
for pair, count in self.get_pair_counts(tc).items():
pair_counts[pair] = pair_counts.get(pair, 0) + count
most_common_pair = max(pair_counts, key=lambda p: pair_counts[p])
tokenised_chunks = [self.merge_pair(tc, most_common_pair, i) for tc in tokenised_chunks]
self.merges[most_common_pair] = i
self.vocab[i] = self.vocab[most_common_pair[0]] + self.vocab[most_common_pair[1]]
for i, st in enumerate(self.special_tokens):
self.vocab[self.vocab_size - len(self.special_tokens) + i] = st.encode("utf-8")
def _encode_single(self, text: str) -> list[int]:
tokens = self.get_tokens(text)
for pair, idx in self.merges.items():
tokens = self.merge_pair(tokens, pair, idx)
return tokens
def _decode_single(self, tokens: list[int]) -> str:
token_bytes: bytes = b"".join(self.vocab[token] for token in tokens)
return token_bytes.decode("utf-8", errors="replace")
def get_tokens(self, text: str) -> list[int]:
tokens: bytes = text.encode(encoding="utf-8")
return list(map(int, tokens))
def get_pair_counts(self, tokens: list[int]) -> dict[tuple[int, int], int]:
counts: dict[tuple[int, int], int] = {}
for pair in zip(tokens, tokens[1:]):
counts[pair] = counts.get(pair, 0) + 1
return counts
def merge_pair(self, tokens: list[int], pair: tuple[int, int], idx: int) -> list[int]:
new_tokens = []
i = 0
while i < len(tokens):
if i < len(tokens) - 1 and (tokens[i], tokens[i + 1]) == pair:
new_tokens.append(idx)
i += 2
else:
new_tokens.append(tokens[i])
i += 1
return new_tokensParaphrasing the GPT-2 paper here.
Consider the word "dog". In a large corpus it appears in many contexts:
dogdog.dog!dog?dog, dogIf we trained a BPE tokenizer like the one above, it might decide to merge d+o+g+. into a single token dog., and separately d+o+g+! into dog!, etc. Each variant would eat up a slot in our fixed vocabulary budget. We'd be wasting indexes on what is essentially the same concept, and the model have to learn that all these variants referred to almosts the same thing.
If we "pre-tokenise" the text using a regex that splits on character categories (e.g. word characters, whitespace, punctuation), and don't allow the tokeniser to merge across these category boundaries, we can avoid this problem. Something like dog+. can never be fused into a single token.
We make an exception for the space character. This can merge with other word characters. This is because spaces are so common we get enough benefits from compressing the text that it outweights some word fragmentation.
(Implementation here)
In my byte pair encoding from scratch notebook, I used a regex to split text into chunks. Merges can never take place across those hard world boundaries.
SentencePiece runs on the entire input as a single character stream and lets merges happen freely across any adjacent characters.
It replaces every space character with the special \u2581 (▁) character (which resembles an underscore but is much rarer, so it won't collide with real underscores).
Because ▁ can merge with any other character, word boundary information is carried inside the token itself. We end up with merges like ▁ + h -> ▁he.
We prepend ▁ to the start of an input as well. This helps the model avoid encoding a word differently just because it happened to appear at the start of the input.
It's pretty easy to implement this in Python. We can take our NaiveBPE implementation and just tweak the get_tokens and decode methods.
We add these two lines to the start of the get_tokens method:
text = f" {text}"
text = text.replace(" ", self.space_token)And this line to the end of the decode method:
text = text.replace(self.space_token, " ").strip()Rather than emitting unk for out-of-vocabulary characters, unknown characters are encoded as their individual UTF-8 bytes, represented as special tokens 0x00 through 0xFF. This makes the encoding lossless.
This parameter (SentencePiece defaults to 0.9995) sets the minimum fraction of characters in the training corpus that must be representable. Rare characters beyond the threshold are either dropped or handled via byte fallback. This prevents rare Unicode characters from wasting vocabulary slots.
Applying NFKC normalisation to the input before training and encoding means that visually equivalent characters are treated as identical.
At training time, instead of always using the single best tokenisation, we can sample from multiple valid segmentations according to their probabilities. This acts as data augmentation and makes the model more robust to different tokenisations of the same text.
In addition to BPE, SentencePiece also supports a Unigram language model. This starts with a large candidate vocabulary and prunes it using an EM procedure that maximises the likelihood of the training data under a unigram language model. It's more principled probabilistically. T5 and XLNet use it.
Tags: AI