Tokenisation

Last updated:7th February 2026

Created:7th February 2026

The simplest tokeniser:

Splits text into chunks in some rule-based way, e.g. a word-based tokeniser would split on sentences and punctuation.

text = "Hello, world!"
tokens = text.split()
print(tokens)

['Hello,', 'world!']

Python's split() method only splits on spaces by default. We can use regular expressions to cover spaces and punctuation that also delimit words.

How does the re.findall(r'\w+|[^\w\s]') regex work?

It has two parts separated by | (which means "or"):

\w+ matches one or more word characters (letters, digits, underscores).
[^\w\s] matches anything that's not a word character and not whitespace. So it matches punctuation.

The ^ inside the brackets [^...] means "not", so [^\w\s] literally means "not a word character, not whitespace".

Both split() and findall() break text into pieces, but they work differently:

split(): You tell it what to remove (the separators), and it gives you what's left.
findall(): You tell it what to keep (the pattern), and it gives you all matches.

import re
tokens = re.findall(r'\w+|[^\w\s]', text)
print(tokens)

['Hello', ',', 'world', '!']

vocab = {token: idx for idx, token in enumerate(set(tokens))}
print(vocab)

{'Hello': 0, ',': 1, '!': 2, 'world': 3}

token_ids = [vocab[token] for token in tokens]
print(token_ids)

[0, 1, 3, 2]

If we try to tokenise text with a word not in our vocabulary, we'd get a KeyError. So we need our first example of a special token, <UNK>.

vocab['<UNK>'] = len(vocab)

Now we can safely encode any text by using vocab.get(token, vocab['<UNK>']) instead of vocab[token].

Converting tokens to embeddings

How embedding layers function as a lookup operation, retrieving vectors corresponding to token IDs.

Absolute positional embeddings
- OpenAI style absolute embeddings that are optimised during training
Relative positional embeddings
Rotatary positional embeddings

Tags: AI