The simplest tokeniser:
text = "Hello, world!"
tokens = text.split()
print(tokens)['Hello,', 'world!']Python's split() method only splits on spaces by default. We can use regular expressions to cover spaces and punctuation that also delimit words.
re.findall(r'\w+|[^\w\s]') regex work?It has two parts separated by | (which means "or"):
\w+ matches one or more word characters (letters, digits, underscores).[^\w\s] matches anything that's not a word character and not whitespace. So it matches punctuation.The ^ inside the brackets [^...] means "not", so [^\w\s] literally means "not a word character, not whitespace".
Both split() and findall() break text into pieces, but they work differently:
split(): You tell it what to remove (the separators), and it gives you what's left.findall(): You tell it what to keep (the pattern), and it gives you all matches.import re
tokens = re.findall(r'\w+|[^\w\s]', text)
print(tokens)['Hello', ',', 'world', '!']vocab = {token: idx for idx, token in enumerate(set(tokens))}
print(vocab){'Hello': 0, ',': 1, '!': 2, 'world': 3}token_ids = [vocab[token] for token in tokens]
print(token_ids)[0, 1, 3, 2]If we try to tokenise text with a word not in our vocabulary, we'd get a KeyError. So we need our first example of a special token, <UNK>.
vocab['<UNK>'] = len(vocab)Now we can safely encode any text by using vocab.get(token, vocab['<UNK>']) instead of vocab[token].
How embedding layers function as a lookup operation, retrieving vectors corresponding to token IDs.
Absolute positional embeddings
Relative positional embeddings
Rotatary positional embeddings
Tags: AI