Tokenize and Build a Vocabulary
Map text to integer ids.
Networks Only Eat Numbers
A neural net cannot read raw letters. Before any learning happens, you must turn text into numbers the model can crunch. 🔢
First Step: Tokenize
To tokenize means to split text into small pieces called tokens. The simplest choice is to split a sentence on spaces into words.
text = "I love deep learning"
tokens = text.split()
# ["I", "love", "deep", "learning"]All lessons in this course
- Tokenize and Build a Vocabulary
- nn.Embedding: Learnable Word Vectors
- Why Embeddings Capture Meaning
- Train a Text Classifier