Machine Learning Academy · Lesson

Hugging Face Tokenizers: Encoding Text for BERT

Learners will load BertTokenizer, tokenise a batch of sentences, inspect input_ids and attention_mask tensors, and handle truncation and padding.

Why Tokenisation Matters for BERT

Before BERT can process text, each character sequence must be converted into numerical IDs the model understands. Tokenisation is the process of splitting text into sub-word units (tokens) and mapping them to integer IDs from a fixed vocabulary. Getting tokenisation right is critical: the wrong padding strategy, missing attention masks, or incorrect truncation can silently corrupt your model's input and hurt accuracy.

Installing Hugging Face Transformers

The Hugging Face Transformers library provides pre-trained models and tokenizers for hundreds of architectures. Install it along with datasets for data loading and torch as the backend. The library follows a consistent API: instantiate a tokenizer with from_pretrained, pass it text, and receive ready-to-use tensors.

# Install dependencies
# pip install transformers datasets torch

from transformers import BertTokenizer
import torch

# Load the pre-trained BERT tokenizer (downloads ~200 KB vocab file)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
print('Vocabulary size:', tokenizer.vocab_size)  # 30522
print('Max length:', tokenizer.model_max_length)  # 512

All lessons in this course

Transformer Architecture: Attention, Tokens, and Context
Hugging Face Tokenizers: Encoding Text for BERT
Fine-Tuning BertForSequenceClassification
Evaluation and Inference: From Logits to Predicted Labels

← Back to Machine Learning Academy