Splitting on Whitespace and Its Limits
Where naive splitting breaks down.
The Simplest Tokenizer
The easiest way to tokenize is to split on spaces. Python's split() method does exactly that with zero setup. ✂️
text = "I love cats"
print(text.split()) # ['I', 'love', 'cats']How split() Works
Called with no arguments, split() breaks on any run of whitespace: spaces, tabs, or newlines. Empty gaps are ignored automatically.
All lessons in this course
- What Is a Token, Really?
- Splitting on Whitespace and Its Limits
- Sentence Segmentation Basics
- Tokenizing With NLTK