Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Tokenization is the foundational preprocessing step that determines how your NLP model "sees" text—and getting it wrong can tank performance before training even begins. You're being tested on understanding why different tokenization strategies exist, when to apply each approach, and how the choice of tokenization affects downstream tasks like classification, generation, and translation. The tradeoffs between vocabulary size, sequence length, and out-of-vocabulary handling appear constantly in system design questions.
Don't just memorize what each method does—know what problem each tokenization approach solves. Whether you're explaining why modern transformers use subword methods instead of simple word splitting, or analyzing why character-level models struggle with long sequences, the underlying principles of granularity, vocabulary coverage, and computational efficiency will guide your answers. Master the "why" behind each method, and you'll handle any tokenization question thrown at you.
These approaches split text using explicit delimiters like spaces or punctuation. They're fast and interpretable but make assumptions about language structure that don't hold universally.
Compare: Word Tokenization vs. Whitespace Tokenization—both use simple splitting rules, but word tokenization handles punctuation while whitespace ignores it entirely. If asked about preprocessing tradeoffs, whitespace is faster but requires cleanup; word tokenization is cleaner but slower.
These methods use explicit rules or statistical patterns to define token boundaries. They offer control and context-awareness at the cost of complexity or data sparsity.
\b\w+\b or (?<!\d)\.(?!\d) have steep learning curvesCompare: Regular Expression Tokenization vs. N-gram Tokenization—regex controls how you split, while n-grams control what size chunks you keep. Regex is preprocessing; n-grams are feature engineering. Use regex when you need custom boundaries, n-grams when you need contextual features.
Modern NLP's dominant paradigm—these algorithms learn to split words into meaningful pieces, balancing vocabulary size against sequence length. They solve the out-of-vocabulary problem while preserving morphological information.
Compare: BPE vs. WordPiece—both are subword methods, but BPE merges by frequency while WordPiece merges by likelihood gain. In practice, they produce similar results; WordPiece tends toward slightly smaller vocabularies. Know that BERT uses WordPiece and GPT uses BPE—this distinction appears in model architecture questions.
The finest granularity possible—every character becomes a token. This eliminates vocabulary problems entirely but creates extremely long sequences.
Compare: Character Tokenization vs. Subword Tokenization—characters guarantee zero OOV but create 5-10x longer sequences; subwords balance both concerns. For FRQ questions about model efficiency vs. coverage tradeoffs, this comparison is your go-to example.
| Concept | Best Examples |
|---|---|
| Simple boundary splitting | Word Tokenization, Whitespace Tokenization |
| Sentence-level segmentation | Sentence Tokenization |
| Custom pattern matching | Regular Expression Tokenization |
| Context-preserving features | N-gram Tokenization |
| Learned subword units | BPE, WordPiece, SentencePiece |
| Finest granularity | Character Tokenization |
| Handling OOV words | Subword methods (BPE, WordPiece, SentencePiece), Character Tokenization |
| Multilingual applications | SentencePiece, Character Tokenization |
Which two tokenization methods both use iterative merging but differ in their merge selection criteria? What's the key difference?
If you're building a model that must handle user-generated text with frequent misspellings and slang, which tokenization approaches would minimize out-of-vocabulary issues, and what tradeoff would each introduce?
Compare and contrast Whitespace Tokenization and Word Tokenization—when would you choose the simpler method despite its limitations?
A multilingual translation model needs to handle Japanese (no spaces), German (long compounds), and English. Which tokenization method is best suited, and why does it outperform word-level approaches here?
Explain why modern transformer models like BERT and GPT use subword tokenization instead of word or character tokenization. What specific problems does this solve?