upgrade
upgrade

🤟🏼Natural Language Processing

Tokenization Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Tokenization is the foundational preprocessing step that determines how your NLP model "sees" text—and getting it wrong can tank performance before training even begins. You're being tested on understanding why different tokenization strategies exist, when to apply each approach, and how the choice of tokenization affects downstream tasks like classification, generation, and translation. The tradeoffs between vocabulary size, sequence length, and out-of-vocabulary handling appear constantly in system design questions.

Don't just memorize what each method does—know what problem each tokenization approach solves. Whether you're explaining why modern transformers use subword methods instead of simple word splitting, or analyzing why character-level models struggle with long sequences, the underlying principles of granularity, vocabulary coverage, and computational efficiency will guide your answers. Master the "why" behind each method, and you'll handle any tokenization question thrown at you.


Simple Boundary-Based Methods

These approaches split text using explicit delimiters like spaces or punctuation. They're fast and interpretable but make assumptions about language structure that don't hold universally.

Word Tokenization

  • Splits text at spaces and punctuation—the most intuitive approach that aligns with human understanding of "words"
  • Struggles with contractions and compounds—"don't" might become ["do", "n't"] or ["don't"], creating inconsistency across implementations
  • Creates open vocabulary problems since any new word becomes an unknown token, limiting generalization to unseen text

Whitespace Tokenization

  • Splits exclusively on whitespace characters—the fastest, simplest tokenization with zero linguistic assumptions
  • Ignores punctuation entirely—"Hello, world!" becomes ["Hello,", "world!"] with punctuation attached
  • Best for preprocessing pipelines where downstream steps handle punctuation separately or when speed matters most

Sentence Tokenization

  • Segments text into sentence-level units—uses periods, question marks, and exclamation points as boundaries
  • Critical for context-dependent tasks like document summarization, machine translation, and discourse analysis
  • Requires edge case handling—abbreviations ("Dr.", "U.S."), decimal numbers, and quotations create ambiguity

Compare: Word Tokenization vs. Whitespace Tokenization—both use simple splitting rules, but word tokenization handles punctuation while whitespace ignores it entirely. If asked about preprocessing tradeoffs, whitespace is faster but requires cleanup; word tokenization is cleaner but slower.


Pattern-Based Methods

These methods use explicit rules or statistical patterns to define token boundaries. They offer control and context-awareness at the cost of complexity or data sparsity.

Regular Expression Tokenization

  • Uses regex patterns for custom tokenization rules—enables precise control over what constitutes a token boundary
  • Highly flexible for domain-specific needs—can handle URLs, hashtags, email addresses, or any custom format
  • Requires regex expertise—patterns like \b\w+\b or (?<!\d)\.(?!\d) have steep learning curves

N-gram Tokenization

  • Generates sequences of n contiguous tokens—bigrams capture "New York," trigrams capture "New York City"
  • Preserves local context that single-token approaches miss, improving language modeling and text prediction
  • Causes combinatorial explosion—vocabulary grows exponentially with n, creating severe data sparsity in high-n models

Compare: Regular Expression Tokenization vs. N-gram Tokenization—regex controls how you split, while n-grams control what size chunks you keep. Regex is preprocessing; n-grams are feature engineering. Use regex when you need custom boundaries, n-grams when you need contextual features.


Subword Methods

Modern NLP's dominant paradigm—these algorithms learn to split words into meaningful pieces, balancing vocabulary size against sequence length. They solve the out-of-vocabulary problem while preserving morphological information.

Byte-Pair Encoding (BPE)

  • Iteratively merges frequent character pairs—starts with characters, learns common combinations like "th" → "the" → "ther"
  • Reduces vocabulary while handling rare words—unseen words decompose into known subword units
  • Powers GPT models—the tokenizer behind GPT-2, GPT-3, and GPT-4 uses BPE variants for efficient text representation

WordPiece Tokenization

  • Merges based on likelihood, not just frequency—chooses merges that maximize training data probability
  • Uses "##" prefix for continuation tokens—"playing" becomes ["play", "##ing"], preserving word boundaries
  • Core to BERT architecture—handles BERT's 30,000-token vocabulary while maintaining rare word coverage

SentencePiece Tokenization

  • Language-agnostic, no pre-tokenization required—treats raw text as character sequences, learns boundaries from data
  • Handles whitespace as a regular character—uses "▁" to mark word boundaries, enabling perfect text reconstruction
  • Ideal for multilingual models—works equally well on Chinese (no spaces) and English (space-delimited)

Compare: BPE vs. WordPiece—both are subword methods, but BPE merges by frequency while WordPiece merges by likelihood gain. In practice, they produce similar results; WordPiece tends toward slightly smaller vocabularies. Know that BERT uses WordPiece and GPT uses BPE—this distinction appears in model architecture questions.


Character-Level Methods

The finest granularity possible—every character becomes a token. This eliminates vocabulary problems entirely but creates extremely long sequences.

Character Tokenization

  • Each character is a separate token—"hello" becomes ["h", "e", "l", "l", "o"], vocabulary size equals character set
  • Handles any input without OOV issues—misspellings, neologisms, and code-switching pose no vocabulary problems
  • Dramatically increases sequence length—a 100-word sentence becomes 500+ tokens, straining attention mechanisms and memory

Subword Tokenization

  • Bridges word and character levels—learns units larger than characters but smaller than words
  • Captures morphological structure—prefixes ("un-"), suffixes ("-ing"), and stems become distinct tokens
  • Standard in modern transformers—virtually all state-of-the-art models use some subword variant

Compare: Character Tokenization vs. Subword Tokenization—characters guarantee zero OOV but create 5-10x longer sequences; subwords balance both concerns. For FRQ questions about model efficiency vs. coverage tradeoffs, this comparison is your go-to example.


Quick Reference Table

ConceptBest Examples
Simple boundary splittingWord Tokenization, Whitespace Tokenization
Sentence-level segmentationSentence Tokenization
Custom pattern matchingRegular Expression Tokenization
Context-preserving featuresN-gram Tokenization
Learned subword unitsBPE, WordPiece, SentencePiece
Finest granularityCharacter Tokenization
Handling OOV wordsSubword methods (BPE, WordPiece, SentencePiece), Character Tokenization
Multilingual applicationsSentencePiece, Character Tokenization

Self-Check Questions

  1. Which two tokenization methods both use iterative merging but differ in their merge selection criteria? What's the key difference?

  2. If you're building a model that must handle user-generated text with frequent misspellings and slang, which tokenization approaches would minimize out-of-vocabulary issues, and what tradeoff would each introduce?

  3. Compare and contrast Whitespace Tokenization and Word Tokenization—when would you choose the simpler method despite its limitations?

  4. A multilingual translation model needs to handle Japanese (no spaces), German (long compounds), and English. Which tokenization method is best suited, and why does it outperform word-level approaches here?

  5. Explain why modern transformer models like BERT and GPT use subword tokenization instead of word or character tokenization. What specific problems does this solve?