🤟🏼Natural Language Processing

Tokenization Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Tokenization is the foundational preprocessing step that determines how your NLP model "sees" text—and getting it wrong can tank performance before training even begins. You're being tested on understanding why different tokenization strategies exist, when to apply each approach, and how the choice of tokenization affects downstream tasks like classification, generation, and translation. The tradeoffs between vocabulary size, sequence length, and out-of-vocabulary handling appear constantly in system design questions.

Don't just memorize what each method does—know what problem each tokenization approach solves. Whether you're explaining why modern transformers use subword methods instead of simple word splitting, or analyzing why character-level models struggle with long sequences, the underlying principles of granularity, vocabulary coverage, and computational efficiency will guide your answers. Master the "why" behind each method, and you'll handle any tokenization question thrown at you.

Simple Boundary-Based Methods

These approaches split text using explicit delimiters like spaces or punctuation. They're fast and interpretable but make assumptions about language structure that don't hold universally.

Word Tokenization

Splits text at spaces and punctuation—the most intuitive approach that aligns with human understanding of "words"
Struggles with contractions and compounds—"don't" might become ["do", "n't"] or ["don't"], creating inconsistency across implementations
Creates open vocabulary problems since any new word becomes an unknown token, limiting generalization to unseen text

Whitespace Tokenization

Splits exclusively on whitespace characters—the fastest, simplest tokenization with zero linguistic assumptions
Ignores punctuation entirely—"Hello, world!" becomes ["Hello,", "world!"] with punctuation attached
Best for preprocessing pipelines where downstream steps handle punctuation separately or when speed matters most

Sentence Tokenization

Segments text into sentence-level units—uses periods, question marks, and exclamation points as boundaries
Critical for context-dependent tasks like document summarization, machine translation, and discourse analysis
Requires edge case handling—abbreviations ("Dr.", "U.S."), decimal numbers, and quotations create ambiguity

Compare: Word Tokenization vs. Whitespace Tokenization—both use simple splitting rules, but word tokenization handles punctuation while whitespace ignores it entirely. If asked about preprocessing tradeoffs, whitespace is faster but requires cleanup; word tokenization is cleaner but slower.

Pattern-Based Methods

These methods use explicit rules or statistical patterns to define token boundaries. They offer control and context-awareness at the cost of complexity or data sparsity.

Regular Expression Tokenization

Uses regex patterns for custom tokenization rules—enables precise control over what constitutes a token boundary
Highly flexible for domain-specific needs—can handle URLs, hashtags, email addresses, or any custom format
Requires regex expertise—patterns like \b\w+\b or (?<!\d)\.(?!\d) have steep learning curves

N-gram Tokenization

Generates sequences of n contiguous tokens—bigrams capture "New York," trigrams capture "New York City"
Preserves local context that single-token approaches miss, improving language modeling and text prediction
Causes combinatorial explosion—vocabulary grows exponentially with n, creating severe data sparsity in high-n models

Compare: Regular Expression Tokenization vs. N-gram Tokenization—regex controls how you split, while n-grams control what size chunks you keep. Regex is preprocessing; n-grams are feature engineering. Use regex when you need custom boundaries, n-grams when you need contextual features.

Subword Methods

Modern NLP's dominant paradigm—these algorithms learn to split words into meaningful pieces, balancing vocabulary size against sequence length. They solve the out-of-vocabulary problem while preserving morphological information.

Byte-Pair Encoding (BPE)

Iteratively merges frequent character pairs—starts with characters, learns common combinations like "th" → "the" → "ther"
Reduces vocabulary while handling rare words—unseen words decompose into known subword units
Powers GPT models—the tokenizer behind GPT-2, GPT-3, and GPT-4 uses BPE variants for efficient text representation

WordPiece Tokenization

Merges based on likelihood, not just frequency—chooses merges that maximize training data probability
Uses "##" prefix for continuation tokens—"playing" becomes ["play", "##ing"], preserving word boundaries
Core to BERT architecture—handles BERT's 30,000-token vocabulary while maintaining rare word coverage

SentencePiece Tokenization

Language-agnostic, no pre-tokenization required—treats raw text as character sequences, learns boundaries from data
Handles whitespace as a regular character—uses "▁" to mark word boundaries, enabling perfect text reconstruction
Ideal for multilingual models—works equally well on Chinese (no spaces) and English (space-delimited)

Compare: BPE vs. WordPiece—both are subword methods, but BPE merges by frequency while WordPiece merges by likelihood gain. In practice, they produce similar results; WordPiece tends toward slightly smaller vocabularies. Know that BERT uses WordPiece and GPT uses BPE—this distinction appears in model architecture questions.

Character-Level Methods

The finest granularity possible—every character becomes a token. This eliminates vocabulary problems entirely but creates extremely long sequences.

Character Tokenization

Each character is a separate token—"hello" becomes ["h", "e", "l", "l", "o"], vocabulary size equals character set
Handles any input without OOV issues—misspellings, neologisms, and code-switching pose no vocabulary problems
Dramatically increases sequence length—a 100-word sentence becomes 500+ tokens, straining attention mechanisms and memory

Subword Tokenization

Bridges word and character levels—learns units larger than characters but smaller than words
Captures morphological structure—prefixes ("un-"), suffixes ("-ing"), and stems become distinct tokens
Standard in modern transformers—virtually all state-of-the-art models use some subword variant

Compare: Character Tokenization vs. Subword Tokenization—characters guarantee zero OOV but create 5-10x longer sequences; subwords balance both concerns. For FRQ questions about model efficiency vs. coverage tradeoffs, this comparison is your go-to example.

Quick Reference Table

Concept	Best Examples
Simple boundary splitting	Word Tokenization, Whitespace Tokenization
Sentence-level segmentation	Sentence Tokenization
Custom pattern matching	Regular Expression Tokenization
Context-preserving features	N-gram Tokenization
Learned subword units	BPE, WordPiece, SentencePiece
Finest granularity	Character Tokenization
Handling OOV words	Subword methods (BPE, WordPiece, SentencePiece), Character Tokenization
Multilingual applications	SentencePiece, Character Tokenization

Self-Check Questions

Which two tokenization methods both use iterative merging but differ in their merge selection criteria? What's the key difference?
If you're building a model that must handle user-generated text with frequent misspellings and slang, which tokenization approaches would minimize out-of-vocabulary issues, and what tradeoff would each introduce?
Compare and contrast Whitespace Tokenization and Word Tokenization—when would you choose the simpler method despite its limitations?
A multilingual translation model needs to handle Japanese (no spaces), German (long compounds), and English. Which tokenization method is best suited, and why does it outperform word-level approaches here?
Explain why modern transformer models like BERT and GPT use subword tokenization instead of word or character tokenization. What specific problems does this solve?

🤟🏼Natural Language Processing

Tokenization Methods

Why This Matters

Simple Boundary-Based Methods

Word Tokenization

Whitespace Tokenization

Sentence Tokenization

Pattern-Based Methods

Regular Expression Tokenization

N-gram Tokenization

Subword Methods

Byte-Pair Encoding (BPE)

WordPiece Tokenization

SentencePiece Tokenization

Character-Level Methods

Character Tokenization

Subword Tokenization

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes