🤟🏼Natural Language Processing

Key NLP Algorithms

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Natural Language Processing sits at the intersection of linguistics, computer science, and machine learning—and understanding its core algorithms is essential for grasping how modern AI systems understand, generate, and manipulate human language. You're being tested not just on what these algorithms do, but on why certain architectures emerged to solve specific problems: sequential dependencies, contextual meaning, semantic representation, and structural analysis.

The algorithms in this guide build on each other conceptually. Tokenization feeds into POS tagging, which enables dependency parsing. Word embeddings revolutionized how machines represent meaning, while transformers solved the parallelization problems that plagued RNNs. Don't just memorize definitions—know which problem each algorithm solves and how it connects to the broader NLP pipeline.

Text Preprocessing and Structural Analysis

Before any sophisticated analysis can happen, raw text must be broken into meaningful units and tagged with linguistic information. These foundational algorithms convert unstructured text into structured data that downstream models can process.

Tokenization

Splits text into discrete units (tokens)—words, subwords, or characters depending on the tokenization strategy
Preprocessing foundation that determines how all subsequent NLP operations interpret text boundaries
Methods range from simple to complex: whitespace splitting, byte-pair encoding (BPE), and SentencePiece for handling multiple languages

Part-of-Speech (POS) Tagging

Assigns grammatical labels (noun, verb, adjective, etc.) to each token, revealing syntactic roles
Enables syntactic understanding by identifying how words function within sentence structure
Implementation approaches include Hidden Markov Models, Conditional Random Fields, and modern neural taggers

Dependency Parsing

Maps grammatical relationships between words, creating a tree structure showing which words modify others
Reveals sentence hierarchy—distinguishing subjects from objects, modifiers from heads
Parsing strategies include transition-based (fast, greedy) and graph-based (globally optimal) algorithms

Compare: POS Tagging vs. Dependency Parsing—both analyze grammatical structure, but POS tagging labels individual words while dependency parsing maps relationships between them. If asked about understanding sentence meaning, dependency parsing provides richer structural information.

Information Extraction and Classification

These algorithms identify what text is about and how it should be categorized. They transform unstructured text into structured knowledge by recognizing entities, detecting sentiment, and assigning labels.

Named Entity Recognition (NER)

Locates and classifies entities—people, organizations, locations, dates, monetary values—within text
Critical for knowledge extraction and populating databases from unstructured documents
Trained on annotated corpora using sequence labeling models like BiLSTM-CRF or transformer-based taggers

Sentiment Analysis

Determines emotional polarity—positive, negative, or neutral—expressed in text
Business applications include brand monitoring, customer feedback analysis, and market sentiment tracking
Approaches span rule-based lexicons to fine-tuned transformer models for nuanced detection

Text Classification

Assigns predefined categories to documents based on content analysis
Ubiquitous applications: spam filtering, news categorization, intent detection in chatbots
Pipeline involves feature extraction (TF-IDF, embeddings) followed by classifiers (SVM, neural networks)

Compare: NER vs. Text Classification—NER identifies specific spans within text and labels them, while text classification assigns one label to an entire document. NER is token-level; classification is document-level.

Semantic Representation

How do machines understand that "king" relates to "queen" the way "man" relates to "woman"? Word embeddings capture meaning as mathematical relationships in vector space, enabling semantic reasoning.

Word Embeddings (Word2Vec, GloVe)

Represent words as dense vectors where similar meanings cluster together in high-dimensional space
Capture semantic relationships through vector arithmetic: $\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$
Enable transfer learning—pre-trained embeddings bootstrap performance on downstream tasks with limited data

Topic Modeling

Discovers latent themes across document collections without predefined categories
Algorithms like LDA (Latent Dirichlet Allocation) model documents as mixtures of topics, topics as mixtures of words
Applications include content organization, trend detection, and exploratory text analysis

Compare: Word Embeddings vs. Topic Modeling—embeddings represent individual word meanings, while topic modeling identifies document-level themes. Word2Vec gives you word similarity; LDA gives you thematic structure across a corpus.

Sequential Neural Architectures

Language unfolds over time—each word depends on what came before. Recurrent architectures were designed to maintain memory across sequences, though they struggle with long-range dependencies.

Recurrent Neural Networks (RNNs)

Process sequences step-by-step, maintaining a hidden state that carries information forward
Natural fit for language where word meaning depends on preceding context
Suffer from vanishing gradients—information from early tokens fades during backpropagation through long sequences

Long Short-Term Memory (LSTM) Networks

Solve the vanishing gradient problem using gated memory cells that control information flow
Three gates regulate memory: input gate (what to add), forget gate (what to discard), output gate (what to expose)
Dominated sequence tasks like machine translation and speech recognition before transformers emerged

Compare: RNN vs. LSTM—both process sequences recurrently, but LSTMs add explicit memory mechanisms that preserve long-range dependencies. If asked why LSTMs replaced vanilla RNNs, the answer is gradient flow and memory retention.

Attention-Based Architectures

The transformer revolution eliminated recurrence entirely. Self-attention allows models to directly connect any two positions in a sequence, enabling parallelization and capturing long-range dependencies without gradient degradation.

Transformer Architecture

Relies on self-attention to weigh the relevance of every token to every other token simultaneously
Enables parallel processing—unlike RNNs, all positions compute at once, dramatically speeding training
Foundation for modern NLP: BERT, GPT, T5, and virtually all state-of-the-art models build on this architecture

BERT (Bidirectional Encoder Representations from Transformers)

Captures bidirectional context—considers both left and right surrounding words simultaneously during pre-training
Pre-training objectives include masked language modeling (predicting hidden words) and next sentence prediction
Fine-tuning paradigm: pre-train once on massive data, then adapt to specific tasks with minimal labeled examples

Compare: LSTM vs. Transformer—LSTMs process sequences sequentially (slow, struggles with long sequences), while transformers process in parallel via attention (fast, handles long-range dependencies elegantly). Transformers also scale better with compute.

Complex NLP Tasks

These represent end-to-end applications that combine multiple algorithms and architectures. They demonstrate how foundational techniques compose into systems that perform sophisticated language understanding and generation.

Machine Translation

Converts text between languages while preserving meaning, grammar, and style
Evolution from rule-based to neural: modern NMT uses encoder-decoder transformers trained on parallel corpora
Challenges include idioms, low-resource languages, and maintaining document-level coherence

Text Summarization

Condenses documents while retaining key information and main ideas
Two paradigms: extractive (selects important sentences verbatim) vs. abstractive (generates new summary text)
Abstractive methods leverage sequence-to-sequence models with attention for fluent, novel summaries

Coreference Resolution

Links mentions to entities—determining that "she," "the CEO," and "Maria" all refer to the same person
Essential for coherent understanding across sentences and paragraphs
Requires reasoning about gender, number, semantic compatibility, and discourse structure

Compare: Extractive vs. Abstractive Summarization—extractive methods are safer (no hallucination risk) but less flexible, while abstractive methods generate fluent summaries but may introduce errors. Know the tradeoff between faithfulness and fluency.

Quick Reference Table

Concept	Best Examples
Text Preprocessing	Tokenization, POS Tagging
Structural Analysis	Dependency Parsing, POS Tagging
Information Extraction	NER, Coreference Resolution
Classification Tasks	Sentiment Analysis, Text Classification
Semantic Representation	Word Embeddings (Word2Vec, GloVe), Topic Modeling
Sequential Processing	RNN, LSTM
Attention Mechanisms	Transformer, BERT
End-to-End Applications	Machine Translation, Text Summarization

Self-Check Questions

Both POS Tagging and Dependency Parsing analyze grammatical structure. What specific information does dependency parsing provide that POS tagging alone cannot?
Compare RNNs and Transformers: which architecture handles long sequences more effectively, and what mechanism enables this advantage?
You need to build a system that identifies all company names in news articles and tracks how they're mentioned throughout each article. Which two algorithms would you combine, and why?
Explain the key difference between extractive and abstractive text summarization. In what scenario might you prefer the extractive approach despite its limitations?
BERT and Word2Vec both create vector representations of language. How does BERT's approach to capturing word meaning differ fundamentally from Word2Vec's static embeddings?