🤟🏼Natural Language Processing

Types of Language Models

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Language models are the foundation of virtually every modern NLP application you'll encounter on exams and in practice. Understanding how these models work isn't just about memorizing architectures—you're being tested on the evolution of how machines capture context, sequential dependencies, and semantic meaning in text. From the simplest probability-based approaches to attention-powered transformers, each model type represents a different solution to the same fundamental challenge: how do we teach machines to predict and generate human language?

The key concepts running through this topic include statistical probability estimation, sequential information flow, distributed representations, and attention mechanisms. When you see exam questions about language models, they're typically probing whether you understand the tradeoffs: Why do some models struggle with long-range dependencies? How does parallelization improve training? What's the difference between learning local patterns versus global semantic relationships? Don't just memorize model names—know what problem each architecture solves and where it falls short.

Statistical Foundation Models

These models represent the earliest approaches to language modeling, relying on probability distributions over word sequences rather than learned representations. They're computationally simple but limited in their ability to capture meaning.

N-gram Language Models

Predicts the next word using only the previous n-1 words—a bigram uses one word of context, a trigram uses two, and so on
Data sparsity becomes severe as n increases; most possible word combinations never appear in training data
Smoothing techniques like Laplace or Kneser-Ney are essential to handle unseen sequences in real applications

Hidden Markov Models (HMM)

Models observable outputs as generated by hidden states—the "hidden" states represent underlying linguistic categories like parts of speech
Markov assumption means each state depends only on the previous state, not the full history
Viterbi algorithm efficiently finds the most likely sequence of hidden states, making HMMs practical for tagging tasks

Compare: N-grams vs. HMMs—both rely on the Markov assumption and use probability tables, but N-grams model surface word sequences while HMMs model latent structure beneath observable text. If an FRQ asks about sequence labeling (POS tagging, named entity recognition), HMMs are your go-to statistical approach.

Embedding-Based Models

These models focus on learning dense vector representations of words that capture semantic relationships. They don't predict sequences directly but provide the foundation for downstream tasks.

Word2Vec

Learns word embeddings through local context windows—either predicting a word from its neighbors (CBOW) or neighbors from a word (Skip-gram)
Semantic arithmetic emerges naturally: $\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$
Shallow architecture with just one hidden layer makes training efficient on large corpora

GloVe (Global Vectors for Word Representation)

Combines global co-occurrence statistics with local context—builds a word-word co-occurrence matrix from the entire corpus
Objective function directly encodes that word vector dot products should equal the log of co-occurrence probability
Captures both syntactic and semantic patterns by leveraging corpus-wide statistics rather than just local windows

Compare: Word2Vec vs. GloVe—both produce dense word vectors, but Word2Vec learns from local context windows while GloVe learns from global co-occurrence counts. Word2Vec is predictive (neural), GloVe is count-based (matrix factorization). Both produce static embeddings—one vector per word regardless of context.

Recurrent Architectures

These models process sequences one token at a time, maintaining hidden states that carry information forward. They introduced the ability to handle variable-length inputs but struggle with efficiency and long-range dependencies.

Recurrent Neural Networks (RNN)

Hidden state $h_t$ updates at each timestep—combines current input with previous hidden state to maintain sequence memory
Vanishing gradient problem makes it difficult to learn dependencies spanning more than ~10-20 tokens
Sequential processing means tokens must be processed in order, preventing parallelization during training

Long Short-Term Memory (LSTM) Networks

Gating mechanisms control information flow—forget gate, input gate, and output gate regulate what the cell remembers or discards
Cell state provides a highway for gradients to flow across many timesteps, mitigating vanishing gradients
Bidirectional LSTMs process sequences in both directions, capturing both past and future context for each position

Compare: RNN vs. LSTM—both process sequences recurrently, but LSTMs add explicit memory cells and gates to preserve information over longer spans. Standard RNNs fail on dependencies beyond ~20 tokens; LSTMs can handle hundreds. If asked about the vanishing gradient problem, LSTMs are the architectural solution within the recurrent paradigm.

Attention-Based Architectures

The transformer architecture revolutionized NLP by replacing recurrence with self-attention, allowing models to directly connect any two positions in a sequence. This enables both better long-range modeling and massive parallelization.

Transformer Models

Self-attention computes relevance scores between all token pairs—each position attends to every other position with learned weights
Positional encodings inject sequence order information since attention itself is permutation invariant
Parallel processing of all positions simultaneously enables training on unprecedented data scales

BERT (Bidirectional Encoder Representations from Transformers)

Masked language modeling (MLM) randomly hides 15% of tokens and trains the model to predict them from bidirectional context
Encoder-only architecture produces contextualized embeddings ideal for classification, extraction, and understanding tasks
Fine-tuning paradigm allows a single pre-trained model to adapt to dozens of downstream tasks with minimal task-specific data

GPT (Generative Pre-trained Transformer)

Autoregressive (left-to-right) generation predicts each token based only on preceding tokens, enabling coherent text generation
Decoder-only architecture with causal masking prevents the model from "seeing the future" during training
Scaling laws show consistent performance improvements as model size, data, and compute increase together

Compare: BERT vs. GPT—both use transformer attention, but BERT sees context in both directions (bidirectional) while GPT sees only left context (unidirectional). BERT excels at understanding tasks (classification, QA); GPT excels at generation tasks (completion, dialogue). This is the most common comparison question on transformer architectures.

Neural Language Models (General Category)

Neural Language Models

Learn distributed representations where words become dense vectors rather than sparse one-hot encodings
Continuous space allows generalization to unseen word combinations through vector similarity
Feed-forward variants (like Bengio's 2003 model) paved the way for modern deep learning approaches to NLP

Compare: Statistical models (N-grams, HMMs) vs. Neural models—statistical models use discrete probability tables and suffer from sparsity; neural models learn continuous representations that generalize better. The shift from statistical to neural represents the most significant paradigm change in NLP history.

Quick Reference Table

Concept	Best Examples
Statistical/Probabilistic	N-gram, HMM
Static Word Embeddings	Word2Vec, GloVe
Sequential Processing	RNN, LSTM
Attention Mechanisms	Transformer, BERT, GPT
Bidirectional Context	BERT, Bidirectional LSTM
Autoregressive Generation	GPT, RNN, LSTM
Solving Vanishing Gradients	LSTM (gates), Transformer (attention)
Pre-training + Fine-tuning	BERT, GPT

Self-Check Questions

Both LSTMs and Transformers address the vanishing gradient problem—what architectural mechanism does each use, and why do Transformers scale better for training?
Word2Vec and GloVe both produce word embeddings. What is the fundamental difference in how they learn these representations, and what limitation do they share that BERT overcomes?
Compare BERT and GPT: which direction(s) of context does each model use, and how does this design choice determine what tasks each excels at?
If you needed to build a part-of-speech tagger using only statistical methods (no neural networks), which model would you choose and why? What assumption does it make about sequences?
An FRQ asks you to explain why modern language models like GPT can generate more coherent long-form text than RNN-based models. What two key advantages of the transformer architecture would you cite in your response?

🤟🏼Natural Language Processing

Types of Language Models

Why This Matters

Statistical Foundation Models

N-gram Language Models

Hidden Markov Models (HMM)

Embedding-Based Models

Word2Vec

GloVe (Global Vectors for Word Representation)

Recurrent Architectures

Recurrent Neural Networks (RNN)

Long Short-Term Memory (LSTM) Networks

Attention-Based Architectures

Transformer Models

BERT (Bidirectional Encoder Representations from Transformers)

GPT (Generative Pre-trained Transformer)

Neural Language Models (General Category)

Neural Language Models

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes