upgrade
upgrade

๐ŸคŸ๐ŸผNatural Language Processing

Types of Language Models

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Language models are the foundation of virtually every modern NLP application you'll encounter on exams and in practice. Understanding how these models work isn't just about memorizing architecturesโ€”you're being tested on the evolution of how machines capture context, sequential dependencies, and semantic meaning in text. From the simplest probability-based approaches to attention-powered transformers, each model type represents a different solution to the same fundamental challenge: how do we teach machines to predict and generate human language?

The key concepts running through this topic include statistical probability estimation, sequential information flow, distributed representations, and attention mechanisms. When you see exam questions about language models, they're typically probing whether you understand the tradeoffs: Why do some models struggle with long-range dependencies? How does parallelization improve training? What's the difference between learning local patterns versus global semantic relationships? Don't just memorize model namesโ€”know what problem each architecture solves and where it falls short.


Statistical Foundation Models

These models represent the earliest approaches to language modeling, relying on probability distributions over word sequences rather than learned representations. They're computationally simple but limited in their ability to capture meaning.

N-gram Language Models

  • Predicts the next word using only the previous n-1 wordsโ€”a bigram uses one word of context, a trigram uses two, and so on
  • Data sparsity becomes severe as n increases; most possible word combinations never appear in training data
  • Smoothing techniques like Laplace or Kneser-Ney are essential to handle unseen sequences in real applications

Hidden Markov Models (HMM)

  • Models observable outputs as generated by hidden statesโ€”the "hidden" states represent underlying linguistic categories like parts of speech
  • Markov assumption means each state depends only on the previous state, not the full history
  • Viterbi algorithm efficiently finds the most likely sequence of hidden states, making HMMs practical for tagging tasks

Compare: N-grams vs. HMMsโ€”both rely on the Markov assumption and use probability tables, but N-grams model surface word sequences while HMMs model latent structure beneath observable text. If an FRQ asks about sequence labeling (POS tagging, named entity recognition), HMMs are your go-to statistical approach.


Embedding-Based Models

These models focus on learning dense vector representations of words that capture semantic relationships. They don't predict sequences directly but provide the foundation for downstream tasks.

Word2Vec

  • Learns word embeddings through local context windowsโ€”either predicting a word from its neighbors (CBOW) or neighbors from a word (Skip-gram)
  • Semantic arithmetic emerges naturally: kingโƒ—โˆ’manโƒ—+womanโƒ—โ‰ˆqueenโƒ—\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}
  • Shallow architecture with just one hidden layer makes training efficient on large corpora

GloVe (Global Vectors for Word Representation)

  • Combines global co-occurrence statistics with local contextโ€”builds a word-word co-occurrence matrix from the entire corpus
  • Objective function directly encodes that word vector dot products should equal the log of co-occurrence probability
  • Captures both syntactic and semantic patterns by leveraging corpus-wide statistics rather than just local windows

Compare: Word2Vec vs. GloVeโ€”both produce dense word vectors, but Word2Vec learns from local context windows while GloVe learns from global co-occurrence counts. Word2Vec is predictive (neural), GloVe is count-based (matrix factorization). Both produce static embeddingsโ€”one vector per word regardless of context.


Recurrent Architectures

These models process sequences one token at a time, maintaining hidden states that carry information forward. They introduced the ability to handle variable-length inputs but struggle with efficiency and long-range dependencies.

Recurrent Neural Networks (RNN)

  • Hidden state hth_t updates at each timestepโ€”combines current input with previous hidden state to maintain sequence memory
  • Vanishing gradient problem makes it difficult to learn dependencies spanning more than ~10-20 tokens
  • Sequential processing means tokens must be processed in order, preventing parallelization during training

Long Short-Term Memory (LSTM) Networks

  • Gating mechanisms control information flowโ€”forget gate, input gate, and output gate regulate what the cell remembers or discards
  • Cell state provides a highway for gradients to flow across many timesteps, mitigating vanishing gradients
  • Bidirectional LSTMs process sequences in both directions, capturing both past and future context for each position

Compare: RNN vs. LSTMโ€”both process sequences recurrently, but LSTMs add explicit memory cells and gates to preserve information over longer spans. Standard RNNs fail on dependencies beyond ~20 tokens; LSTMs can handle hundreds. If asked about the vanishing gradient problem, LSTMs are the architectural solution within the recurrent paradigm.


Attention-Based Architectures

The transformer architecture revolutionized NLP by replacing recurrence with self-attention, allowing models to directly connect any two positions in a sequence. This enables both better long-range modeling and massive parallelization.

Transformer Models

  • Self-attention computes relevance scores between all token pairsโ€”each position attends to every other position with learned weights
  • Positional encodings inject sequence order information since attention itself is permutation invariant
  • Parallel processing of all positions simultaneously enables training on unprecedented data scales

BERT (Bidirectional Encoder Representations from Transformers)

  • Masked language modeling (MLM) randomly hides 15% of tokens and trains the model to predict them from bidirectional context
  • Encoder-only architecture produces contextualized embeddings ideal for classification, extraction, and understanding tasks
  • Fine-tuning paradigm allows a single pre-trained model to adapt to dozens of downstream tasks with minimal task-specific data

GPT (Generative Pre-trained Transformer)

  • Autoregressive (left-to-right) generation predicts each token based only on preceding tokens, enabling coherent text generation
  • Decoder-only architecture with causal masking prevents the model from "seeing the future" during training
  • Scaling laws show consistent performance improvements as model size, data, and compute increase together

Compare: BERT vs. GPTโ€”both use transformer attention, but BERT sees context in both directions (bidirectional) while GPT sees only left context (unidirectional). BERT excels at understanding tasks (classification, QA); GPT excels at generation tasks (completion, dialogue). This is the most common comparison question on transformer architectures.


Neural Language Models (General Category)

Neural Language Models

  • Learn distributed representations where words become dense vectors rather than sparse one-hot encodings
  • Continuous space allows generalization to unseen word combinations through vector similarity
  • Feed-forward variants (like Bengio's 2003 model) paved the way for modern deep learning approaches to NLP

Compare: Statistical models (N-grams, HMMs) vs. Neural modelsโ€”statistical models use discrete probability tables and suffer from sparsity; neural models learn continuous representations that generalize better. The shift from statistical to neural represents the most significant paradigm change in NLP history.


Quick Reference Table

ConceptBest Examples
Statistical/ProbabilisticN-gram, HMM
Static Word EmbeddingsWord2Vec, GloVe
Sequential ProcessingRNN, LSTM
Attention MechanismsTransformer, BERT, GPT
Bidirectional ContextBERT, Bidirectional LSTM
Autoregressive GenerationGPT, RNN, LSTM
Solving Vanishing GradientsLSTM (gates), Transformer (attention)
Pre-training + Fine-tuningBERT, GPT

Self-Check Questions

  1. Both LSTMs and Transformers address the vanishing gradient problemโ€”what architectural mechanism does each use, and why do Transformers scale better for training?

  2. Word2Vec and GloVe both produce word embeddings. What is the fundamental difference in how they learn these representations, and what limitation do they share that BERT overcomes?

  3. Compare BERT and GPT: which direction(s) of context does each model use, and how does this design choice determine what tasks each excels at?

  4. If you needed to build a part-of-speech tagger using only statistical methods (no neural networks), which model would you choose and why? What assumption does it make about sequences?

  5. An FRQ asks you to explain why modern language models like GPT can generate more coherent long-form text than RNN-based models. What two key advantages of the transformer architecture would you cite in your response?