Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Language models are the foundation of virtually every modern NLP application you'll encounter on exams and in practice. Understanding how these models work isn't just about memorizing architecturesโyou're being tested on the evolution of how machines capture context, sequential dependencies, and semantic meaning in text. From the simplest probability-based approaches to attention-powered transformers, each model type represents a different solution to the same fundamental challenge: how do we teach machines to predict and generate human language?
The key concepts running through this topic include statistical probability estimation, sequential information flow, distributed representations, and attention mechanisms. When you see exam questions about language models, they're typically probing whether you understand the tradeoffs: Why do some models struggle with long-range dependencies? How does parallelization improve training? What's the difference between learning local patterns versus global semantic relationships? Don't just memorize model namesโknow what problem each architecture solves and where it falls short.
These models represent the earliest approaches to language modeling, relying on probability distributions over word sequences rather than learned representations. They're computationally simple but limited in their ability to capture meaning.
Compare: N-grams vs. HMMsโboth rely on the Markov assumption and use probability tables, but N-grams model surface word sequences while HMMs model latent structure beneath observable text. If an FRQ asks about sequence labeling (POS tagging, named entity recognition), HMMs are your go-to statistical approach.
These models focus on learning dense vector representations of words that capture semantic relationships. They don't predict sequences directly but provide the foundation for downstream tasks.
Compare: Word2Vec vs. GloVeโboth produce dense word vectors, but Word2Vec learns from local context windows while GloVe learns from global co-occurrence counts. Word2Vec is predictive (neural), GloVe is count-based (matrix factorization). Both produce static embeddingsโone vector per word regardless of context.
These models process sequences one token at a time, maintaining hidden states that carry information forward. They introduced the ability to handle variable-length inputs but struggle with efficiency and long-range dependencies.
Compare: RNN vs. LSTMโboth process sequences recurrently, but LSTMs add explicit memory cells and gates to preserve information over longer spans. Standard RNNs fail on dependencies beyond ~20 tokens; LSTMs can handle hundreds. If asked about the vanishing gradient problem, LSTMs are the architectural solution within the recurrent paradigm.
The transformer architecture revolutionized NLP by replacing recurrence with self-attention, allowing models to directly connect any two positions in a sequence. This enables both better long-range modeling and massive parallelization.
Compare: BERT vs. GPTโboth use transformer attention, but BERT sees context in both directions (bidirectional) while GPT sees only left context (unidirectional). BERT excels at understanding tasks (classification, QA); GPT excels at generation tasks (completion, dialogue). This is the most common comparison question on transformer architectures.
Compare: Statistical models (N-grams, HMMs) vs. Neural modelsโstatistical models use discrete probability tables and suffer from sparsity; neural models learn continuous representations that generalize better. The shift from statistical to neural represents the most significant paradigm change in NLP history.
| Concept | Best Examples |
|---|---|
| Statistical/Probabilistic | N-gram, HMM |
| Static Word Embeddings | Word2Vec, GloVe |
| Sequential Processing | RNN, LSTM |
| Attention Mechanisms | Transformer, BERT, GPT |
| Bidirectional Context | BERT, Bidirectional LSTM |
| Autoregressive Generation | GPT, RNN, LSTM |
| Solving Vanishing Gradients | LSTM (gates), Transformer (attention) |
| Pre-training + Fine-tuning | BERT, GPT |
Both LSTMs and Transformers address the vanishing gradient problemโwhat architectural mechanism does each use, and why do Transformers scale better for training?
Word2Vec and GloVe both produce word embeddings. What is the fundamental difference in how they learn these representations, and what limitation do they share that BERT overcomes?
Compare BERT and GPT: which direction(s) of context does each model use, and how does this design choice determine what tasks each excels at?
If you needed to build a part-of-speech tagger using only statistical methods (no neural networks), which model would you choose and why? What assumption does it make about sequences?
An FRQ asks you to explain why modern language models like GPT can generate more coherent long-form text than RNN-based models. What two key advantages of the transformer architecture would you cite in your response?