upgrade
upgrade

🧐Deep Learning Systems

Common Deep Learning Architectures

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Understanding deep learning architectures isn't about memorizing layer configurations—it's about recognizing which architectural innovations solve which fundamental problems. You're being tested on your ability to match tasks to architectures, explain why certain designs work for specific data types, and articulate the tradeoffs between approaches. The core principles here—how networks handle spatial relationships, sequential dependencies, long-range context, and generative modeling—form the foundation for nearly every modern AI system.

Each architecture in this guide emerged to address a specific limitation of its predecessors. CNNs conquered spatial feature extraction; RNNs tackled sequential memory; LSTMs solved vanishing gradients; Transformers unlocked parallelization and global attention. Don't just memorize what each architecture does—know what problem it was designed to solve and how its mechanism achieves that goal.


Spatial Feature Extraction: Convolutional Architectures

These architectures exploit the spatial structure of data, using local connectivity and weight sharing to efficiently learn hierarchical features from images and other grid-like inputs.

Convolutional Neural Networks (CNNs)

  • Convolutional layers apply learnable filters across spatial dimensions—this parameter sharing dramatically reduces model size compared to fully connected networks
  • Pooling operations provide translation invariance and dimensionality reduction, allowing the network to focus on what features exist rather than exactly where
  • Hierarchical feature learning means early layers detect edges and textures while deeper layers compose these into complex objects—essential for image classification, detection, and segmentation

Residual Networks (ResNets)

  • Skip connections add input directly to output of residual blocks, enabling gradient flow through networks with hundreds of layers via the identity mapping y=F(x)+xy = F(x) + x
  • Degradation problem solved—before ResNets, simply adding layers caused worse training performance; skip connections allow networks to learn residual functions instead
  • State-of-the-art baselines for image classification; ResNet-50 and ResNet-101 remain standard backbones for transfer learning across computer vision tasks

Inception Networks

  • Multi-scale feature extraction through parallel branches applying 1×11 \times 1, 3×33 \times 3, and 5×55 \times 5 convolutions simultaneously—captures both fine and coarse patterns
  • Computational efficiency achieved via 1×11 \times 1 convolutions that reduce channel dimensions before expensive operations, factorizing large convolutions
  • GoogLeNet/Inception-v3 demonstrated that architectural innovation could outperform simply adding more layers, influencing modern efficient network design

U-Net

  • Encoder-decoder symmetry compresses spatial information then reconstructs it, with the encoder capturing what and the decoder recovering where
  • Skip connections between corresponding layers concatenate high-resolution features from encoder to decoder, preserving fine-grained spatial details critical for segmentation
  • Biomedical imaging standard for tasks requiring pixel-level precision—works exceptionally well with limited training data due to aggressive data augmentation

Compare: ResNets vs. U-Net—both use skip connections, but for different purposes. ResNets enable deeper networks by improving gradient flow; U-Net enables precise localization by preserving spatial information. If asked about skip connections, clarify which problem you're solving.


Sequential and Temporal Modeling

These architectures process data where order matters, maintaining state or context across time steps. The key challenge is learning dependencies across varying sequence lengths.

Recurrent Neural Networks (RNNs)

  • Hidden state ht=f(Whht1+Wxxt)h_t = f(W_h h_{t-1} + W_x x_t) carries information forward through the sequence—this memory enables context-aware processing
  • Parameter sharing across time means the same weights process each step, making RNNs efficient for variable-length sequences in time series, text, and speech
  • Vanishing gradient problem occurs because gradients multiply through many time steps—values less than 1 shrink exponentially, preventing learning of long-range dependencies

Long Short-Term Memory (LSTM) Networks

  • Gating mechanisms—forget gate, input gate, and output gate—control information flow, allowing the network to selectively remember or discard information
  • Cell state acts as a conveyor belt carrying information across many time steps with minimal transformation, solving the vanishing gradient problem through additive rather than multiplicative updates
  • Standard for sequential tasks before Transformers dominated; still preferred when data is truly sequential and attention overhead is prohibitive (speech recognition, some time series)

Compare: RNNs vs. LSTMs—both process sequences step-by-step, but LSTMs add gating to control memory. RNNs fail on long sequences due to vanishing gradients; LSTMs handle hundreds of time steps. Know this distinction cold—it's a classic exam question.


Attention-Based Architectures

The attention mechanism computes dynamic, input-dependent weights over all positions, enabling models to capture global dependencies without sequential processing constraints.

Transformer Architecture

  • Self-attention computes relevance scores between all token pairs via Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V, weighting each position by its relevance to the query
  • Parallelization advantage—unlike RNNs, all positions process simultaneously, enabling massive speedups on modern GPUs and training on unprecedented data scales
  • Foundation for modern NLP and beyond—encoder-decoder structure underlies translation models; decoder-only (GPT) and encoder-only (BERT) variants dominate their respective tasks

BERT (Bidirectional Encoder Representations from Transformers)

  • Bidirectional context via masked language modeling—BERT sees tokens on both sides during pre-training, unlike left-to-right language models
  • Pre-training then fine-tuning paradigm enables transfer learning; massive unsupervised pre-training creates rich representations that adapt to downstream tasks with minimal labeled data
  • Contextual embeddings mean the same word gets different representations based on surrounding context—"bank" in "river bank" vs. "bank account" produces distinct vectors

Compare: Transformers vs. LSTMs—both handle sequences, but Transformers use attention (parallel, global context) while LSTMs use recurrence (sequential, local-to-global). Transformers dominate when compute is available; LSTMs remain relevant for streaming/real-time applications.


Generative and Representation Learning

These architectures learn to generate new data or compress information into useful representations, often through unsupervised or self-supervised objectives.

Generative Adversarial Networks (GANs)

  • Adversarial training pits generator GG against discriminator DD in a minimax game: minGmaxDE[logD(x)]+E[log(1D(G(z)))]\min_G \max_D \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1-D(G(z)))]
  • Implicit density modeling—GANs learn to sample from the data distribution without explicitly modeling it, enabling high-fidelity image synthesis
  • Applications span creative AI—image generation, style transfer, super-resolution, and data augmentation; training instability and mode collapse remain active research challenges

Autoencoders

  • Encoder-decoder bottleneck forces the network to learn compressed representations; encoder maps input to latent code zz, decoder reconstructs from zz
  • Reconstruction loss (typically MSE or cross-entropy) trains the network to preserve essential information while discarding noise
  • Variants for different goals—variational autoencoders (VAEs) learn probabilistic latent spaces for generation; denoising autoencoders improve robustness; sparse autoencoders encourage interpretable features

Compare: GANs vs. Autoencoders—both generate data, but GANs use adversarial training (sharper outputs, harder to train) while autoencoders use reconstruction loss (more stable, often blurrier). VAEs bridge the gap with probabilistic latent spaces.


Quick Reference Table

ConceptBest Examples
Spatial feature extractionCNNs, ResNets, Inception
Precise segmentationU-Net, encoder-decoder CNNs
Sequential processingRNNs, LSTMs
Long-range dependenciesLSTMs, Transformers
Parallel sequence modelingTransformers, BERT
Pre-trained language modelsBERT, GPT (Transformer-based)
Image generationGANs, VAEs
Representation learningAutoencoders, VAEs, BERT

Self-Check Questions

  1. Both ResNets and U-Net use skip connections—explain the different problems each architecture solves with this mechanism.

  2. Why do standard RNNs struggle with long sequences, and what specific architectural innovation in LSTMs addresses this limitation?

  3. Compare the training objectives of GANs and autoencoders. When would you choose one over the other for a generation task?

  4. A Transformer processes a 512-token sequence while an LSTM processes the same sequence. Describe the key computational and representational differences between these approaches.

  5. You need to segment tumors in medical CT scans with limited labeled data. Which architecture would you choose, and what specific features make it suitable for this task?