Neural networks are the foundation of modern computer vision. They process visual data through layers of interconnected artificial neurons, enabling machines to recognize patterns, classify images, and make decisions about what they "see."

This guide covers how neural networks work, how they're trained, the major architectures used in computer vision, and the practical challenges you'll encounter when building and deploying them.

Fundamentals of neural networks

Neural networks are loosely inspired by the brain, but in practice they're mathematical systems that learn patterns from data. Each network is built from simple computational units (artificial neurons) connected together in layers. The magic comes from how these connections are tuned during training.

Biological inspiration

Neural networks borrow a few key ideas from biology:

They consist of interconnected nodes (artificial neurons) that process and transmit information
They learn and adapt through exposure to training data, somewhat like how the brain strengthens certain neural pathways through experience
They use parallel processing to handle complex tasks efficiently

The analogy to the brain is useful but loose. Artificial neurons are far simpler than biological ones, and the learning algorithms don't closely match how the brain actually works. Think of it as "inspired by" rather than "modeled after."

Artificial neurons

An artificial neuron is the basic computational unit. It takes in multiple inputs, combines them, and produces a single output. Here's what happens inside one:

Each input $x_i$ gets multiplied by a corresponding weight $w_i$
All weighted inputs are summed together, plus a bias term $b$
The sum passes through an activation function $f$ to produce the output

The full equation:

$y = f\left(\sum_{i=1}^n w_i x_i + b\right)$

The weights control how much influence each input has. The bias shifts the activation threshold. During training, the network adjusts these weights and biases to improve its predictions.

Network architectures

Architecture refers to how neurons are arranged and connected. The three main components are:

Input layer: receives raw data (e.g., pixel values of an image)
Hidden layers: process and extract increasingly abstract features
Output layer: produces the final prediction or classification

Common architectures, in order of complexity:

Single-layer perceptron: one layer of weights connecting inputs to outputs. Can only solve linearly separable problems.
Multi-layer perceptron (MLP): adds one or more hidden layers, enabling it to learn non-linear relationships.
Deep neural networks: MLPs with many hidden layers, capable of learning complex hierarchical features from data.

Activation functions

Without activation functions, a neural network would just compute linear transformations no matter how many layers it has. Activation functions introduce non-linearity, which is what allows networks to learn complex patterns.

The most common ones:

Sigmoid: $f(x) = \frac{1}{1 + e^{-x}}$ Outputs values between 0 and 1. Useful for probabilities, but suffers from vanishing gradients in deep networks.
ReLU (Rectified Linear Unit): $f(x) = \max(0, x)$ The default choice for most hidden layers. Simple, fast, and avoids vanishing gradients for positive values. Can cause "dead neurons" if inputs are consistently negative.
Tanh: $f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ Outputs between -1 and 1. Zero-centered, which can help training, but still prone to vanishing gradients.

Your choice of activation function affects how quickly the network trains and how well it performs. ReLU and its variants (Leaky ReLU, GELU) dominate in modern computer vision.

Training neural networks

Training is the process of adjusting a network's weights and biases so it produces accurate outputs. You feed it labeled examples, measure how wrong its predictions are, and then nudge the parameters to reduce that error. Repeat thousands (or millions) of times.

Backpropagation algorithm

Backpropagation is how the network figures out which parameters to adjust and by how much. It works by computing gradients of the loss function with respect to every weight in the network, using the chain rule from calculus.

The process in four steps:

Forward pass: Input data flows through the network, layer by layer, producing a prediction
Compute loss: Compare the prediction to the true label using a loss function
Backward pass: Starting from the output, compute the gradient of the loss with respect to each weight by applying the chain rule through each layer
Update parameters: Adjust weights in the direction that reduces the loss

The key insight is that backpropagation makes gradient computation efficient even in deep networks. Without it, you'd have to compute each gradient independently, which would be prohibitively slow.

Gradient descent optimization

Once backpropagation gives you the gradients, gradient descent uses them to update the parameters. The learning rate controls how big each update step is: too large and training becomes unstable, too small and it takes forever to converge.

Three main variants:

Batch gradient descent: computes gradients over the entire training set before updating. Stable but slow for large datasets.
Stochastic gradient descent (SGD): updates after each single training example. Noisy but fast.
Mini-batch gradient descent: updates after a small batch of examples (e.g., 32 or 64). The standard approach, balancing speed and stability.

Modern optimizers like Adam and RMSprop go further by adapting the learning rate for each parameter individually, which typically leads to faster and more reliable convergence.

Loss functions

The loss function quantifies how far off the network's predictions are from the ground truth. It's what the optimizer is trying to minimize.

Common choices in computer vision:

Mean Squared Error (MSE): used for regression tasks (e.g., predicting bounding box coordinates). Penalizes large errors heavily.
Cross-entropy loss: the standard for classification tasks. Measures the difference between predicted class probabilities and the true labels.
Focal loss: a modified cross-entropy that down-weights easy examples and focuses on hard ones. Particularly useful in object detection where background examples vastly outnumber objects.

Picking the right loss function matters. It shapes what the network optimizes for, so it should align with what you actually care about in your task.

Overfitting vs underfitting

These are the two main failure modes during training:

Overfitting means the model memorized the training data, including its noise, instead of learning general patterns. You'll see high accuracy on training data but poor performance on new data. Common fixes:

Regularization: L1/L2 penalties on weight magnitudes discourage overly complex models
Dropout: randomly deactivates neurons during training, forcing the network to not rely on any single feature
Early stopping: halt training when validation performance stops improving
Data augmentation: artificially expand the training set with transformed versions of existing images

Underfitting means the model is too simple to capture the patterns in the data. Performance is poor on both training and test data. Fixes include increasing model capacity (more layers or neurons), training longer, or reducing regularization.

Use a validation set (data the model doesn't train on) to monitor which problem you're facing. If training accuracy is high but validation accuracy is low, you're overfitting. If both are low, you're underfitting.

Types of neural networks

Different architectures are designed for different types of data and tasks. Choosing the right one for your problem is a critical design decision.

Feedforward networks

The simplest architecture. Data flows in one direction from input to output with no loops or feedback connections. Multi-layer perceptrons are feedforward networks.

These work for basic classification tasks, but they treat each input as a flat vector. For images, that means they ignore spatial structure entirely: a pixel in the top-left corner is treated the same as one in the center. This makes them a poor choice for most computer vision tasks, which is why CNNs were developed.

Convolutional neural networks

CNNs are purpose-built for grid-structured data like images. Instead of connecting every neuron to every input, they use small learnable filters that slide across the image, detecting local patterns.

Key components:

Convolutional layers: apply filters to extract local features (edges, textures, shapes). Each filter produces a feature map.
Pooling layers: reduce spatial dimensions (e.g., max pooling takes the largest value in each small region), making the network more efficient and somewhat invariant to small translations.
Fully connected layers: flatten the feature maps and produce the final classification output.

CNNs learn a spatial hierarchy: early layers detect simple features like edges, middle layers combine those into textures and parts, and deeper layers recognize whole objects. This is what makes them so effective for vision tasks.

Popular architectures include AlexNet (2012, proved deep CNNs work), VGGNet (very deep with small 3×3 filters), and ResNet (introduced skip connections to train networks with 100+ layers).

Recurrent neural networks

RNNs process sequential data by maintaining an internal state (memory) that gets updated at each time step. In computer vision, they're used when temporal order matters:

Video analysis: understanding actions across frames
Image captioning: generating text descriptions of images word by word
Action recognition: classifying activities in video sequences

Standard RNNs struggle with long sequences due to vanishing gradients. LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) variants use gating mechanisms to selectively remember and forget information, making them much more effective for longer sequences.

Biological inspiration, Neural Net - Ascension Glossary

Generative adversarial networks

GANs consist of two networks trained in opposition:

The generator tries to create realistic synthetic images
The discriminator tries to distinguish real images from generated ones

As training progresses, the generator gets better at fooling the discriminator, and the discriminator gets better at detecting fakes. The result is a generator that can produce highly realistic images.

Applications in computer vision:

Image synthesis: generating photorealistic faces, scenes, or objects
Style transfer: applying the artistic style of one image to the content of another
Data augmentation: creating synthetic training data when real data is scarce

GANs are notoriously tricky to train. Mode collapse (the generator produces only a narrow range of outputs) and training instability are common challenges.

Deep learning

Deep learning refers to neural networks with many layers. The "deep" part is what enables these networks to automatically learn useful features from raw data, rather than requiring hand-engineered features like traditional computer vision methods.

Deep vs shallow networks

A shallow network might have one or two hidden layers. A deep network has many, sometimes hundreds. The practical differences:

Deep networks can learn hierarchical representations, where each layer builds on the previous one. This lets them capture increasingly abstract patterns.
Shallow networks can theoretically approximate any function, but they may need an impractically large number of neurons to do so. Deep networks achieve the same representational power more efficiently.

The tradeoff: deep networks require more data, more compute, and more careful training (proper initialization, batch normalization, residual connections) to avoid issues like vanishing gradients.

Feature hierarchy

One of the most important concepts in deep learning for vision. Each layer in a deep network learns features at a different level of abstraction:

Early layers: edges, corners, color gradients
Middle layers: textures, simple shapes, parts of objects (e.g., eyes, wheels)
Deep layers: whole objects, scenes, semantic concepts

You can actually visualize what each layer has learned using techniques like feature map visualization (showing what activates a particular filter) and activation maximization (generating an input that maximally activates a specific neuron). These tools confirm that deep networks really do build up from simple to complex features.

Transfer learning

Training a deep network from scratch requires massive datasets and compute time. Transfer learning sidesteps this by reusing a model that was already trained on a large dataset (typically ImageNet, which has 1.2 million images across 1,000 categories).

Why it works: the low-level and mid-level features learned on one image dataset (edges, textures, shapes) are broadly useful for many vision tasks. Only the high-level, task-specific features need to be learned from scratch.

Two common approaches:

Feature extraction: use the pre-trained network as a fixed feature extractor. Remove the final classification layer, feed your images through, and train a new classifier on the extracted features.
Fine-tuning: start with the pre-trained weights and continue training on your new dataset, allowing some or all layers to update.

Fine-tuning pre-trained models

Fine-tuning requires some care to avoid destroying the useful features the model already learned. A typical process:

Replace the final layer(s) with new layers matching your task (e.g., a new classification head with the right number of output classes)
Freeze the early layers so their weights don't change during initial training
Train the new layers using a relatively low learning rate
Gradually unfreeze deeper layers and continue training with an even lower learning rate

The idea is to preserve the general features in early layers while adapting the higher layers to your specific task. If your new dataset is very small, freeze more layers. If it's larger or very different from ImageNet, unfreeze more.

Applications in computer vision

Image classification

The task of assigning a label to an entire image. A network takes in an image and outputs a probability distribution over predefined categories (e.g., "cat: 0.92, dog: 0.07, bird: 0.01").

Real-world uses:

Medical imaging: classifying X-rays or pathology slides as normal vs. abnormal
Facial recognition: identifying or verifying individuals
Content-based image retrieval: finding visually similar images in a database

Key challenges include handling thousands of classes, distinguishing fine-grained categories (e.g., different bird species), and dealing with class imbalance where some categories have far fewer training examples than others.

Object detection

Goes beyond classification by identifying where objects are in an image, not just what the image contains. The output is typically a set of bounding boxes, each with a class label and confidence score.

The pipeline generally involves:

Region proposals: identifying candidate regions that might contain objects
Bounding box regression: refining the coordinates of each proposed box
Classification: determining what object (if any) is in each box

Popular architectures:

YOLO (You Only Look Once): processes the entire image in a single pass, making it very fast. Good for real-time applications.
SSD (Single Shot Detector): similar single-pass approach with multi-scale detection.
Faster R-CNN: uses a region proposal network for higher accuracy, but slower than single-shot methods.

Applications include pedestrian and traffic sign detection in autonomous vehicles, surveillance systems, and retail inventory tracking.

Semantic segmentation

Assigns a class label to every single pixel in an image. Instead of a bounding box around a car, you get the exact pixel-level outline of the car.

This provides much more detailed scene understanding than object detection. Uses include:

Medical imaging: precisely delineating organ boundaries or tumor regions
Autonomous driving: identifying road surfaces, lane markings, sidewalks, and obstacles at the pixel level
Satellite imagery: mapping land use, vegetation, and water bodies

Key architectures include U-Net (originally designed for biomedical image segmentation, uses an encoder-decoder structure with skip connections) and DeepLab (uses atrous/dilated convolutions to capture multi-scale context).

Image generation

Creating new images that don't exist in the training set. The network learns the underlying distribution of the training data and samples from it.

Main techniques:

VAEs (Variational Autoencoders): learn a compressed latent representation and generate images by sampling from it. Tend to produce slightly blurry results.
GANs: produce sharper, more realistic images through adversarial training (see above).
Diffusion models: iteratively denoise random noise into coherent images. Currently produce some of the highest-quality generated images.

Applications range from art creation and style transfer to data augmentation (generating synthetic training examples) and image inpainting (filling in missing or damaged regions of an image).

Neural network implementations

Popular frameworks

The two dominant frameworks for building neural networks:

TensorFlow (Google): mature ecosystem supporting both research and production. Its high-level API, Keras, makes it straightforward to build and train models quickly. Also offers low-level control when you need it.
PyTorch (Meta): uses dynamic computation graphs, meaning you can change the network structure on the fly. This makes debugging easier and is why it's the most popular choice in research.

Other frameworks like Caffe, MXNet, and ONNX (an interoperability format) exist but are less widely used. When choosing, consider the community size, available tutorials, and whether you need production deployment tools or research flexibility.

Biological inspiration, Artificial Neural Network

Hardware acceleration

Neural network training involves massive amounts of matrix multiplication, which benefits enormously from specialized hardware:

GPUs: their massively parallel architecture (thousands of cores) makes them ideal for the matrix operations in neural networks. NVIDIA GPUs with CUDA support are the standard.
TPUs: Google's custom chips designed specifically for machine learning. Optimized for TensorFlow workloads and available through Google Cloud.
FPGAs: programmable chips that offer a middle ground between flexibility and efficiency. Used in some edge deployment scenarios.

Your hardware choice depends on model size, whether you're training or running inference, and your budget. Training a large model might require multiple GPUs, while inference on a mobile device might use a small, optimized chip.

Distributed training

When a model or dataset is too large for a single machine, you distribute the work across multiple devices:

Data parallelism: each device gets a copy of the full model but processes a different subset of the data. Gradients are averaged across devices before updating weights.
Model parallelism: different parts of the model run on different devices. Necessary when the model itself doesn't fit in one device's memory.
Pipeline parallelism: combines both approaches, splitting the model into stages and processing different mini-batches through the pipeline simultaneously.

Challenges include communication overhead between devices, keeping model copies synchronized, and diminishing returns as you add more devices. PyTorch's DistributedDataParallel and Horovod are common tools for this.

Model deployment

Getting a trained model into production where it actually processes real data:

Cloud deployment: services like AWS SageMaker or Google Cloud AI Platform handle scaling and infrastructure
Edge deployment: running models on smartphones, cameras, or IoT devices, which requires aggressive model optimization
On-premise servers: for situations with strict data privacy requirements

To make models fast and small enough for deployment, you'll often use:

Quantization: reducing weight precision from 32-bit floats to 8-bit integers
Pruning: removing weights or neurons that contribute little to the output
Knowledge distillation: training a smaller "student" model to mimic a larger "teacher" model

Tools like TensorFlow Serving, TensorFlow Lite (for mobile), and ONNX Runtime handle the serving infrastructure.

Challenges and limitations

Interpretability issues

Neural networks are often called "black boxes" because it's difficult to understand why they made a particular prediction. For image classification, you might know the network says "cat" with 95% confidence, but not what visual features drove that decision.

This is a serious problem in high-stakes domains like medical diagnosis or autonomous driving, where you need to trust and verify the model's reasoning.

Techniques to improve interpretability:

Saliency maps: highlight which pixels most influenced the output
Attention maps: show where the network "looked" when making its decision
LIME: generates local explanations by testing how small changes to the input affect the output

There's generally a tradeoff: simpler models are more interpretable but less accurate, while complex deep networks perform better but are harder to explain.

Adversarial attacks

Small, carefully crafted perturbations to an input image can cause a network to misclassify it with high confidence. These perturbations are often imperceptible to humans. For example, adding a tiny amount of noise to an image of a panda can make a network classify it as a gibbon.

Types of attacks:

White-box: the attacker knows the model architecture and weights
Black-box: the attacker can only query the model and observe outputs
Targeted: forces misclassification into a specific wrong class
Untargeted: causes any misclassification

Defenses include adversarial training (including adversarial examples in the training set), input preprocessing to remove perturbations, and ensemble methods that are harder to fool simultaneously.

Ethical considerations

Neural networks in computer vision raise several ethical concerns:

Bias: if training data overrepresents certain demographics, the model will perform worse on underrepresented groups. Facial recognition systems have shown significantly higher error rates for certain racial groups and genders.
Privacy: facial recognition and surveillance systems can be used to track individuals without consent.
Deepfakes: GANs and diffusion models can generate convincing fake images and videos, enabling misinformation.

Addressing these issues requires diverse training datasets, fairness-aware evaluation metrics, transparent development practices, and clear regulatory frameworks.

Computational requirements

Training state-of-the-art vision models demands significant resources. Training GPT-scale or large vision models can cost millions of dollars in compute and consume enormous amounts of energy.

Approaches to reduce computational costs:

Efficient architectures: MobileNet and EfficientNet are designed to achieve good accuracy with far fewer parameters and operations
Model compression: pruning and quantization (described above) reduce model size and inference cost
Hardware-aware neural architecture search (NAS): automatically designs architectures optimized for specific hardware constraints

These techniques are especially important for deploying vision models on resource-constrained devices like smartphones and embedded systems.

Future directions

Neuromorphic computing

Neuromorphic chips (like Intel's Loihi and IBM's TrueNorth) are designed to process information more like biological neurons, using spikes rather than continuous values. Potential advantages for computer vision include dramatically lower power consumption and real-time processing for event-based cameras, which only record changes in a scene rather than full frames.

Quantum neural networks

Quantum computing could theoretically provide exponential speedups for certain neural network operations and handle high-dimensional data more efficiently. However, current quantum hardware is limited in scale, noisy, and requires error correction. Practical quantum advantages for computer vision remain largely theoretical at this stage.

Explainable AI

A growing research area focused on making neural networks more transparent:

Attention mechanisms that highlight which image regions the model focuses on
Concept-based explanations that link internal activations to human-understandable concepts (e.g., "this neuron activates for striped textures")
Counterfactual explanations that show what would need to change in an input to produce a different output

These techniques are particularly valuable in medical imaging, autonomous driving, and any application where decisions need to be auditable.

Energy-efficient architectures

As AI systems scale up, energy consumption becomes a growing concern. Research directions include:

Sparse networks that activate only a fraction of their parameters for each input
Mixed-precision training that uses lower-precision arithmetic where full precision isn't needed
Hardware-software co-design that jointly optimizes the model architecture and the chip it runs on

These advances are critical for bringing advanced vision capabilities to mobile devices, IoT sensors, and other environments where power is limited.

2,589 studying →