Camera models and image formation are foundational concepts in computer vision. They explain how 3D scenes get captured as 2D images, covering everything from basic pinhole cameras to complex lens systems and digital sensors.

Understanding these models is crucial for tasks like 3D reconstruction and camera calibration. This guide covers perspective projection, lens distortion, the image formation pipeline, and the key parameters that define how cameras turn light into usable image data.

Pinhole camera model

The pinhole camera is the simplest model of image formation, and it serves as the foundation for nearly every other camera model in computer vision. Even when you're working with complex lens systems, you'll often start from the pinhole model and add corrections on top of it.

Geometry of pinhole cameras

A pinhole camera is just a light-tight box with a tiny hole (the pinhole) on one side. Light rays from the scene pass through this hole and hit the opposite wall, forming an inverted image.

The distance between the pinhole and the image plane is the focal length.
A smaller pinhole produces a sharper image but lets in less light.
The pinhole gives you infinite depth of field since every point in the scene projects through the same tiny aperture.
At extremely small aperture sizes, diffraction starts to blur the image, so there's a practical limit to how small you can make the hole.

Image formation process

Each point in the 3D scene sends light rays in all directions, but only one ray from each point passes through the pinhole. That ray hits a unique location on the image plane, which is why the pinhole model produces a clean one-to-one mapping from scene points to image points.

The result is a perspective projection: objects farther from the camera appear smaller, and parallel lines in the scene converge toward vanishing points in the image.

Perspective projection equations

The perspective projection equations map 3D world coordinates $(X, Y, Z)$ to 2D image coordinates $(x, y)$ :

$x = f \frac{X}{Z}, \quad y = f \frac{Y}{Z}$

where $f$ is the focal length and $Z$ is the depth of the point along the optical axis.

These equations use homogeneous coordinates so the projection can be expressed as a single matrix multiplication.
In practice, you also need to account for the principal point offset and pixel scaling (covered in the intrinsic parameters section below).
This nonlinear division by $Z$ is what gives perspective images their characteristic look: nearby objects loom large, distant objects shrink.

Lens-based camera models

Real cameras use lenses instead of pinholes because lenses gather far more light while still forming a focused image. The tradeoff is that lenses introduce optical effects like limited depth of field and distortion that the pinhole model doesn't capture.

Thin lens approximation

The thin lens model simplifies a real multi-element lens into a single ideal optical element. It assumes all light rays pass through a single plane at the optical center.

The thin lens equation relates focal length, object distance, and image distance:

$\frac{1}{f} = \frac{1}{u} + \frac{1}{v}$

where $f$ is the focal length, $u$ is the distance from the object to the lens, and $v$ is the distance from the lens to the focused image. This model is accurate enough for most computer vision work and much simpler than tracing rays through every lens element.

Focal length and field of view

Focal length controls how much of the scene the camera can see:

Shorter focal lengths give a wider field of view (wide-angle lenses).
Longer focal lengths give a narrower field of view (telephoto lenses).

The field of view is calculated as:

$FOV = 2 \arctan\left(\frac{d}{2f}\right)$

where $d$ is the sensor dimension (width or height) and $f$ is the focal length. For example, a 50mm lens on a 36mm-wide full-frame sensor gives a horizontal FOV of about 39.6 degrees. Focal length also affects perspective distortion: wide-angle lenses exaggerate the size of nearby objects relative to distant ones.

Depth of field vs aperture

Depth of field (DOF) is the range of distances over which objects appear acceptably sharp.

Larger apertures (smaller f-numbers like f/1.8) produce shallow DOF, blurring the background.
Smaller apertures (larger f-numbers like f/16) produce deep DOF, keeping more of the scene in focus.
DOF also depends on focal length and subject distance.

The boundary of "acceptable sharpness" is defined by the circle of confusion, which is the maximum blur spot size that still looks sharp to the viewer. In computer vision, you generally want deep DOF so that features across the whole scene are in focus.

Camera intrinsic parameters

Intrinsic parameters describe the internal properties of a camera that affect how 3D points map to pixel coordinates. These stay constant for a given camera setup unless you physically change the lens or sensor.

Focal length and principal point

In the digital context, focal length is typically expressed in pixels rather than millimeters, because what matters for computation is how many pixels correspond to a given angular field of view.

$f_x$ and $f_y$ are the focal lengths in the x and y pixel directions.
The principal point $(c_x, c_y)$ is where the optical axis intersects the image plane. Ideally it's at the image center, but manufacturing tolerances often shift it slightly.

These parameters are collected into the camera intrinsic matrix:

$K = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}$

Pixel aspect ratio

Pixel aspect ratio is the ratio of pixel width to pixel height. Most modern sensors have square pixels (1:1 ratio), which means $f_x = f_y$ . When pixels aren't square, $f_x$ and $f_y$ differ, and the intrinsic matrix automatically accounts for this. You'll mostly encounter non-square pixels in older or specialized imaging hardware.

Skew coefficient

The skew coefficient $s$ accounts for any non-orthogonality between the sensor's x and y axes. With the skew parameter included, the intrinsic matrix becomes:

$K = \begin{bmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}$

For virtually all modern cameras, $s = 0$ . It only becomes relevant with certain manufacturing defects or unusual sensor geometries. Still, calibration routines often estimate it just to verify it's negligible.

Camera extrinsic parameters

Extrinsic parameters define where the camera is and which direction it's pointing in the world. Unlike intrinsics, these change every time the camera moves.

Rotation and translation matrices

The rotation matrix $R$ (3×3) describes the camera's orientation relative to the world coordinate system.
The translation vector $t$ (3×1) describes the camera's position.
Together they form the 3×4 extrinsic matrix $[R \mid t]$ .

Rotation can be parameterized in several ways: Euler angles (intuitive but suffer from gimbal lock), quaternions (compact and numerically stable), or rotation vectors (axis-angle representation). The choice depends on your application.

$Geometry of pinhole cameras, Pinhole Size Chart - Geometry, Diffraction and Rayleigh's … | Flickr$

World vs camera coordinates

World coordinates describe points in the 3D scene using a fixed reference frame. Camera coordinates describe those same points relative to the camera's own position and orientation.

The transformation from world to camera coordinates is:

$P_{camera} = R(P_{world} - C)$

where $C$ is the camera center in world coordinates. This transformation is essential whenever you need to combine information from multiple camera views, as in stereo vision or structure from motion.

Homogeneous transformations

Homogeneous transformations combine rotation and translation into a single 4×4 matrix:

$T = \begin{bmatrix} R & t \\ 0 & 1 \end{bmatrix}$

The advantage is that you can chain multiple transformations by multiplying their matrices together. This is especially useful in robotics and multi-camera setups where you need to convert between several coordinate systems in sequence. The extra dimension in homogeneous coordinates is what makes this clean matrix multiplication possible.

Distortion models

Real lenses don't project light exactly the way the pinhole model predicts. Distortion models capture these deviations so you can correct for them.

Radial distortion

Radial distortion happens because light rays bend differently depending on how far they are from the optical center. There are two common types:

Barrel distortion: straight lines bow outward (common in wide-angle lenses).
Pincushion distortion: straight lines bow inward (common in telephoto lenses).

The standard model uses a polynomial in the radial distance $r$ from the image center:

$x_{distorted} = x(1 + k_1r^2 + k_2r^4 + k_3r^6)$

The coefficients $k_1$ , $k_2$ , and $k_3$ are estimated during calibration. Often just $k_1$ and $k_2$ are enough; $k_3$ is only needed for lenses with severe distortion.

Tangential distortion

Tangential distortion arises when lens elements aren't perfectly aligned with the sensor plane. It produces an asymmetric shift in pixel positions:

$x_{distorted} = x + [2p_1 xy + p_2(r^2 + 2x^2)]$ $y_{distorted} = y + [p_1(r^2 + 2y^2) + 2p_2 xy]$

The parameters $p_1$ and $p_2$ are typically much smaller than the radial coefficients in modern cameras, but ignoring them can still introduce measurable error in precision applications.

Lens distortion correction

Distortion correction (often called undistortion) removes lens distortion to produce an image that matches the ideal pinhole projection.

Calibrate the camera to estimate the distortion coefficients ( $k_1, k_2, k_3, p_1, p_2$ ).
For each pixel in the output (undistorted) image, apply the inverse distortion model to find the corresponding location in the distorted input image.
Interpolate pixel values from the input image to fill in the output.

This can be done as a preprocessing step or folded into the camera model itself. Either way, correcting distortion significantly improves the accuracy of feature detection, matching, and 3D reconstruction.

Camera calibration techniques

Camera calibration is the process of estimating a camera's intrinsic and extrinsic parameters from images of known objects. Without calibration, you can't reliably convert between pixel coordinates and real-world measurements.

Checkerboard pattern method

The checkerboard is the most common calibration target because its corners are easy to detect automatically and their 3D positions are known exactly.

Print a checkerboard with known square dimensions (e.g., 25mm squares).
Capture 10-20 images of the checkerboard at various angles and positions.
Detect the inner corner points in each image using a corner detector.
Establish correspondences between the known 3D corner positions and their detected 2D image locations.
Solve for the camera parameters using least-squares optimization.

Varying the checkerboard's orientation across images is important because it provides the geometric diversity needed to constrain all the parameters.

Zhang's calibration algorithm

Zhang's method is the standard algorithm behind most checkerboard calibration tools (including OpenCV's). It works with a planar calibration pattern and requires at least three views.

For each image, compute a homography mapping the pattern plane to the image plane.
Extract constraints on the intrinsic parameters from these homographies (each homography provides two constraints).
Solve a closed-form system for an initial estimate of the intrinsics.
Refine all parameters (intrinsics, extrinsics, and distortion coefficients) simultaneously using nonlinear optimization (Levenberg-Marquardt).

This method is popular because it's accurate, doesn't require expensive 3D calibration rigs, and works well with a simple printed pattern.

Intrinsic vs extrinsic calibration

Intrinsic calibration determines the internal parameters: focal length, principal point, and distortion coefficients. These are fixed for a given camera/lens combination.
Extrinsic calibration estimates the camera's pose (position and orientation) relative to a world coordinate frame. These change whenever the camera moves.

Most calibration procedures estimate both simultaneously, but they can be separated. For instance, you might calibrate intrinsics once in the lab and then estimate extrinsics on the fly during deployment.

Stereo camera systems

Stereo systems use two (or more) cameras with a known spatial relationship to recover depth information, much like human binocular vision.

Epipolar geometry

When two cameras observe the same scene, the geometry of their arrangement constrains where a point seen in one image can appear in the other. This constraint is called epipolar geometry.

For a point $x$ in the first image, its corresponding point $x'$ in the second image must lie on a line called the epipolar line.
The fundamental matrix $F$ encodes this relationship: $x'^T F x = 0$
The essential matrix $E$ is the equivalent for calibrated cameras (normalized coordinates): $E = K'^T F K$

Epipolar geometry reduces the stereo correspondence search from a 2D problem to a 1D problem along epipolar lines, which is a huge computational savings.

Stereo rectification

Stereo rectification transforms both images so that epipolar lines become horizontal and aligned with image rows. After rectification:

Corresponding points in the left and right images sit on the same row.
The search for matches becomes a simple 1D scan along each row.

Rectification involves computing a pair of homographies that rotate each camera's image plane into a common fronto-parallel configuration. This step is standard practice before running any stereo matching algorithm.

Geometry of pinhole cameras, math - Pinhole camera projection matrix explained - Stack Overflow

Disparity and depth estimation

Disparity is the horizontal pixel difference between corresponding points in the left and right images. It's inversely proportional to depth:

$Z = \frac{f \cdot B}{d}$

where $Z$ is depth, $f$ is focal length (in pixels), $B$ is the baseline (distance between cameras), and $d$ is disparity.

Points closer to the cameras have larger disparity; distant points have smaller disparity.
Dense disparity maps are computed using stereo matching algorithms like block matching or semi-global matching (SGM).
These depth maps are the basis for 3D reconstruction, obstacle detection, and navigation in robotics and autonomous vehicles.

Advanced camera models

Some applications require camera models that go beyond the standard pinhole-plus-lens framework. These specialized models handle extreme fields of view or capture richer information about the scene.

Fisheye lenses

Fisheye lenses achieve fields of view up to 180 degrees or more by intentionally introducing extreme radial distortion. The standard radial distortion polynomial doesn't work well here; instead, specialized projection models are used:

Equidistant: $r = f \cdot \theta$
Equisolid angle: $r = 2f \sin(\theta/2)$
Orthographic: $r = f \sin(\theta)$

where $\theta$ is the angle between the incoming ray and the optical axis. Fisheye lenses are common in surveillance, automotive surround-view systems, and robotics where wide coverage matters more than rectilinear projection.

Omnidirectional cameras

Omnidirectional cameras capture a full 360-degree field of view. There are two main approaches:

Catadioptric systems combine a curved mirror with a conventional camera. The mirror shape (parabolic, hyperbolic, etc.) determines the projection geometry.
Multi-camera arrays stitch together images from several cameras pointing in different directions.

These systems use specialized projection models like the unified sphere model to handle the non-standard geometry. They're used in virtual tours, mobile robotics, and autonomous navigation where full situational awareness is needed.

Light field cameras

A conventional camera records only the intensity of light at each pixel. A light field camera also records the direction of incoming light rays, capturing the full 4D light field (2 spatial dimensions + 2 angular dimensions).

Typically implemented using a microlens array placed in front of the sensor, where each microlens captures a small range of ray directions.
This extra angular information enables post-capture refocusing and depth estimation without stereo.
Applications include computational photography, virtual reality content creation, and 3D displays.
The tradeoff is reduced spatial resolution, since sensor pixels are shared between spatial and angular sampling.

Image formation pipeline

The image formation pipeline describes everything that happens between photons hitting the sensor and the final digital image file. Understanding this pipeline helps you reason about image quality and artifacts.

Color filter array

Most digital cameras use a single sensor with a color filter array (CFA) on top, so each pixel records only one color channel.

The Bayer pattern (RGGB) is by far the most common. It has twice as many green pixels as red or blue, matching human visual sensitivity.
Alternatives include Fujifilm's X-Trans pattern (less prone to moiré) and RGBW patterns (better low-light sensitivity).
Because each pixel sees only one color, the full RGB image must be reconstructed through demosaicing.

Demosaicing algorithms

Demosaicing reconstructs a full-color image from the single-channel CFA data by interpolating the missing color values at each pixel.

Simple methods like bilinear interpolation are fast but produce artifacts at edges (false colors, zipper effects).
Advanced methods use edge-aware or adaptive interpolation to preserve sharp boundaries.
The quality of demosaicing directly affects downstream tasks like feature detection and color-based segmentation.

White balance and color correction

The color of light in a scene varies with the light source (sunlight is bluish, incandescent light is yellowish). White balance adjusts the image so that neutral colors appear neutral regardless of the illuminant.

Color correction goes further, compensating for differences between the sensor's spectral response and the standard color space (e.g., sRGB). Both can be applied automatically in-camera or manually during post-processing. Accurate color is important in computer vision whenever algorithms rely on color information for detection, tracking, or classification.

Digital image sensors

The sensor is where photons become electrical signals. Its characteristics determine much of the camera's performance in terms of sensitivity, noise, and dynamic range.

CCD vs CMOS sensors

CCD (Charge-Coupled Device) sensors transfer accumulated charge across the chip row by row and read it out at a single amplifier. This produces uniform, low-noise output but is slower and more power-hungry.
CMOS (Complementary Metal-Oxide-Semiconductor) sensors have an amplifier and readout circuit at every pixel, enabling faster readout, lower power consumption, and easier integration with on-chip processing.

CMOS technology has improved dramatically and now dominates nearly all consumer, industrial, and scientific imaging. CCD sensors are still found in some niche scientific applications where their noise characteristics are preferred.

Quantum efficiency

Quantum efficiency (QE) measures what fraction of incoming photons actually get converted into signal electrons. A sensor with 80% QE converts 80 out of every 100 photons.

QE varies with wavelength, so a sensor might be more sensitive to green light than to blue or red.
Higher QE means better low-light performance and a higher signal-to-noise ratio.
Back-illuminated (BSI) sensor designs improve QE by moving the wiring layer behind the photodiode, letting more light reach the active area.

Noise sources in digital imaging

Every digital image contains noise from multiple sources:

Shot noise: Caused by the random arrival of photons. Follows a Poisson distribution, so it's more noticeable in darker regions. This is a fundamental physical limit.
Read noise: Introduced during the conversion of charge to voltage and digitization. Reduced by better electronics design.
Dark current noise: Electrons generated thermally even without light. Increases with exposure time and sensor temperature.
Fixed pattern noise: Pixel-to-pixel variations in sensitivity or dark current. Can be corrected with calibration frames.

In low-light or long-exposure situations, noise becomes a significant concern and drives the need for noise reduction in the image processing pipeline.