Refinement in Crystal Structure Determination
Purpose and Process of Refinement
Once you have an initial structural model from phasing, refinement is the process of iteratively adjusting that model until it best reproduces the experimental diffraction data. The goal is to minimize the disagreement between what you actually measured and what your model predicts.
During refinement, the algorithm adjusts several types of parameters:
- Atomic positions (x, y, z coordinates for each atom)
- Atomic displacement parameters (B-factors, describing how much each atom vibrates or is disordered)
- Occupancy factors (the fraction of unit cells where a particular atom site is actually occupied)
Each refinement cycle recalculates structure factors from the current model, compares them to the observed data, and shifts parameters to reduce the mismatch. This repeats until changes become negligibly small (convergence).
Constraints vs. restraints play a critical role in keeping the model chemically sensible. A constraint fixes a parameter exactly (e.g., forcing two atoms to share the same position in a disordered site). A restraint allows a parameter to vary but penalizes deviations from an expected value (e.g., keeping a CโC bond length near 1.54 ร ). Without these, refinement can drift toward mathematically optimal but physically meaningless solutions, especially when the data-to-parameter ratio is low.
Why Refinement Matters
Refinement is what turns a rough structural model into a publishable, trustworthy result. Specifically, it:
- Provides accurate atomic coordinates for calculating bond lengths, angles, and torsion angles
- Reveals subtle structural features like hydrogen bonding networks and ligand binding modes
- Enables meaningful comparison between related structures (e.g., wild-type vs. mutant proteins, or a series of related small molecules)
- Validates the proposed structure by demonstrating quantitative agreement with experimental data
Least Squares Refinement
Principles
Least squares (LS) refinement is the classical approach for small-molecule crystallography and remains widely used. The core idea: find the set of model parameters that minimizes the sum of squared residuals between observed and calculated data.
Because the relationship between atomic parameters and structure factor amplitudes is nonlinear, you can't solve this in one step. Instead, the method linearizes the problem around the current parameter values and solves iteratively. Each cycle computes a set of parameter shifts, applies them, and repeats.
Key assumptions of least squares refinement:
- Errors in the data follow a Gaussian (normal) distribution
- The current model is already reasonably close to the true structure (the linearization only works well near the correct answer)
- Each observation can be assigned a meaningful weight based on its estimated precision
Weighted least squares is standard practice. Reflections measured with higher precision get larger weights, so they contribute more to driving the refinement. Poorly measured reflections have less influence.

Mathematical Framework
The objective function to minimize is:
To find the minimum, you set the partial derivative with respect to each refinable parameter equal to zero:
This produces a system of normal equations. The parameter shifts are solved as:
where:
- = observed structure factor amplitudes
- = calculated structure factor amplitudes from the current model
- = weight for each reflection (typically )
- = design matrix containing partial derivatives
- = diagonal weight matrix
- = vector of differences
The matrix is sometimes called the normal matrix. Its inverse provides estimates of parameter uncertainties (standard deviations on atomic coordinates, for example). This is one practical advantage of least squares: you get error estimates directly from the refinement.
Limitations
Least squares works extremely well when the model is nearly correct and the data are well-measured. It struggles when:
- The starting model is far from the true structure (risk of converging to a false minimum)
- Data quality is poor or resolution is low
- Error distributions are non-Gaussian (e.g., outlier reflections from ice rings or detector artifacts)
Maximum Likelihood Refinement
Principles and Advantages
Maximum likelihood (ML) refinement takes a fundamentally different statistical approach. Instead of minimizing residuals, it asks: given my current model (including its known imperfections), what is the probability of observing the data I actually measured? The method then adjusts parameters to maximize that probability.
This distinction matters because ML explicitly accounts for model incompleteness. The method treats the difference between the current model and the true structure as a source of error, estimating an overall coordinate error parameter () during refinement. This makes ML particularly powerful in several situations:
- Early-stage refinement, when the model is still incomplete or partially wrong
- Low-resolution data, where the data-to-parameter ratio is unfavorable
- Macromolecular crystallography, where models are almost always incomplete (missing solvent, disordered loops, unmodeled ligands)
ML refinement also handles outliers more gracefully than least squares, because it doesn't assume strictly Gaussian errors. It converges faster and is less likely to get trapped in false minima.

Mathematical Framework
The likelihood function is the product of probabilities for all observed reflections:
In practice, you maximize the log-likelihood (since products become sums, which are computationally easier):
The probability distributions used depend on the reflection type:
- Acentric reflections follow a Rice distribution
- Centric reflections follow a modified Wilson distribution
These distributions naturally incorporate the expected error in the model, which is why ML doesn't require the model to be nearly perfect before it works well.
ML refinement is implemented in major macromolecular programs like REFMAC5 and phenix.refine. These programs also fold in geometric restraints as prior probability distributions (a Bayesian approach), and can incorporate experimental phase information through multivariate likelihood functions.
Least Squares vs. Maximum Likelihood
| Feature | Least Squares | Maximum Likelihood |
|---|---|---|
| What it optimizes | Minimizes squared residuals | Maximizes probability of observed data |
| Error assumption | Gaussian errors | Flexible; accounts for model error |
| Model completeness | Assumes model is nearly correct | Handles incomplete models well |
| Typical use | Small-molecule structures | Macromolecular structures |
| Convergence | Can be slow if model is poor | Generally faster, more robust |
| Error estimates | From normal matrix inversion | From likelihood surface curvature |
For well-determined small-molecule structures with high-resolution data, least squares and ML give essentially identical results. The advantage of ML becomes clear with challenging data or incomplete models.
Structure Quality Evaluation
Statistical Indicators
After refinement, you need to assess how well the model actually fits the data and whether it makes chemical sense. Several metrics are standard:
R-factors are the most commonly reported statistics. The conventional R-work measures overall agreement:
The R-free is calculated the same way but uses a small subset of reflections (typically 5%) that were excluded from refinement entirely. R-free detects overfitting: if R-work drops but R-free doesn't, you're fitting noise rather than real structural features. A large gap between R-work and R-free (more than ~5% for macromolecules) is a warning sign.
Difference Fourier maps () highlight regions where the model disagrees with the data. Positive peaks suggest unmodeled atoms (missing water, ligand, or alternate conformer). Negative peaks indicate atoms placed where no electron density exists.
B-factor analysis flags atoms with unusually high or low displacement parameters. Very high B-factors may indicate disorder or a misplaced atom. Abrupt jumps in B-factor along a chain suggest modeling errors.
Geometric and Conformational Validation
Beyond data-fit statistics, the model must be chemically reasonable:
- Bond lengths and angles are checked against libraries of expected values (e.g., Engh & Huber restraint dictionaries for proteins)
- Ramachandran plots evaluate protein backbone conformations. Most residues should fall in favored regions; outliers need careful inspection
- Real-space correlation coefficients (RSCC) measure how well each residue's model matches the local electron density
- Non-crystallographic symmetry (NCS) consistency checks ensure that molecules related by NCS have similar conformations where expected
Validation Tools
Several widely used programs automate structure validation:
- MolProbity performs comprehensive geometric, conformational, and clash analysis
- PROCHECK assesses stereochemical quality of protein structures
- WHATCHECK provides detailed validation reports covering many structural criteria
- PDB-REDO automatically re-refines and re-validates deposited structures, often catching errors in published models
- Electron Density Server (EDS) generates maps for deposited structures so anyone can inspect the experimental evidence
For structures containing ligands or metal ions, specialized tools apply:
- Anomalous difference maps confirm the identity and position of metal ions and heavy atoms
- Mogul (from the CSD) validates small-molecule geometry against the Cambridge Structural Database
- Ensemble refinement models conformational heterogeneity by generating multiple structures that collectively fit the data, providing a more honest picture of molecular flexibility