Genome-Wide Association Studies (GWAS) are a method for finding genetic variants linked to complex traits and diseases. Rather than studying one gene at a time, GWAS scan the entire genome across thousands of people, looking for small DNA differences that show up more often in individuals with a particular trait or condition. This approach has transformed how geneticists study polygenic traits, where many variants each contribute a small effect.

Purpose and Methodology of GWAS

The goal of GWAS is to identify specific genetic variants, usually single nucleotide polymorphisms (SNPs), that are statistically associated with a complex trait (like height) or disease (like type 2 diabetes). These studies don't start with a hypothesis about which gene matters. Instead, they take an unbiased look across the whole genome.

The methodology follows a clear sequence:

Recruit a large study population of thousands to hundreds of thousands of individuals. Some are cases (have the disease or trait), others are controls (don't have it). For quantitative traits, you simply measure the trait value in everyone.
Genotype each individual for a dense set of SNPs across the genome using DNA microarrays or sequencing.
Collect phenotypic data for every participant (disease status, trait measurements, etc.).
Run statistical tests (such as chi-square tests or logistic regression) at each SNP position, comparing allele frequencies between cases and controls.
Correct for multiple testing using methods like the Bonferroni correction, since you're testing hundreds of thousands or millions of SNPs simultaneously and need to avoid false positives.

Purpose and methodology of GWAS, Frontiers | From GWAS to Function: Using Functional Genomics to Identify the Mechanisms ...

SNPs as Genetic Markers

A single nucleotide polymorphism (SNP) is a position in the genome where a single base differs between individuals. For example, at a given position, some people might carry an A while others carry a G. SNPs are the most common type of genetic variation in the human genome, with millions catalogued across human populations.

SNPs make ideal markers for GWAS for three reasons:

Abundance: They occur roughly every 300 base pairs on average, providing dense coverage of the genome.
Stability: SNPs have low mutation rates, so they're reliably inherited across generations.
Easy genotyping: Modern microarray chips can genotype over a million SNPs per individual in a single experiment.

In a typical GWAS, researchers select a set of SNPs spaced to cover the entire genome at a certain density (for example, one SNP roughly every 5,000 base pairs). These SNPs don't need to be the actual causal variants. Because of linkage disequilibrium (LD), nearby SNPs tend to be inherited together. So a genotyped SNP can serve as a proxy for other variants in its neighborhood, effectively tagging a region of the genome without needing to sequence every base.

Purpose and methodology of GWAS, Frontiers | Genome-Wide Search for SNP Interactions in GWAS Data: Algorithm, Feasibility ...

Interpretation of GWAS Results

GWAS results are typically displayed as a Manhattan plot, where each dot represents a SNP, the x-axis shows chromosomal position, and the y-axis shows the negative log of the p-value. SNPs that rise above the significance threshold form tall "skyscrapers" on the plot.

Significance threshold: A SNP is considered significantly associated with the trait if its p-value falls below $p < 5 \times 10^{-8}$ . This threshold is deliberately strict because of the massive number of tests being performed. If you test 1 million SNPs at $p < 0.05$ , you'd expect 50,000 false positives by chance alone. The genome-wide threshold accounts for this.

What the p-value means: It represents the probability of seeing an association this strong (or stronger) purely by chance, assuming there's actually no real association. A smaller p-value means stronger statistical evidence that the SNP is truly linked to the trait.

One critical point: a significant GWAS hit tells you a region of the genome is associated with the trait. It doesn't automatically tell you which gene is responsible or how the variant affects biology. Most significant SNPs fall in non-coding regions, so connecting a hit to a specific gene or mechanism requires follow-up work.

Challenges and Limitations of GWAS

Population stratification is one of the biggest confounders. If your study population contains subgroups with different genetic ancestries (for example, European and African descent), and those subgroups also differ in disease prevalence for non-genetic reasons (diet, environment), you can get false associations. A SNP might appear linked to the disease simply because it's more common in one ancestry group, not because it actually influences the trait. Researchers address this by including principal components of genetic ancestry as covariates in their statistical models, or by using family-based study designs.

Large sample sizes are essential. Complex traits are influenced by many variants, each with a small effect. Detecting these tiny effects requires enormous statistical power, which means enrolling thousands to hundreds of thousands of participants. Underpowered studies miss real associations (false negatives) and produce unreliable effect size estimates.

Missing heritability is another ongoing challenge. Even the largest GWAS typically explain only a fraction of a trait's estimated heritability. The remaining "missing" heritability may come from rare variants that SNP arrays don't capture well, gene-gene interactions, gene-environment interactions, or structural variants like copy number changes.

Applications of GWAS Findings

Identifying candidate genes: Significant SNPs are mapped to nearby genes based on genomic position. For example, GWAS for breast cancer risk identified variants near BRCA1 and BRCA2, reinforcing their known roles and also uncovering previously unsuspected loci. These candidate genes then become targets for functional studies in the lab.

Revealing biological pathways: When you collect all the candidate genes from a GWAS, you can run pathway enrichment analyses to see whether certain biological processes are overrepresented. A GWAS for type 2 diabetes, for instance, highlighted genes involved in insulin signaling and pancreatic beta-cell function. This kind of analysis moves you from a list of statistical hits to a biological story about how the trait works at a molecular level.

Polygenic risk scores (PRS): GWAS results can be combined across many SNPs to calculate a single score that estimates an individual's genetic predisposition to a trait or disease. While no single SNP has a large effect for most complex traits, the cumulative effect of thousands of small-effect variants can be informative for risk prediction, though PRS are still limited in clinical utility and tend to perform best in the populations where the original GWAS was conducted.

2,589 studying →