Computational Genomics

🧬Computational Genomics Unit 8 – Population Genomics and GWAS

Population genomics examines genetic variation within and between populations to understand evolutionary processes and population structure. It explores concepts like genetic diversity, drift, selection, and gene flow, using tools such as linkage disequilibrium and Hardy-Weinberg equilibrium. Genome-wide association studies (GWAS) identify genetic variants linked to traits or diseases in populations. GWAS uses case-control designs, genotyping arrays, and statistical methods to uncover associations, considering factors like population structure and multiple testing correction.

Key Concepts in Population Genomics

  • Population genomics studies genetic variation within and between populations to understand evolutionary processes and population structure
  • Genetic diversity refers to the total number of genetic characteristics in the genetic makeup of a species
  • Genetic drift is the change in allele frequencies in a population due to random sampling of organisms
  • Natural selection is the process whereby organisms better adapted to their environment tend to survive and produce more offspring
  • Gene flow is the transfer of genetic variation from one population to another through migration or admixture
  • Linkage disequilibrium (LD) is the non-random association of alleles at different loci in a given population
    • LD can be influenced by factors such as population structure, selection, and recombination rates
  • Hardy-Weinberg equilibrium (HWE) is a state in which allele and genotype frequencies remain constant from generation to generation in the absence of evolutionary influences

Genetic Variation and Population Structure

  • Single nucleotide polymorphisms (SNPs) are the most common type of genetic variation used in population genomics studies
  • Copy number variations (CNVs) and insertions/deletions (indels) also contribute to genetic variation within populations
  • Population structure refers to the presence of genetically distinct subgroups within a population
  • Principal component analysis (PCA) is a statistical method used to visualize and assess population structure
    • PCA reduces high-dimensional genetic data into a smaller number of principal components that capture the majority of the variation
  • Admixture analysis estimates the proportions of an individual's genome that originate from different ancestral populations
  • F-statistics (Fst) measure the degree of genetic differentiation between populations
    • Fst values range from 0 (no differentiation) to 1 (complete differentiation)
  • Isolation by distance (IBD) is a pattern where genetic differences between populations increase with geographic distance due to limited gene flow

GWAS Fundamentals

  • Genome-wide association studies (GWAS) aim to identify genetic variants associated with traits or diseases in a population
  • GWAS typically use a case-control design, comparing allele frequencies between individuals with (cases) and without (controls) a specific phenotype
  • The common disease-common variant (CDCV) hypothesis suggests that common diseases are influenced by common genetic variants with small effect sizes
  • Genotyping arrays are used to simultaneously genotype hundreds of thousands to millions of SNPs across the genome
  • Imputation is the process of inferring unobserved genotypes based on reference panels and linkage disequilibrium patterns
  • Multiple testing correction is essential in GWAS to control for false positives due to the large number of statistical tests performed
    • Bonferroni correction and false discovery rate (FDR) are commonly used methods for multiple testing correction
  • Manhattan plots visualize GWAS results, with the negative logarithm of the p-value plotted against the genomic position of each SNP

Data Collection and Quality Control

  • Study design considerations for GWAS include sample size, case-control ratio, and population stratification
  • Genotyping quality control (QC) steps are crucial to ensure the accuracy and reliability of GWAS results
  • SNP QC measures include call rate, minor allele frequency (MAF), and Hardy-Weinberg equilibrium (HWE) testing
    • SNPs with low call rates, low MAF, or deviations from HWE are often excluded from analysis
  • Sample QC measures include individual call rate, heterozygosity, and relatedness checks
    • Samples with low call rates, extreme heterozygosity, or cryptic relatedness may be removed
  • Population stratification can lead to spurious associations and is often addressed using principal component analysis (PCA) or mixed models
  • Batch effects can arise from technical factors (genotyping platform, lab, or processing date) and should be identified and corrected
  • Phenotype data quality is equally important, with considerations for phenotype definition, measurement, and harmonization across studies

Statistical Methods in GWAS

  • Single-SNP association tests, such as the chi-square test or logistic regression, are used to assess the association between each SNP and the phenotype of interest
  • Linear regression is used for quantitative traits, while logistic regression is used for binary traits (case-control studies)
  • Covariates, such as age, sex, and principal components, can be included in the regression models to adjust for potential confounding factors
  • Mixed linear models (MLMs) are used to account for population structure and cryptic relatedness by incorporating a kinship matrix
  • Meta-analysis combines GWAS results from multiple studies to increase statistical power and identify robust associations
    • Fixed-effect and random-effect models are used depending on the heterogeneity of effect sizes across studies
  • Bayesian methods, such as Bayesian variable selection regression (BVSR), can be used to prioritize SNPs and estimate their effect sizes
  • Polygenic risk scores (PRS) aggregate the effects of multiple SNPs to predict an individual's risk for a specific trait or disease

Interpreting GWAS Results

  • Genome-wide significance threshold is typically set at p<5×108p < 5 \times 10^{-8} to account for multiple testing in GWAS
  • Locus zoom plots visualize the association signals and linkage disequilibrium patterns in a specific genomic region
  • Functional annotation of GWAS hits involves integrating information from various sources (e.g., gene expression, epigenetics, and biological pathways) to understand their potential functional impact
  • Heritability estimates the proportion of phenotypic variance explained by genetic factors and can be calculated using GWAS summary statistics
  • Genetic correlation analysis assesses the shared genetic basis between different traits or diseases using GWAS summary statistics
  • Mendelian randomization uses genetic variants as instrumental variables to infer causal relationships between exposures and outcomes
  • Replication of GWAS findings in independent cohorts is essential to validate the associations and assess their generalizability

Challenges and Limitations

  • Missing heritability refers to the gap between the heritability estimates from family studies and the variance explained by GWAS-identified variants
  • Rare variants (MAF < 1%) are not well captured by standard GWAS genotyping arrays and may require sequencing-based approaches
  • Gene-environment interactions can modulate the effect of genetic variants on the phenotype but are often not accounted for in GWAS
  • Phenotypic heterogeneity, where different genetic variants contribute to different subtypes of a disease, can reduce the power of GWAS
  • Population-specific genetic effects may limit the transferability of GWAS findings across diverse populations
  • Biological interpretation of GWAS results can be challenging, as associated variants may not be the causal variants and may affect genes or regulatory elements distant from the SNP
  • Ethical considerations, such as informed consent, data privacy, and the potential for genetic discrimination, must be addressed in GWAS

Applications and Future Directions

  • Drug target discovery and repositioning: GWAS can identify novel therapeutic targets and suggest potential drug repurposing opportunities
  • Precision medicine: GWAS findings can inform personalized risk prediction, diagnosis, and treatment strategies
  • Integration of multi-omics data (transcriptomics, epigenomics, and proteomics) can provide a more comprehensive understanding of the biological mechanisms underlying GWAS associations
  • Fine-mapping and functional validation studies are necessary to pinpoint the causal variants and elucidate their functional consequences
  • Transethnic GWAS and meta-analyses can improve the power to detect associations and assess the generalizability of findings across diverse populations
  • Polygenic risk scores (PRS) have the potential to improve risk stratification and targeted interventions, but their clinical utility and ethical implications need to be carefully considered
  • Machine learning and artificial intelligence approaches can be applied to GWAS data to improve risk prediction, identify novel associations, and uncover complex genetic architectures
  • Collaboration and data sharing among researchers, institutions, and countries are crucial to accelerate progress in GWAS and translate the findings into tangible benefits for human health


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.