Copy number variations (CNVs) are structural changes in DNA that impact gene dosage and expression. They can range from small deletions to large duplications, affecting phenotypes and disease risk. Understanding CNVs is crucial for genomic analysis.

CNV detection methods include array-based and sequencing-based approaches. Each has pros and cons in resolution, cost, and ability to detect different CNV types. Computational challenges in CNV analysis involve alignment issues, artifact distinction, and handling repetitive regions.

Types of CNVs

  • Copy number variations (CNVs) are structural variations in the genome that involve changes in the number of copies of specific DNA segments, ranging from a few hundred base pairs to several megabases in size
  • CNVs can encompass genes or regulatory elements, leading to dosage imbalances and potentially impacting gene expression and phenotypic traits

Deletions vs duplications

Top images from around the web for Deletions vs duplications
Top images from around the web for Deletions vs duplications
  • Deletions involve the loss of one or more copies of a DNA segment, resulting in a reduced copy number compared to the reference genome
    • Can lead to haploinsufficiency or complete loss of gene function (null alleles)
    • Examples: 22q11.2 syndrome, 16p11.2 deletion associated with
  • Duplications involve the gain of one or more copies of a DNA segment, resulting in an increased copy number
    • Can lead to increased gene dosage and potentially enhanced or altered gene function
    • Examples: Charcot-Marie-Tooth disease type 1A (PMP22 ), 15q11-q13 duplication associated with autism spectrum disorder

Insertions vs translocations

  • Insertions involve the addition of a DNA segment at a new location in the genome, without replacing existing sequences
    • Can disrupt genes or regulatory elements at the insertion site, or create fusion genes
    • Examples: Huntington's disease (CAG repeat expansion), Fragile X syndrome (CGG repeat expansion)
  • Translocations involve the exchange of DNA segments between non-homologous chromosomes or different locations on the same chromosome
    • Can create fusion genes or disrupt gene function at the breakpoints
    • Examples: Philadelphia chromosome (BCR-ABL fusion) in chronic myeloid leukemia, t(11;22) translocation in Ewing's sarcoma

Simple vs complex CNVs

  • Simple CNVs involve a single contiguous DNA segment with a uniform copy number change
    • Easier to detect and interpret using current computational methods
    • Examples: Single-gene deletions or duplications, recurrent microdeletion or microduplication syndromes
  • Complex CNVs involve multiple non-contiguous DNA segments, often with varying copy numbers and breakpoints
    • More challenging to detect and interpret, requiring advanced computational approaches
    • Examples: Chromothripsis (chromosome shattering and reassembly), complex rearrangements in cancer genomes

Detection methods for CNVs

  • Various experimental and computational methods have been developed to detect and characterize CNVs in individual genomes or populations, each with its own strengths and limitations
  • The choice of detection method depends on factors such as the desired resolution, coverage, cost, and availability of samples and resources

Array-based approaches

  • Microarray platforms (SNP arrays, CGH arrays) hybridize genomic DNA to probes targeting specific loci or regions of the genome
    • Intensity signals are compared between test and reference samples to infer copy number changes
    • Provide genome-wide coverage at moderate resolution (tens to hundreds of kilobases)
    • Examples: Affymetrix SNP arrays, Agilent CGH arrays
  • Advantages: Cost-effective, high-throughput, and well-established data analysis pipelines
  • Limitations: Limited resolution, difficulty detecting balanced rearrangements or novel insertions, prone to batch effects and noise

Sequencing-based approaches

  • (NGS) technologies generate millions of short reads that are mapped to a reference genome to identify CNVs
    • Read depth analysis compares the number of mapped reads in a region to the expected coverage based on a reference sample or population
    • Split-read and paired-end mapping detect CNVs by identifying discordant or unmapped reads spanning breakpoints
    • Provide higher resolution (single-base to kilobase level) and the ability to detect novel or complex rearrangements
    • Examples: Whole-genome sequencing (WGS), whole-exome sequencing (WES), targeted sequencing panels
  • Advantages: High resolution, ability to detect balanced and complex rearrangements, and the potential for single-base precision
  • Limitations: Higher cost, more complex data analysis, and the need for sufficient sequencing depth and coverage

Comparison of detection methods

  • Array-based methods are generally more cost-effective and have shorter turnaround times, making them suitable for large-scale studies or clinical diagnostics
    • However, they have limited resolution and may miss some types of CNVs
  • Sequencing-based methods provide higher resolution and the ability to detect a wider range of CNVs, including complex and novel rearrangements
    • However, they are more expensive and computationally intensive, requiring advanced bioinformatics pipelines and expertise
  • Integration of multiple detection methods can improve the accuracy and comprehensiveness of CNV detection
    • For example, using array-based methods for initial screening and sequencing-based methods for validation and fine-mapping of candidate regions

Biological impact of CNVs

  • CNVs can have diverse functional consequences on genes and regulatory elements, leading to and disease susceptibility
  • Understanding the biological impact of CNVs is crucial for interpreting their clinical significance and elucidating their roles in human health and evolution

Effects on gene expression

  • CNVs can alter gene dosage by changing the number of copies of a gene or its regulatory elements
    • Deletions can lead to reduced gene expression or complete loss of function
    • Duplications can lead to increased gene expression or gain of function
    • Examples: Salivary amylase gene (AMY1) influencing starch digestion, SRGAP2 gene duplications in human brain evolution
  • CNVs can disrupt gene regulatory landscapes by altering the position or copy number of enhancers, silencers, or insulators
    • Can lead to ectopic gene expression, silencing, or altered temporal or spatial patterns of expression
    • Examples: SOX9 enhancer duplications in XX male disorders of sex development, deletions of the H19/IGF2 imprinting control region in Beckwith-Wiedemann syndrome

Association with diseases

  • CNVs have been implicated in a wide range of human diseases, including developmental disorders, neuropsychiatric conditions, autoimmune diseases, and cancers
    • Can act as highly penetrant causal variants or modify disease risk in combination with other genetic and environmental factors
    • Examples: 22q11.2 deletion syndrome (DiGeorge syndrome), 16p11.2 deletion/duplication in autism spectrum disorder and , BRCA1/BRCA2 deletions in hereditary breast and ovarian cancer
  • CNVs can contribute to disease pathogenesis through various mechanisms
    • Gene dosage imbalances affecting key developmental pathways or cellular processes
    • Disruption of gene regulatory networks leading to altered gene expression patterns
    • Unmasking of recessive alleles or functional SNPs on the remaining copy of a deleted region
    • Gain of novel or altered gene functions through duplications or fusions

Role in evolutionary adaptation

  • CNVs can provide a substrate for evolutionary innovation and adaptation by altering gene dosage, creating novel gene combinations, or modifying gene regulation
    • Positive selection can favor CNVs that confer beneficial traits or increase fitness in specific environments
    • Examples: AMY1 gene copy number expansion in populations with high-starch diets, SRGAP2 gene duplications in human brain evolution and cognitive development
  • CNVs can contribute to intra- and inter-species phenotypic variation and divergence
    • Differences in CNV profiles between populations or species can reflect distinct evolutionary histories and adaptations
    • Examples: Copy number variation of the GSTM1 gene influencing detoxification capacity in human populations, CNVs in the PRDM9 gene affecting recombination hotspot usage in primates

Computational challenges in CNV analysis

  • The accurate detection, genotyping, and interpretation of CNVs from genomic data pose significant computational challenges due to their complexity, diversity, and the limitations of current sequencing technologies and algorithms
  • Addressing these challenges requires the development of robust bioinformatics tools, statistical methods, and data integration strategies

Alignment and mapping issues

  • CNVs can confound the alignment and mapping of sequencing reads to a reference genome, leading to false-positive or false-negative CNV calls
    • Reads originating from CNV regions may have reduced mapping quality, map to multiple locations, or fail to map altogether
    • Alignment algorithms may misinterpret CNV-related discordances as sequencing errors or structural variations of different types
  • Strategies to mitigate alignment issues include:
    • Using alignment tools that allow for split-read mapping and soft-clipping of reads spanning CNV breakpoints (e.g., BWA-MEM, HISAT2)
    • Applying local realignment around potential CNV regions to improve mapping accuracy
    • Incorporating population-specific or pan-genome reference sequences to capture CNV diversity

Distinguishing true CNVs from artifacts

  • CNV detection methods can be confounded by various sources of technical and biological noise, leading to spurious CNV calls
    • PCR amplification biases, sequencing errors, and batch effects can introduce systematic coverage variations mimicking CNVs
    • Biological factors such as GC content, mappability, and repetitive elements can affect read depth and mapping in a region-specific manner
  • Strategies to distinguish true CNVs from artifacts include:
    • Applying statistical correction methods to normalize read depth data for GC content, mappability, and other biases
    • Using multiple CNV detection algorithms based on different principles (e.g., read depth, split-reads, assembly) and comparing their results
    • Validating candidate CNVs using orthogonal methods such as PCR, qPCR, or targeted sequencing

Handling repetitive regions

  • CNVs often occur in repetitive regions of the genome, such as segmental duplications, tandem repeats, and transposable elements
    • These regions are difficult to analyze due to their high sequence similarity, leading to ambiguous mapping and reduced detection sensitivity
    • CNVs within repetitive regions may be missed or miscalled, especially if they involve novel or complex rearrangements
  • Strategies to handle repetitive regions in CNV analysis include:
    • Using specialized alignment tools that can handle multi-mapping reads and distinguish between paralogous sequences (e.g., mrFAST, LAST)
    • Applying targeted assembly or de novo assembly approaches to resolve the structure and copy number of repetitive CNVs
    • Leveraging long-read sequencing technologies (e.g., PacBio, Oxford Nanopore) to span repetitive elements and provide contiguous CNV resolution

Algorithms for CNV detection

  • A variety of computational algorithms have been developed to detect CNVs from different types of genomic data, each with its own strengths, limitations, and application domains
  • The choice of algorithm depends on factors such as the data type (e.g., array, sequencing), desired resolution and sensitivity, and the availability of computational resources and expertise

Read depth-based methods

  • Read depth-based methods identify CNVs by comparing the observed read coverage in a region to the expected coverage based on a reference sample or population
    • Regions with significantly higher or lower coverage than expected are called as duplications or deletions, respectively
    • Examples: CNVnator, ReadDepth, Control-FREEC
  • Advantages: Simple principle, can detect large CNVs, and work well for high-coverage data
  • Limitations: Limited resolution (kilobase level), sensitive to coverage biases and sequencing artifacts, and difficulty detecting copy-neutral events (e.g., balanced translocations)

Split-read and paired-end mapping

  • Split-read methods identify CNVs by detecting reads that span breakpoints, with one portion mapping to one genomic location and the other portion mapping to a different location
    • The alignment patterns of split reads can reveal the precise breakpoints and architecture of CNVs
    • Examples: Pindel, LUMPY, DELLY
  • Paired-end mapping methods identify CNVs by analyzing the distance and orientation of paired-end reads mapping to the reference genome
    • Discordant pairs with abnormal insert sizes or orientations can indicate the presence of deletions, duplications, or inversions
    • Examples: BreakDancer, PEMer, SVDetect
  • Advantages: Can detect smaller CNVs (down to single-base resolution), provide breakpoint information, and work well for low-coverage data
  • Limitations: Sensitive to read length and library insert size, may miss CNVs in repetitive or low-complexity regions, and require sufficient sequencing coverage at breakpoints

Assembly-based approaches

  • Assembly-based approaches aim to reconstruct the full sequence of CNV regions by assembling reads into contigs and comparing them to the reference genome
    • Can detect novel or complex CNVs that are difficult to identify using alignment-based methods
    • Examples: TIGRA-SV, SvABA, novoBreak
  • Local assembly methods focus on assembling reads from specific regions of interest, such as those with abnormal read depth or discordant pair mappings
    • More computationally efficient than global assembly, but may miss some CNVs or produce fragmented assemblies
    • Examples: DISCOVAR, GRIDSS, SvABA
  • Global assembly methods attempt to assemble the entire genome de novo and compare the resulting contigs to the reference to identify CNVs
    • Can provide a more comprehensive view of CNV structure and content, but are computationally intensive and may require high-coverage data
    • Examples: Cortex, FALCON, Canu
  • Advantages: Can detect novel or complex CNVs, provide full CNV sequence information, and potentially resolve CNVs in repetitive regions
  • Limitations: Computationally intensive, may require high-coverage data, and sensitive to assembly errors and chimeric contigs

Statistical considerations for CNVs

  • Accurate CNV detection and genotyping require robust statistical methods to control for false discoveries, correct for technical biases, and account for the unique properties of CNV data
  • Developing and applying appropriate statistical frameworks is essential for ensuring the reliability and reproducibility of CNV analyses

False discovery rate control

  • CNV detection methods often generate a large number of candidate CNV calls, some of which may be false positives due to technical artifacts or biological noise
    • False discovery rate (FDR) control methods aim to estimate and limit the proportion of false positives among the detected CNVs
    • Examples: Benjamini-Hochberg procedure, Storey's q-value method, permutation-based FDR estimation
  • Strategies for FDR control in CNV analysis include:
    • Applying stringent filtering criteria based on CNV size, read depth, number of supporting reads, or other quality metrics
    • Using multiple CNV detection methods and considering only the concordant calls across methods
    • Estimating FDR based on simulated or permuted data that mimics the properties of the real data

Normalization and bias correction

  • CNV detection methods can be affected by various sources of technical bias, such as GC content, mappability, and batch effects, which can lead to spurious or missed CNV calls
    • Normalization and bias correction methods aim to remove or mitigate these systematic biases and improve the accuracy of CNV detection
    • Examples: GC content correction, mappability correction, principal component analysis (PCA)-based correction
  • Strategies for normalization and bias correction in CNV analysis include:
    • Applying regression-based methods to model and remove the relationship between read depth and known biases (e.g., GC content, mappability)
    • Using control samples or reference populations to estimate and subtract systematic biases
    • Employing unsupervised methods (e.g., PCA, SVD) to identify and remove batch effects or other unknown biases

Genotype calling and phasing

  • CNV genotyping involves determining the copy number state of a CNV locus in an individual genome, which can be challenging due to the continuous nature of read depth signals and the presence of technical noise
    • Genotype calling methods aim to assign discrete copy number states (e.g., deletion, normal, duplication) to each CNV locus based on the observed read depth or other supporting evidence
    • Examples: Gaussian mixture models, hidden Markov models, support vector machines
  • CNV phasing involves determining the allelic configuration of CNVs on maternal and paternal chromosomes, which can provide insights into their origin, inheritance, and functional impact
    • Phasing methods leverage linkage disequilibrium, family information, or long-range sequencing data to assign CNVs to specific haplotypes
    • Examples: HapCUT, HapCompass, WhatsHap
  • Strategies for genotype calling and phasing in CNV analysis include:
    • Applying probabilistic models (e.g., Gaussian mixture models) to cluster read depth signals into discrete copy number states
    • Using family-based designs or population-based imputation to improve the accuracy of genotype calling and phasing
    • Integrating multiple data types (e.g., read depth, allele-specific read counts, linkage disequilibrium) to enhance the power and resolution of genotype calling and phasing

Databases and resources for CNVs

  • Several public databases and resources have been developed to facilitate the annotation, interpretation, and sharing of CNV data across studies and populations
  • These resources provide valuable information on the frequency, distribution, and functional impact of CNVs, and enable the integration of CNV data with other types of genomic and phenotypic information

Public CNV databases

  • (DGV): A comprehensive catalog of CNVs and other structural variations identified in healthy individuals from various populations
    • Includes data from multiple studies and platforms, with detailed information on CN

Key Terms to Review (19)

Adaptive evolution: Adaptive evolution is the process through which organisms become better suited to their environment through genetic changes that enhance survival and reproduction. This process is driven by natural selection, where advantageous traits become more common in a population over generations, resulting in improved fitness. It plays a significant role in the diversification of species and their ability to thrive in varying ecological contexts.
Array comparative genomic hybridization: Array comparative genomic hybridization (aCGH) is a genomic technology used to detect and quantify structural variations in DNA, particularly copy number variations (CNVs). This method involves comparing the test DNA to a reference DNA on a microarray platform, allowing researchers to identify gains or losses of genetic material across the genome. aCGH plays a critical role in understanding the genetic basis of diseases, particularly in cancer and developmental disorders, by revealing alterations that contribute to disease phenotypes.
Association testing: Association testing is a statistical method used to determine whether a specific genetic variant is associated with a trait or disease within a population. This approach is fundamental in genomics, particularly for identifying the relationship between genetic markers and phenotypic outcomes, such as copy number variations (CNVs). By examining these associations, researchers can gain insights into the genetic architecture of diseases and the potential functional implications of CNVs.
Autism spectrum disorder: Autism spectrum disorder (ASD) is a neurodevelopmental condition characterized by challenges in social interaction, communication, and restricted or repetitive behaviors. It encompasses a wide range of symptoms and levels of disability, often described as a 'spectrum' because individuals can experience varying degrees of impairment and strengths.
Copy Number Variation: Copy number variation (CNV) refers to a type of structural variation in the genome where segments of DNA are duplicated or deleted, leading to differences in the number of copies of particular genes or genomic regions among individuals. This genetic variability can play a significant role in various biological processes, including gene expression and susceptibility to diseases.
Database of genomic variants: A database of genomic variants is a structured collection of information about genetic variations found within and across different genomes, including single nucleotide polymorphisms (SNPs), insertions, deletions, and copy number variations. These databases serve as crucial resources for researchers and clinicians to understand the association of these variants with diseases, phenotypes, and traits, facilitating advancements in personalized medicine and genomics.
Deletion: A deletion is a type of genetic mutation where a segment of DNA is removed or lost from the chromosome. This alteration can impact gene function, leading to various effects on an organism's phenotype, and plays a significant role in copy number variations, as it results in a reduction of genomic material which can affect gene dosage and expression levels.
Disease association: Disease association refers to the correlation between specific genetic variations and the presence or risk of particular diseases. These associations can provide insight into how genetic factors contribute to disease susceptibility and help identify potential targets for therapy or prevention strategies. By understanding these connections, researchers can better comprehend the biological mechanisms underlying diseases.
Duplication: Duplication refers to a genetic event where a segment of DNA is copied, resulting in multiple copies of that segment within the genome. This can lead to variations in gene dosage and function, influencing phenotypic traits and potentially contributing to evolution. Duplications can occur at different scales, affecting single genes or larger genomic regions, and they play a crucial role in generating copy number variations (CNVs) that can impact health and disease.
Gains: In the context of copy number variations (CNVs), gains refer to the increase in the number of copies of a specific genomic region. This increase can lead to an overexpression of genes located within the gained region, which may impact cellular functions and contribute to various diseases, including cancer. Understanding gains is crucial for deciphering the genetic basis of disorders and how these alterations can influence phenotype and pathology.
Genome aggregation database: A genome aggregation database is a comprehensive collection of genetic data from multiple individuals, designed to provide insights into the variation within human genomes. These databases aggregate sequencing information, allowing researchers to identify and analyze genetic variants, including rare mutations and common polymorphisms, across diverse populations. The main goal is to facilitate the understanding of genetic diversity and its implications for human health and disease.
Genomic architecture: Genomic architecture refers to the structural organization of an organism's genome, encompassing the arrangement and interaction of genes, regulatory elements, and other functional genomic features. This organization is crucial for understanding how genetic information is expressed and regulated, influencing phenotypic traits and disease susceptibility. The interplay between genomic architecture and genetic variations, particularly copy number variations (CNVs), highlights how structural changes in the genome can affect biological functions.
Genomic instability: Genomic instability refers to an increased tendency of the genome to acquire mutations, leading to alterations in DNA sequences, chromosome structure, or number. This phenomenon is significant in understanding how diseases, particularly cancer, arise as it can result in structural variations and copy number variations, both of which are critical for genomic diversity and can contribute to the progression of various diseases.
Losses: In the context of copy number variations (CNVs), losses refer to the deletion or reduction of segments of DNA, resulting in fewer copies of a specific genomic region. This can lead to gene dosage effects, where the decrease in the number of gene copies impacts the expression levels and functions of genes within that region. Losses are significant as they can disrupt normal biological processes, potentially contributing to various diseases and phenotypic variations.
Multi-Dimensional Scaling: Multi-Dimensional Scaling (MDS) is a statistical technique used for visualizing the level of similarity or dissimilarity of data points in a multi-dimensional space. It transforms data into a lower-dimensional representation while preserving the pairwise distances between points as much as possible. This method is especially useful in genomics for analyzing complex datasets, such as those involving Copy Number Variations (CNVs), by allowing researchers to identify patterns and relationships among samples.
Natural selection: Natural selection is the process through which certain traits become more or less common in a population based on their impact on the survival and reproduction of individuals. This mechanism plays a critical role in evolution, as individuals with advantageous traits are more likely to survive and pass those traits to future generations, shaping genetic diversity over time. It helps explain variations in traits such as copy number variations and the distribution of alleles within populations.
Next-generation sequencing: Next-generation sequencing (NGS) refers to a set of advanced DNA sequencing technologies that allow for the rapid and cost-effective sequencing of large amounts of genetic material. This technology has revolutionized genomics by enabling whole-genome sequencing, exome sequencing, and targeted sequencing, allowing researchers to analyze complex genomes and understand genetic variations more thoroughly.
Phenotypic Variation: Phenotypic variation refers to the observable differences in physical and biological traits among individuals within a population, which arise from genetic, environmental, and developmental factors. These variations can manifest in various forms, such as morphology, physiology, and behavior, and are crucial for understanding evolution and adaptation in different environments. The study of phenotypic variation is essential for examining how genetic changes, like copy number variations and insertions/deletions, influence traits and contribute to diversity in organisms.
Schizophrenia: Schizophrenia is a chronic and severe mental disorder that affects how a person thinks, feels, and behaves, often characterized by hallucinations, delusions, and cognitive impairments. Research has shown that genetic factors, including copy number variations (CNVs), play a significant role in the etiology of schizophrenia, suggesting that structural changes in the genome can contribute to the development of this complex disorder.
© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.