Genome scaffolding and gap filling are crucial steps in reconstructing complete genome sequences. These processes involve ordering and orienting contigs into larger structures called scaffolds, and then attempting to close the gaps between them.

These techniques help overcome challenges like , sequencing errors, and incomplete coverage. By using various data sources and algorithms, researchers can improve the accuracy and completeness of genome assemblies, enabling better downstream analyses and insights into genetic information.

Genome assembly challenges

  • Genome assembly is the process of reconstructing a complete genome sequence from shorter DNA fragments, which presents several challenges due to the complexity and size of genomes
  • These challenges can lead to fragmented assemblies, misassemblies, and gaps in the reconstructed genome sequence, affecting the accuracy and completeness of the assembly

Repetitive sequences

Top images from around the web for Repetitive sequences
Top images from around the web for Repetitive sequences
  • Repetitive sequences, such as transposable elements, satellite DNA, and segmental duplications, are abundant in many genomes and can span hundreds to thousands of base pairs
  • These sequences can cause ambiguities during assembly, as the shorter sequencing reads may not be long enough to span the entire repetitive region, making it difficult to determine their correct placement
  • Repetitive sequences can lead to collapsed repeats in the assembly, where multiple copies of the repeat are represented as a single copy, or expanded repeats, where the number of copies is overestimated

Sequencing errors

  • Sequencing technologies are not perfect and can introduce errors in the DNA sequence, such as substitutions, insertions, or deletions of nucleotides
  • These errors can create false overlaps between sequencing reads, leading to misassemblies or fragmented contigs
  • Sequencing errors can also introduce false variations in the assembled genome, which can be mistaken for true biological variations, such as single nucleotide polymorphisms (SNPs) or structural variations

Incomplete coverage

  • Sequencing coverage refers to the average number of times each base in the genome is represented in the sequencing reads
  • Incomplete coverage can occur due to biases in the sequencing process, such as GC content bias or uneven amplification, leading to regions of the genome with low or no coverage
  • Insufficient coverage can result in gaps in the assembled genome sequence, as there may not be enough overlapping reads to bridge certain regions, especially those with repetitive or complex sequences

Scaffolding

  • Scaffolding is the process of ordering and orienting contigs (contiguous sequences) into larger structures called scaffolds, which represent the relative positions and orientations of the contigs in the genome
  • Scaffolding aims to bridge gaps between contigs and provide a more complete and accurate representation of the genome structure

Scaffold graph construction

  • Scaffold graph construction involves creating a graph representation of the contigs and their connections based on additional information, such as mate-pair or paired-end reads
  • In a scaffold graph, nodes represent contigs, and edges represent the connections between contigs based on the linking information
  • The graph structure allows for the identification of the most likely order and orientation of contigs, considering the constraints imposed by the linking data

Mate-pair reads

  • Mate-pair reads are sequencing reads generated from DNA fragments with a larger insert size (2-5 kb or more) compared to standard paired-end reads
  • These reads span larger distances in the genome and can provide long-range connectivity information for scaffolding
  • Mate-pair reads can help bridge gaps between contigs and resolve repetitive regions, as they can span these problematic areas and provide linking information

Paired-end reads

  • Paired-end reads are sequencing reads generated from both ends of DNA fragments with a known orientation and approximate distance between them
  • The distance between paired-end reads is typically shorter than mate-pair reads (200-500 bp) but can still provide valuable linking information for scaffolding
  • Paired-end reads can help orient contigs and estimate the size of gaps between them based on the expected insert size of the library

Long-read technologies

  • technologies, such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), generate reads that can span several kilobases to hundreds of kilobases in length
  • These longer reads can directly span repetitive regions and provide continuous sequence information, facilitating the assembly of complex genomes
  • Long reads can be used for scaffolding by directly connecting contigs or by providing a backbone for hybrid assembly approaches that combine long and short reads

Gap filling

  • Gap filling is the process of attempting to close the gaps between contigs in a scaffold by generating or identifying sequences that can fill these gaps
  • Gap filling aims to improve the continuity and completeness of the genome assembly by reducing the number and size of gaps

Gap identification

  • Gap identification involves locating the gaps between contigs in a scaffold based on the available linking information and the expected distance between contigs
  • Gaps can be identified by analyzing the scaffold graph and identifying regions where there are no connecting edges between contigs or where the distance between contigs exceeds the expected insert size of the linking reads
  • The size and location of gaps can be estimated based on the linking information and the size of the contigs flanking the gap

Local assembly methods

  • Local assembly methods aim to fill gaps by performing a targeted assembly of the reads that map to the regions flanking the gap
  • These methods extract the reads that align to the contigs on either side of the gap and attempt to assemble them into a contiguous sequence that spans the gap
  • Local assembly can be performed using short reads, mate-pair reads, or a combination of both, depending on the available data and the size of the gap

Reference-based approaches

  • Reference-based gap filling approaches utilize a closely related reference genome to guide the gap filling process
  • These methods align the contigs flanking the gap to the reference genome and attempt to identify the corresponding sequence in the reference that can fill the gap
  • Reference-based approaches can be effective when a high-quality reference genome is available for a closely related species, but they may not capture species-specific variations or novel sequences

De novo gap filling

  • De novo gap filling methods attempt to fill gaps without relying on a reference genome, using only the sequencing reads and the assembly information
  • These approaches often involve iterative rounds of local assembly, where the reads mapping to the gap region are assembled, and the resulting contigs are incorporated back into the assembly
  • De novo gap filling can be computationally intensive and may require multiple iterations to close larger gaps or resolve complex regions

Scaffolding algorithms

  • Scaffolding algorithms are computational methods designed to order and orient contigs into scaffolds based on linking information from paired-end reads, mate-pair reads, or other sources of long-range connectivity
  • These algorithms aim to find the most likely arrangement of contigs that satisfies the constraints imposed by the linking data while minimizing conflicts and inconsistencies

Greedy algorithms

  • Greedy scaffolding algorithms make locally optimal decisions at each step, iteratively joining contigs into scaffolds based on the strongest linking evidence
  • These algorithms typically start with the longest contigs and progressively add shorter contigs to the scaffolds, prioritizing the links with the highest support or consistency
  • Examples of greedy scaffolding algorithms include SSPACE, SOPRA, and GRASS

Graph-based algorithms

  • Graph-based scaffolding algorithms represent the contigs and their connections as a graph, where nodes represent contigs and edges represent the linking information
  • These algorithms often use advanced graph theory concepts, such as minimum spanning trees, maximum likelihood paths, or network flow optimization, to find the most likely arrangement of contigs
  • Graph-based algorithms can handle complex linking patterns and can be more robust to errors and inconsistencies in the data compared to greedy approaches
  • Examples of graph-based scaffolding algorithms include BESST, ScaffMatch, and Opera

Hybrid approaches

  • Hybrid scaffolding approaches combine multiple sources of information or different algorithmic strategies to improve the accuracy and completeness of the scaffolding process
  • These approaches may integrate data from different sequencing technologies (e.g., short reads and long reads), use a combination of greedy and graph-based methods, or incorporate additional information such as physical maps or genetic linkage data
  • aim to leverage the strengths of different data types and algorithms to overcome the limitations of individual methods and produce high-quality scaffolds
  • Examples of hybrid scaffolding approaches include RAILS, LINKS, and SALSA

Gap filling algorithms

  • Gap filling algorithms are computational methods designed to close the gaps between contigs in a scaffold by generating or identifying sequences that can span these gaps
  • These algorithms utilize various strategies, such as local assembly, reference-based approaches, or de novo methods, to generate gap-filling sequences

GapCloser

  • is a gap filling tool that is part of the package, which is commonly used for de novo genome assembly
  • It uses a local assembly approach to fill gaps by extracting reads that map to the contigs flanking the gap and performing a targeted assembly of these reads
  • GapCloser iteratively extends the contigs into the gap region using a k-mer-based approach, attempting to find overlaps between the reads and the contig ends

GapFiller

  • is a standalone gap filling tool that uses a hybrid approach combining local assembly and reference-based methods
  • It first aligns the contigs flanking the gap to a reference genome (if available) to identify potential gap-filling sequences
  • If no suitable reference-based sequence is found, GapFiller performs a local assembly of the reads mapping to the gap region using a seed-and-extend approach

GMcloser

  • is a gap filling tool that uses a reference-based approach to close gaps in a draft genome assembly
  • It aligns the contigs flanking the gap to a closely related reference genome and extracts the corresponding sequence from the reference to fill the gap
  • GMcloser can handle multiple reference genomes and can fill gaps using a consensus sequence derived from multiple alignments

TGS-GapCloser

  • is a gap filling tool designed to close gaps using long reads generated by third-generation sequencing (TGS) technologies, such as PacBio or Oxford Nanopore
  • It utilizes the long reads to directly span the gaps between contigs, providing a continuous sequence that can close the gap
  • TGS-GapCloser can handle errors and variations in the long reads by performing a local alignment and consensus calling step to refine the gap-filling sequence

Quality assessment

  • Quality assessment is the process of evaluating the accuracy, completeness, and contiguity of a genome assembly, including the scaffolds and gap-filled sequences
  • Various metrics and approaches are used to assess the quality of an assembly and identify potential issues or areas for improvement

Scaffold N50 metric

  • The scaffold is a commonly used metric to assess the contiguity of a genome assembly at the scaffold level
  • It represents the length of the scaffold at which 50% of the total assembly length is contained in scaffolds of that size or larger
  • A higher scaffold N50 value indicates a more contiguous assembly, with fewer and larger scaffolds

Gap statistics

  • provide information about the number, size, and distribution of gaps in the scaffolded assembly
  • These statistics can include the total number of gaps, the average and median gap size, and the gap size distribution
  • Lower gap statistics (fewer and smaller gaps) generally indicate a more complete and contiguous assembly

Misassembly detection

  • involves identifying regions in the assembly where the order or orientation of contigs is incorrect or where there are chimeric joins between unrelated sequences
  • Misassemblies can be detected using various approaches, such as comparing the assembly to a reference genome, analyzing read coverage and consistency, or using long-range information from mate-pair or long reads
  • Tools like QUAST, REAPR, and FRCbam can be used to detect and quantify misassemblies in an assembly

Completeness evaluation

  • assesses the extent to which the assembly captures the full content of the genome, including genes, regulatory elements, and other biologically relevant features
  • Completeness can be evaluated using benchmarking sets of conserved genes, such as single-copy orthologs, or by comparing the assembly to a closely related reference genome
  • Tools like BUSCO and CEGMA can be used to assess the completeness of an assembly based on the presence and completeness of conserved gene sets

Challenges and limitations

  • Despite advances in scaffolding and gap filling methods, there are still several challenges and limitations that can affect the quality and completeness of genome assemblies

Chimeric sequences

  • Chimeric sequences are artificial joins between unrelated sequences that can occur during the assembly process, particularly in regions with repetitive or complex sequences
  • Chimeric sequences can lead to misassemblies and incorrect representations of the genome structure
  • Identifying and resolving chimeric sequences can be challenging, as they may not be easily distinguishable from true biological variations or rearrangements

Misassemblies

  • Misassemblies are regions in the assembly where the order or orientation of contigs is incorrect, leading to a misrepresentation of the true genome structure
  • Misassemblies can arise from various sources, such as chimeric sequences, incorrect linking information, or errors in the assembly algorithms
  • Detecting and correcting misassemblies can be difficult, particularly in the absence of a high-quality reference genome or long-range connectivity information

Unresolved gaps

  • Despite the application of gap filling methods, some gaps in the assembly may remain unresolved due to various factors, such as repetitive sequences, regions, or limitations of the available data and algorithms
  • Unresolved gaps can affect the continuity and completeness of the assembly and may hinder downstream analyses, such as gene annotation or comparative genomics
  • Closing all gaps in a genome assembly may not always be possible, particularly for large and complex genomes with extensive repetitive content

Computational complexity

  • Scaffolding and gap filling algorithms can be computationally intensive, particularly for large and complex genomes with high levels of repetitive sequences or heterozygosity
  • The computational complexity of these methods can increase with the size of the genome, the amount of sequencing data, and the complexity of the linking information
  • Scaling these algorithms to handle large datasets or multiple genomes can be challenging and may require significant computational resources and optimization efforts

Advances in scaffolding and gap filling

  • Recent advances in sequencing technologies, computational methods, and data integration approaches have led to improvements in scaffolding and gap filling strategies, enabling the generation of more complete and accurate genome assemblies

Optical mapping

  • Optical mapping is a technique that generates high-resolution physical maps of genomes by imaging and analyzing long, fluorescently labeled DNA molecules
  • These physical maps provide long-range connectivity information that can be used for scaffolding and validating the assembly
  • Optical mapping data can help resolve complex regions, identify misassemblies, and anchor scaffolds to chromosomes

Chromosome conformation capture

  • Chromosome conformation capture (3C) techniques, such as Hi-C and Dovetail, capture the spatial proximity of DNA sequences in the nucleus, providing long-range connectivity information for scaffolding
  • These methods generate contact frequency maps that reflect the three-dimensional organization of the genome, allowing for the ordering and orientation of scaffolds into chromosome-scale assemblies
  • Hi-C and Dovetail data can help resolve complex regions, identify misassemblies, and provide a framework for genome-wide scaffolding

Single-molecule sequencing

  • Single-molecule sequencing technologies, such as PacBio and Oxford Nanopore, generate long reads that can span tens to hundreds of kilobases, providing continuous sequence information for scaffolding and gap filling
  • These long reads can directly resolve repetitive regions and complex structures, reducing the need for complex computational methods to infer the genome structure
  • Single-molecule sequencing data can be used for hybrid assembly approaches, where long reads are combined with short reads to generate high-quality, contiguous assemblies

Artificial intelligence approaches

  • Artificial intelligence (AI) and machine learning (ML) approaches are being increasingly applied to genome assembly, scaffolding, and gap filling problems
  • These methods can learn patterns and features from large datasets, such as sequencing reads or assembly graphs, to make predictions and guide the assembly process
  • AI and ML approaches can be used for tasks such as error correction, repeat resolution, scaffolding, and gap filling, potentially improving the accuracy and efficiency of these processes
  • Examples of AI-based tools for genome assembly include LINKS, which uses long read information for scaffolding, and DeepVariant, which uses deep learning for variant calling and assembly polishing

Key Terms to Review (28)

Completeness evaluation: Completeness evaluation refers to the process of assessing how complete a genome assembly is, determining whether all regions of the genome are represented and identifying gaps in the assembly. This evaluation is crucial in genome scaffolding and gap filling as it helps researchers understand the accuracy and comprehensiveness of the assembled genomic data, guiding further efforts to improve the quality of the genome assembly.
Contig Length: Contig length refers to the total length of a contiguous sequence of DNA that has been assembled from overlapping fragments during the genome assembly process. This measure is crucial in evaluating the quality and completeness of the assembled genome, as longer contigs often indicate better assembly accuracy and provide more useful information for downstream analyses like genome annotation and comparative genomics.
De novo assembly: De novo assembly is the process of constructing a genome from scratch without the aid of a reference genome, utilizing sequences obtained from high-throughput sequencing technologies. This method is essential for analyzing species with no prior genomic information and is heavily reliant on the accuracy and efficiency of next-generation sequencing techniques.
DNA sequencing: DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. This technique enables researchers to read the genetic code, which can reveal important information about genes, genetic variations, and evolutionary relationships. Understanding DNA sequences is crucial for genome scaffolding and gap filling, as it allows for the assembly and validation of genomic data by providing insights into where gaps exist and how to bridge them.
Dynamic Programming: Dynamic programming is a method used to solve complex problems by breaking them down into simpler subproblems and storing the results of these subproblems to avoid redundant calculations. This technique is particularly useful in optimization problems, where it helps to efficiently find the best solution among many possible solutions. It is widely applied in bioinformatics for tasks such as aligning sequences, assembling genomes, filling gaps in genome scaffolding, and predicting gene structures.
Functional annotation: Functional annotation refers to the process of identifying the biological function of genes, proteins, and other genomic elements. This process is crucial for understanding how different components of an organism's genome contribute to its phenotype and biological processes, linking sequence data with functional insights across various research areas.
Gap filler: A gap filler is a bioinformatics tool or algorithm used to fill in the gaps in a genome assembly, which arise from incomplete sequencing data or unresolvable regions. This process improves the continuity and accuracy of the assembled genome, making it more useful for further analysis and interpretation. By using various methods, such as leveraging paired-end reads or additional sequencing data, gap fillers enhance the overall quality of genome assemblies.
Gap Statistics: Gap statistics is a statistical method used to determine the optimal number of clusters in a dataset by comparing the observed data with a null reference distribution. This technique helps in evaluating the clustering structure of genomic data, providing insights on how many distinct groups exist based on similarity, which is critical in genome scaffolding and gap filling.
Gapcloser: A gapcloser is a computational tool or algorithm used in genomics to fill in the gaps between contigs during the genome assembly process. This process enhances the quality of assembled genomes by reducing the number of unsequenced regions, which can lead to a more complete and accurate representation of the organism's genetic material. Gapclosers work by utilizing paired-end reads or mate-pair reads to infer the missing sequences and bridge the gaps in the assembly.
Gapfiller: A gapfiller is a computational tool used in genome assembly to close gaps between contigs or scaffolds by inferring the missing sequences based on available data. This process is essential in creating a more complete and accurate representation of the genome, enhancing the continuity of sequence data. Gapfillers utilize various strategies, including read alignment and comparative genomics, to fill these gaps, ultimately improving the quality of the assembled genome.
Gene mapping: Gene mapping is the process of determining the specific locations of genes on a chromosome, helping to understand the genetic architecture of organisms. This process is essential in identifying the relationships between genes, their functions, and their interactions, contributing to our knowledge of traits and diseases. Gene mapping aids in genome scaffolding by aligning sequences and filling gaps to create a more complete picture of genetic information.
Gene Prediction: Gene prediction refers to the computational process of identifying the locations and structures of genes within a DNA sequence. This process plays a critical role in genomics, as it helps in annotating genomes and understanding gene functions, which is essential for further biological analysis and research. Accurate gene prediction is crucial for the development of biological databases and tools, aiding in tasks such as genome scaffolding and understanding microbial communities.
Gmcloser: Gmcloser is a tool used in genome assembly that helps to create more complete genome scaffolds by filling gaps between contigs. This tool employs algorithms to integrate information from paired-end reads and other genomic data to bridge the gaps, improving the overall continuity of assembled genomes. It enhances the quality of genomic data by providing a clearer picture of the structure and organization of the genome.
Greedy Algorithm: A greedy algorithm is an approach to solving optimization problems by making a series of choices, each of which looks best at the moment. This method builds up a solution piece by piece, always choosing the next piece that offers the most immediate benefit. Greedy algorithms are particularly useful in fields like sequence assembly and genome scaffolding, where making local optimum choices can lead to a global optimum solution efficiently.
Hybrid approaches: Hybrid approaches refer to methodologies that integrate multiple techniques or tools to solve complex problems or improve accuracy in various scientific fields, particularly in computational genomics. By combining the strengths of different methods, these approaches enhance the overall effectiveness in tasks such as reconstructing genome sequences and addressing variations like insertions and deletions. This adaptability is crucial for efficiently bridging gaps in data and creating a more comprehensive understanding of genomic structures.
Long-read sequencing: Long-read sequencing is a genomic sequencing method that produces longer contiguous reads of DNA, typically over 10,000 base pairs, allowing for a more comprehensive understanding of complex genomic regions. This technique enhances the assembly of genomes by spanning repetitive sequences and structural variations, making it invaluable for accurate genome scaffolding, detecting structural variations, and advancing metagenomics studies.
Low Coverage: Low coverage refers to a situation in genomic sequencing where only a small portion of the genome is represented by overlapping reads, leading to gaps in the data. This can pose challenges in accurately reconstructing the genome or identifying variations, particularly when trying to create a comprehensive assembly from short reads. In this context, understanding low coverage is crucial for effectively addressing issues during the assembly and scaffolding processes.
Mate-pair sequencing: Mate-pair sequencing is a next-generation sequencing technique that involves the generation of DNA fragments with known distance between them, allowing for more accurate reconstruction of genomes. This method enhances genome assembly by connecting distant sequences that are physically linked in the DNA, which helps in resolving complex regions and filling gaps in the genome.
Misassembly Detection: Misassembly detection is the process of identifying incorrect or erroneous arrangements of DNA sequences in genomic data, which can arise during genome assembly. Accurate detection of misassemblies is crucial for ensuring that assembled genomes accurately represent the underlying biological information. It involves analyzing discrepancies in sequence alignments, coverage, and structural variations, which can inform researchers about potential errors that need correction.
N50: n50 is a statistical measure used to assess the quality of genome assemblies by determining the length of the shortest contig or scaffold in a set that covers at least half of the total assembly length. This metric provides insight into the continuity and completeness of assembled sequences, serving as a benchmark for comparing different assembly methods and strategies.
Reference-guided assembly: Reference-guided assembly is a bioinformatics technique used to reconstruct genomes by aligning short DNA reads to a known reference genome, which helps in accurately assembling sequences while leveraging the existing genomic context. This method aids in resolving complex regions of the genome and enhances the quality of assembled sequences by reducing errors often found in de novo assembly. By comparing new data against a reference, it allows for better identification of structural variations and facilitates gap filling within scaffolded sequences.
Repetitive sequences: Repetitive sequences are segments of DNA that are repeated multiple times within a genome. These sequences can vary in length and complexity, and they can be classified into different categories such as microsatellites, minisatellites, and transposable elements. The presence of repetitive sequences can complicate genome assembly and scaffolding processes due to their tendency to cause ambiguity in sequence alignment, which is critical for accurate genomic analysis.
RNA Sequencing: RNA sequencing is a powerful technique used to analyze the quantity and sequences of RNA in a biological sample, allowing researchers to understand gene expression and regulation. By converting RNA into complementary DNA (cDNA) and sequencing it, this method provides insights into the transcriptome, revealing which genes are active under specific conditions. This data can be crucial for genome scaffolding and gap filling as it helps identify missing regions and annotate genes accurately.
Scaffold n50 metric: The scaffold n50 metric is a statistical measure used to assess the quality of genome assemblies by indicating the length of the longest scaffolds that together account for at least half of the total assembly length. It provides a way to evaluate how well the genomic data has been organized into longer contiguous sequences, which is essential for accurate genome analysis and comparison.
Soapdenovo: Soapdenovo is a de novo genome assembly tool that uses overlapping DNA sequences to create a complete genomic sequence without a reference genome. This method is particularly effective for assembling genomes from short-read sequencing technologies, allowing researchers to reconstruct genomes from scratch and fill in gaps in the sequencing data.
SPAdes: SPAdes is a genome assembly software tool designed for reconstructing genomes from next-generation sequencing (NGS) data. It utilizes various algorithms to produce high-quality assemblies, making it particularly useful for de novo assembly and improving the scaffolding and gap-filling processes. SPAdes is popular due to its ability to handle a wide range of sequencing technologies and its flexibility in adapting to different types of genomic data.
Structural Variation: Structural variation refers to large-scale alterations in the structure of chromosomes, which can include deletions, duplications, inversions, or translocations of genomic segments. These variations can significantly impact genome architecture and function, playing a crucial role in evolution, genetic diversity, and disease susceptibility. Understanding structural variation is essential for assembling genomes accurately and filling gaps during the genome scaffolding process.
Tgs-gapcloser: tgs-gapcloser is a software tool designed to fill gaps in genome assemblies, particularly those generated through third-generation sequencing technologies. It enhances the completeness of genomic data by leveraging long reads to connect contigs and close gaps, which is crucial for producing high-quality reference genomes.
© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.