Genome scaffolding and gap filling are crucial steps in reconstructing complete genome sequences. These processes involve ordering and orienting contigs into larger structures called scaffolds, and then attempting to close the gaps between them.
These techniques help overcome challenges like , sequencing errors, and incomplete coverage. By using various data sources and algorithms, researchers can improve the accuracy and completeness of genome assemblies, enabling better downstream analyses and insights into genetic information.
Genome assembly challenges
Genome assembly is the process of reconstructing a complete genome sequence from shorter DNA fragments, which presents several challenges due to the complexity and size of genomes
These challenges can lead to fragmented assemblies, misassemblies, and gaps in the reconstructed genome sequence, affecting the accuracy and completeness of the assembly
Repetitive sequences
Top images from around the web for Repetitive sequences
Frontiers | Different Modes of Gene Duplication Show Divergent Evolutionary Patterns and ... View original
Is this image relevant?
Frontiers | An Improved Melon Reference Genome With Single-Molecule Sequencing Uncovers a Recent ... View original
Is this image relevant?
Frontiers | Different Modes of Gene Duplication Show Divergent Evolutionary Patterns and ... View original
Is this image relevant?
Frontiers | An Improved Melon Reference Genome With Single-Molecule Sequencing Uncovers a Recent ... View original
Is this image relevant?
1 of 2
Top images from around the web for Repetitive sequences
Frontiers | Different Modes of Gene Duplication Show Divergent Evolutionary Patterns and ... View original
Is this image relevant?
Frontiers | An Improved Melon Reference Genome With Single-Molecule Sequencing Uncovers a Recent ... View original
Is this image relevant?
Frontiers | Different Modes of Gene Duplication Show Divergent Evolutionary Patterns and ... View original
Is this image relevant?
Frontiers | An Improved Melon Reference Genome With Single-Molecule Sequencing Uncovers a Recent ... View original
Is this image relevant?
1 of 2
Repetitive sequences, such as transposable elements, satellite DNA, and segmental duplications, are abundant in many genomes and can span hundreds to thousands of base pairs
These sequences can cause ambiguities during assembly, as the shorter sequencing reads may not be long enough to span the entire repetitive region, making it difficult to determine their correct placement
Repetitive sequences can lead to collapsed repeats in the assembly, where multiple copies of the repeat are represented as a single copy, or expanded repeats, where the number of copies is overestimated
Sequencing errors
Sequencing technologies are not perfect and can introduce errors in the DNA sequence, such as substitutions, insertions, or deletions of nucleotides
These errors can create false overlaps between sequencing reads, leading to misassemblies or fragmented contigs
Sequencing errors can also introduce false variations in the assembled genome, which can be mistaken for true biological variations, such as single nucleotide polymorphisms (SNPs) or structural variations
Incomplete coverage
Sequencing coverage refers to the average number of times each base in the genome is represented in the sequencing reads
Incomplete coverage can occur due to biases in the sequencing process, such as GC content bias or uneven amplification, leading to regions of the genome with low or no coverage
Insufficient coverage can result in gaps in the assembled genome sequence, as there may not be enough overlapping reads to bridge certain regions, especially those with repetitive or complex sequences
Scaffolding
Scaffolding is the process of ordering and orienting contigs (contiguous sequences) into larger structures called scaffolds, which represent the relative positions and orientations of the contigs in the genome
Scaffolding aims to bridge gaps between contigs and provide a more complete and accurate representation of the genome structure
Scaffold graph construction
Scaffold graph construction involves creating a graph representation of the contigs and their connections based on additional information, such as mate-pair or paired-end reads
In a scaffold graph, nodes represent contigs, and edges represent the connections between contigs based on the linking information
The graph structure allows for the identification of the most likely order and orientation of contigs, considering the constraints imposed by the linking data
Mate-pair reads
Mate-pair reads are sequencing reads generated from DNA fragments with a larger insert size (2-5 kb or more) compared to standard paired-end reads
These reads span larger distances in the genome and can provide long-range connectivity information for scaffolding
Mate-pair reads can help bridge gaps between contigs and resolve repetitive regions, as they can span these problematic areas and provide linking information
Paired-end reads
Paired-end reads are sequencing reads generated from both ends of DNA fragments with a known orientation and approximate distance between them
The distance between paired-end reads is typically shorter than mate-pair reads (200-500 bp) but can still provide valuable linking information for scaffolding
Paired-end reads can help orient contigs and estimate the size of gaps between them based on the expected insert size of the library
Long-read technologies
technologies, such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), generate reads that can span several kilobases to hundreds of kilobases in length
These longer reads can directly span repetitive regions and provide continuous sequence information, facilitating the assembly of complex genomes
Long reads can be used for scaffolding by directly connecting contigs or by providing a backbone for hybrid assembly approaches that combine long and short reads
Gap filling
Gap filling is the process of attempting to close the gaps between contigs in a scaffold by generating or identifying sequences that can fill these gaps
Gap filling aims to improve the continuity and completeness of the genome assembly by reducing the number and size of gaps
Gap identification
Gap identification involves locating the gaps between contigs in a scaffold based on the available linking information and the expected distance between contigs
Gaps can be identified by analyzing the scaffold graph and identifying regions where there are no connecting edges between contigs or where the distance between contigs exceeds the expected insert size of the linking reads
The size and location of gaps can be estimated based on the linking information and the size of the contigs flanking the gap
Local assembly methods
Local assembly methods aim to fill gaps by performing a targeted assembly of the reads that map to the regions flanking the gap
These methods extract the reads that align to the contigs on either side of the gap and attempt to assemble them into a contiguous sequence that spans the gap
Local assembly can be performed using short reads, mate-pair reads, or a combination of both, depending on the available data and the size of the gap
Reference-based approaches
Reference-based gap filling approaches utilize a closely related reference genome to guide the gap filling process
These methods align the contigs flanking the gap to the reference genome and attempt to identify the corresponding sequence in the reference that can fill the gap
Reference-based approaches can be effective when a high-quality reference genome is available for a closely related species, but they may not capture species-specific variations or novel sequences
De novo gap filling
De novo gap filling methods attempt to fill gaps without relying on a reference genome, using only the sequencing reads and the assembly information
These approaches often involve iterative rounds of local assembly, where the reads mapping to the gap region are assembled, and the resulting contigs are incorporated back into the assembly
De novo gap filling can be computationally intensive and may require multiple iterations to close larger gaps or resolve complex regions
Scaffolding algorithms
Scaffolding algorithms are computational methods designed to order and orient contigs into scaffolds based on linking information from paired-end reads, mate-pair reads, or other sources of long-range connectivity
These algorithms aim to find the most likely arrangement of contigs that satisfies the constraints imposed by the linking data while minimizing conflicts and inconsistencies
Greedy algorithms
Greedy scaffolding algorithms make locally optimal decisions at each step, iteratively joining contigs into scaffolds based on the strongest linking evidence
These algorithms typically start with the longest contigs and progressively add shorter contigs to the scaffolds, prioritizing the links with the highest support or consistency
Examples of greedy scaffolding algorithms include SSPACE, SOPRA, and GRASS
Graph-based algorithms
Graph-based scaffolding algorithms represent the contigs and their connections as a graph, where nodes represent contigs and edges represent the linking information
These algorithms often use advanced graph theory concepts, such as minimum spanning trees, maximum likelihood paths, or network flow optimization, to find the most likely arrangement of contigs
Graph-based algorithms can handle complex linking patterns and can be more robust to errors and inconsistencies in the data compared to greedy approaches
Examples of graph-based scaffolding algorithms include BESST, ScaffMatch, and Opera
Hybrid approaches
Hybrid scaffolding approaches combine multiple sources of information or different algorithmic strategies to improve the accuracy and completeness of the scaffolding process
These approaches may integrate data from different sequencing technologies (e.g., short reads and long reads), use a combination of greedy and graph-based methods, or incorporate additional information such as physical maps or genetic linkage data
aim to leverage the strengths of different data types and algorithms to overcome the limitations of individual methods and produce high-quality scaffolds
Examples of hybrid scaffolding approaches include RAILS, LINKS, and SALSA
Gap filling algorithms
Gap filling algorithms are computational methods designed to close the gaps between contigs in a scaffold by generating or identifying sequences that can span these gaps
These algorithms utilize various strategies, such as local assembly, reference-based approaches, or de novo methods, to generate gap-filling sequences
GapCloser
is a gap filling tool that is part of the package, which is commonly used for de novo genome assembly
It uses a local assembly approach to fill gaps by extracting reads that map to the contigs flanking the gap and performing a targeted assembly of these reads
GapCloser iteratively extends the contigs into the gap region using a k-mer-based approach, attempting to find overlaps between the reads and the contig ends
GapFiller
is a standalone gap filling tool that uses a hybrid approach combining local assembly and reference-based methods
It first aligns the contigs flanking the gap to a reference genome (if available) to identify potential gap-filling sequences
If no suitable reference-based sequence is found, GapFiller performs a local assembly of the reads mapping to the gap region using a seed-and-extend approach
GMcloser
is a gap filling tool that uses a reference-based approach to close gaps in a draft genome assembly
It aligns the contigs flanking the gap to a closely related reference genome and extracts the corresponding sequence from the reference to fill the gap
GMcloser can handle multiple reference genomes and can fill gaps using a consensus sequence derived from multiple alignments
TGS-GapCloser
is a gap filling tool designed to close gaps using long reads generated by third-generation sequencing (TGS) technologies, such as PacBio or Oxford Nanopore
It utilizes the long reads to directly span the gaps between contigs, providing a continuous sequence that can close the gap
TGS-GapCloser can handle errors and variations in the long reads by performing a local alignment and consensus calling step to refine the gap-filling sequence
Quality assessment
Quality assessment is the process of evaluating the accuracy, completeness, and contiguity of a genome assembly, including the scaffolds and gap-filled sequences
Various metrics and approaches are used to assess the quality of an assembly and identify potential issues or areas for improvement
Scaffold N50 metric
The scaffold is a commonly used metric to assess the contiguity of a genome assembly at the scaffold level
It represents the length of the scaffold at which 50% of the total assembly length is contained in scaffolds of that size or larger
A higher scaffold N50 value indicates a more contiguous assembly, with fewer and larger scaffolds
Gap statistics
provide information about the number, size, and distribution of gaps in the scaffolded assembly
These statistics can include the total number of gaps, the average and median gap size, and the gap size distribution
Lower gap statistics (fewer and smaller gaps) generally indicate a more complete and contiguous assembly
Misassembly detection
involves identifying regions in the assembly where the order or orientation of contigs is incorrect or where there are chimeric joins between unrelated sequences
Misassemblies can be detected using various approaches, such as comparing the assembly to a reference genome, analyzing read coverage and consistency, or using long-range information from mate-pair or long reads
Tools like QUAST, REAPR, and FRCbam can be used to detect and quantify misassemblies in an assembly
Completeness evaluation
assesses the extent to which the assembly captures the full content of the genome, including genes, regulatory elements, and other biologically relevant features
Completeness can be evaluated using benchmarking sets of conserved genes, such as single-copy orthologs, or by comparing the assembly to a closely related reference genome
Tools like BUSCO and CEGMA can be used to assess the completeness of an assembly based on the presence and completeness of conserved gene sets
Challenges and limitations
Despite advances in scaffolding and gap filling methods, there are still several challenges and limitations that can affect the quality and completeness of genome assemblies
Chimeric sequences
Chimeric sequences are artificial joins between unrelated sequences that can occur during the assembly process, particularly in regions with repetitive or complex sequences
Chimeric sequences can lead to misassemblies and incorrect representations of the genome structure
Identifying and resolving chimeric sequences can be challenging, as they may not be easily distinguishable from true biological variations or rearrangements
Misassemblies
Misassemblies are regions in the assembly where the order or orientation of contigs is incorrect, leading to a misrepresentation of the true genome structure
Misassemblies can arise from various sources, such as chimeric sequences, incorrect linking information, or errors in the assembly algorithms
Detecting and correcting misassemblies can be difficult, particularly in the absence of a high-quality reference genome or long-range connectivity information
Unresolved gaps
Despite the application of gap filling methods, some gaps in the assembly may remain unresolved due to various factors, such as repetitive sequences, regions, or limitations of the available data and algorithms
Unresolved gaps can affect the continuity and completeness of the assembly and may hinder downstream analyses, such as gene annotation or comparative genomics
Closing all gaps in a genome assembly may not always be possible, particularly for large and complex genomes with extensive repetitive content
Computational complexity
Scaffolding and gap filling algorithms can be computationally intensive, particularly for large and complex genomes with high levels of repetitive sequences or heterozygosity
The computational complexity of these methods can increase with the size of the genome, the amount of sequencing data, and the complexity of the linking information
Scaling these algorithms to handle large datasets or multiple genomes can be challenging and may require significant computational resources and optimization efforts
Advances in scaffolding and gap filling
Recent advances in sequencing technologies, computational methods, and data integration approaches have led to improvements in scaffolding and gap filling strategies, enabling the generation of more complete and accurate genome assemblies
Optical mapping
Optical mapping is a technique that generates high-resolution physical maps of genomes by imaging and analyzing long, fluorescently labeled DNA molecules
These physical maps provide long-range connectivity information that can be used for scaffolding and validating the assembly
Optical mapping data can help resolve complex regions, identify misassemblies, and anchor scaffolds to chromosomes
Chromosome conformation capture
Chromosome conformation capture (3C) techniques, such as Hi-C and Dovetail, capture the spatial proximity of DNA sequences in the nucleus, providing long-range connectivity information for scaffolding
These methods generate contact frequency maps that reflect the three-dimensional organization of the genome, allowing for the ordering and orientation of scaffolds into chromosome-scale assemblies
Hi-C and Dovetail data can help resolve complex regions, identify misassemblies, and provide a framework for genome-wide scaffolding
Single-molecule sequencing
Single-molecule sequencing technologies, such as PacBio and Oxford Nanopore, generate long reads that can span tens to hundreds of kilobases, providing continuous sequence information for scaffolding and gap filling
These long reads can directly resolve repetitive regions and complex structures, reducing the need for complex computational methods to infer the genome structure
Single-molecule sequencing data can be used for hybrid assembly approaches, where long reads are combined with short reads to generate high-quality, contiguous assemblies
Artificial intelligence approaches
Artificial intelligence (AI) and machine learning (ML) approaches are being increasingly applied to genome assembly, scaffolding, and gap filling problems
These methods can learn patterns and features from large datasets, such as sequencing reads or assembly graphs, to make predictions and guide the assembly process
AI and ML approaches can be used for tasks such as error correction, repeat resolution, scaffolding, and gap filling, potentially improving the accuracy and efficiency of these processes
Examples of AI-based tools for genome assembly include LINKS, which uses long read information for scaffolding, and DeepVariant, which uses deep learning for variant calling and assembly polishing
Key Terms to Review (28)
Completeness evaluation: Completeness evaluation refers to the process of assessing how complete a genome assembly is, determining whether all regions of the genome are represented and identifying gaps in the assembly. This evaluation is crucial in genome scaffolding and gap filling as it helps researchers understand the accuracy and comprehensiveness of the assembled genomic data, guiding further efforts to improve the quality of the genome assembly.
Contig Length: Contig length refers to the total length of a contiguous sequence of DNA that has been assembled from overlapping fragments during the genome assembly process. This measure is crucial in evaluating the quality and completeness of the assembled genome, as longer contigs often indicate better assembly accuracy and provide more useful information for downstream analyses like genome annotation and comparative genomics.
De novo assembly: De novo assembly is the process of constructing a genome from scratch without the aid of a reference genome, utilizing sequences obtained from high-throughput sequencing technologies. This method is essential for analyzing species with no prior genomic information and is heavily reliant on the accuracy and efficiency of next-generation sequencing techniques.
DNA sequencing: DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. This technique enables researchers to read the genetic code, which can reveal important information about genes, genetic variations, and evolutionary relationships. Understanding DNA sequences is crucial for genome scaffolding and gap filling, as it allows for the assembly and validation of genomic data by providing insights into where gaps exist and how to bridge them.
Dynamic Programming: Dynamic programming is a method used to solve complex problems by breaking them down into simpler subproblems and storing the results of these subproblems to avoid redundant calculations. This technique is particularly useful in optimization problems, where it helps to efficiently find the best solution among many possible solutions. It is widely applied in bioinformatics for tasks such as aligning sequences, assembling genomes, filling gaps in genome scaffolding, and predicting gene structures.
Functional annotation: Functional annotation refers to the process of identifying the biological function of genes, proteins, and other genomic elements. This process is crucial for understanding how different components of an organism's genome contribute to its phenotype and biological processes, linking sequence data with functional insights across various research areas.
Gap filler: A gap filler is a bioinformatics tool or algorithm used to fill in the gaps in a genome assembly, which arise from incomplete sequencing data or unresolvable regions. This process improves the continuity and accuracy of the assembled genome, making it more useful for further analysis and interpretation. By using various methods, such as leveraging paired-end reads or additional sequencing data, gap fillers enhance the overall quality of genome assemblies.
Gap Statistics: Gap statistics is a statistical method used to determine the optimal number of clusters in a dataset by comparing the observed data with a null reference distribution. This technique helps in evaluating the clustering structure of genomic data, providing insights on how many distinct groups exist based on similarity, which is critical in genome scaffolding and gap filling.
Gapcloser: A gapcloser is a computational tool or algorithm used in genomics to fill in the gaps between contigs during the genome assembly process. This process enhances the quality of assembled genomes by reducing the number of unsequenced regions, which can lead to a more complete and accurate representation of the organism's genetic material. Gapclosers work by utilizing paired-end reads or mate-pair reads to infer the missing sequences and bridge the gaps in the assembly.
Gapfiller: A gapfiller is a computational tool used in genome assembly to close gaps between contigs or scaffolds by inferring the missing sequences based on available data. This process is essential in creating a more complete and accurate representation of the genome, enhancing the continuity of sequence data. Gapfillers utilize various strategies, including read alignment and comparative genomics, to fill these gaps, ultimately improving the quality of the assembled genome.
Gene mapping: Gene mapping is the process of determining the specific locations of genes on a chromosome, helping to understand the genetic architecture of organisms. This process is essential in identifying the relationships between genes, their functions, and their interactions, contributing to our knowledge of traits and diseases. Gene mapping aids in genome scaffolding by aligning sequences and filling gaps to create a more complete picture of genetic information.
Gene Prediction: Gene prediction refers to the computational process of identifying the locations and structures of genes within a DNA sequence. This process plays a critical role in genomics, as it helps in annotating genomes and understanding gene functions, which is essential for further biological analysis and research. Accurate gene prediction is crucial for the development of biological databases and tools, aiding in tasks such as genome scaffolding and understanding microbial communities.
Gmcloser: Gmcloser is a tool used in genome assembly that helps to create more complete genome scaffolds by filling gaps between contigs. This tool employs algorithms to integrate information from paired-end reads and other genomic data to bridge the gaps, improving the overall continuity of assembled genomes. It enhances the quality of genomic data by providing a clearer picture of the structure and organization of the genome.
Greedy Algorithm: A greedy algorithm is an approach to solving optimization problems by making a series of choices, each of which looks best at the moment. This method builds up a solution piece by piece, always choosing the next piece that offers the most immediate benefit. Greedy algorithms are particularly useful in fields like sequence assembly and genome scaffolding, where making local optimum choices can lead to a global optimum solution efficiently.
Hybrid approaches: Hybrid approaches refer to methodologies that integrate multiple techniques or tools to solve complex problems or improve accuracy in various scientific fields, particularly in computational genomics. By combining the strengths of different methods, these approaches enhance the overall effectiveness in tasks such as reconstructing genome sequences and addressing variations like insertions and deletions. This adaptability is crucial for efficiently bridging gaps in data and creating a more comprehensive understanding of genomic structures.
Long-read sequencing: Long-read sequencing is a genomic sequencing method that produces longer contiguous reads of DNA, typically over 10,000 base pairs, allowing for a more comprehensive understanding of complex genomic regions. This technique enhances the assembly of genomes by spanning repetitive sequences and structural variations, making it invaluable for accurate genome scaffolding, detecting structural variations, and advancing metagenomics studies.
Low Coverage: Low coverage refers to a situation in genomic sequencing where only a small portion of the genome is represented by overlapping reads, leading to gaps in the data. This can pose challenges in accurately reconstructing the genome or identifying variations, particularly when trying to create a comprehensive assembly from short reads. In this context, understanding low coverage is crucial for effectively addressing issues during the assembly and scaffolding processes.
Mate-pair sequencing: Mate-pair sequencing is a next-generation sequencing technique that involves the generation of DNA fragments with known distance between them, allowing for more accurate reconstruction of genomes. This method enhances genome assembly by connecting distant sequences that are physically linked in the DNA, which helps in resolving complex regions and filling gaps in the genome.
Misassembly Detection: Misassembly detection is the process of identifying incorrect or erroneous arrangements of DNA sequences in genomic data, which can arise during genome assembly. Accurate detection of misassemblies is crucial for ensuring that assembled genomes accurately represent the underlying biological information. It involves analyzing discrepancies in sequence alignments, coverage, and structural variations, which can inform researchers about potential errors that need correction.
N50: n50 is a statistical measure used to assess the quality of genome assemblies by determining the length of the shortest contig or scaffold in a set that covers at least half of the total assembly length. This metric provides insight into the continuity and completeness of assembled sequences, serving as a benchmark for comparing different assembly methods and strategies.
Reference-guided assembly: Reference-guided assembly is a bioinformatics technique used to reconstruct genomes by aligning short DNA reads to a known reference genome, which helps in accurately assembling sequences while leveraging the existing genomic context. This method aids in resolving complex regions of the genome and enhances the quality of assembled sequences by reducing errors often found in de novo assembly. By comparing new data against a reference, it allows for better identification of structural variations and facilitates gap filling within scaffolded sequences.
Repetitive sequences: Repetitive sequences are segments of DNA that are repeated multiple times within a genome. These sequences can vary in length and complexity, and they can be classified into different categories such as microsatellites, minisatellites, and transposable elements. The presence of repetitive sequences can complicate genome assembly and scaffolding processes due to their tendency to cause ambiguity in sequence alignment, which is critical for accurate genomic analysis.
RNA Sequencing: RNA sequencing is a powerful technique used to analyze the quantity and sequences of RNA in a biological sample, allowing researchers to understand gene expression and regulation. By converting RNA into complementary DNA (cDNA) and sequencing it, this method provides insights into the transcriptome, revealing which genes are active under specific conditions. This data can be crucial for genome scaffolding and gap filling as it helps identify missing regions and annotate genes accurately.
Scaffold n50 metric: The scaffold n50 metric is a statistical measure used to assess the quality of genome assemblies by indicating the length of the longest scaffolds that together account for at least half of the total assembly length. It provides a way to evaluate how well the genomic data has been organized into longer contiguous sequences, which is essential for accurate genome analysis and comparison.
Soapdenovo: Soapdenovo is a de novo genome assembly tool that uses overlapping DNA sequences to create a complete genomic sequence without a reference genome. This method is particularly effective for assembling genomes from short-read sequencing technologies, allowing researchers to reconstruct genomes from scratch and fill in gaps in the sequencing data.
SPAdes: SPAdes is a genome assembly software tool designed for reconstructing genomes from next-generation sequencing (NGS) data. It utilizes various algorithms to produce high-quality assemblies, making it particularly useful for de novo assembly and improving the scaffolding and gap-filling processes. SPAdes is popular due to its ability to handle a wide range of sequencing technologies and its flexibility in adapting to different types of genomic data.
Structural Variation: Structural variation refers to large-scale alterations in the structure of chromosomes, which can include deletions, duplications, inversions, or translocations of genomic segments. These variations can significantly impact genome architecture and function, playing a crucial role in evolution, genetic diversity, and disease susceptibility. Understanding structural variation is essential for assembling genomes accurately and filling gaps during the genome scaffolding process.
Tgs-gapcloser: tgs-gapcloser is a software tool designed to fill gaps in genome assemblies, particularly those generated through third-generation sequencing technologies. It enhances the completeness of genomic data by leveraging long reads to connect contigs and close gaps, which is crucial for producing high-quality reference genomes.