Fiveable

🧬Genomics Unit 2 Review

QR code for Genomics practice questions

2.4 Genome assembly strategies and algorithms

2.4 Genome assembly strategies and algorithms

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🧬Genomics
Unit & Topic Study Guides

Genome assembly is a crucial step in DNA sequencing, piecing together short reads into a complete genome. It's like solving a massive jigsaw puzzle, but with billions of pieces and no picture on the box. Challenges include repetitive sequences and sequencing errors.

Two main approaches are used: de novo assembly, which builds the genome from scratch, and reference-guided assembly, which uses a similar genome as a template. Algorithms like overlap-layout-consensus and de Bruijn graphs help tackle this complex task, while quality metrics ensure the final assembly is accurate and complete.

Genome assembly challenges and goals

Challenges in genome assembly

  • Presence of repetitive sequences complicates assembly by introducing ambiguity in the reconstruction process
  • Sequencing errors can lead to incorrect base calls and misassemblies
  • Uneven coverage across the genome can result in gaps or poorly assembled regions
  • Computational complexity of assembling large genomes requires significant resources and efficient algorithms

Goals of genome assembly

  • Generate accurate representations of the original DNA sequence with minimal errors
  • Produce contiguous sequences (contigs) that cover as much of the genome as possible
  • Minimize gaps and misassemblies in the assembled sequence
  • Create complete genome assemblies that facilitate downstream analyses (gene annotation, comparative genomics, structural variation identification)

De novo vs reference-guided assembly

Challenges in genome assembly, Frontiers | SECNVs: A Simulator of Copy Number Variants and Whole-Exome Sequences From Reference ...

De novo assembly

  • Reconstructs the genome sequence without using a pre-existing reference genome
  • Relies solely on the information present in the sequencing reads
  • Necessary when no suitable reference genome is available (novel species, highly divergent strains)
  • Can capture unique features and variations specific to the target genome

Reference-guided assembly

  • Utilizes a closely related reference genome to guide the assembly process
  • Aligns sequencing reads to the reference genome to aid in the reconstruction
  • Advantageous when a high-quality reference genome is available
  • Can help resolve complex regions and improve overall assembly quality
  • May miss or misassemble regions absent or highly divergent from the reference

Overlap-layout-consensus and de Bruijn graph algorithms

Challenges in genome assembly, Frontiers | A Case Study into Microbial Genome Assembly Gap Sequences and Finishing Strategies

Overlap-layout-consensus (OLC) algorithm

  • Graph-based assembly approach involving three main steps: overlap, layout, and consensus
  • Overlap step: identifies significant overlaps between sequencing reads using pairwise comparisons (suffix trees, hash tables)
  • Layout step: constructs a graph representing the relationships and potential arrangements of the reads based on overlaps
  • Consensus step: traverses the layout graph to determine the most likely DNA sequence, resolving conflicts and ambiguities

De Bruijn graph algorithm

  • Breaks reads into shorter, fixed-length subsequences called k-mers
  • Constructs a graph where nodes represent k-mers and edges represent overlaps between k-mers
  • Reconstructs the genome sequence by finding an Eulerian path that visits each edge in the graph exactly once
  • More computationally efficient than OLC for large genomes and high-coverage datasets
  • May be more sensitive to sequencing errors and repeats compared to OLC

Genome assembly quality assessment

Metrics for evaluating assembly quality

  • N50: length of the shortest contig or scaffold such that 50% of the total assembly length is contained in contigs or scaffolds of that length or longer
  • L50: minimum number of contigs or scaffolds needed to cover 50% of the assembly
  • Genome coverage: percentage of the reference genome covered by the assembled sequences
  • Number of misassemblies and gaps: indicators of assembly accuracy and completeness

Tools for assessing assembly quality

  • BUSCO (Benchmarking Universal Single-Copy Orthologs): assesses completeness by searching for conserved orthologous genes expected in a specific lineage
  • Alignment to a reference genome (if available): evaluates accuracy, misassemblies, and gaps
  • Read depth distribution analysis: identifies potential misassemblies or collapsed repeats based on uneven coverage
  • Interactive visualization tools (IGV, Tablet): allows visual inspection of the assembly and alignment of sequencing reads to identify errors or inconsistencies
Pep mascot
Upgrade your Fiveable account to print any study guide

Download study guides as beautiful PDFs See example

Print or share PDFs with your students

Always prints our latest, updated content

Mark up and annotate as you study

Click below to go to billing portal → update your plan → choose Yearly → and select "Fiveable Share Plan". Only pay the difference

Plan is open to all students, teachers, parents, etc
Pep mascot
Upgrade your Fiveable account to export vocabulary

Download study guides as beautiful PDFs See example

Print or share PDFs with your students

Always prints our latest, updated content

Mark up and annotate as you study

Plan is open to all students, teachers, parents, etc
report an error
description

screenshots help us find and fix the issue faster (optional)

add screenshot

2,589 studying →