Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Genome assembly is the computational backbone of modern genomics. Without it, raw sequencing reads are just billions of meaningless fragments. Understanding these tools means knowing how different assembly algorithms work, why certain tools fit specific sequencing technologies, and what trade-offs exist between accuracy, speed, and computational cost.
The key concepts revolve around graph-based assembly strategies, read length and error profiles, and hybrid approaches that combine multiple data types. Don't just memorize tool names. Know which assembly paradigm each tool uses (de Bruijn graph vs. overlap-layout-consensus), what sequencing platform it's optimized for, and when you'd pick one over another. If an exam question asks you to design an assembly pipeline, you need to match the tool to the data type and project scale.
De Bruijn graph assemblers work by breaking reads into overlapping subsequences of length k (called k-mers), then building a directed graph where each k-mer is a node (or edge, depending on the formulation) and paths through the graph represent candidate genomic sequences. This approach trades exact read-to-read overlap information for computational efficiency, which is critical when dealing with the hundreds of millions of reads that Illumina platforms produce.
Compare: Velvet vs. ABySS: both use de Bruijn graphs for short reads, but ABySS distributes computation across machines while Velvet runs on a single node. Choose ABySS when your genome exceeds available RAM; choose Velvet for smaller, faster assemblies.
These tools build on the basic de Bruijn graph framework with more sophisticated algorithms to handle challenging scenarios like uneven coverage, single-cell data, and complex microbial communities.
Compare: SPAdes vs. IDBA-UD: both handle uneven coverage, but SPAdes excels with single-cell amplified DNA while IDBA-UD is optimized for naturally uneven metagenomic samples. If a question asks about assembling a rare bacterial species from an environmental sample, either could work depending on the sequencing approach.
Long-read technologies (PacBio and Oxford Nanopore) produce reads spanning tens of thousands of bases, but with higher per-base error rates (roughly 5-15% depending on the platform and chemistry). These assemblers must correct those errors while leveraging the reads' ability to span repetitive regions that fragment short-read assemblies.
Canu follows a three-stage hierarchical pipeline:
Canu is platform-agnostic, working with both PacBio and Oxford Nanopore data. This flexibility makes it a common default choice for long-read projects.
Compare: Canu vs. Falcon: both assemble long reads with built-in error correction, but Falcon is PacBio-specific with diploid phasing capabilities, while Canu handles multiple long-read platforms. For a heterozygous diploid genome sequenced on PacBio, Falcon offers better haplotype resolution.
Before de Bruijn graphs dominated, OLC assemblers were the standard approach. The strategy has three steps:
The pairwise overlap step scales roughly as with the number of reads, which is why OLC became impractical for the massive read counts from Illumina. However, OLC remains valuable for longer reads where the total read count is lower and preserving full overlap information improves accuracy.
Compare: Newbler vs. modern long-read assemblers: Newbler used OLC for 454's "long" reads (hundreds of bases), while Canu and Falcon handle truly long reads (tens of thousands of bases). This illustrates how assembly strategies evolve alongside sequencing technology.
Hybrid assemblers combine short and long reads to get the best of both worlds: the high per-base accuracy of Illumina data and the structural contiguity of long-read data. These approaches are increasingly common as multi-platform sequencing becomes standard.
Compare: MaSuRCA vs. ALLPATHS-LG: both produce high-quality hybrid assemblies, but ALLPATHS-LG demands specific library preparations while MaSuRCA is more flexible with input data types. MaSuRCA works well for opportunistic hybrid assembly when you have mixed data on hand; ALLPATHS-LG is better for planned, well-funded projects where you can design the sequencing libraries to match its requirements.
| Concept | Best Examples |
|---|---|
| De Bruijn graph (short reads) | Velvet, SOAPdenovo, ABySS |
| Uneven coverage handling | SPAdes, IDBA-UD |
| Long-read assembly | Canu, Falcon |
| Distributed/parallel computing | ABySS |
| Hybrid assembly (short + long) | MaSuRCA, ALLPATHS-LG |
| Single-cell genomics | SPAdes |
| Metagenomic assembly | SPAdes (metaSPAdes), IDBA-UD |
| Diploid haplotype phasing | Falcon (FALCON-Unzip) |
| Legacy/454 sequencing | Newbler |
Which two assemblers both use de Bruijn graphs but differ in their computational architecture, one requiring a single machine and one distributing across a cluster?
You have PacBio long-read data from a diploid organism and need to separate maternal and paternal haplotypes. Which assembler offers this capability, and what feature enables it?
Compare and contrast SPAdes and IDBA-UD: what assembly challenge do both address, and what specific use case distinguishes each tool?
A collaborator has both Illumina short reads and Oxford Nanopore long reads for a complex plant genome. Which assembler(s) would you recommend, and why might hybrid assembly outperform using either data type alone?
Why would Newbler be a poor choice for a modern sequencing project, and what fundamental shift in sequencing technology made de Bruijn graph assemblers dominant over OLC approaches?