Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Genome assembly is the computational backbone of modern genomics—without it, raw sequencing reads are just billions of meaningless fragments. You're being tested on understanding how different assembly algorithms work, why certain tools are chosen for specific sequencing technologies, and what trade-offs exist between accuracy, speed, and computational resources. These tools represent the bridge between raw data and biological insight, whether you're reconstructing a bacterial genome or tackling a complex eukaryotic assembly.
The key concepts here revolve around graph-based assembly strategies, read length and error profiles, and hybrid approaches that combine multiple data types. Don't just memorize tool names—know which assembly paradigm each tool uses (de Bruijn graph vs. overlap-layout-consensus), what sequencing platform it's optimized for, and when you'd choose one over another. If an exam question asks you to design an assembly pipeline, you need to match the tool to the data type and project scale.
These tools break reads into k-mers and construct a graph where paths represent potential genomic sequences. The de Bruijn graph approach trades exact overlap information for computational efficiency, making it ideal for the massive datasets produced by Illumina sequencing.
Compare: Velvet vs. ABySS—both use de Bruijn graphs for short reads, but ABySS distributes computation across machines while Velvet runs on a single node. Choose ABySS when your genome exceeds available RAM; choose Velvet for smaller, faster assemblies.
These tools build on basic de Bruijn approaches with sophisticated algorithms to handle challenging scenarios like uneven coverage, single-cell data, and complex microbial communities.
Compare: SPAdes vs. IDBA-UD—both handle uneven coverage, but SPAdes excels with single-cell amplified DNA while IDBA-UD is optimized for naturally uneven metagenomic samples. If an FRQ asks about assembling a rare bacterial species from an environmental sample, either could work depending on the sequencing approach.
Long-read technologies (PacBio, Oxford Nanopore) produce reads spanning thousands of bases but with higher error rates. These assemblers must correct errors while leveraging the ability to span repetitive regions that fragment short-read assemblies.
Compare: Canu vs. Falcon—both assemble long reads with error correction, but Falcon is PacBio-specific with diploid phasing capabilities, while Canu handles multiple long-read platforms. For a heterozygous diploid genome on PacBio, Falcon offers better haplotype resolution.
Before de Bruijn graphs dominated, OLC assemblers found overlaps between entire reads. This approach remains valuable for longer reads where preserving full overlap information improves accuracy.
Compare: Newbler vs. modern long-read assemblers—Newbler used OLC for 454's "long" reads (hundreds of bases), while Canu/Falcon handle truly long reads (thousands of bases). This illustrates how assembly strategies evolve with sequencing technology.
These tools combine short and long reads to leverage the accuracy of Illumina data with the contiguity of long-read data. Hybrid approaches are increasingly common as multi-platform sequencing becomes standard practice.
Compare: MaSuRCA vs. ALLPATHS-LG—both produce high-quality hybrid assemblies, but ALLPATHS-LG demands specific library preparations while MaSuRCA is more flexible with input data types. MaSuRCA is better for opportunistic hybrid assembly; ALLPATHS-LG for planned, well-funded projects.
| Concept | Best Examples |
|---|---|
| De Bruijn graph (short reads) | Velvet, SOAPdenovo, ABySS |
| Uneven coverage handling | SPAdes, IDBA-UD |
| Long-read assembly | Canu, Falcon |
| Distributed/parallel computing | ABySS |
| Hybrid assembly (short + long) | MaSuRCA, ALLPATHS-LG |
| Single-cell genomics | SPAdes |
| Metagenomic assembly | SPAdes, IDBA-UD |
| Legacy/454 sequencing | Newbler |
Which two assemblers both use de Bruijn graphs but differ in their computational architecture—one requiring a single machine and one distributing across a cluster?
You have PacBio long-read data from a diploid organism and need to separate maternal and paternal haplotypes. Which assembler offers this capability, and what feature enables it?
Compare and contrast SPAdes and IDBA-UD: what assembly challenge do both address, and what specific use case distinguishes each tool?
A collaborator has both Illumina short reads and Oxford Nanopore long reads for a complex plant genome. Which assembler(s) would you recommend, and why might hybrid assembly outperform using either data type alone?
Why would Newbler be a poor choice for a modern sequencing project, and what fundamental shift in sequencing technology made de Bruijn graph assemblers dominant over overlap-layout-consensus approaches?