Genome sequencing gives scientists the ability to read and analyze entire genetic codes, fundamentally changing how we study biology. As sequencing technologies have advanced, they've become faster and cheaper, opening the door to large-scale genomic studies that were once unthinkable. Knowing this history helps you understand why current bioinformatics tools and pipelines exist the way they do.

Early sequencing methods

In 1977, Frederick Sanger introduced the chain-termination method, the first practical way to sequence DNA. That same year, Maxam and Gilbert published a competing approach based on chemical degradation. Both methods could only handle short DNA fragments, but they proved the concept was viable.

Automated sequencing machines arrived in the 1980s, boosting throughput and accuracy
The discovery of polymerase chain reaction (PCR) in 1983 by Kary Mullis made it possible to amplify tiny DNA samples into quantities large enough for sequencing

Human Genome Project impact

Launched in 1990, the Human Genome Project (HGP) set out to sequence the entire 3.2 billion base-pair human genome. The project used a shotgun sequencing approach, breaking DNA into smaller overlapping fragments, sequencing each one, then computationally reassembling them.

Completed in 2003, two years ahead of schedule
Total cost: roughly $2.7 billion
Demonstrated the need for faster, cheaper sequencing, directly spurring development of next-generation sequencing (NGS)
Laid the groundwork for personalized medicine and comparative genomics

DNA sequencing principles

At its core, DNA sequencing determines the precise order of nucleotides (A, T, C, G) in a DNA molecule. Different technologies use different biochemical reactions and detection methods to accomplish this, but they all share that same goal. Understanding these principles is essential for interpreting sequencing data correctly.

Sanger sequencing basics

Sanger sequencing works by synthesizing a complementary DNA strand and randomly terminating the growing chain at each base position. Here's the process:

A DNA template, primer, DNA polymerase, normal dNTPs, and fluorescently labeled dideoxynucleotides (ddNTPs) are combined in a reaction
When a ddNTP gets incorporated instead of a normal dNTP, the chain terminates because ddNTPs lack the 3'-OH group needed for the next bond
This produces fragments of every possible length, each ending with a fluorescent ddNTP
Capillary electrophoresis separates these fragments by size
A laser reads the fluorescent labels as fragments pass through a detector, revealing the sequence

Sanger sequencing produces high-quality reads up to about 900 base pairs long, with accuracy around 99.99%.

Next-generation sequencing overview

NGS technologies sequence millions of DNA fragments simultaneously (massively parallel sequencing), which is the key difference from Sanger's one-fragment-at-a-time approach.

Includes platforms like Illumina, Ion Torrent, and the now-discontinued 454
Generally produces shorter reads (50–300 bp) but at vastly higher throughput
Uses either sequencing-by-synthesis or sequencing-by-ligation chemistry
Enables whole-genome sequencing at a fraction of the cost and time of Sanger methods

First-generation sequencing

First-generation sequencing refers to the methods developed in the 1970s that made DNA sequencing possible for the first time. These approaches were primarily used for sequencing individual genes or small genomic regions and laid the foundation for everything that followed.

Sanger method details

The original Sanger method used four separate reactions, each containing one type of ddNTP (ddATP, ddCTP, ddGTP, or ddTTP). Fragments were separated on a polyacrylamide gel, producing a ladder-like pattern that was initially read by hand.

Later improvements automated the process:

All four ddNTPs were labeled with different fluorescent dyes, allowing a single reaction
Capillary electrophoresis replaced slab gels for faster, more consistent separation
Capable of sequencing up to ~1000 base pairs with 99.99% accuracy
Remained the gold standard for targeted sequencing and validation for decades

Maxam-Gilbert method

Developed by Allan Maxam and Walter Gilbert, this method takes a completely different approach. Instead of synthesizing new DNA, it chemically breaks existing DNA at specific bases.

Four separate chemical reactions target different nucleotides: G, A+G, C, and C+T
Radioactive labeling (not fluorescence) is used to visualize fragments on a gel
The pattern of fragment sizes reveals the sequence

This method fell out of favor because it uses hazardous chemicals (like hydrazine and dimethyl sulfate) and is technically demanding. However, it still has niche advantages for sequencing DNA with high GC content or problematic secondary structures.

Second-generation sequencing

Second-generation sequencing, commonly called Next-Generation Sequencing (NGS), represented a massive leap in throughput and cost-efficiency. These platforms enabled whole-genome sequencing of complex organisms and powered large-scale projects that would have been impossible with Sanger alone.

Illumina sequencing technology

Illumina dominates the NGS market and is the platform you'll encounter most often. The process works as follows:

DNA is fragmented and adapters are ligated to both ends
Fragments bind to a flow cell surface and undergo bridge amplification, creating dense clusters of identical copies
Sequencing-by-synthesis begins: fluorescently labeled reversible terminator nucleotides are added one at a time
After each incorporation, a camera captures an image of the entire flow cell
The fluorescent label and terminator group are chemically removed, allowing the next cycle

This produces billions of reads in parallel, typically 75–300 bp long, with accuracy above 99%. The cost per base is very low, which is why Illumina is the workhorse of most genomics labs.

Ion Torrent sequencing

Ion Torrent takes a different detection approach: instead of optics, it uses semiconductor technology to detect pH changes.

DNA fragments are attached to beads and loaded into microwells on a semiconductor chip
Individual nucleotides (A, T, C, G) are flooded across the chip one at a time
When a nucleotide is incorporated, a hydrogen ion is released, causing a measurable voltage change
No fluorescent labels or cameras needed, which makes the instrument simpler and faster

The main weakness is homopolymer errors: when multiple identical bases occur in a row (e.g., AAAA), the signal scales linearly, but distinguishing 7 from 8 identical bases becomes unreliable. This makes Ion Torrent best suited for smaller genomes and targeted sequencing.

454 pyrosequencing

454 Life Sciences introduced the first commercially successful NGS platform in 2005. It uses emulsion PCR to amplify fragments on beads, then detects nucleotide incorporation through a light-producing reaction.

When a nucleotide is incorporated, pyrophosphate is released
The enzyme luciferase converts this into a detectable light signal (hence "pyrosequencing")
Produced relatively long reads for NGS (up to ~1000 bp)
Discontinued in 2016 because it couldn't compete with Illumina on cost or throughput

Early sequencing methods, Visualizing and Characterizing DNA, RNA, and Protein | Microbiology

Third-generation sequencing

Third-generation platforms sequence single molecules directly, without requiring PCR amplification. This eliminates amplification bias and produces much longer reads, which is critical for resolving complex genomic regions.

Pacific Biosciences SMRT

PacBio's Single Molecule Real-Time (SMRT) sequencing watches a single DNA polymerase molecule at work in real time.

The polymerase sits at the bottom of a tiny well called a zero-mode waveguide (ZMW), only ~70 nm wide
Fluorescently labeled nucleotides are incorporated, and each base emits a characteristic flash of light
Average read lengths of 10–30 kb, with some reads exceeding 100 kb
Circular consensus sequencing (CCS) mode: the polymerase reads a circularized template multiple times, then the reads are combined to produce "HiFi" reads with accuracy above 99.9%
Can detect DNA modifications like methylation directly during sequencing, without extra sample prep

PacBio is especially useful for de novo genome assembly and resolving repetitive regions that short reads can't span.

Oxford Nanopore technologies

Oxford Nanopore uses an entirely different principle: a protein nanopore embedded in a membrane measures changes in electrical current as a DNA strand passes through it.

A motor protein feeds single-stranded DNA through the nanopore one base at a time
Each base (or combination of bases in the pore at once) disrupts the ionic current in a characteristic way
Base calling algorithms translate these current signals into sequence

Key features:

Ultra-long reads have been reported exceeding 2 Mb, with no theoretical upper limit on read length
The MinION is a portable, USB-powered device roughly the size of a stapler, enabling sequencing in the field
Real-time data streaming means you can start analyzing before the run finishes
Error rates are higher than short-read platforms (roughly 5–15% for raw reads, though improvements continue)

Comparison to short-read methods

Choosing between long-read and short-read sequencing depends on your research question:

Feature	Short-read (e.g., Illumina)	Long-read (e.g., PacBio, Nanopore)
Read length	75–300 bp	10 kb to >1 Mb
Per-base accuracy	>99.9%	~90–99.9% (technology-dependent)
Throughput	Very high	Moderate
Cost per base	Low	Higher
Best for	Variant calling, RNA-seq, large cohorts	De novo assembly, structural variants, repetitive regions

Hybrid approaches that combine long and short reads are increasingly common. Long reads provide the scaffold, and short reads polish the consensus to high accuracy.

Sequencing applications

Different research and clinical goals call for different sequencing strategies. The choice of approach affects cost, depth of coverage, and the types of variants you can detect.

Whole genome sequencing

Whole genome sequencing (WGS) determines the complete DNA sequence of an organism. It captures coding regions, regulatory elements, intergenic sequences, and structural features.

Enables comprehensive analysis of SNPs, indels, structural rearrangements, and novel genes
Used in evolutionary biology, population genetics, cancer genomics, and agricultural biotechnology
Generates large datasets (a single human genome at 30x coverage produces ~90 GB of raw data), requiring significant computational resources

Exome sequencing

Exome sequencing targets only the protein-coding regions (exons), which make up roughly 1–2% of the human genome but contain an estimated ~85% of known disease-causing variants.

Much more cost-effective than WGS when you're primarily interested in coding mutations
Widely used in clinical diagnostics for rare Mendelian disorders
Requires a capture step before sequencing: biotinylated probes hybridize to exonic regions, and non-target DNA is washed away

The tradeoff is that you miss non-coding regulatory variants, structural variants, and intronic mutations.

Targeted sequencing approaches

Targeted sequencing focuses on specific genomic regions of interest, giving you very deep coverage at low cost.

Amplicon sequencing: PCR primers amplify specific regions before sequencing
Capture-based methods: similar to exome capture but with custom probe panels
CRISPR-Cas9 enrichment: uses Cas9 to selectively cut and enrich target regions without PCR

Applications include cancer mutation profiling (e.g., sequencing a panel of 50 known oncogenes), pharmacogenomics, and pathogen identification.

Data analysis challenges

Sequencing instruments produce raw data, but turning that data into biological insight requires multiple computational steps. Each step introduces potential errors that bioinformaticians need to manage.

Read quality assessment

Before any analysis, you need to verify that your sequencing data is reliable. Quality assessment examines:

Base quality scores (Phred scores): a Phred score of 30 means a 1 in 1,000 chance of an incorrect base call
GC content distribution (skewed distributions may indicate contamination or bias)
Sequence duplication levels and adapter contamination

Tools like FastQC and MultiQC generate visual reports of these metrics. Based on the results, you'll perform quality control steps: trimming adapters, correcting errors, and filtering out low-quality reads.

Genome assembly methods

Genome assembly reconstructs the original genome from millions of short (or long) reads. There are two main strategies:

De novo assembly: used when no reference genome exists. Algorithms build the sequence from scratch using overlap information between reads.
Reference-guided assembly: aligns reads to an existing genome of the same or closely related species.

The two major algorithmic frameworks are:

Overlap-Layout-Consensus (OLC): works well with longer reads; finds overlaps between all read pairs
De Bruijn graphs: breaks reads into k-mers (subsequences of length k) and finds paths through the graph; efficient for short reads

Repetitive regions remain the biggest challenge. If a repeat is longer than your read length, the assembler can't tell how many copies exist or where they belong. This is where long-read technologies make the biggest difference.

Variant calling algorithms

Variant calling identifies positions where a sequenced genome differs from a reference. The main types of variants are SNPs (single nucleotide polymorphisms), indels (insertions/deletions), and structural variants (large rearrangements).

Algorithms consider read depth, mapping quality, base quality, and allele frequency
Popular tools: GATK (Broad Institute's standard pipeline), FreeBayes (Bayesian approach), and DeepVariant (deep learning-based)
A major challenge is distinguishing true variants from sequencing errors or alignment artifacts, especially in low-complexity regions

Early sequencing methods, DNA sequencing - wikidoc

Emerging technologies

Sequencing technology continues to evolve rapidly. These newer approaches address limitations of current methods and open up entirely new types of experiments.

Single-cell sequencing

Standard sequencing averages signals across millions of cells, masking the differences between individual cells. Single-cell sequencing isolates and sequences DNA or RNA from individual cells, revealing heterogeneity within tissues.

Techniques exist for single-cell whole genome, transcriptome (scRNA-seq), and epigenome analysis
A key challenge is amplification bias: since you start with a tiny amount of material from one cell, PCR amplification can distort the true abundances
Applications include mapping cell types in tumors, tracing developmental lineages, and characterizing immune cell diversity
Requires specialized bioinformatics tools for normalization, clustering, and trajectory analysis (e.g., Seurat, Scanpy)

Long-read sequencing advancements

Long-read platforms are improving rapidly in both accuracy and throughput.

PacBio HiFi reads now combine 10–20 kb read lengths with >99.9% accuracy, approaching short-read quality
Oxford Nanopore's ultra-long reads have enabled the first telomere-to-telomere (T2T) human genome assemblies, filling in gaps that persisted for 20 years after the Human Genome Project
Newer platforms (Singular Genomics, Element Biosciences) are entering the market with competitive accuracy
Algorithm development for long reads remains an active area, particularly for handling the distinct error profiles of each platform

In situ sequencing

In situ sequencing performs sequencing directly within intact tissue sections, preserving spatial context that's lost when you extract DNA or RNA from homogenized samples.

Techniques include FISSEQ (fluorescence in situ sequencing) and spatially-resolved transcriptomics methods like MERFISH and Slide-seq
You can see not just which genes are expressed, but where in the tissue they're expressed
Applications in neuroscience (mapping gene expression across brain regions), developmental biology, and tumor microenvironment studies
Requires specialized image analysis pipelines and methods for integrating spatial data with expression data

Ethical considerations

As genome sequencing becomes cheaper and more widespread, the ethical questions surrounding it grow more pressing. Bioinformaticians handle sensitive genetic data regularly and need to understand these issues.

Privacy concerns

Your genome is uniquely identifying. Even "anonymized" genetic datasets can potentially be re-identified by cross-referencing with public genealogy databases or other data sources.

Genomic data also reveals information about biological relatives who never consented to testing
Secure storage and controlled-access mechanisms (like dbGaP's tiered access system) are standard practice
Emerging techniques like homomorphic encryption and federated learning aim to enable analysis of genomic data without exposing raw sequences

Genetic discrimination issues

Genetic information could be misused in employment or insurance decisions. In the US, the Genetic Information Nondiscrimination Act (GINA) of 2008 prohibits discrimination in health insurance and employment based on genetic information, but it has gaps: it doesn't cover life insurance, disability insurance, or long-term care insurance.

Interpreting genetic risk is complex. Most conditions involve many genes and environmental factors, so a "predisposition" is not a diagnosis.
Public education about what genetic results actually mean remains an ongoing challenge
Ethical debates around prenatal genetic screening and selective reproduction continue to evolve

Informed consent for genomic testing is more complicated than for most medical procedures because the implications extend far beyond the immediate test.

Participants need to understand that sequencing may reveal incidental findings (unexpected results unrelated to the original purpose, like a cancer predisposition gene found during a research study)
Consent for minors raises additional questions, since children can't meaningfully consent to having their genome on record
Dynamic consent models are emerging, where participants can update their preferences as new analyses become possible with their stored data

Future of genome sequencing

Sequencing technology shows no signs of slowing down. The trends point toward cheaper, faster, more portable, and more integrated approaches.

Cost reduction trends

The cost of sequencing a human genome has dropped from ~$2.7 billion (Human Genome Project) to roughly $200 today. This decline has outpaced Moore's Law, largely due to NGS innovations.

The $\$ 100 genome is an industry target that would make WGS routine in clinical care
Costs are dropping not just for sequencing itself but also for library preparation and sample handling
The bottleneck is shifting from data generation to data storage, analysis, and interpretation

Portable sequencing devices

Miniaturized sequencing devices are making genomics accessible outside traditional labs.

Oxford Nanopore's MinION has been used to sequence pathogens during Ebola and Zika outbreaks in remote field settings, and even aboard the International Space Station
Real-time results enable rapid clinical decisions (e.g., identifying antibiotic resistance genes in a bacterial infection within hours)
Challenges include limited computational resources in the field, driving development of edge computing and cloud-based analysis pipelines

Integration with other omics

The future of genomics is multi-omics: combining genomic data with transcriptomics (RNA), proteomics (proteins), metabolomics (metabolites), and epigenomics (chemical modifications to DNA).

No single omics layer tells the full story. A mutation in a gene (genomics) only matters if the gene is expressed (transcriptomics) and the protein is functional (proteomics).
Machine learning and network analysis tools are being developed to integrate these diverse data types
Applications include systems biology, precision medicine, and drug discovery
The goal is predictive models that can assess disease risk and treatment response from a patient's combined omics profile

2,589 studying →