upgrade
upgrade

🧬Bioinformatics

Fundamental Genomic Databases

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

In bioinformatics, knowing which database to query is just as important as knowing how to analyze the data you retrieve. You're being tested on your ability to match research questions to appropriate data sources—understanding that sequence retrieval, structural analysis, and variation studies each require different tools. The databases you'll encounter aren't isolated silos; they form an interconnected ecosystem where primary sequence archives, curated reference collections, structural repositories, and functional annotation resources each serve distinct purposes.

Don't just memorize database names and URLs. Know what type of data each resource contains, how databases collaborate and cross-reference each other, and when you'd choose one over another for a specific analysis. Exam questions often present a research scenario and ask which database best addresses the need—or require you to trace data flow from raw sequence submission to functional interpretation.


Primary Sequence Archives: The Foundation Layer

These databases serve as the first stop for newly generated sequence data. They operate under the International Nucleotide Sequence Database Collaboration (INSDC), meaning sequences submitted to one are automatically shared with the others—a critical concept for understanding data redundancy and global accessibility.

GenBank

  • NCBI's primary nucleotide sequence repository—the most commonly accessed database for retrieving DNA and RNA sequences in U.S.-based research
  • Accepts direct submissions from researchers worldwide, meaning data quality varies from raw experimental sequences to well-annotated entries
  • Cross-references protein translations automatically, linking nucleotide records to corresponding amino acid sequences via accession numbers

EMBL-EBI European Nucleotide Archive (ENA)

  • Europe's primary sequence archive—operates under EMBL-EBI and mirrors GenBank/DDBJ content through INSDC synchronization
  • Emphasizes functional annotations alongside raw sequences, integrating data with other EMBL-EBI resources like InterPro and Ensembl
  • Supports assembled genomes and raw reads, making it essential for accessing next-generation sequencing data in standardized formats

DDBJ (DNA Data Bank of Japan)

  • Asia-Pacific's INSDC partner—completes the global triad ensuring sequence data remains freely accessible regardless of submission location
  • Identical data content to GenBank and ENA due to daily synchronization, so accession numbers work across all three archives
  • Provides specialized submission tools optimized for high-throughput sequencing projects common in Asian research institutions

Compare: GenBank vs. ENA vs. DDBJ—all three contain identical sequence data through INSDC sharing, but differ in submission interfaces, integrated analysis tools, and regional support. For exams, remember: one submission populates all three, so choosing between them is about workflow preference, not data access.


Curated Reference Collections: Quality Over Quantity

While primary archives accept all submissions, these databases apply expert curation to create authoritative reference sets. Curation means human review and standardization—essential when you need reliable sequences for genome annotation or comparative analysis.

RefSeq

  • NCBI's gold-standard reference sequences—distinguished from GenBank by rigorous curation and non-redundant organization
  • Accession prefix system helps identify data type: NM_ for mRNA, NP_ for protein, NC_ for complete genomic molecules
  • Essential for genome annotation pipelines because curated transcripts and proteins provide reliable targets for BLAST searches and gene prediction

UniProt

  • The definitive protein sequence and function database—combines Swiss-Prot (manually curated) and TrEMBL (automatically annotated) entries
  • Functional annotations include enzyme activity, subcellular localization, post-translational modifications, and disease associations
  • Cross-references extensively to PDB for structures, GO for ontology terms, and pathway databases—making it a hub for protein-centric research

Compare: RefSeq vs. UniProt—both are curated, but RefSeq focuses on nucleotide sequences and genomic context while UniProt emphasizes protein function and biological annotation. If an FRQ asks about identifying a gene's genomic location, use RefSeq; for understanding what the protein does, use UniProt.


Genome Browsers: Visualization and Integration

These platforms don't just store data—they display genomic information in spatial context, allowing you to see how genes, regulatory elements, and variations relate to chromosomal position. The key skill is knowing which browser offers the annotations you need.

Ensembl

  • Vertebrate and model organism focus—provides comprehensive gene models, orthology predictions, and variation data for ~200 species
  • Integrates multiple data types including gene predictions, regulatory features, and comparative genomics alignments in a single interface
  • BioMart tool enables bulk data downloads and complex queries across datasets—critical for large-scale computational analyses

UCSC Genome Browser

  • Track-based visualization system—excels at displaying diverse annotation types (genes, conservation, epigenomics) as layered horizontal tracks
  • Extensive human and mouse annotations including ENCODE regulatory data, making it preferred for human disease research
  • Custom track uploads allow researchers to overlay their own data onto reference annotations for direct comparison

Compare: Ensembl vs. UCSC Genome Browser—both visualize genomes, but Ensembl offers stronger comparative genomics and orthology tools while UCSC provides more human regulatory annotations and flexible track customization. Know that both access similar underlying data but present it differently.


Variation Databases: Cataloging Genetic Diversity

Understanding genetic variation requires specialized repositories that document what varies, where it occurs, and how frequently. These databases support population genetics, GWAS studies, and clinical variant interpretation.

dbSNP

  • NCBI's catalog of short genetic variants—includes SNPs, small insertions/deletions, and microsatellites with unique rs# identifiers
  • Population frequency data from projects like 1000 Genomes helps distinguish rare variants from common polymorphisms
  • Clinical significance annotations link variants to phenotypes, though ClinVar provides more detailed pathogenicity assessments

Compare: dbSNP vs. UniProt variant annotations—dbSNP catalogs genomic position and population frequency of variants, while UniProt documents functional consequences at the protein level. Use dbSNP to find variants in a region; use UniProt to understand how a specific variant affects protein function.


Structural and Functional Data: Beyond Sequence

Sequence alone doesn't reveal how molecules function. These databases provide three-dimensional structures and experimental functional data that connect genotype to phenotype.

PDB (Protein Data Bank)

  • The sole archive for experimentally determined 3D structures—contains X-ray crystallography, NMR, and cryo-EM structures of proteins and nucleic acids
  • Each entry includes atomic coordinates, experimental methods, resolution metrics, and often bound ligands or interaction partners
  • Essential for structure-based drug design and understanding molecular mechanisms through visualization tools like PyMOL or Chimera

Gene Expression Omnibus (GEO)

  • NCBI's repository for functional genomics experiments—archives microarray, RNA-Seq, ChIP-Seq, and other high-throughput datasets
  • Raw data plus processed results allow both reanalysis of original experiments and meta-analyses across studies
  • GEO DataSets and Profiles tools enable quick exploration of gene expression patterns across conditions without downloading full datasets

Compare: PDB vs. GEO—both provide functional insights, but PDB reveals static molecular structure while GEO captures dynamic expression changes. A protein's structure (PDB) explains how it might work; expression data (GEO) shows when and where it's active.


Quick Reference Table

ConceptBest Examples
Primary sequence submissionGenBank, ENA, DDBJ
Curated reference sequencesRefSeq, UniProt (Swiss-Prot)
Genome visualizationEnsembl, UCSC Genome Browser
Genetic variationdbSNP
Protein function annotationUniProt
3D molecular structurePDB
Gene expression experimentsGEO
INSDC collaborationGenBank + ENA + DDBJ (synchronized)

Self-Check Questions

  1. You've discovered a novel gene and need to submit its sequence for public access. Which database(s) would accept your submission, and what happens to the data after submission through INSDC?

  2. Compare RefSeq and GenBank: both are maintained by NCBI and contain nucleotide sequences, so why would you choose RefSeq accessions over GenBank entries for a genome annotation pipeline?

  3. A colleague wants to identify all known SNPs within a 50kb region surrounding a disease-associated gene and determine their frequencies in European populations. Which database should they query first, and what identifier system will they encounter?

  4. You're studying a protein's enzymatic mechanism and need both its 3D structure and information about known functional domains. Which two databases would you cross-reference, and what complementary information does each provide?

  5. An FRQ asks you to design a workflow for investigating whether a gene is differentially expressed in cancer versus normal tissue. Which database would you search for existing experimental data, and what data types might you find there?