๐ŸงฌBioinformatics

Fundamental Genomic Databases

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

In bioinformatics, knowing which database to query is just as important as knowing how to analyze the data you retrieve. You're being tested on your ability to match research questions to appropriate data sources. Sequence retrieval, structural analysis, and variation studies each require different tools. The databases you'll encounter aren't isolated silos; they form an interconnected ecosystem where primary sequence archives, curated reference collections, structural repositories, and functional annotation resources each serve distinct purposes.

Don't just memorize database names and URLs. Know what type of data each resource contains, how databases collaborate and cross-reference each other, and when you'd choose one over another for a specific analysis. Exam questions often present a research scenario and ask which database best addresses the need, or require you to trace data flow from raw sequence submission to functional interpretation.


Primary Sequence Archives: The Foundation Layer

These databases are the first stop for newly generated sequence data. They operate under the International Nucleotide Sequence Database Collaboration (INSDC), which means sequences submitted to any one member are automatically shared with the other two. This is a critical concept: it ensures data redundancy and global accessibility regardless of where a researcher submits.

GenBank

  • NCBI's primary nucleotide sequence repository and the most commonly accessed database for retrieving DNA and RNA sequences in U.S.-based research
  • Accepts direct submissions from researchers worldwide, so data quality varies widely, from raw experimental sequences to well-annotated entries
  • Cross-references protein translations automatically, linking nucleotide records to corresponding amino acid sequences via accession numbers

EMBL-EBI European Nucleotide Archive (ENA)

  • Europe's primary sequence archive, operated by EMBL-EBI, mirroring GenBank and DDBJ content through INSDC synchronization
  • Integrates data with other EMBL-EBI resources like InterPro and Ensembl, providing functional context alongside raw sequences
  • Supports both assembled genomes and raw sequencing reads, making it particularly useful for accessing next-generation sequencing data in standardized formats

DDBJ (DNA Data Bank of Japan)

  • Asia-Pacific's INSDC partner, completing the global triad that ensures sequence data remains freely accessible regardless of submission location
  • Contains identical data to GenBank and ENA due to daily synchronization, so accession numbers work across all three archives
  • Provides specialized submission tools optimized for high-throughput sequencing projects common in Asian research institutions

Compare: GenBank vs. ENA vs. DDBJ: all three contain identical sequence data through INSDC sharing, but differ in submission interfaces, integrated analysis tools, and regional support. One submission populates all three, so choosing between them is about workflow preference, not data access.


Curated Reference Collections: Quality Over Quantity

Primary archives accept all submissions, but these databases apply expert curation to create authoritative reference sets. Curation means human review and standardization of records. You need curated data when reliable, non-redundant sequences matter, such as for genome annotation or comparative analysis.

RefSeq

  • NCBI's gold-standard reference sequences, distinguished from GenBank by rigorous curation and non-redundant organization
  • Uses an accession prefix system that tells you the data type at a glance: NM_ for mRNA, NP_ for protein, NC_ for complete genomic molecules, XM_/XP_ for computationally predicted transcripts/proteins
  • Essential for genome annotation pipelines because curated transcripts and proteins provide reliable targets for BLAST searches and gene prediction

UniProt

UniProt is the definitive protein sequence and function database. It's split into two sections: Swiss-Prot (manually curated, smaller, high-confidence entries) and TrEMBL (automatically annotated, much larger, less verified). Knowing which section your protein comes from tells you how much you can trust the annotations.

  • Functional annotations cover enzyme activity, subcellular localization, post-translational modifications, and disease associations
  • Cross-references extensively to PDB for structures, Gene Ontology (GO) for standardized function terms, and pathway databases like Reactome and KEGG, making it a hub for protein-centric research

Compare: RefSeq vs. UniProt: both are curated, but RefSeq focuses on nucleotide sequences and genomic context while UniProt emphasizes protein function and biological annotation. If a question asks about identifying a gene's genomic location, use RefSeq. For understanding what the protein does, use UniProt.


Genome Browsers: Visualization and Integration

These platforms don't just store data. They display genomic information in spatial context, letting you see how genes, regulatory elements, and variations relate to chromosomal position. The key skill is knowing which browser offers the annotations you need.

Ensembl

  • Vertebrate and model organism focus, providing comprehensive gene models, orthology predictions, and variation data for roughly 200 species
  • Integrates multiple data types including gene predictions, regulatory features, and comparative genomics alignments in a single interface
  • The BioMart tool enables bulk data downloads and complex queries across datasets, which is critical for large-scale computational analyses where you need to pull data programmatically rather than clicking through a browser

UCSC Genome Browser

  • Track-based visualization system that excels at displaying diverse annotation types (genes, conservation scores, epigenomic marks) as layered horizontal tracks
  • Offers extensive human and mouse annotations including ENCODE regulatory data, making it a go-to for human disease and regulatory genomics research
  • Supports custom track uploads, so researchers can overlay their own experimental data onto reference annotations for direct visual comparison

Compare: Ensembl vs. UCSC Genome Browser: both visualize genomes and access similar underlying data, but Ensembl offers stronger comparative genomics and orthology tools while UCSC provides more human regulatory annotations and flexible track customization. Pick based on what your analysis needs.


Variation Databases: Cataloging Genetic Diversity

Understanding genetic variation requires specialized repositories that document what varies, where it occurs, and how frequently. These databases support population genetics, genome-wide association studies (GWAS), and clinical variant interpretation.

dbSNP

dbSNP is NCBI's catalog of short genetic variants, including single nucleotide polymorphisms (SNPs), small insertions/deletions (indels), and microsatellites. Each variant gets a unique rs# identifier (e.g., rs1234567) that serves as a universal reference across studies and publications.

  • Includes population frequency data from projects like the 1000 Genomes Project and gnomAD, helping you distinguish rare variants from common polymorphisms
  • Contains clinical significance annotations linking variants to phenotypes, though ClinVar (a separate NCBI database) provides more detailed and regularly updated pathogenicity assessments for clinical use

Compare: dbSNP vs. UniProt variant annotations: dbSNP catalogs genomic position and population frequency of variants, while UniProt documents functional consequences at the protein level. Use dbSNP to find what variants exist in a genomic region; use UniProt to understand how a specific variant affects protein function.


Structural and Functional Data: Beyond Sequence

Sequence alone doesn't reveal how molecules function. These databases provide three-dimensional structures and experimental functional data that connect genotype to phenotype.

PDB (Protein Data Bank)

The PDB is the sole international archive for experimentally determined 3D structures of biological macromolecules. If a structure was solved experimentally, it lives here.

  • Contains structures determined by X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy (cryo-EM) for proteins, nucleic acids, and their complexes
  • Each entry includes atomic coordinates, experimental method details, resolution metrics, and often bound ligands or interaction partners
  • Essential for structure-based drug design and understanding molecular mechanisms; structures are typically visualized with tools like PyMOL, UCSF Chimera, or Mol*

Gene Expression Omnibus (GEO)

GEO is NCBI's repository for functional genomics experiments, archiving data from microarray, RNA-Seq, ChIP-Seq, ATAC-Seq, and other high-throughput assays. Think of it as the place to find out when, where, and how much genes are expressed under different conditions.

  • Stores both raw data and processed results, enabling reanalysis of original experiments and meta-analyses across multiple studies
  • GEO DataSets and GEO Profiles tools let you quickly explore gene expression patterns across conditions without downloading full datasets, which is useful for preliminary investigation

Compare: PDB vs. GEO: both provide functional insights, but PDB reveals static molecular structure while GEO captures dynamic expression changes. A protein's structure (PDB) explains how it might work mechanistically; expression data (GEO) shows when and where it's active.


Quick Reference Table

ConceptBest Examples
Primary sequence submissionGenBank, ENA, DDBJ
Curated reference sequencesRefSeq, UniProt (Swiss-Prot)
Genome visualizationEnsembl, UCSC Genome Browser
Genetic variationdbSNP, ClinVar
Protein function annotationUniProt
3D molecular structurePDB
Gene expression experimentsGEO
INSDC collaborationGenBank + ENA + DDBJ (synchronized)

Self-Check Questions

  1. You've discovered a novel gene and need to submit its sequence for public access. Which database(s) would accept your submission, and what happens to the data after submission through INSDC?

  2. Compare RefSeq and GenBank: both are maintained by NCBI and contain nucleotide sequences, so why would you choose RefSeq accessions over GenBank entries for a genome annotation pipeline?

  3. A colleague wants to identify all known SNPs within a 50kb region surrounding a disease-associated gene and determine their frequencies in European populations. Which database should they query first, and what identifier system will they encounter?

  4. You're studying a protein's enzymatic mechanism and need both its 3D structure and information about known functional domains. Which two databases would you cross-reference, and what complementary information does each provide?

  5. You need to investigate whether a gene is differentially expressed in cancer versus normal tissue. Which database would you search for existing experimental data, and what data types might you find there?