🧬Bioinformatics

Fundamental Genomic Databases

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

In bioinformatics, knowing which database to query is just as important as knowing how to analyze the data you retrieve. You're being tested on your ability to match research questions to appropriate data sources. Sequence retrieval, structural analysis, and variation studies each require different tools. The databases you'll encounter aren't isolated silos; they form an interconnected ecosystem where primary sequence archives, curated reference collections, structural repositories, and functional annotation resources each serve distinct purposes.

Don't just memorize database names and URLs. Know what type of data each resource contains, how databases collaborate and cross-reference each other, and when you'd choose one over another for a specific analysis. Exam questions often present a research scenario and ask which database best addresses the need, or require you to trace data flow from raw sequence submission to functional interpretation.

Primary Sequence Archives: The Foundation Layer

These databases are the first stop for newly generated sequence data. They operate under the International Nucleotide Sequence Database Collaboration (INSDC), which means sequences submitted to any one member are automatically shared with the other two. This is a critical concept: it ensures data redundancy and global accessibility regardless of where a researcher submits.

GenBank

NCBI's primary nucleotide sequence repository and the most commonly accessed database for retrieving DNA and RNA sequences in U.S.-based research
Accepts direct submissions from researchers worldwide, so data quality varies widely, from raw experimental sequences to well-annotated entries
Cross-references protein translations automatically, linking nucleotide records to corresponding amino acid sequences via accession numbers

EMBL-EBI European Nucleotide Archive (ENA)

Europe's primary sequence archive, operated by EMBL-EBI, mirroring GenBank and DDBJ content through INSDC synchronization
Integrates data with other EMBL-EBI resources like InterPro and Ensembl, providing functional context alongside raw sequences
Supports both assembled genomes and raw sequencing reads, making it particularly useful for accessing next-generation sequencing data in standardized formats

DDBJ (DNA Data Bank of Japan)

Asia-Pacific's INSDC partner, completing the global triad that ensures sequence data remains freely accessible regardless of submission location
Contains identical data to GenBank and ENA due to daily synchronization, so accession numbers work across all three archives
Provides specialized submission tools optimized for high-throughput sequencing projects common in Asian research institutions

Compare: GenBank vs. ENA vs. DDBJ: all three contain identical sequence data through INSDC sharing, but differ in submission interfaces, integrated analysis tools, and regional support. One submission populates all three, so choosing between them is about workflow preference, not data access.

Curated Reference Collections: Quality Over Quantity

Primary archives accept all submissions, but these databases apply expert curation to create authoritative reference sets. Curation means human review and standardization of records. You need curated data when reliable, non-redundant sequences matter, such as for genome annotation or comparative analysis.

RefSeq

NCBI's gold-standard reference sequences, distinguished from GenBank by rigorous curation and non-redundant organization
Uses an accession prefix system that tells you the data type at a glance: NM_ for mRNA, NP_ for protein, NC_ for complete genomic molecules, XM_/XP_ for computationally predicted transcripts/proteins
Essential for genome annotation pipelines because curated transcripts and proteins provide reliable targets for BLAST searches and gene prediction

UniProt

UniProt is the definitive protein sequence and function database. It's split into two sections: Swiss-Prot (manually curated, smaller, high-confidence entries) and TrEMBL (automatically annotated, much larger, less verified). Knowing which section your protein comes from tells you how much you can trust the annotations.

Functional annotations cover enzyme activity, subcellular localization, post-translational modifications, and disease associations
Cross-references extensively to PDB for structures, Gene Ontology (GO) for standardized function terms, and pathway databases like Reactome and KEGG, making it a hub for protein-centric research

Compare: RefSeq vs. UniProt: both are curated, but RefSeq focuses on nucleotide sequences and genomic context while UniProt emphasizes protein function and biological annotation. If a question asks about identifying a gene's genomic location, use RefSeq. For understanding what the protein does, use UniProt.

Genome Browsers: Visualization and Integration

These platforms don't just store data. They display genomic information in spatial context, letting you see how genes, regulatory elements, and variations relate to chromosomal position. The key skill is knowing which browser offers the annotations you need.

Ensembl

Vertebrate and model organism focus, providing comprehensive gene models, orthology predictions, and variation data for roughly 200 species
Integrates multiple data types including gene predictions, regulatory features, and comparative genomics alignments in a single interface
The BioMart tool enables bulk data downloads and complex queries across datasets, which is critical for large-scale computational analyses where you need to pull data programmatically rather than clicking through a browser

UCSC Genome Browser

Track-based visualization system that excels at displaying diverse annotation types (genes, conservation scores, epigenomic marks) as layered horizontal tracks
Offers extensive human and mouse annotations including ENCODE regulatory data, making it a go-to for human disease and regulatory genomics research
Supports custom track uploads, so researchers can overlay their own experimental data onto reference annotations for direct visual comparison

Compare: Ensembl vs. UCSC Genome Browser: both visualize genomes and access similar underlying data, but Ensembl offers stronger comparative genomics and orthology tools while UCSC provides more human regulatory annotations and flexible track customization. Pick based on what your analysis needs.

Variation Databases: Cataloging Genetic Diversity

Understanding genetic variation requires specialized repositories that document what varies, where it occurs, and how frequently. These databases support population genetics, genome-wide association studies (GWAS), and clinical variant interpretation.

dbSNP

dbSNP is NCBI's catalog of short genetic variants, including single nucleotide polymorphisms (SNPs), small insertions/deletions (indels), and microsatellites. Each variant gets a unique rs# identifier (e.g., rs1234567) that serves as a universal reference across studies and publications.

Includes population frequency data from projects like the 1000 Genomes Project and gnomAD, helping you distinguish rare variants from common polymorphisms
Contains clinical significance annotations linking variants to phenotypes, though ClinVar (a separate NCBI database) provides more detailed and regularly updated pathogenicity assessments for clinical use

Compare: dbSNP vs. UniProt variant annotations: dbSNP catalogs genomic position and population frequency of variants, while UniProt documents functional consequences at the protein level. Use dbSNP to find what variants exist in a genomic region; use UniProt to understand how a specific variant affects protein function.

Structural and Functional Data: Beyond Sequence

Sequence alone doesn't reveal how molecules function. These databases provide three-dimensional structures and experimental functional data that connect genotype to phenotype.

PDB (Protein Data Bank)

The PDB is the sole international archive for experimentally determined 3D structures of biological macromolecules. If a structure was solved experimentally, it lives here.

Contains structures determined by X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy (cryo-EM) for proteins, nucleic acids, and their complexes
Each entry includes atomic coordinates, experimental method details, resolution metrics, and often bound ligands or interaction partners
Essential for structure-based drug design and understanding molecular mechanisms; structures are typically visualized with tools like PyMOL, UCSF Chimera, or Mol*

Gene Expression Omnibus (GEO)

GEO is NCBI's repository for functional genomics experiments, archiving data from microarray, RNA-Seq, ChIP-Seq, ATAC-Seq, and other high-throughput assays. Think of it as the place to find out when, where, and how much genes are expressed under different conditions.

Stores both raw data and processed results, enabling reanalysis of original experiments and meta-analyses across multiple studies
GEO DataSets and GEO Profiles tools let you quickly explore gene expression patterns across conditions without downloading full datasets, which is useful for preliminary investigation

Compare: PDB vs. GEO: both provide functional insights, but PDB reveals static molecular structure while GEO captures dynamic expression changes. A protein's structure (PDB) explains how it might work mechanistically; expression data (GEO) shows when and where it's active.

Quick Reference Table

Concept	Best Examples
Primary sequence submission	GenBank, ENA, DDBJ
Curated reference sequences	RefSeq, UniProt (Swiss-Prot)
Genome visualization	Ensembl, UCSC Genome Browser
Genetic variation	dbSNP, ClinVar
Protein function annotation	UniProt
3D molecular structure	PDB
Gene expression experiments	GEO
INSDC collaboration	GenBank + ENA + DDBJ (synchronized)

Self-Check Questions

You've discovered a novel gene and need to submit its sequence for public access. Which database(s) would accept your submission, and what happens to the data after submission through INSDC?
Compare RefSeq and GenBank: both are maintained by NCBI and contain nucleotide sequences, so why would you choose RefSeq accessions over GenBank entries for a genome annotation pipeline?
A colleague wants to identify all known SNPs within a 50kb region surrounding a disease-associated gene and determine their frequencies in European populations. Which database should they query first, and what identifier system will they encounter?
You're studying a protein's enzymatic mechanism and need both its 3D structure and information about known functional domains. Which two databases would you cross-reference, and what complementary information does each provide?
You need to investigate whether a gene is differentially expressed in cancer versus normal tissue. Which database would you search for existing experimental data, and what data types might you find there?

🧬Bioinformatics

Fundamental Genomic Databases

Why This Matters

Primary Sequence Archives: The Foundation Layer

GenBank

EMBL-EBI European Nucleotide Archive (ENA)

DDBJ (DNA Data Bank of Japan)

Curated Reference Collections: Quality Over Quantity

RefSeq

UniProt

Genome Browsers: Visualization and Integration

Ensembl

UCSC Genome Browser

Variation Databases: Cataloging Genetic Diversity

dbSNP

Structural and Functional Data: Beyond Sequence

PDB (Protein Data Bank)

Gene Expression Omnibus (GEO)

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes