Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
In bioinformatics, knowing which database to query is just as important as knowing how to analyze the data you retrieve. You're being tested on your ability to match research questions to appropriate data sources—understanding that sequence retrieval, structural analysis, and variation studies each require different tools. The databases you'll encounter aren't isolated silos; they form an interconnected ecosystem where primary sequence archives, curated reference collections, structural repositories, and functional annotation resources each serve distinct purposes.
Don't just memorize database names and URLs. Know what type of data each resource contains, how databases collaborate and cross-reference each other, and when you'd choose one over another for a specific analysis. Exam questions often present a research scenario and ask which database best addresses the need—or require you to trace data flow from raw sequence submission to functional interpretation.
These databases serve as the first stop for newly generated sequence data. They operate under the International Nucleotide Sequence Database Collaboration (INSDC), meaning sequences submitted to one are automatically shared with the others—a critical concept for understanding data redundancy and global accessibility.
Compare: GenBank vs. ENA vs. DDBJ—all three contain identical sequence data through INSDC sharing, but differ in submission interfaces, integrated analysis tools, and regional support. For exams, remember: one submission populates all three, so choosing between them is about workflow preference, not data access.
While primary archives accept all submissions, these databases apply expert curation to create authoritative reference sets. Curation means human review and standardization—essential when you need reliable sequences for genome annotation or comparative analysis.
Compare: RefSeq vs. UniProt—both are curated, but RefSeq focuses on nucleotide sequences and genomic context while UniProt emphasizes protein function and biological annotation. If an FRQ asks about identifying a gene's genomic location, use RefSeq; for understanding what the protein does, use UniProt.
These platforms don't just store data—they display genomic information in spatial context, allowing you to see how genes, regulatory elements, and variations relate to chromosomal position. The key skill is knowing which browser offers the annotations you need.
Compare: Ensembl vs. UCSC Genome Browser—both visualize genomes, but Ensembl offers stronger comparative genomics and orthology tools while UCSC provides more human regulatory annotations and flexible track customization. Know that both access similar underlying data but present it differently.
Understanding genetic variation requires specialized repositories that document what varies, where it occurs, and how frequently. These databases support population genetics, GWAS studies, and clinical variant interpretation.
Compare: dbSNP vs. UniProt variant annotations—dbSNP catalogs genomic position and population frequency of variants, while UniProt documents functional consequences at the protein level. Use dbSNP to find variants in a region; use UniProt to understand how a specific variant affects protein function.
Sequence alone doesn't reveal how molecules function. These databases provide three-dimensional structures and experimental functional data that connect genotype to phenotype.
Compare: PDB vs. GEO—both provide functional insights, but PDB reveals static molecular structure while GEO captures dynamic expression changes. A protein's structure (PDB) explains how it might work; expression data (GEO) shows when and where it's active.
| Concept | Best Examples |
|---|---|
| Primary sequence submission | GenBank, ENA, DDBJ |
| Curated reference sequences | RefSeq, UniProt (Swiss-Prot) |
| Genome visualization | Ensembl, UCSC Genome Browser |
| Genetic variation | dbSNP |
| Protein function annotation | UniProt |
| 3D molecular structure | PDB |
| Gene expression experiments | GEO |
| INSDC collaboration | GenBank + ENA + DDBJ (synchronized) |
You've discovered a novel gene and need to submit its sequence for public access. Which database(s) would accept your submission, and what happens to the data after submission through INSDC?
Compare RefSeq and GenBank: both are maintained by NCBI and contain nucleotide sequences, so why would you choose RefSeq accessions over GenBank entries for a genome annotation pipeline?
A colleague wants to identify all known SNPs within a 50kb region surrounding a disease-associated gene and determine their frequencies in European populations. Which database should they query first, and what identifier system will they encounter?
You're studying a protein's enzymatic mechanism and need both its 3D structure and information about known functional domains. Which two databases would you cross-reference, and what complementary information does each provide?
An FRQ asks you to design a workflow for investigating whether a gene is differentially expressed in cancer versus normal tissue. Which database would you search for existing experimental data, and what data types might you find there?