Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
In bioinformatics, knowing which database to query is just as important as knowing how to analyze the data you retrieve. You're being tested on your ability to match research questions to appropriate data sources. Sequence retrieval, structural analysis, and variation studies each require different tools. The databases you'll encounter aren't isolated silos; they form an interconnected ecosystem where primary sequence archives, curated reference collections, structural repositories, and functional annotation resources each serve distinct purposes.
Don't just memorize database names and URLs. Know what type of data each resource contains, how databases collaborate and cross-reference each other, and when you'd choose one over another for a specific analysis. Exam questions often present a research scenario and ask which database best addresses the need, or require you to trace data flow from raw sequence submission to functional interpretation.
These databases are the first stop for newly generated sequence data. They operate under the International Nucleotide Sequence Database Collaboration (INSDC), which means sequences submitted to any one member are automatically shared with the other two. This is a critical concept: it ensures data redundancy and global accessibility regardless of where a researcher submits.
Compare: GenBank vs. ENA vs. DDBJ: all three contain identical sequence data through INSDC sharing, but differ in submission interfaces, integrated analysis tools, and regional support. One submission populates all three, so choosing between them is about workflow preference, not data access.
Primary archives accept all submissions, but these databases apply expert curation to create authoritative reference sets. Curation means human review and standardization of records. You need curated data when reliable, non-redundant sequences matter, such as for genome annotation or comparative analysis.
UniProt is the definitive protein sequence and function database. It's split into two sections: Swiss-Prot (manually curated, smaller, high-confidence entries) and TrEMBL (automatically annotated, much larger, less verified). Knowing which section your protein comes from tells you how much you can trust the annotations.
Compare: RefSeq vs. UniProt: both are curated, but RefSeq focuses on nucleotide sequences and genomic context while UniProt emphasizes protein function and biological annotation. If a question asks about identifying a gene's genomic location, use RefSeq. For understanding what the protein does, use UniProt.
These platforms don't just store data. They display genomic information in spatial context, letting you see how genes, regulatory elements, and variations relate to chromosomal position. The key skill is knowing which browser offers the annotations you need.
Compare: Ensembl vs. UCSC Genome Browser: both visualize genomes and access similar underlying data, but Ensembl offers stronger comparative genomics and orthology tools while UCSC provides more human regulatory annotations and flexible track customization. Pick based on what your analysis needs.
Understanding genetic variation requires specialized repositories that document what varies, where it occurs, and how frequently. These databases support population genetics, genome-wide association studies (GWAS), and clinical variant interpretation.
dbSNP is NCBI's catalog of short genetic variants, including single nucleotide polymorphisms (SNPs), small insertions/deletions (indels), and microsatellites. Each variant gets a unique rs# identifier (e.g., rs1234567) that serves as a universal reference across studies and publications.
Compare: dbSNP vs. UniProt variant annotations: dbSNP catalogs genomic position and population frequency of variants, while UniProt documents functional consequences at the protein level. Use dbSNP to find what variants exist in a genomic region; use UniProt to understand how a specific variant affects protein function.
Sequence alone doesn't reveal how molecules function. These databases provide three-dimensional structures and experimental functional data that connect genotype to phenotype.
The PDB is the sole international archive for experimentally determined 3D structures of biological macromolecules. If a structure was solved experimentally, it lives here.
GEO is NCBI's repository for functional genomics experiments, archiving data from microarray, RNA-Seq, ChIP-Seq, ATAC-Seq, and other high-throughput assays. Think of it as the place to find out when, where, and how much genes are expressed under different conditions.
Compare: PDB vs. GEO: both provide functional insights, but PDB reveals static molecular structure while GEO captures dynamic expression changes. A protein's structure (PDB) explains how it might work mechanistically; expression data (GEO) shows when and where it's active.
| Concept | Best Examples |
|---|---|
| Primary sequence submission | GenBank, ENA, DDBJ |
| Curated reference sequences | RefSeq, UniProt (Swiss-Prot) |
| Genome visualization | Ensembl, UCSC Genome Browser |
| Genetic variation | dbSNP, ClinVar |
| Protein function annotation | UniProt |
| 3D molecular structure | PDB |
| Gene expression experiments | GEO |
| INSDC collaboration | GenBank + ENA + DDBJ (synchronized) |
You've discovered a novel gene and need to submit its sequence for public access. Which database(s) would accept your submission, and what happens to the data after submission through INSDC?
Compare RefSeq and GenBank: both are maintained by NCBI and contain nucleotide sequences, so why would you choose RefSeq accessions over GenBank entries for a genome annotation pipeline?
A colleague wants to identify all known SNPs within a 50kb region surrounding a disease-associated gene and determine their frequencies in European populations. Which database should they query first, and what identifier system will they encounter?
You're studying a protein's enzymatic mechanism and need both its 3D structure and information about known functional domains. Which two databases would you cross-reference, and what complementary information does each provide?
You need to investigate whether a gene is differentially expressed in cancer versus normal tissue. Which database would you search for existing experimental data, and what data types might you find there?