2.1 Introduction to biological databases (GenBank, UniProt, PDB, etc.)
8 min read•august 14, 2024
Biological databases are essential tools in computational biology, storing vast amounts of genetic and molecular data. They serve as central repositories for researchers worldwide, enabling efficient data storage, retrieval, and analysis.
, , and PDB are key databases for , protein information, and 3D structures, respectively. These resources facilitate data sharing, collaboration, and the development of computational tools, accelerating scientific discoveries in the field of biology.
Biological Databases and Their Functions
Major Biological Databases
Top images from around the web for Major Biological Databases
Genome Browsing and Visualization - UCSC | Griffith Lab View original
Is this image relevant?
1 of 3
GenBank database of nucleotide sequences maintained by the National Center for Biotechnology Information (NCBI)
Stores DNA and RNA sequences from various organisms (human, mouse, bacteria)
Provides information on the source organism, coding regions, and associated publications for each sequence entry
UniProt (Universal Protein Resource) comprehensive database of and functional information
Serves as a central resource for analyzing protein sequences and their annotations
Includes data on protein names, amino acid sequences, domain structure, post-translational modifications (phosphorylation, glycosylation), subcellular localization (nucleus, cytoplasm), and biological processes
database that stores 3D structural data of large biological molecules
Contains experimentally determined structures of proteins and nucleic acids derived from X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy
Provides atomic coordinates, experimental method details, resolution, and other structural annotations for each entry
database that provides genome annotation and analysis for vertebrates and other eukaryotic species
Offers a wide range of genomic data, including gene sequences, regulatory regions (promoters, enhancers), and comparative genomics information
Covers various species (human, mouse, zebrafish) and allows for cross-species comparisons
UCSC Genome Browser web-based tool for visualizing and exploring genomic data
Provides access to a wide range of annotated genomes (human, mouse, fruit fly)
Allows users to view and analyze various genomic features (genes, transcripts, regulatory elements)
Integration and Metadata in Databases
GenBank and UniProt store metadata associated with the sequences
Taxonomy information (species, lineage) helps organize sequences based on evolutionary relationships
Literature references provide context and support for the submitted sequences
Links to related databases (Ensembl, PDB) facilitate data integration and exploration
UniProt incorporates data from other databases to provide a comprehensive view of protein function and classification
database of protein families and domains helps classify proteins based on conserved regions
terms describe the molecular functions, biological processes, and cellular components associated with proteins
Importance of Databases in Research
Data Storage and Retrieval
Biological databases serve as central repositories for storing, organizing, and retrieving vast amounts of biological data
Researchers worldwide generate data through experiments (sequencing, structural studies) and submit them to databases
Databases provide a structured and standardized format for data storage, making it easier to access and query the information
Centralized data storage ensures data integrity, consistency, and long-term preservation
Databases enable researchers to access and analyze data efficiently
Saves time and resources that would otherwise be spent on generating the data independently
Allows researchers to focus on data analysis and interpretation rather than data generation
Provides access to a wide range of data types (sequences, structures, annotations) through user-friendly interfaces and search tools
Collaboration and Data Sharing
Biological databases facilitate data sharing and collaboration among researchers
Promotes scientific progress by allowing researchers to build upon existing knowledge
Enables researchers to validate and reproduce findings from other studies
Encourages collaborative efforts to tackle complex biological questions
Data sharing through databases accelerates the pace of scientific discoveries
Researchers can quickly access and integrate data from multiple sources
Facilitates the identification of patterns, trends, and relationships across different datasets
Stimulates new hypotheses and research directions based on the available data
Data Analysis and Hypothesis Generation
The availability of well-curated and annotated data in biological databases allows researchers to perform comparative analyses
Comparing sequences across different species helps identify conserved regions and functional elements
Analyzing protein structures provides insights into their function, interactions, and evolutionary relationships
Integrating data from multiple sources (gene expression, protein interactions, pathways) enables a systems-level understanding of biological processes
Biological databases provide a foundation for generating hypotheses and guiding experimental validation
Researchers can identify potential targets for further study based on the information available in databases
Computational predictions (gene function, protein structure) can be experimentally tested and refined
Databases facilitate the design of targeted experiments by providing relevant background information and candidate molecules
Computational Tool Development
Biological databases provide a rich resource for developing computational tools and algorithms
algorithms (, ) rely on the availability of comprehensive sequence databases
Protein structure prediction methods (, threading) utilize structural data from PDB
Gene prediction and annotation tools (, ) leverage information from databases to improve their accuracy
The integration of data from multiple biological databases enables the development of predictive models
Machine learning algorithms can be trained on large datasets to predict protein function, disease associations, or drug targets
Network analysis tools can uncover complex relationships and pathways by integrating data from various sources
Databases provide the necessary training data and benchmarks for developing and evaluating computational methods
Primary vs Secondary Databases
Primary Databases
Primary databases store original, experimentally derived data that have not undergone significant processing or interpretation
Examples include GenBank (nucleotide sequences), UniProt (protein sequences), and PDB (protein structures)
Data is directly submitted by researchers who generated the experimental results
Minimal curation or annotation is performed on the raw data
Primary databases are typically maintained by organizations responsible for data generation and submission
The National Center for Biotechnology Information (NCBI) maintains GenBank
The European Bioinformatics Institute (EBI) and the Swiss Institute of Bioinformatics (SIB) collaborate on UniProt
The worldwide Protein Data Bank (wwPDB) manages PDB
Primary databases focus on storing raw data and ensuring its integrity and accessibility
Provide unique identifiers (accession numbers) for each data entry
Implement data validation and quality control measures to ensure data consistency
Offer search and retrieval tools to access the stored data
Secondary Databases
Secondary databases, also known as derived databases, contain information that has been curated, annotated, or computationally analyzed based on data from primary databases
Examples include (protein families), GO (Gene Ontology), and (metabolic pathways)
Data is derived from the analysis and interpretation of primary data sources
Involves manual curation by experts or automated computational pipelines
Secondary databases integrate data from multiple primary sources to provide additional layers of annotation and interpretation
Pfam database groups proteins into families based on sequence similarity and conserved domains
GO database provides a structured vocabulary to describe gene and protein functions across different species
KEGG database maps genes and proteins to metabolic pathways and molecular interactions
Secondary databases aim to provide insights and knowledge derived from the analysis of primary data
Facilitate the functional annotation and classification of genes and proteins
Enable the identification of evolutionary relationships and conserved functional modules
Provide a higher-level understanding of biological processes and systems
Relationship between Primary and Secondary Databases
Secondary databases rely on the data stored in primary databases as their primary source of information
Pfam and InterPro databases use protein sequences from UniProt to build families and identify conserved domains
GO annotations in UniProt are derived from manual curation and computational analysis of primary sequence and literature data
KEGG database integrates data from GenBank, UniProt, and PDB to construct metabolic pathways and molecular networks
The curation and annotation efforts in secondary databases add value to the primary data
Provide standardized terminology and classifications for describing biological entities and processes
Facilitate data integration and comparison across different studies and organisms
Enable researchers to make inferences and generate hypotheses based on the annotated data
Updates and changes in primary databases are propagated to secondary databases
Secondary databases regularly update their content to incorporate new data from primary sources
Ensure the consistency and reliability of the derived information
Provide versioning and tracking of changes to maintain data provenance
Data Types in GenBank, UniProt, and PDB
GenBank Data Types
Nucleotide sequences: DNA and RNA sequences from various organisms
Includes genomic DNA, cDNA, and EST sequences
Represents the primary genetic information of organisms
Sequence annotations: Additional information associated with each sequence entry
Source organism and taxonomy
Coding regions and gene products (proteins)
Regulatory elements (promoters, enhancers)
Literature references and links to related databases
Sequence features: Specific regions or sites within the nucleotide sequence
Genes, exons, introns, and untranslated regions (UTRs)
Transcription start sites and polyadenylation signals
Mutations, polymorphisms, and sequence variations
UniProt Data Types
Protein sequences: Amino acid sequences of proteins from various organisms
Includes both reviewed (manually annotated) and unreviewed (automatically annotated) entries
Represents the primary structure of proteins
Protein annotations: Additional information associated with each protein entry
Cross-references: Links to related databases and resources
Gene Ontology (GO) terms for functional annotation
Pfam and InterPro databases for domain and family information
PDB database for structural information
Literature references and external database identifiers
PDB Data Types
3D structural data: Atomic coordinates and experimental details of biological macromolecules
Proteins, nucleic acids (DNA, RNA), and their complexes
Experimentally determined structures from X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy
Structural annotations: Additional information associated with each structure entry
Secondary structure elements (alpha helices, beta sheets)
Ligands, cofactors, and metal ions
Biological assembly and quaternary structure
Experimental conditions and quality metrics (resolution, R-factor)
Structural features: Specific regions or sites within the 3D structure
Active sites and binding pockets
Protein-protein interaction interfaces
Conformational changes and flexibility
Mutations and their structural impact
Data Integration and Cross-Referencing
GenBank, UniProt, and PDB databases are interconnected and cross-referenced
GenBank provides links to corresponding protein entries in UniProt
UniProt provides links to corresponding nucleotide entries in GenBank and structure entries in PDB
PDB provides links to corresponding protein entries in UniProt and genomic context in GenBank
Data integration across databases enables a more comprehensive understanding of biological entities
Combining sequence, structure, and functional information
Facilitating the mapping of genetic variations to protein structures and functions
Enabling the analysis of evolutionary relationships and conservation across species
Cross-referencing allows researchers to navigate seamlessly between different data types and resources
Accessing related information from multiple perspectives (sequence, structure, function)
Facilitating and knowledge discovery
Providing a more complete picture of biological systems and processes
Key Terms to Review (27)
API Access: API access refers to the ability to interact with an application programming interface (API), which allows different software systems to communicate and share data. In the context of biological databases, API access provides researchers with a way to programmatically retrieve and manipulate data from resources like GenBank, UniProt, and PDB, facilitating the integration of various datasets into bioinformatics applications and analyses.
Augustus: Augustus refers to the first emperor of Rome, who ruled from 27 BCE until his death in 14 CE. He established the Roman Empire after the fall of the Roman Republic and is known for significant reforms, including a comprehensive census and the construction of infrastructure, which laid the foundation for a stable and prosperous empire. His reign marked the beginning of the Pax Romana, a period of relative peace and stability across the empire that influenced various aspects of governance, culture, and society.
Biopython: Biopython is a collection of Python tools and libraries designed for biological computation, providing an accessible way to handle and analyze biological data. It connects programming with biology by facilitating the parsing of various bioinformatics data formats, accessing biological databases, and implementing algorithms for analysis in a straightforward manner.
BLAST: BLAST (Basic Local Alignment Search Tool) is a powerful algorithm used for comparing biological sequences, such as DNA, RNA, or protein sequences, to identify regions of similarity. It helps researchers find homologous sequences in biological databases, enabling them to draw insights about gene function, evolutionary relationships, and more.
Data mining: Data mining is the process of discovering patterns, correlations, and useful information from large sets of data using various techniques from statistics, machine learning, and database systems. This process is crucial in modern biology as it helps in extracting meaningful insights from complex biological data, which is essential for advancements in research and healthcare.
Ensembl: Ensembl is a comprehensive genome browser and database that provides access to annotated genomic data for a wide range of species, primarily vertebrates. It integrates various biological information, including gene sequences, variations, and comparative genomics, allowing researchers to study gene function, evolution, and relationships across different organisms.
FASTA: FASTA is a text-based format for representing nucleotide or protein sequences, designed for easy sharing and parsing by computational tools. This format is widely used in bioinformatics, allowing researchers to efficiently store, access, and analyze biological sequence data from various databases and applications.
Gbk: Gbk, short for GenBank format, is a file format used to represent nucleotide sequences and their associated annotations in biological databases. This format plays a crucial role in the exchange of sequence data, allowing researchers to share information about genes, proteins, and other biological features across various platforms, including GenBank, a major public database of genetic sequences.
GenBank: GenBank is a comprehensive public database of nucleotide sequences and their protein translations, serving as a critical resource for researchers in the field of molecular biology. It supports various computational methods by providing essential sequence data that facilitate genome annotation, gene prediction, and comparative analyses among species.
Gene annotation: Gene annotation is the process of identifying and describing the functional elements of a gene within a genome, including its sequence, structure, and biological role. This process involves using various computational tools and databases to assign information to genes, which is crucial for understanding their functions and interactions. Effective gene annotation helps in deciphering genetic information and enables researchers to make sense of genomic data for applications in medicine, agriculture, and evolutionary biology.
Gene Ontology (GO): Gene Ontology (GO) is a standardized vocabulary that is used to describe the functions, biological processes, and cellular components of genes across different species. GO provides a framework for consistent annotation of genes and gene products, enabling researchers to share and analyze biological data more effectively. This structured information helps in understanding gene functions in a wider biological context, facilitating research in areas like genomics and proteomics.
Hmmer: HMMER is a software suite used for searching sequence databases for homologs of protein sequences and for making sequence alignments using profile hidden Markov models (HMMs). This method allows researchers to identify conserved regions and motifs within protein sequences, which can provide insights into their function and evolutionary relationships.
Homology modeling: Homology modeling is a computational technique used to predict the three-dimensional structure of a protein based on its similarity to known structures of homologous proteins. By aligning the amino acid sequence of the target protein with that of a related protein with a known structure, homology modeling can generate models that provide insights into the protein's function and interactions. This method is particularly valuable in the context of understanding protein functions and guiding drug design.
Human Genome Project: The Human Genome Project was a groundbreaking international scientific research initiative aimed at mapping and understanding all the genes of the human species, completed in 2003. This monumental effort not only identified the approximately 20,000-25,000 genes in human DNA but also provided a foundation for advancements in biological databases, enhancing data accessibility and analysis. Additionally, it has had profound implications for personalized medicine and the societal impact of genomics on healthcare practices.
InterPro: InterPro is a comprehensive database that provides functional analysis of proteins by classifying them into families and predicting the presence of domains and important sites. It integrates diverse information from various biological databases, creating a unified resource that helps researchers identify and understand protein function and relationships across different organisms. This interconnectedness is essential for tasks like multiple sequence alignment, as it aids in predicting how sequences relate to known structures and functions.
KEGG: KEGG, or the Kyoto Encyclopedia of Genes and Genomes, is a comprehensive database resource that integrates genomic, chemical, and systemic functional information. It plays a crucial role in understanding biological functions and systems by providing a framework for analyzing gene functions and metabolic pathways.
Maker: In the context of biological research, a maker is a specific sequence or feature in the genome that signifies the presence of a gene or other genomic element. Makers are crucial for accurately identifying and annotating genes within genomic data, thus facilitating the understanding of biological functions and relationships within databases.
Multiple Sequence Alignment: Multiple sequence alignment (MSA) is a bioinformatics method used to align three or more biological sequences (protein or nucleic acid) simultaneously, helping to identify conserved regions, mutations, and functional similarities among them. This technique builds on pairwise sequence alignment methods to provide a comprehensive view of evolutionary relationships and functional characteristics among multiple sequences, thereby enhancing the analysis of complex biological data.
Nucleotide sequences: Nucleotide sequences are the specific order of nucleotides in a strand of DNA or RNA, which determine the genetic information carried by the molecule. These sequences are fundamental to the functioning of all living organisms, as they encode the instructions for building proteins and maintaining cellular processes. Understanding nucleotide sequences is crucial for analyzing genetic variation, evolutionary relationships, and biological functions across different organisms.
Pdb format: The pdb format, or Protein Data Bank format, is a standardized file format used for representing three-dimensional structures of biological macromolecules, such as proteins and nucleic acids. This format allows researchers to share, visualize, and analyze molecular structures in a uniform way across various software and databases, making it an essential component in structural biology.
Pfam: Pfam is a comprehensive database that classifies protein families and domains based on sequence alignments and hidden Markov models. It provides researchers with valuable insights into the functional and evolutionary relationships of proteins, enabling the identification of conserved sequences and motifs across different organisms. Pfam is crucial for understanding protein function, structure, and interactions, and is widely used in bioinformatics tools and analyses.
Protein Data Bank (PDB): The Protein Data Bank (PDB) is a comprehensive repository for the three-dimensional structural data of biological macromolecules, primarily proteins and nucleic acids. It serves as a crucial resource for researchers in fields such as structural biology, bioinformatics, and computational biology, providing data that helps in understanding molecular functions, interactions, and mechanisms. The PDB is widely used in conjunction with other biological databases to support the study of protein structures and their implications in various biological processes.
Protein sequences: Protein sequences are linear chains of amino acids that make up proteins, determining their structure and function within biological systems. These sequences are crucial for understanding biological functions and interactions, as they dictate how proteins fold and how they interact with other molecules. Analyzing protein sequences is vital for various applications, including bioinformatics, evolutionary studies, and therapeutic development.
Sequence Alignment: Sequence alignment is a method used to arrange the sequences of DNA, RNA, or protein to identify regions of similarity that may indicate functional, structural, or evolutionary relationships. This technique is vital for comparing biological sequences and is closely linked to various formats and tools used for data analysis, programming languages for implementation, and biological research methodologies.
SQL Queries: SQL queries are structured commands used to communicate with a database, allowing users to access, manipulate, and retrieve data stored within it. These commands can be simple or complex, enabling a wide range of operations such as selecting specific data, filtering results, and joining multiple tables. SQL queries form the backbone of data interaction in various applications, particularly when dealing with biological databases.
Structural bioinformatics: Structural bioinformatics is a branch of bioinformatics that focuses on the analysis and prediction of the three-dimensional structures of biological macromolecules, primarily proteins and nucleic acids. It involves utilizing computational methods to model and visualize these structures, which can help in understanding their functions and interactions. The integration of structural data from biological databases enhances research in drug design, protein engineering, and the study of molecular mechanisms.
UniProt: UniProt is a comprehensive protein sequence and functional information database, providing detailed information on protein sequences, structures, and functions. It serves as a critical resource for researchers in various fields, enabling easy access to essential data about proteins, facilitating studies in areas such as genomics, proteomics, and molecular biology.