Mathematical and Computational Methods in Molecular Biology

๐ŸงฌMathematical and Computational Methods in Molecular Biology Unit 8 โ€“ Phylogenetic Trees & Evolution Models

Phylogenetic trees and evolution models are essential tools in molecular biology. They help us understand how organisms are related and how genes change over time. These methods use genetic data to build family trees of species and genes. Mathematical models describe how DNA and proteins evolve. Scientists use these models to create accurate trees and estimate when species diverged. Advanced techniques like machine learning are improving our ability to analyze complex evolutionary relationships.

Got a Unit Test this week?

we crunched the numbers and here's the most likely topics on your next test

Key Concepts in Phylogenetics

  • Phylogenetics studies the evolutionary relationships among organisms based on their genetic and morphological characteristics
  • Phylogenetic trees represent the inferred evolutionary history and relationships of species or genes
  • Homologous traits are shared characteristics inherited from a common ancestor (vertebral column in mammals)
  • Analogous traits are similar features that evolved independently in different lineages due to convergent evolution (wings in birds and bats)
  • Molecular clock hypothesis assumes that genetic changes accumulate at a constant rate over time, allowing the estimation of divergence times
    • Calibration points from fossil records or known evolutionary events are used to convert genetic distances into absolute time
  • Horizontal gene transfer occurs when genetic material is exchanged between organisms outside of vertical inheritance from parent to offspring (antibiotic resistance in bacteria)
  • Incomplete lineage sorting happens when ancestral polymorphisms persist through speciation events, leading to discordance between gene trees and species trees

Evolutionary Models and Their Mathematics

  • Evolutionary models describe the process of genetic change over time and provide a mathematical framework for phylogenetic inference
  • Nucleotide substitution models capture the probabilities of different types of base changes in DNA sequences
    • Jukes-Cantor (JC) model assumes equal rates for all substitutions and equal base frequencies
    • Kimura 2-parameter (K2P) model allows different rates for transitions and transversions
    • General time-reversible (GTR) model incorporates unequal base frequencies and six substitution rates
  • Amino acid substitution models describe the probabilities of amino acid replacements in protein sequences (PAM, BLOSUM matrices)
  • Rate heterogeneity models account for variation in substitution rates across sites using a gamma distribution or invariant sites
  • Codon-based models consider the selective pressures acting on protein-coding sequences by incorporating synonymous and non-synonymous substitution rates
  • Continuous-time Markov chains are used to model the substitution process, with transition probabilities calculated using the matrix exponential: P(t)=eQtP(t) = e^{Qt}, where QQ is the instantaneous rate matrix

Building Phylogenetic Trees

  • Phylogenetic trees are constructed based on the similarities and differences in genetic or morphological characters among taxa
  • Sequence alignment is a crucial step in molecular phylogenetics, where homologous positions in DNA or protein sequences are identified and arranged
    • Multiple sequence alignment algorithms (ClustalW, MUSCLE) optimize the alignment by minimizing the number of gaps and mismatches
  • Distance-based methods calculate pairwise evolutionary distances between sequences and use clustering algorithms to build the tree
    • Neighbor-joining (NJ) algorithm iteratively joins the closest pairs of taxa and adjusts branch lengths to minimize the total tree length
    • UPGMA (Unweighted Pair Group Method with Arithmetic Mean) assumes a constant rate of evolution and produces rooted ultrametric trees
  • Character-based methods evaluate the fit of alternative tree topologies to the observed character data
    • Maximum parsimony (MP) selects the tree that requires the fewest evolutionary changes to explain the data
    • Maximum likelihood (ML) estimates the probability of observing the data given a tree topology and an evolutionary model, choosing the tree with the highest likelihood
  • Bayesian inference incorporates prior knowledge and calculates the posterior probability of trees using Markov chain Monte Carlo (MCMC) sampling
  • Bootstrapping assesses the statistical support for each clade in the tree by resampling the original data with replacement and calculating the proportion of replicates containing the clade

Statistical Methods in Tree Construction

  • Statistical methods evaluate the reliability and robustness of phylogenetic inferences by quantifying the support for different tree topologies
  • Likelihood ratio test (LRT) compares the goodness of fit of two nested evolutionary models, with the test statistic 2ฮ”lnโกL2\Delta\ln L following a chi-square distribution
  • Akaike information criterion (AIC) and Bayesian information criterion (BIC) balance the likelihood and complexity of models, favoring simpler models that adequately explain the data
    • AIC=โˆ’2lnโกL+2KAIC = -2\ln L + 2K, where LL is the likelihood and KK is the number of parameters
    • BIC=โˆ’2lnโกL+KlnโกnBIC = -2\ln L + K\ln n, where nn is the sample size
  • Bayesian posterior probabilities quantify the support for each clade by integrating over tree topologies and model parameters using MCMC sampling
  • Nonparametric bootstrapping estimates the sampling variance and confidence intervals of phylogenetic estimates by resampling the original data
  • Approximate likelihood ratio test (aLRT) provides a fast alternative to bootstrapping by comparing the likelihoods of the best and second-best alternative configurations around each branch

Computational Algorithms for Tree Inference

  • Efficient algorithms are essential for inferring phylogenetic trees from large datasets and complex models
  • Heuristic search strategies explore the tree space by applying local rearrangements to find the optimal tree
    • Nearest-neighbor interchange (NNI) swaps the subtrees on either side of an internal branch
    • Subtree pruning and regrafting (SPR) removes a subtree and reattaches it to a different branch
    • Tree bisection and reconnection (TBR) splits the tree into two subtrees and reconnects them at all possible locations
  • Branch-and-bound algorithms guarantee finding the optimal tree by pruning the search space based on upper and lower bounds of the optimality criterion
  • Divide-and-conquer approaches break down the problem into smaller subproblems and combine their solutions (disk-covering methods)
  • Parallel computing techniques distribute the computational load across multiple processors or cores to speed up the analysis
  • Approximation algorithms provide suboptimal but computationally efficient solutions (FastTree, RAxML)

Interpreting and Analyzing Phylogenetic Trees

  • Phylogenetic trees provide insights into the evolutionary history, relationships, and diversity of organisms
  • Monophyletic groups (clades) consist of an ancestor and all its descendants, forming a natural evolutionary unit
  • Paraphyletic groups include an ancestor and some, but not all, of its descendants, often due to incomplete sampling or exclusion of derived lineages
  • Polyphyletic groups contain taxa that do not share a common ancestor, indicating convergent evolution or incorrect classification
  • Rooting a tree determines the direction of evolution and the common ancestor of all taxa
    • Outgroup rooting uses a distantly related taxon to root the tree, assuming it diverged before the ingroup taxa
    • Midpoint rooting places the root at the midpoint of the longest path between any two taxa
  • Branch lengths represent the amount of evolutionary change or time elapsed along each lineage
  • Clade support values (bootstrap percentages, posterior probabilities) indicate the statistical confidence in each grouping
  • Reconciliation of gene trees and species trees can infer the history of gene duplications, losses, and horizontal transfers

Applications in Molecular Biology

  • Phylogenetic analysis has diverse applications in molecular biology, from understanding evolutionary relationships to inferring functional properties of genes and proteins
  • Comparative genomics uses phylogenetic trees to study the evolution of genomes, identifying conserved regions, gene family expansions, and lineage-specific adaptations
  • Phylogenomics reconstructs the evolutionary history of species by analyzing large-scale genomic data, resolving deep branching patterns and detecting incomplete lineage sorting
  • Molecular epidemiology tracks the spread and evolution of pathogens (HIV, influenza) by constructing phylogenetic trees from viral sequences sampled over time and space
  • Protein function prediction relies on the principle of evolutionary conservation, inferring the function of uncharacterized proteins based on their phylogenetic relationship to annotated homologs
  • Drug target identification exploits the evolutionary differences between pathogen and host proteins to find specific inhibitors with minimal side effects
  • Ancestral sequence reconstruction infers the most likely sequences of extinct ancestors at the internal nodes of a phylogenetic tree, enabling the study of ancient adaptations and biomolecular resurrection

Advanced Topics and Current Research

  • Phylogenetic networks extend the tree model to accommodate reticulate evolutionary events such as hybridization, recombination, and horizontal gene transfer
    • Split networks represent incompatible phylogenetic signals as parallel edges, visualizing the conflicting relationships
    • Ancestral recombination graphs (ARGs) capture the history of recombination events and the ancestry of different genomic regions
  • Coalescent theory models the genealogical history of alleles in a population, providing a framework for inferring population parameters and demographic events from genetic data
    • Multi-species coalescent models account for incomplete lineage sorting and discordance between gene trees and species trees
    • Bayesian skyline plots estimate changes in effective population size over time based on the coalescent patterns of sampled sequences
  • Phylogenetic comparative methods study the evolution of traits along a phylogeny, testing for correlations, adaptive radiations, and evolutionary convergence
    • Independent contrasts transform trait values to account for the non-independence of species due to shared ancestry
    • Ornstein-Uhlenbeck (OU) models incorporate stabilizing selection and attraction towards optimal trait values
  • Machine learning approaches, such as deep learning and graph convolutional networks, are being applied to phylogenetic inference, enabling the analysis of complex, high-dimensional data
  • Microbiome phylogenetics investigates the diversity and evolution of microbial communities, using metagenomics and amplicon sequencing to reconstruct the phylogenetic relationships of uncultured organisms
  • Viral quasispecies analysis studies the intra-host evolutionary dynamics of rapidly evolving viruses (HIV, hepatitis C) by constructing phylogenetic trees from single-molecule sequencing data


ยฉ 2024 Fiveable Inc. All rights reserved.
APยฎ and SATยฎ are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

ยฉ 2024 Fiveable Inc. All rights reserved.
APยฎ and SATยฎ are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.