๐งฌMathematical and Computational Methods in Molecular Biology Unit 8 โ Phylogenetic Trees & Evolution Models
Phylogenetic trees and evolution models are essential tools in molecular biology. They help us understand how organisms are related and how genes change over time. These methods use genetic data to build family trees of species and genes.
Mathematical models describe how DNA and proteins evolve. Scientists use these models to create accurate trees and estimate when species diverged. Advanced techniques like machine learning are improving our ability to analyze complex evolutionary relationships.
we crunched the numbers and here's the most likely topics on your next test
Key Concepts in Phylogenetics
Phylogenetics studies the evolutionary relationships among organisms based on their genetic and morphological characteristics
Phylogenetic trees represent the inferred evolutionary history and relationships of species or genes
Homologous traits are shared characteristics inherited from a common ancestor (vertebral column in mammals)
Analogous traits are similar features that evolved independently in different lineages due to convergent evolution (wings in birds and bats)
Molecular clock hypothesis assumes that genetic changes accumulate at a constant rate over time, allowing the estimation of divergence times
Calibration points from fossil records or known evolutionary events are used to convert genetic distances into absolute time
Horizontal gene transfer occurs when genetic material is exchanged between organisms outside of vertical inheritance from parent to offspring (antibiotic resistance in bacteria)
Incomplete lineage sorting happens when ancestral polymorphisms persist through speciation events, leading to discordance between gene trees and species trees
Evolutionary Models and Their Mathematics
Evolutionary models describe the process of genetic change over time and provide a mathematical framework for phylogenetic inference
Nucleotide substitution models capture the probabilities of different types of base changes in DNA sequences
Jukes-Cantor (JC) model assumes equal rates for all substitutions and equal base frequencies
Kimura 2-parameter (K2P) model allows different rates for transitions and transversions
General time-reversible (GTR) model incorporates unequal base frequencies and six substitution rates
Amino acid substitution models describe the probabilities of amino acid replacements in protein sequences (PAM, BLOSUM matrices)
Rate heterogeneity models account for variation in substitution rates across sites using a gamma distribution or invariant sites
Codon-based models consider the selective pressures acting on protein-coding sequences by incorporating synonymous and non-synonymous substitution rates
Continuous-time Markov chains are used to model the substitution process, with transition probabilities calculated using the matrix exponential: P(t)=eQt, where Q is the instantaneous rate matrix
Building Phylogenetic Trees
Phylogenetic trees are constructed based on the similarities and differences in genetic or morphological characters among taxa
Sequence alignment is a crucial step in molecular phylogenetics, where homologous positions in DNA or protein sequences are identified and arranged
Multiple sequence alignment algorithms (ClustalW, MUSCLE) optimize the alignment by minimizing the number of gaps and mismatches
Distance-based methods calculate pairwise evolutionary distances between sequences and use clustering algorithms to build the tree
Neighbor-joining (NJ) algorithm iteratively joins the closest pairs of taxa and adjusts branch lengths to minimize the total tree length
UPGMA (Unweighted Pair Group Method with Arithmetic Mean) assumes a constant rate of evolution and produces rooted ultrametric trees
Character-based methods evaluate the fit of alternative tree topologies to the observed character data
Maximum parsimony (MP) selects the tree that requires the fewest evolutionary changes to explain the data
Maximum likelihood (ML) estimates the probability of observing the data given a tree topology and an evolutionary model, choosing the tree with the highest likelihood
Bayesian inference incorporates prior knowledge and calculates the posterior probability of trees using Markov chain Monte Carlo (MCMC) sampling
Bootstrapping assesses the statistical support for each clade in the tree by resampling the original data with replacement and calculating the proportion of replicates containing the clade
Statistical Methods in Tree Construction
Statistical methods evaluate the reliability and robustness of phylogenetic inferences by quantifying the support for different tree topologies
Likelihood ratio test (LRT) compares the goodness of fit of two nested evolutionary models, with the test statistic 2ฮlnL following a chi-square distribution
Akaike information criterion (AIC) and Bayesian information criterion (BIC) balance the likelihood and complexity of models, favoring simpler models that adequately explain the data
AIC=โ2lnL+2K, where L is the likelihood and K is the number of parameters
BIC=โ2lnL+Klnn, where n is the sample size
Bayesian posterior probabilities quantify the support for each clade by integrating over tree topologies and model parameters using MCMC sampling
Nonparametric bootstrapping estimates the sampling variance and confidence intervals of phylogenetic estimates by resampling the original data
Approximate likelihood ratio test (aLRT) provides a fast alternative to bootstrapping by comparing the likelihoods of the best and second-best alternative configurations around each branch
Computational Algorithms for Tree Inference
Efficient algorithms are essential for inferring phylogenetic trees from large datasets and complex models
Heuristic search strategies explore the tree space by applying local rearrangements to find the optimal tree
Nearest-neighbor interchange (NNI) swaps the subtrees on either side of an internal branch
Subtree pruning and regrafting (SPR) removes a subtree and reattaches it to a different branch
Tree bisection and reconnection (TBR) splits the tree into two subtrees and reconnects them at all possible locations
Branch-and-bound algorithms guarantee finding the optimal tree by pruning the search space based on upper and lower bounds of the optimality criterion
Divide-and-conquer approaches break down the problem into smaller subproblems and combine their solutions (disk-covering methods)
Parallel computing techniques distribute the computational load across multiple processors or cores to speed up the analysis
Approximation algorithms provide suboptimal but computationally efficient solutions (FastTree, RAxML)
Interpreting and Analyzing Phylogenetic Trees
Phylogenetic trees provide insights into the evolutionary history, relationships, and diversity of organisms
Monophyletic groups (clades) consist of an ancestor and all its descendants, forming a natural evolutionary unit
Paraphyletic groups include an ancestor and some, but not all, of its descendants, often due to incomplete sampling or exclusion of derived lineages
Polyphyletic groups contain taxa that do not share a common ancestor, indicating convergent evolution or incorrect classification
Rooting a tree determines the direction of evolution and the common ancestor of all taxa
Outgroup rooting uses a distantly related taxon to root the tree, assuming it diverged before the ingroup taxa
Midpoint rooting places the root at the midpoint of the longest path between any two taxa
Branch lengths represent the amount of evolutionary change or time elapsed along each lineage
Clade support values (bootstrap percentages, posterior probabilities) indicate the statistical confidence in each grouping
Reconciliation of gene trees and species trees can infer the history of gene duplications, losses, and horizontal transfers
Applications in Molecular Biology
Phylogenetic analysis has diverse applications in molecular biology, from understanding evolutionary relationships to inferring functional properties of genes and proteins
Comparative genomics uses phylogenetic trees to study the evolution of genomes, identifying conserved regions, gene family expansions, and lineage-specific adaptations
Phylogenomics reconstructs the evolutionary history of species by analyzing large-scale genomic data, resolving deep branching patterns and detecting incomplete lineage sorting
Molecular epidemiology tracks the spread and evolution of pathogens (HIV, influenza) by constructing phylogenetic trees from viral sequences sampled over time and space
Protein function prediction relies on the principle of evolutionary conservation, inferring the function of uncharacterized proteins based on their phylogenetic relationship to annotated homologs
Drug target identification exploits the evolutionary differences between pathogen and host proteins to find specific inhibitors with minimal side effects
Ancestral sequence reconstruction infers the most likely sequences of extinct ancestors at the internal nodes of a phylogenetic tree, enabling the study of ancient adaptations and biomolecular resurrection
Advanced Topics and Current Research
Phylogenetic networks extend the tree model to accommodate reticulate evolutionary events such as hybridization, recombination, and horizontal gene transfer
Split networks represent incompatible phylogenetic signals as parallel edges, visualizing the conflicting relationships
Ancestral recombination graphs (ARGs) capture the history of recombination events and the ancestry of different genomic regions
Coalescent theory models the genealogical history of alleles in a population, providing a framework for inferring population parameters and demographic events from genetic data
Multi-species coalescent models account for incomplete lineage sorting and discordance between gene trees and species trees
Bayesian skyline plots estimate changes in effective population size over time based on the coalescent patterns of sampled sequences
Phylogenetic comparative methods study the evolution of traits along a phylogeny, testing for correlations, adaptive radiations, and evolutionary convergence
Independent contrasts transform trait values to account for the non-independence of species due to shared ancestry
Ornstein-Uhlenbeck (OU) models incorporate stabilizing selection and attraction towards optimal trait values
Machine learning approaches, such as deep learning and graph convolutional networks, are being applied to phylogenetic inference, enabling the analysis of complex, high-dimensional data
Microbiome phylogenetics investigates the diversity and evolution of microbial communities, using metagenomics and amplicon sequencing to reconstruct the phylogenetic relationships of uncultured organisms
Viral quasispecies analysis studies the intra-host evolutionary dynamics of rapidly evolving viruses (HIV, hepatitis C) by constructing phylogenetic trees from single-molecule sequencing data