💻Computational Biology Unit 7 Review

Phylogenetic tree construction is a crucial aspect of molecular evolution studies. It involves using various methods to infer evolutionary relationships between organisms based on genetic data. These methods can be broadly categorized into distance-based and character-based approaches, each with its own strengths and limitations.

Understanding the different tree construction methods is essential for interpreting evolutionary relationships accurately. From simple algorithms like UPGMA to more complex approaches like maximum likelihood and Bayesian inference, each method offers unique insights into the evolutionary history of organisms. Assessing tree reliability and interpreting results are key skills in molecular phylogenetics.

Phylogenetic Tree Construction Methods

Distance-Based and Character-Based Methods

Phylogenetic trees are constructed using either distance-based or character-based methods, each with their own advantages and limitations
Distance-based methods calculate pairwise distances between sequences to construct a tree (neighbor-joining, UPGMA)
- These methods are computationally efficient but may not always produce the most accurate tree topology
- Neighbor-joining is a bottom-up clustering algorithm that minimizes the total branch length at each stage of tree construction
- UPGMA assumes a constant rate of evolution and produces rooted trees
Character-based methods use discrete characters or character states to infer the most likely tree topology (maximum parsimony, maximum likelihood)
- These methods are more computationally intensive but can produce more accurate trees
- Maximum parsimony selects the tree that requires the fewest evolutionary changes to explain the observed data
- Maximum likelihood estimates the probability of observing the data given a specific tree topology and evolutionary model
Bayesian inference incorporates prior probabilities and calculates the posterior probability of a tree given the data and the model

Advantages and Limitations of Different Methods

Distance-based methods are computationally efficient and can handle large datasets, but they may lose information by reducing sequences to pairwise distances
- They are sensitive to the choice of evolutionary model and may not recover the correct tree topology if the model assumptions are violated
- Neighbor-joining does not assume a constant rate of evolution, making it more flexible than UPGMA
- UPGMA is sensitive to unequal rates of evolution and can produce incorrect trees if this assumption is violated
Character-based methods use more information from the sequences and can produce more accurate trees, but they are computationally intensive and may be sensitive to model choice
- Maximum parsimony does not explicitly model evolutionary processes and may be misled by homoplasy (convergent evolution, reversals, and parallel evolution)
- Maximum likelihood accounts for different rates of evolution and provides a statistical framework for model selection and hypothesis testing
- Bayesian inference incorporates prior knowledge and quantifies uncertainty in tree estimates, but it requires specifying prior distributions and can be computationally demanding

Constructing Phylogenetic Trees

Input Data and Evolutionary Models

Distance-based methods require a distance matrix as input, which is calculated from pairwise sequence alignments using a specific evolutionary model (Jukes-Cantor, Kimura 2-parameter)
- The choice of evolutionary model affects the estimated distances and the resulting tree topology
- Models differ in their assumptions about nucleotide frequencies, substitution rates, and rate variation among sites
Character-based methods require a multiple sequence alignment as input, where each position in the alignment represents a character
- The quality of the alignment affects the accuracy of the resulting tree
- Evolutionary models are used to calculate the likelihood of the data given a tree topology and to estimate branch lengths
- Models for character-based methods include nucleotide substitution models (GTR, HKY), amino acid substitution models (WAG, LG), and codon models (Goldman-Yang)

Tree Searching and Optimization Algorithms

Neighbor-joining starts with a star-like tree and iteratively joins the least distant pairs of taxa, adjusting branch lengths to minimize the total tree length
- The algorithm is fast and guaranteed to find the tree with the smallest total branch length for a given distance matrix
UPGMA creates a rooted tree by successively clustering the least distant pairs of taxa, assuming a molecular clock (constant rate of evolution)
- The algorithm is simple and fast but may produce incorrect trees if the molecular clock assumption is violated
Maximum parsimony searches for the tree that minimizes the total number of character state changes (mutations) required to explain the observed data
- Exact searches are computationally infeasible for large datasets, so heuristic methods (branch-and-bound, tree bisection, and reconnection) are used to find the most parsimonious tree(s)
Maximum likelihood calculates the probability of observing the data given a tree topology and an evolutionary model, selecting the tree with the highest likelihood
- The likelihood is optimized using numerical methods (Newton-Raphson, expectation-maximization) or heuristic searches (nearest neighbor interchange, subtree pruning, and regrafting)
Bayesian inference combines the likelihood of the data with prior probabilities to calculate the posterior probability of a tree, often using Markov Chain Monte Carlo (MCMC) sampling to explore the tree space
- MCMC algorithms (Metropolis-Hastings, Gibbs sampling) generate a sample of trees from the posterior distribution, which can be summarized to estimate tree topology, branch lengths, and support values

Phylogenetic Tree Reliability

Statistical Measures of Branch Support

Bootstrap analysis is a resampling technique used to estimate the reliability of tree branches by creating pseudoreplicates of the original dataset and calculating the proportion of times each branch is recovered
- Branches with high bootstrap support (>70%) are considered more reliable than those with low support
- Bootstrapping is computationally intensive and may not always provide an accurate measure of branch support, especially for small datasets or short branches
Jackknife analysis is similar to bootstrapping but involves removing a proportion of the data (50%) in each pseudoreplicate
- Jackknifing is faster than bootstrapping but may be less accurate due to the smaller sample size in each pseudoreplicate
Decay index (Bremer support) measures the number of additional evolutionary steps required to collapse a branch in a maximum parsimony tree
- Higher decay indices indicate stronger support for a branch, as more evidence is needed to contradict it
- Decay indices are calculated by searching for the shortest trees that do not contain a particular branch and comparing their lengths to the most parsimonious tree
Posterior probabilities in Bayesian inference indicate the probability of a branch being true given the data and the model
- Posterior probabilities are interpreted differently from bootstrap values and are generally higher for well-supported branches
- Posterior probabilities can be sensitive to model choice and prior specifications, so they should be interpreted cautiously

Assessing Tree Fit and Model Selection

Consistency index (CI) and retention index (RI) are used to assess the fit of a maximum parsimony tree to the data
- CI measures the amount of homoplasy in the tree, with higher values indicating less homoplasy and a better fit
- RI measures the proportion of synapomorphy retained in the tree, with higher values indicating a better fit
- Both indices range from 0 to 1, with 1 indicating a perfect fit and no homoplasy
Likelihood ratio tests can be used to compare the fit of different evolutionary models to the data and select the best-fitting model for tree construction
- The likelihood ratio test compares the likelihoods of two nested models, with the more complex model having additional parameters
- If the likelihood improvement of the more complex model is statistically significant, it is preferred over the simpler model
- Model selection criteria (Akaike information criterion, Bayesian information criterion) balance the fit of the model with its complexity, favoring models that explain the data well without overfitting

Interpreting Phylogenetic Trees

Evolutionary Relationships and Patterns

The branching pattern (topology) of a phylogenetic tree reflects the evolutionary relationships among taxa
- Taxa that share a more recent common ancestor are more closely related than those with a more distant common ancestor
- Monophyletic groups (clades) consist of an ancestor and all its descendants, and are supported by shared derived characters (synapomorphies)
- Paraphyletic groups include an ancestor but not all of its descendants, while polyphyletic groups have multiple ancestors
Branch lengths in a phylogenetic tree represent the amount of evolutionary change (number of substitutions per site) between taxa
- Longer branches indicate more evolutionary change and a greater genetic distance between taxa
- Branch lengths can be used to estimate divergence times and rates of evolution, but they are affected by the choice of evolutionary model and the presence of rate variation among lineages
Rooted trees have a specific node designated as the root, representing the common ancestor of all taxa in the tree
- The root determines the direction of evolutionary change and the relative ages of lineages
- Unrooted trees do not specify the position of the root and only depict the relative relationships among taxa
Outgroup taxa are used to root a tree and determine the direction of character state changes
- An outgroup is a taxon that is known to be less closely related to the ingroup taxa than they are to each other
- The outgroup is used to polarize character states, with the state present in the outgroup considered the ancestral state

Inferring Evolutionary Events and Processes

Polytomies (multifurcations) in a tree indicate uncertainty in the branching order or rapid diversification events
- Hard polytomies represent simultaneous divergence of multiple lineages, while soft polytomies result from insufficient data to resolve the branching order
- Polytomies can be resolved by adding more data (characters or taxa) or using more sophisticated phylogenetic methods
Convergent evolution can be inferred when distantly related taxa share similar character states due to similar selective pressures
- Convergent characters (homoplasies) can mislead phylogenetic analyses and should be identified and accounted for
- Convergence can be detected by comparing the fit of alternative tree topologies or by examining character state distributions across the tree
Horizontal gene transfer can be detected when a gene tree topology differs significantly from the species tree topology
- Horizontal gene transfer is common in prokaryotes and can result in discordance between gene trees and species trees
- Phylogenetic network methods can be used to visualize and quantify horizontal gene transfer events
- Reconciliation methods can be used to infer the history of gene duplications, losses, and transfers that explain the discordance between gene and species trees

💻Computational Biology Unit 7 Review