📉Statistical Methods for Data Science Unit 2 – Probability Theory & Distributions
Probability theory and distributions form the foundation of statistical analysis in data science. These concepts provide a framework for understanding uncertainty and variability in data, enabling researchers to make informed decisions and predictions.
From basic probability rules to complex distributions, this unit covers essential tools for modeling real-world phenomena. Students learn to calculate probabilities, work with various distribution types, and apply these concepts to solve practical problems in data analysis and decision-making.
Probability measures the likelihood of an event occurring and ranges from 0 (impossible) to 1 (certain)
Sample space (S) represents the set of all possible outcomes of an experiment or random process
An event (E) is a subset of the sample space containing outcomes of interest
Mutually exclusive events cannot occur simultaneously (rolling a 1 and 2 on a single die roll)
Independent events do not influence each other's probability (flipping a coin multiple times)
The probability of independent events occurring together is the product of their individual probabilities
Conditional probability measures the likelihood of an event given that another event has occurred, denoted as P(A|B) = P(A ∩ B) / P(B)
Bayes' Theorem allows updating probabilities based on new information: P(A|B) = P(B|A) * P(A) / P(B)
The Law of Total Probability states that the probability of an event is the sum of its conditional probabilities across all possible outcomes of another event
Probability Distributions Explained
A probability distribution assigns probabilities to each possible outcome of a random variable
Discrete probability distributions deal with countable outcomes (number of heads in 10 coin flips)
Continuous probability distributions deal with uncountable outcomes on a continuous scale (height, weight)
Probability density functions (PDFs) describe the likelihood of a continuous random variable taking on a specific value
The area under the PDF curve between two points represents the probability of the variable falling within that range
Cumulative distribution functions (CDFs) give the probability that a random variable is less than or equal to a given value
Expected value (mean) is the average value of a random variable over many trials, weighted by probability
Variance measures the average squared deviation from the mean, indicating the spread of the distribution
Standard deviation is the square root of variance and has the same units as the random variable
Types of Random Variables
Discrete random variables have countable outcomes (number of defective items in a batch)
Examples include Bernoulli, binomial, Poisson, and geometric distributions
Continuous random variables have uncountable outcomes on a continuous scale (time until failure)
Examples include uniform, normal (Gaussian), exponential, and beta distributions
Mixed random variables have both discrete and continuous components (amount of rainfall, which can be zero or a continuous value)
Independent random variables do not influence each other's probabilities or outcomes
Identically distributed random variables follow the same probability distribution
Joint probability distributions describe the likelihood of multiple random variables taking on specific values simultaneously
Marginal probability distributions are derived from joint distributions by summing or integrating over the other variables
Common Probability Distributions
Bernoulli distribution models a single binary outcome (success or failure) with probability p
Binomial distribution models the number of successes in n independent Bernoulli trials with probability p
Mean: np, Variance: np(1-p)
Poisson distribution models the number of rare events occurring in a fixed interval with average rate λ
Mean and variance: λ
Geometric distribution models the number of trials until the first success with probability p
Mean: 1/p, Variance: (1-p)/p^2
Uniform distribution has equal probability for all values between a minimum (a) and maximum (b)
Mean: (a+b)/2, Variance: (b-a)^2/12
Normal (Gaussian) distribution is symmetric and bell-shaped, characterized by mean μ and standard deviation σ
68-95-99.7 rule: 68% of data within 1σ, 95% within 2σ, 99.7% within 3σ
Exponential distribution models the time between events in a Poisson process with rate λ
Mean: 1/λ, Variance: 1/λ^2
Beta distribution is flexible and models probabilities, proportions, or percentages
Shape determined by two positive parameters, α and β
Properties of Distributions
Skewness measures the asymmetry of a distribution
Positive skew: tail on the right side, mean > median
Negative skew: tail on the left side, mean < median
Kurtosis measures the tailedness and peakedness of a distribution compared to a normal distribution
Leptokurtic: heavy tails and a sharp peak (positive excess kurtosis)
Platykurtic: light tails and a flatter peak (negative excess kurtosis)
Moments characterize various aspects of a distribution
First moment: mean (central tendency)
Second moment: variance (spread)
Third moment: skewness (asymmetry)
Fourth moment: kurtosis (tailedness and peakedness)
Moment-generating functions (MGFs) uniquely characterize a distribution and facilitate calculation of moments
Probability-generating functions (PGFs) serve a similar purpose for discrete distributions
Transformations of random variables lead to new distributions with different properties
Linear transformations affect mean and variance
Nonlinear transformations can change the shape of the distribution
Calculating Probabilities
For discrete distributions, probabilities are summed across the desired outcomes
P(X = x) for a specific value, P(a ≤ X ≤ b) for a range of values
For continuous distributions, probabilities are determined by integrating the PDF over the desired range
P(a ≤ X ≤ b) = ∫[a to b] f(x) dx, where f(x) is the PDF
Standardization (z-score) transforms a random variable to have mean 0 and standard deviation 1
z = (X - μ) / σ, where X is the original variable, μ is the mean, and σ is the standard deviation
Allows for comparison and calculation of probabilities across different normal distributions
Quantile functions (inverse CDFs) determine the value corresponding to a given cumulative probability
For example, the 0.95 quantile is the value below which 95% of the data falls
Monte Carlo simulation generates random samples from a distribution to estimate probabilities and quantities of interest
Central Limit Theorem (CLT) states that the sum or average of many independent random variables approximates a normal distribution, regardless of the original distribution
Enables inference and hypothesis testing for large samples
Applications in Data Science
Probability distributions model real-world phenomena and uncertainties
The time between customer arrivals at a store follows an exponential distribution with an average of 10 minutes. What is the probability that the next customer arrives within 5 minutes?
Exponential distribution with rate λ = 1/10 customers per minute
CDF: F(x) = 1 - e^(-λx)
P(X ≤ 5) = F(5) = 1 - e^(-1/10 × 5) = 0.3935
The weights of apples harvested from a tree follow a normal distribution with a mean of 150 grams and a standard deviation of 20 grams. What is the probability that a randomly selected apple weighs between 140 and 160 grams?