Statistical Methods for Data Science

📉Statistical Methods for Data Science Unit 2 – Probability Theory & Distributions

Probability theory and distributions form the foundation of statistical analysis in data science. These concepts provide a framework for understanding uncertainty and variability in data, enabling researchers to make informed decisions and predictions. From basic probability rules to complex distributions, this unit covers essential tools for modeling real-world phenomena. Students learn to calculate probabilities, work with various distribution types, and apply these concepts to solve practical problems in data analysis and decision-making.

Key Concepts in Probability Theory

  • Probability measures the likelihood of an event occurring and ranges from 0 (impossible) to 1 (certain)
  • Sample space (S) represents the set of all possible outcomes of an experiment or random process
  • An event (E) is a subset of the sample space containing outcomes of interest
  • Mutually exclusive events cannot occur simultaneously (rolling a 1 and 2 on a single die roll)
  • Independent events do not influence each other's probability (flipping a coin multiple times)
    • The probability of independent events occurring together is the product of their individual probabilities
  • Conditional probability measures the likelihood of an event given that another event has occurred, denoted as P(A|B) = P(A ∩ B) / P(B)
  • Bayes' Theorem allows updating probabilities based on new information: P(A|B) = P(B|A) * P(A) / P(B)
  • The Law of Total Probability states that the probability of an event is the sum of its conditional probabilities across all possible outcomes of another event

Probability Distributions Explained

  • A probability distribution assigns probabilities to each possible outcome of a random variable
  • Discrete probability distributions deal with countable outcomes (number of heads in 10 coin flips)
  • Continuous probability distributions deal with uncountable outcomes on a continuous scale (height, weight)
  • Probability density functions (PDFs) describe the likelihood of a continuous random variable taking on a specific value
    • The area under the PDF curve between two points represents the probability of the variable falling within that range
  • Cumulative distribution functions (CDFs) give the probability that a random variable is less than or equal to a given value
  • Expected value (mean) is the average value of a random variable over many trials, weighted by probability
  • Variance measures the average squared deviation from the mean, indicating the spread of the distribution
  • Standard deviation is the square root of variance and has the same units as the random variable

Types of Random Variables

  • Discrete random variables have countable outcomes (number of defective items in a batch)
    • Examples include Bernoulli, binomial, Poisson, and geometric distributions
  • Continuous random variables have uncountable outcomes on a continuous scale (time until failure)
    • Examples include uniform, normal (Gaussian), exponential, and beta distributions
  • Mixed random variables have both discrete and continuous components (amount of rainfall, which can be zero or a continuous value)
  • Independent random variables do not influence each other's probabilities or outcomes
  • Identically distributed random variables follow the same probability distribution
  • Joint probability distributions describe the likelihood of multiple random variables taking on specific values simultaneously
  • Marginal probability distributions are derived from joint distributions by summing or integrating over the other variables

Common Probability Distributions

  • Bernoulli distribution models a single binary outcome (success or failure) with probability p
  • Binomial distribution models the number of successes in n independent Bernoulli trials with probability p
    • Mean: np, Variance: np(1-p)
  • Poisson distribution models the number of rare events occurring in a fixed interval with average rate λ
    • Mean and variance: λ
  • Geometric distribution models the number of trials until the first success with probability p
    • Mean: 1/p, Variance: (1-p)/p^2
  • Uniform distribution has equal probability for all values between a minimum (a) and maximum (b)
    • Mean: (a+b)/2, Variance: (b-a)^2/12
  • Normal (Gaussian) distribution is symmetric and bell-shaped, characterized by mean μ and standard deviation σ
    • 68-95-99.7 rule: 68% of data within 1σ, 95% within 2σ, 99.7% within 3σ
  • Exponential distribution models the time between events in a Poisson process with rate λ
    • Mean: 1/λ, Variance: 1/λ^2
  • Beta distribution is flexible and models probabilities, proportions, or percentages
    • Shape determined by two positive parameters, α and β

Properties of Distributions

  • Skewness measures the asymmetry of a distribution
    • Positive skew: tail on the right side, mean > median
    • Negative skew: tail on the left side, mean < median
  • Kurtosis measures the tailedness and peakedness of a distribution compared to a normal distribution
    • Leptokurtic: heavy tails and a sharp peak (positive excess kurtosis)
    • Platykurtic: light tails and a flatter peak (negative excess kurtosis)
  • Moments characterize various aspects of a distribution
    • First moment: mean (central tendency)
    • Second moment: variance (spread)
    • Third moment: skewness (asymmetry)
    • Fourth moment: kurtosis (tailedness and peakedness)
  • Moment-generating functions (MGFs) uniquely characterize a distribution and facilitate calculation of moments
  • Probability-generating functions (PGFs) serve a similar purpose for discrete distributions
  • Transformations of random variables lead to new distributions with different properties
    • Linear transformations affect mean and variance
    • Nonlinear transformations can change the shape of the distribution

Calculating Probabilities

  • For discrete distributions, probabilities are summed across the desired outcomes
    • P(X = x) for a specific value, P(a ≤ X ≤ b) for a range of values
  • For continuous distributions, probabilities are determined by integrating the PDF over the desired range
    • P(a ≤ X ≤ b) = ∫[a to b] f(x) dx, where f(x) is the PDF
  • Standardization (z-score) transforms a random variable to have mean 0 and standard deviation 1
    • z = (X - μ) / σ, where X is the original variable, μ is the mean, and σ is the standard deviation
    • Allows for comparison and calculation of probabilities across different normal distributions
  • Quantile functions (inverse CDFs) determine the value corresponding to a given cumulative probability
    • For example, the 0.95 quantile is the value below which 95% of the data falls
  • Monte Carlo simulation generates random samples from a distribution to estimate probabilities and quantities of interest
  • Central Limit Theorem (CLT) states that the sum or average of many independent random variables approximates a normal distribution, regardless of the original distribution
    • Enables inference and hypothesis testing for large samples

Applications in Data Science

  • Probability distributions model real-world phenomena and uncertainties
    • Examples: customer arrival times, product defects, stock price fluctuations
  • Bayesian inference updates probabilities based on observed data to make predictions and decisions
    • Posterior distribution ∝ Likelihood × Prior distribution
  • Hypothesis testing uses probability to determine the significance of findings and make data-driven decisions
    • Null hypothesis (H0) assumes no effect or difference
    • Alternative hypothesis (H1) represents the claim or effect of interest
    • p-value: probability of observing the data or more extreme results under the null hypothesis
  • Confidence intervals provide a range of plausible values for a population parameter with a specified level of confidence
  • A/B testing compares two versions of a product or service to determine which performs better based on a metric of interest
  • Anomaly detection identifies rare or unusual events that deviate from the expected distribution
  • Probabilistic graphical models (Bayesian networks, Markov random fields) represent dependencies among random variables for inference and prediction

Practice Problems and Examples

  1. A fair die is rolled three times. What is the probability of getting a 6 on all three rolls?

    • Sample space: S = {(1,1,1), (1,1,2), ..., (6,6,6)}, |S| = 6^3 = 216
    • Event E = {(6,6,6)}, |E| = 1
    • P(E) = |E| / |S| = 1 / 216 = 0.00463
  2. The time between customer arrivals at a store follows an exponential distribution with an average of 10 minutes. What is the probability that the next customer arrives within 5 minutes?

    • Exponential distribution with rate λ = 1/10 customers per minute
    • CDF: F(x) = 1 - e^(-λx)
    • P(X ≤ 5) = F(5) = 1 - e^(-1/10 × 5) = 0.3935
  3. The weights of apples harvested from a tree follow a normal distribution with a mean of 150 grams and a standard deviation of 20 grams. What is the probability that a randomly selected apple weighs between 140 and 160 grams?

    • Standardize the range: z1 = (140 - 150) / 20 = -0.5, z2 = (160 - 150) / 20 = 0.5
    • Using a standard normal table or calculator: P(-0.5 ≤ Z ≤ 0.5) = 0.6915
  4. A machine produces defective items with a probability of 0.02. If 100 items are produced, what is the probability that exactly 3 items are defective?

    • Binomial distribution with n = 100, p = 0.02
    • P(X = 3) = (1003)\binom{100}{3} (0.02)^3 (1 - 0.02)^97 = 0.0576
  5. The joint probability distribution of two discrete random variables X and Y is given by:

    X \ Y01
    00.10.2
    10.30.4

    Calculate the marginal probability distribution of X and the conditional probability P(Y = 1 | X = 0).

    • Marginal probability of X: P(X = 0) = 0.1 + 0.2 = 0.3, P(X = 1) = 0.3 + 0.4 = 0.7
    • Conditional probability: P(Y = 1 | X = 0) = P(X = 0, Y = 1) / P(X = 0) = 0.2 / 0.3 = 0.6667


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.