📊Intro to Business Analytics Unit 3 – Probability Distributions & Sampling
Probability distributions and sampling techniques form the backbone of statistical analysis in business analytics. These concepts help analysts model random events, make predictions, and draw insights from data, enabling informed decision-making across various business domains.
From normal distributions in quality control to Poisson processes in customer arrivals, understanding these tools is crucial. Sampling methods ensure representative data collection, while statistical tests and visualizations aid in interpreting results and communicating findings effectively to stakeholders.
Probability distributions describe the likelihood of different outcomes in a random experiment or process
Random variables can be discrete (distinct values) or continuous (any value within a range)
Probability density functions (PDFs) and cumulative distribution functions (CDFs) mathematically define distributions
Expected value represents the average outcome of a random variable over many trials
Variance and standard deviation measure the spread or dispersion of a distribution
Higher variance indicates greater variability in the possible outcomes
Central Limit Theorem states that the distribution of sample means approximates a normal distribution as the sample size increases, regardless of the shape of the population distribution
Types of Probability Distributions
Normal (Gaussian) distribution is symmetric and bell-shaped, characterized by its mean and standard deviation
Useful for modeling many natural phenomena and averages of large samples
Binomial distribution describes the number of successes in a fixed number of independent trials with two possible outcomes (success or failure)
Poisson distribution models the number of events occurring in a fixed interval of time or space, given a known average rate
Exponential distribution represents the time between events in a Poisson process
Uniform distribution has constant probability over a defined range and zero probability outside that range
Other notable distributions include Beta, Gamma, and Chi-square, each with specific applications
Measures of Central Tendency and Dispersion
Mean is the arithmetic average of a dataset, calculated by summing all values and dividing by the number of observations
Median is the middle value when a dataset is ordered from lowest to highest
Robust to outliers and useful for skewed distributions
Mode is the most frequently occurring value in a dataset
Range is the difference between the maximum and minimum values
Interquartile range (IQR) is the difference between the 75th and 25th percentiles, representing the middle 50% of the data
Variance measures the average squared deviation from the mean, indicating how far values typically are from the average
Standard deviation is the square root of the variance, expressed in the same units as the original data
Sampling Techniques
Simple random sampling selects a subset of individuals from a population such that each individual has an equal probability of being chosen
Stratified sampling divides the population into subgroups (strata) based on a specific characteristic and then randomly samples from each stratum
Ensures representation of key subgroups in the sample
Cluster sampling divides the population into clusters, randomly selects a subset of clusters, and then samples all individuals within those clusters
Useful when a complete list of the population is not available or when clusters naturally occur (geographic regions)
Systematic sampling selects individuals at regular intervals from a population list
Convenience sampling selects individuals who are easily accessible or willing to participate, but may introduce bias
Probability Distribution Applications
Quality control uses normal distribution to set acceptable ranges for product specifications
Finance employs various distributions to model asset returns, portfolio risk, and option pricing
Marketing may use binomial distribution to analyze the success of a campaign or product launch
Operations management can apply Poisson distribution to model the number of customer arrivals or machine failures in a given time period
Exponential distribution is often used to model waiting times, such as customer service call durations or equipment failure rates
Common Statistical Tests
Z-test compares a sample mean to a population mean when the population standard deviation is known
T-test compares means between two groups or against a hypothesized value when the population standard deviation is unknown
ANOVA (Analysis of Variance) tests for differences among three or more group means
Chi-square test assesses the association between two categorical variables
Regression analysis examines the relationship between a dependent variable and one or more independent variables
Linear regression assumes a linear relationship between variables
Logistic regression predicts binary outcomes based on predictor variables
Data Visualization for Distributions
Histogram displays the frequency distribution of a continuous variable using bins
Shape, center, and spread of the distribution can be easily observed
Box plot (box-and-whisker plot) summarizes the five-number summary (minimum, Q1, median, Q3, maximum) of a distribution
Useful for comparing distributions across groups
Probability plot (Q-Q plot) assesses if a dataset follows a specific theoretical distribution by plotting the quantiles of the data against the quantiles of the theoretical distribution
Points falling along a straight line indicate a good fit
Cumulative frequency plot shows the cumulative proportion or percentage of observations less than or equal to each value
Violin plot combines a box plot with a kernel density plot to display the distribution shape and summary statistics
Real-world Business Examples
A manufacturing company monitors the diameters of ball bearings produced, expecting the diameters to follow a normal distribution with a mean of 10mm and a standard deviation of 0.1mm
Quality control limits are set at ±3 standard deviations from the mean
An e-commerce retailer analyzes the number of daily website visits, which follows a Poisson distribution with an average of 1,000 visits per day
This information helps plan server capacity and customer support staffing
A financial institution assesses the risk of its loan portfolio by modeling the probability of default for each loan using a binomial distribution
The institution can then determine the appropriate interest rates and reserve requirements
A market research firm conducts a survey using stratified sampling based on age groups to ensure adequate representation of each age category in the sample
Results are then weighted to reflect the age distribution of the target population
A hospital manages patient wait times, which are modeled using an exponential distribution with an average wait of 30 minutes
This information is used to optimize staffing levels and improve patient satisfaction