🎳Intro to Econometrics Unit 1 – Probability & Statistics Fundamentals
Probability and statistics form the foundation of econometrics, providing tools to analyze and interpret data. This unit covers key concepts like probability distributions, random variables, and descriptive statistics, essential for understanding economic phenomena and making informed decisions.
Students will learn about hypothesis testing, confidence intervals, and regression analysis. These techniques allow economists to draw inferences from sample data, estimate relationships between variables, and test economic theories using empirical evidence.
Probability measures the likelihood of an event occurring, expressed as a number between 0 and 1
0 indicates an impossible event, while 1 represents a certain event
Random variables assign numerical values to outcomes of a random experiment
Discrete random variables have countable outcomes (number of defective items in a batch)
Continuous random variables have an infinite number of possible outcomes within a range (height of students in a class)
Probability distributions describe the likelihood of different outcomes for a random variable
Probability mass functions (PMFs) define probability distributions for discrete random variables
Probability density functions (PDFs) define probability distributions for continuous random variables
Expected value represents the average outcome of a random variable over a large number of trials, calculated as the sum of each outcome multiplied by its probability
Variance and standard deviation measure the spread or dispersion of a probability distribution
Variance is the average squared deviation from the mean, denoted as σ2
Standard deviation is the square root of variance, denoted as σ
Covariance and correlation measure the relationship between two random variables
Covariance indicates the direction of the linear relationship (positive, negative, or zero)
Correlation is a standardized measure of the linear relationship, ranging from -1 to 1
Probability Basics
The law of large numbers states that as the number of trials increases, the average of the results will converge to the expected value
Conditional probability is the probability of an event A occurring, given that event B has already occurred, denoted as P(A∣B)
The multiplication rule states that the probability of two events A and B occurring together is the product of the probability of A and the conditional probability of B given A, expressed as P(A∩B)=P(A)×P(B∣A)
Independent events have no influence on each other's probability
For independent events A and B, P(A∣B)=P(A) and P(B∣A)=P(B)
The probability of independent events occurring together is the product of their individual probabilities, P(A∩B)=P(A)×P(B)
Mutually exclusive events cannot occur at the same time
The probability of mutually exclusive events A and B occurring is the sum of their individual probabilities, P(A∪B)=P(A)+P(B)
The complement of an event A is the probability of A not occurring, denoted as P(A′) or 1−P(A)
Bayes' theorem describes the probability of an event based on prior knowledge and new evidence, expressed as P(A∣B)=P(B)P(B∣A)×P(A)
Types of Distributions
Bernoulli distribution models a single trial with two possible outcomes (success or failure)
The probability of success is denoted as p, and the probability of failure is 1−p
Binomial distribution models the number of successes in a fixed number of independent Bernoulli trials
Characterized by the number of trials n and the probability of success p
Poisson distribution models the number of rare events occurring in a fixed interval of time or space
Characterized by the average rate of occurrence λ
Normal (Gaussian) distribution is a continuous probability distribution with a bell-shaped curve
Characterized by its mean μ and standard deviation σ
The standard normal distribution has a mean of 0 and a standard deviation of 1
Uniform distribution has equal probability for all outcomes within a given range
Discrete uniform distribution has a fixed number of equally likely outcomes
Continuous uniform distribution has an infinite number of equally likely outcomes within a range
Exponential distribution models the time between events in a Poisson process
Characterized by the rate parameter λ, which is the inverse of the mean
Student's t-distribution is similar to the normal distribution but with heavier tails, used when the sample size is small or the population standard deviation is unknown
Descriptive Statistics
Measures of central tendency describe the center or typical value of a dataset
Mean is the arithmetic average of all values in a dataset
Median is the middle value when the dataset is ordered from lowest to highest
Mode is the most frequently occurring value in a dataset
Measures of dispersion describe the spread or variability of a dataset
Range is the difference between the maximum and minimum values
Interquartile range (IQR) is the difference between the first and third quartiles
Variance and standard deviation measure the average distance of data points from the mean
Skewness measures the asymmetry of a distribution
Positive skewness indicates a longer right tail, while negative skewness indicates a longer left tail
Kurtosis measures the heaviness of the tails and peakedness of a distribution compared to a normal distribution
Leptokurtic distributions have heavier tails and a higher peak than a normal distribution
Platykurtic distributions have lighter tails and a lower peak than a normal distribution
Percentiles and quartiles divide a dataset into equal parts
Percentiles divide a dataset into 100 equal parts
Quartiles divide a dataset into four equal parts (Q1, Q2 or median, Q3)
Boxplots visually represent the five-number summary of a dataset (minimum, Q1, median, Q3, maximum)
Outliers are data points that fall outside the whiskers of a boxplot
Inferential Statistics
Hypothesis testing is a statistical method for making decisions based on sample data
The null hypothesis (H0) represents the status quo or no effect
The alternative hypothesis (Ha or H1) represents the claim or effect being tested
Type I error (false positive) occurs when rejecting a true null hypothesis
The significance level α is the probability of making a Type I error
Type II error (false negative) occurs when failing to reject a false null hypothesis
The power of a test is the probability of correctly rejecting a false null hypothesis
Confidence intervals estimate the range of values that likely contain the true population parameter
The confidence level (e.g., 95%) represents the probability that the interval contains the true value
p-values measure the strength of evidence against the null hypothesis
A small p-value (typically < 0.05) indicates strong evidence against the null hypothesis
t-tests compare means between two groups or a sample mean to a known population mean
Independent samples t-test compares means between two independent groups
Paired samples t-test compares means between two related groups or measurements
One-sample t-test compares a sample mean to a known population mean
Analysis of Variance (ANOVA) tests for differences in means among three or more groups
One-way ANOVA compares means across one categorical variable
Two-way ANOVA compares means across two categorical variables and their interaction
Data Visualization Techniques
Scatter plots display the relationship between two continuous variables
Each data point represents an observation with its x and y coordinates
Line plots connect data points in order, typically used for time series data
Multiple line plots can be used to compare trends across different categories
Bar plots compare values across different categories
Vertical bar plots (column charts) are used for categories with no natural ordering
Horizontal bar plots are useful when category labels are long or numerous
Histograms display the distribution of a continuous variable
The x-axis is divided into bins, and the y-axis shows the frequency or count of observations in each bin
Pie charts show the proportion or percentage of each category in a whole
Best used when the number of categories is small and the proportions are significantly different
Heatmaps display values using color intensity, useful for visualizing patterns in matrices or tables
Box plots summarize the distribution of a continuous variable across different categories
They display the five-number summary and any outliers
Violin plots combine a box plot and a kernel density plot to show the distribution shape
Faceting (small multiples) creates multiple subplots based on one or more categorical variables, allowing for comparisons across subgroups
Applications in Econometrics
Regression analysis models the relationship between a dependent variable and one or more independent variables
Simple linear regression models the relationship between two continuous variables
Multiple linear regression models the relationship between a dependent variable and multiple independent variables
Time series analysis studies data collected over time to identify trends, seasonality, and other patterns
Autoregressive (AR) models use past values of the variable to predict future values
Moving average (MA) models use past forecast errors to predict future values
Autoregressive integrated moving average (ARIMA) models combine AR and MA components and account for non-stationarity
Panel data analysis studies data collected over time for multiple individuals, firms, or other entities
Fixed effects models control for unobserved, time-invariant individual characteristics
Random effects models assume individual-specific effects are uncorrelated with the independent variables
Instrumental variables (IV) estimation addresses endogeneity issues when independent variables are correlated with the error term
Valid instruments are correlated with the endogenous variable but not with the error term
Difference-in-differences (DID) estimation compares the change in outcomes between a treatment and control group before and after an intervention
Parallel trends assumption: the treatment and control groups would have followed the same trend in the absence of the intervention
Propensity score matching (PSM) estimates the effect of a treatment by comparing treated and untreated observations with similar propensity scores
The propensity score is the probability of receiving the treatment based on observed characteristics
Common Pitfalls and Misconceptions
Correlation does not imply causation: a strong correlation between two variables does not necessarily mean that one causes the other
Confounding variables or reverse causality may explain the observed relationship
Outliers can heavily influence statistical measures and model results
It is important to identify and appropriately handle outliers based on the research context
Overfitting occurs when a model is too complex and fits the noise in the data rather than the underlying pattern
Overfitted models may have poor performance on new, unseen data
Underfitting occurs when a model is too simple and fails to capture the true relationship between variables
Underfitted models may have high bias and low variance
Multicollinearity arises when independent variables in a regression model are highly correlated with each other
Multicollinearity can lead to unstable coefficient estimates and difficulty in interpreting individual variable effects
Heteroscedasticity refers to the situation where the variance of the error term is not constant across observations
Heteroscedasticity can lead to biased standard errors and invalid inference
Autocorrelation occurs when the error terms in a time series or panel data model are correlated with each other
Autocorrelation can lead to biased standard errors and inefficient coefficient estimates
Simpson's paradox occurs when a trend or relationship observed in aggregated data disappears or reverses when the data is disaggregated by a confounding variable
It highlights the importance of considering subgroup effects and controlling for relevant variables