🎳Intro to Econometrics Unit 2 – Linear Regression: Simple and Multiple

Linear regression is a powerful statistical tool used to model relationships between variables. It estimates how changes in independent variables affect a dependent variable, making it useful for prediction and understanding causal connections. Simple linear regression involves one independent variable, while multiple regression uses two or more. Both methods minimize the sum of squared residuals to find the best-fitting line, but multiple regression allows for controlling confounding factors and examining individual effects.

What's Linear Regression?

  • Statistical method used to model the linear relationship between a dependent variable and one or more independent variables
  • Estimates the effect of changes in the independent variable(s) on the dependent variable
  • Represented by the equation: y=β0+β1x1+β2x2+...+βnxn+εy = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \varepsilon
    • yy: dependent variable
    • β0\beta_0: y-intercept (value of y when all independent variables are zero)
    • β1,β2,...,βn\beta_1, \beta_2, ..., \beta_n: coefficients representing the change in y for a one-unit change in the corresponding independent variable
    • x1,x2,...,xnx_1, x_2, ..., x_n: independent variables
    • ε\varepsilon: error term (captures the variation in y not explained by the independent variables)
  • Minimizes the sum of squared residuals (differences between observed and predicted values) to find the best-fitting line
  • Can be used for prediction, causal inference, and understanding the relationship between variables
  • Widely applied in various fields (economics, finance, social sciences, and more)

Simple vs Multiple: What's the Diff?

  • Simple linear regression involves one independent variable, while multiple linear regression involves two or more independent variables
  • Simple linear regression equation: y=β0+β1x+εy = \beta_0 + \beta_1x + \varepsilon
    • Only one coefficient (β1\beta_1) representing the effect of the independent variable on the dependent variable
  • Multiple linear regression equation: y=β0+β1x1+β2x2+...+βnxn+εy = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \varepsilon
    • Multiple coefficients (β1,β2,...,βn\beta_1, \beta_2, ..., \beta_n) representing the effect of each independent variable on the dependent variable, holding other variables constant
  • Multiple linear regression allows for controlling confounding factors and examining the individual effects of each independent variable
  • Simple linear regression is a special case of multiple linear regression with only one independent variable
  • Choice between simple and multiple regression depends on the research question and the number of relevant variables

Key Assumptions and When They Break

  • Linearity: The relationship between the dependent and independent variables is linear
    • Violation: Non-linear relationships (exponential, logarithmic, etc.) can lead to biased estimates
  • Independence: Observations are independent of each other
    • Violation: Autocorrelation (correlation between observations over time or space) can lead to underestimated standard errors and invalid inference
  • Homoscedasticity: The variance of the error term is constant across all levels of the independent variables
    • Violation: Heteroscedasticity (non-constant variance) can lead to inefficient estimates and invalid inference
  • Normality: The error term is normally distributed with a mean of zero
    • Violation: Non-normal errors can affect the validity of hypothesis tests and confidence intervals, especially in small samples
  • No multicollinearity: Independent variables are not highly correlated with each other
    • Violation: Multicollinearity can lead to unstable and unreliable coefficient estimates and difficulty in interpreting individual effects
  • Exogeneity: The error term is uncorrelated with the independent variables
    • Violation: Endogeneity (correlation between the error term and independent variables) can lead to biased and inconsistent estimates

Interpreting Coefficients: What Do They Mean?

  • Coefficients represent the change in the dependent variable for a one-unit change in the corresponding independent variable, holding other variables constant
  • For continuous variables, the coefficient is interpreted as the marginal effect
    • Example: If the coefficient for years of education is 0.05 in a wage regression, a one-year increase in education is associated with a 0.05 unit (e.g., dollar) increase in wages, ceteris paribus
  • For binary (dummy) variables, the coefficient represents the difference in the dependent variable between the two categories
    • Example: If the coefficient for a gender dummy (1 = female, 0 = male) is -0.2 in a wage regression, being female is associated with a 0.2 unit (e.g., dollar) lower wage compared to males, ceteris paribus
  • Coefficients are subject to hypothesis testing (t-tests) to determine statistical significance
    • A statistically significant coefficient indicates that the effect is unlikely to be due to chance alone
  • Confidence intervals provide a range of plausible values for the true coefficient
  • Interpreting coefficients requires considering the units of measurement and the context of the study

Goodness of Fit: R-squared and Friends

  • R-squared (coefficient of determination) measures the proportion of variation in the dependent variable explained by the independent variables
    • Ranges from 0 to 1, with higher values indicating a better fit
    • Calculated as: R2=1SSRSSTR^2 = 1 - \frac{SSR}{SST}, where SSRSSR is the sum of squared residuals and SSTSST is the total sum of squares
  • Adjusted R-squared accounts for the number of independent variables and penalizes the addition of irrelevant variables
    • Useful for comparing models with different numbers of independent variables
    • Calculated as: Rˉ2=1(1R2)(n1)nk1\bar{R}^2 = 1 - \frac{(1-R^2)(n-1)}{n-k-1}, where nn is the sample size and kk is the number of independent variables
  • F-test assesses the overall significance of the regression model
    • Tests the null hypothesis that all coefficients (except the intercept) are simultaneously equal to zero
    • A significant F-test indicates that the model as a whole is statistically significant
  • While R-squared and adjusted R-squared provide a measure of fit, they should not be the sole criterion for model selection
    • A high R-squared does not necessarily imply a well-specified or theoretically sound model
    • Other factors (theoretical justification, residual diagnostics, and practical significance) should also be considered

Potential Pitfalls and How to Spot Them

  • Omitted variable bias: Occurs when a relevant variable is excluded from the model
    • Can lead to biased and inconsistent estimates of the included variables
    • Detected by examining the correlation between the error term and independent variables (endogeneity) or by adding the omitted variable and checking for changes in coefficients
  • Measurement error: Occurs when variables are measured with error
    • Can lead to biased and inconsistent estimates, especially when the error is in the independent variables (attenuation bias)
    • Addressed by using instrumental variables or improved measurement techniques
  • Sample selection bias: Occurs when the sample is not representative of the population of interest
    • Can lead to biased and inconsistent estimates if the selection process is related to the dependent variable
    • Addressed by using appropriate sampling methods or correction techniques (Heckman selection model)
  • Outliers and influential observations: Extreme values that can substantially affect the regression results
    • Detected by examining residual plots, leverage values, and Cook's distance
    • Addressed by investigating the cause of the outliers and considering robust regression techniques (median regression, quantile regression)
  • Misspecification of functional form: Occurs when the assumed functional form (linear) does not match the true relationship
    • Can lead to biased and inconsistent estimates
    • Detected by examining residual plots for patterns and considering non-linear transformations or more flexible functional forms (polynomials, splines)

Real-World Applications

  • Labor economics: Estimating the returns to education, the gender wage gap, or the impact of job training programs on earnings
  • Health economics: Examining the determinants of healthcare utilization, the effect of health insurance on health outcomes, or the cost-effectiveness of medical interventions
  • Environmental economics: Assessing the impact of pollution on property values, the willingness to pay for environmental amenities, or the effectiveness of environmental regulations
  • Finance: Modeling the relationship between stock returns and financial ratios, the determinants of corporate bond yields, or the factors affecting exchange rates
  • Marketing: Analyzing the effect of advertising expenditure on sales, the impact of price promotions on consumer demand, or the drivers of customer satisfaction
  • Public policy: Evaluating the impact of government programs on various outcomes (poverty, crime, education), the determinants of voter turnout, or the factors influencing public opinion

Tips for Running Regressions

  • Clearly specify the research question and the theoretical framework guiding the analysis
  • Select the appropriate variables based on theory and data availability
    • Consider the level of measurement (continuous, categorical) and the expected relationships
  • Check for missing values and outliers, and decide on an appropriate treatment (deletion, imputation, robust methods)
  • Examine the descriptive statistics and correlations among variables to gain initial insights
  • Specify the regression model, including any necessary interactions or non-linear terms
  • Estimate the model and assess the overall fit (R-squared, F-test) and the significance of individual coefficients (t-tests)
  • Conduct diagnostic tests for the assumptions (linearity, independence, homoscedasticity, normality, multicollinearity, exogeneity) and address any violations
  • Interpret the results in the context of the research question and the theoretical framework
    • Consider the magnitude, sign, and statistical significance of the coefficients
    • Discuss the limitations and potential sources of bias
  • Perform robustness checks by considering alternative specifications, subsamples, or estimation methods
  • Present the results clearly and transparently, including the model specification, coefficient estimates, standard errors, and diagnostic tests


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.