Intro to Econometrics

🎳Intro to Econometrics Unit 3 – Econometric Model Design

Econometric model design is a crucial skill for analyzing economic phenomena and testing hypotheses. It combines economic theory, mathematics, and statistical inference to create models that explain relationships between variables. This unit covers key concepts, types of models, and techniques for specification and evaluation. Students learn about dependent and independent variables, error terms, and estimation methods like OLS. They explore various model types, including linear regression, time series, and panel data models. The unit also covers data collection, model specification techniques, and methods for evaluating model performance and addressing common pitfalls.

Key Concepts and Definitions

  • Econometrics combines economic theory, mathematics, and statistical inference to analyze economic phenomena and test hypotheses
  • Dependent variable (Y) represents the outcome or effect being studied and is influenced by independent variables (X)
  • Independent variables (X) are factors that explain or predict changes in the dependent variable
  • Stochastic error term (ε) captures the unexplained variation in the dependent variable not accounted for by the independent variables
  • Ordinary Least Squares (OLS) is a common estimation method that minimizes the sum of squared residuals to find the best-fit line
  • Hypothesis testing evaluates the statistical significance of estimated coefficients using t-tests or F-tests
    • Null hypothesis (H0) represents the default assumption that there is no significant relationship between variables
    • Alternative hypothesis (Ha) suggests a significant relationship exists
  • Multicollinearity occurs when independent variables are highly correlated with each other, leading to unreliable coefficient estimates

Types of Econometric Models

  • Simple linear regression models the relationship between one dependent variable and one independent variable: Y=β0+β1X+εY = β0 + β1X + ε
  • Multiple linear regression extends simple regression to include multiple independent variables: Y=β0+β1X1+β2X2+...+βkXk+εY = β0 + β1X1 + β2X2 + ... + βkXk + ε
  • Time series models analyze data collected over regular time intervals (daily stock prices, monthly unemployment rates)
    • Autoregressive (AR) models use lagged values of the dependent variable as independent variables
    • Moving Average (MA) models use lagged values of the error term as independent variables
  • Panel data models combine cross-sectional and time series data (household income across multiple years)
    • Fixed effects models control for unobserved, time-invariant factors within each cross-sectional unit
    • Random effects models assume unobserved factors are uncorrelated with the independent variables
  • Logistic regression models binary dependent variables using a logistic function to estimate probabilities (pass/fail, yes/no)

Data Collection and Preparation

  • Primary data is collected directly by the researcher through surveys, experiments, or observations
  • Secondary data is obtained from existing sources such as government databases, financial reports, or published research
  • Data cleaning involves identifying and correcting errors, inconsistencies, and missing values in the dataset
  • Outliers are extreme values that can significantly influence the results and should be carefully examined
  • Transforming variables (logarithmic, square root) can improve model fit and interpretation
    • Logarithmic transformations are useful for variables with skewed distributions or to interpret coefficients as elasticities
    • Interaction terms capture the combined effect of two independent variables on the dependent variable
  • Standardizing variables by subtracting the mean and dividing by the standard deviation allows for comparison of coefficients across different scales

Model Specification Techniques

  • Economic theory guides the selection of relevant variables and the expected relationships between them
  • Stepwise regression iteratively adds or removes variables based on statistical criteria (forward selection, backward elimination)
  • Best subset selection evaluates all possible combinations of independent variables to find the optimal model
  • Ramsey RESET test assesses whether the functional form of the model is correctly specified
    • A significant test result suggests the presence of omitted variables or incorrect functional form
  • Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) balance model fit and complexity to select the best model
    • Lower AIC or BIC values indicate a better trade-off between fit and parsimony
  • Dummy variables represent categorical or qualitative factors (gender, region) and take values of 0 or 1
  • Lagged variables account for the delayed impact of independent variables on the dependent variable (Yt = β0 + β1Xt-1 + ε)

Estimation Methods

  • Ordinary Least Squares (OLS) is the most common estimation method for linear regression models
    • OLS assumes the error terms are independently and identically distributed with a mean of zero and constant variance
  • Maximum Likelihood Estimation (MLE) finds the parameter values that maximize the likelihood function of the observed data
    • MLE is often used for non-linear models (logistic regression) or when the error terms are not normally distributed
  • Instrumental Variables (IV) estimation addresses endogeneity issues when independent variables are correlated with the error term
    • Valid instruments are correlated with the endogenous variable but uncorrelated with the error term
  • Generalized Method of Moments (GMM) is a more flexible estimation approach that allows for heteroskedasticity and autocorrelation in the error terms
  • Two-Stage Least Squares (2SLS) is an IV estimation method that first regresses the endogenous variable on the instruments and then uses the predicted values in the main regression

Model Evaluation and Testing

  • R-squared measures the proportion of variation in the dependent variable explained by the independent variables
    • Adjusted R-squared penalizes the addition of irrelevant variables and is preferred for model comparison
  • F-test assesses the overall significance of the model by testing the joint hypothesis that all coefficients (except the intercept) are zero
  • t-tests evaluate the individual significance of each coefficient, testing the null hypothesis that the coefficient is zero
  • Durbin-Watson test detects the presence of first-order autocorrelation in the error terms
    • Test statistic values close to 2 indicate no autocorrelation, while values close to 0 or 4 suggest positive or negative autocorrelation
  • Breusch-Pagan test checks for heteroskedasticity, which occurs when the variance of the error terms is not constant across observations
  • Variance Inflation Factor (VIF) measures the degree of multicollinearity among independent variables
    • VIF values greater than 5 or 10 indicate severe multicollinearity and may require variable transformation or removal

Practical Applications and Case Studies

  • Demand estimation models the relationship between price, income, and quantity demanded for a product (elasticities)
  • Wage determination analyzes factors influencing individual wages (education, experience, gender)
  • Economic growth regressions examine the determinants of cross-country differences in GDP growth rates (investment, human capital, institutions)
  • Environmental Kuznets Curve hypothesis tests the relationship between economic development and environmental degradation
  • Gravity models of international trade predict bilateral trade flows based on economic size and geographic distance
  • Hedonic pricing models estimate the value of individual attributes of a good (housing prices based on location, size, amenities)
  • Event studies assess the impact of specific events (mergers, policy changes) on stock prices or other financial variables

Common Pitfalls and Limitations

  • Omitted variable bias occurs when a relevant variable is excluded from the model, leading to biased and inconsistent estimates
  • Reverse causality arises when the dependent variable influences one or more independent variables, violating the assumption of exogeneity
  • Sample selection bias results from non-random sampling or self-selection, making the sample unrepresentative of the population
  • Measurement errors in the dependent or independent variables can lead to biased and inconsistent estimates
  • Heteroskedasticity and autocorrelation in the error terms violate OLS assumptions and require robust standard errors or alternative estimation methods
  • Overfitting occurs when a model is too complex and fits the noise in the data rather than the underlying relationship
    • Overfitted models have poor out-of-sample predictive performance and may not generalize well to new data
  • External validity refers to the extent to which the results can be generalized to other contexts or populations beyond the study sample


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.
Glossary
Glossary