Fitting Linear Models to Data
Linear models let you describe the relationship between two variables using an equation. Once you have that equation, you can make predictions: given a new x-value, what y-value should you expect? This section covers how to build those models from real data, how to interpret them, and where they break down.
Scatter Plots for Variable Relationships
A scatter plot is the starting point for any data-fitting problem. It plots pairs of values so you can see whether a pattern exists before you try to model it.
- The independent variable (explanatory) goes on the x-axis. This is the variable you think might influence the other (e.g., hours spent studying).
- The dependent variable (response) goes on the y-axis. This is the outcome you're measuring (e.g., exam score).
Once the points are plotted, look for a correlation pattern:
- Positive correlation: As x increases, y tends to increase. Think height and weight: taller people generally weigh more.
- Negative correlation: As x increases, y tends to decrease. A classic example is price and quantity demanded: raise the price, and fewer people buy.
- No correlation: The points show no clear trend. Shoe size and IQ, for instance, have no meaningful relationship.
Also watch for outliers, which are data points that fall far from the overall pattern. A single outlier can distort your results, so it's worth investigating whether it reflects a data entry error or a genuinely unusual observation.

Line of Best Fit Interpretation
The line of best fit (also called the regression line) is the straight line that best summarizes the trend in your scatter plot. "Best" here has a precise meaning: the line minimizes the sum of the squared vertical distances between each data point and the line itself. This method is called least-squares regression.
In practice, you'll use a graphing calculator or spreadsheet to compute it. The output is an equation in the form:
Each part of that equation tells you something specific:
- Slope : The predicted change in y for every one-unit increase in x. If in a model relating study hours to exam points, each additional hour of studying predicts about 3.2 more points on the exam.
- Y-intercept : The predicted y-value when . This is the "starting point" of the model. Sometimes it makes practical sense (a base exam score with zero study hours); sometimes it doesn't (a model predicting weight from height at height = 0 inches is meaningless).
- Correlation coefficient : A number between and that measures the strength and direction of the linear relationship. Values near indicate a strong positive linear trend, values near indicate a strong negative linear trend, and values near suggest little to no linear relationship.
One more concept to know: residuals. A residual is the difference between an observed y-value and the y-value predicted by the line. In other words, . Residuals tell you how far off the model is for each data point.

Linear vs. Nonlinear Relationships
Not every relationship between variables is linear. A linear relationship shows a roughly constant rate of change: the data points cluster around a straight line. A nonlinear relationship has a rate of change that varies across the range of x-values, so the data curves rather than following a line.
Common nonlinear patterns include:
- Exponential: Growth that accelerates over time (bacterial population doubling)
- Quadratic: Data that rises then falls, or vice versa (the arc of a thrown ball)
- Logarithmic: Rapid change at first that levels off (the pH scale)
You can often tell which type you're dealing with by inspecting the scatter plot. But a more reliable check is a residual plot: graph the residuals against the x-values after fitting a linear model.
- If the residuals scatter randomly above and below zero with no visible pattern, a linear model is a reasonable fit.
- If the residuals form a curve or systematic pattern (e.g., negative, then positive, then negative again), the relationship is likely nonlinear, and a straight line isn't capturing the true trend.
Regression Analysis for Predictive Modeling
Regression analysis is the formal process of fitting a mathematical equation to data so you can make predictions. In simple linear regression, you model the relationship between one independent variable and one dependent variable with a linear equation.
Here are the steps for building a predictive model:
- Collect and organize your data into paired values.
- Create a scatter plot to check whether a linear model is reasonable.
- Calculate the line of best fit using technology (calculator, Excel, Desmos, etc.).
- Assess goodness of fit using the correlation coefficient and the coefficient of determination . The value tells you the proportion of the variation in y that's explained by the model. For example, means 85% of the variation in y is accounted for by the linear relationship with x.
- Use the regression equation to predict y-values for given x-values.
Three important limitations to keep in mind:
- Extrapolation is risky. Your model is only reliable within the range of x-values you actually observed. Predicting far outside that range (e.g., using 10 years of sales data to forecast 50 years ahead) can produce wildly inaccurate results because the trend may not hold.
- Correlation does not imply causation. A strong correlation between two variables doesn't mean one causes the other. There may be confounding variables at play. Ice cream sales and drowning rates are correlated, but that's because both increase in hot weather, not because ice cream causes drowning.
- Outliers and influential points can pull the regression line toward them, distorting the slope and intercept. Always check your scatter plot and residuals for points that might be skewing the model.