Linear Regression¶
What This Is¶
Linear regression fits a line (or a hyperplane) that predicts a continuous target from one or more features. The model has the form y = Xw + b, and the standard training choice is to minimize squared error.
The practical lesson is that linear regression is the first baseline you should beat before trusting anything fancier. Its coefficients are directly interpretable, its failure modes are easy to see in residual plots, and its assumptions — linearity, independent errors, constant variance — are the first hypotheses worth testing.
When You Use It¶
- predicting a continuous number where a linear relationship is plausible
- producing a coefficient-level story that a stakeholder can read directly
- setting a baseline before trying gradient boosting or a neural model
- diagnosing whether a problem is really nonlinear or just noisy
- giving yourself a reference for what "good enough" should look like
Do Not Use It When¶
- the target is a class label — use Logistic Regression instead
- the residual plot shows clear curvature or heteroscedasticity and you cannot fix it with transforms or features
- the feature matrix has severe collinearity — switch to Ridge or Lasso, or drop features
- the task really needs nonlinear interactions and you are not willing to engineer them by hand
Tooling¶
LinearRegressionRidgeLassoElasticNetPolynomialFeaturesStandardScalerPipeliner2_scoremean_absolute_errormean_squared_errorstatsmodels.OLSfor p-values and diagnostics- residual plots and influence diagnostics
Closed Form And Gradient View¶
The ordinary least squares (OLS) solution has a closed form:
w = (X^T X)^(-1) X^T y
That inverse only exists when X^T X is invertible, which breaks when columns are linearly dependent. Ridge regression replaces the inverse with (X^T X + λI)^(-1) X^T y — the regularizer is exactly what keeps the problem well conditioned under collinearity.
The gradient view is equally simple: squared loss L = (y - Xw)^2 / 2n has gradient -X^T(y - Xw) / n. That is the update that every gradient-descent implementation of linear regression runs.
Regularization Cheat Sheet¶
| Variant | Penalty | Use when |
|---|---|---|
LinearRegression |
none | baseline and when features are already well conditioned |
Ridge |
λ \|\|w\|\|_2^2 |
many correlated features, keep all coefficients small |
Lasso |
λ \|\|w\|\|_1 |
you want automatic feature selection and sparse coefficients |
ElasticNet |
α·L1 + (1-α)·L2 |
correlated features where you still want sparsity |
Minimal Example¶
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
model = make_pipeline(StandardScaler(), LinearRegression())
model.fit(X_train, y_train)
Scaling matters more than it looks. With unscaled features the coefficient magnitudes cannot be compared, and regularization does not penalize them fairly.
Worked Pattern¶
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
ridge = Ridge()
grid = GridSearchCV(ridge, {"alpha": [0.01, 0.1, 1.0, 10.0, 100.0]}, cv=5)
grid.fit(X_train, y_train)
The logarithmic grid for alpha is the right shape because the penalty acts multiplicatively on the coefficient vector.
Assumptions To Check¶
- linearity — plot each feature against the residuals; any trend means the model shape is wrong, not the data
- independent errors — in time series, neighboring residuals correlate; use a sequential split and see Sequential Splits and Lag Features
- constant variance — the residual cloud should not fan out; if it does, try log-transforming the target
- low collinearity — look at variance inflation factors (VIF); high VIF means unstable coefficients
What To Inspect¶
r2_scoreon the validation split, alongside MAE for a unit-interpretable error- the residual plot — should look like noise, not a shape
- the largest residuals — one or two outliers can swing OLS
- the coefficient table — signs and magnitudes must survive a stakeholder read
- whether Ridge or Lasso changes the story materially
- whether a nonlinear feature (a log, a ratio) fixes the residual pattern before you jump to a new model
Failure Pattern¶
Treating r2_score as a single verdict. A 0.85 R² on a training set that looks curved in the residual plot is a worse result than a 0.78 R² with residuals that look like noise — because the second one is honest about what the model captured.
A second failure pattern is letting one outlier drive the coefficient. Squared loss weights large errors quadratically, so a single point in the corner can tilt the line.
A third failure pattern is reading coefficients when features are highly correlated. With collinearity, the coefficient you see is a consequence of which features happen to enter the design matrix, not a stable statement about the world.
Quick Checks¶
- Is the residual plot a cloud or a shape?
- Is scaling in the pipeline?
- Do the coefficients change by an order of magnitude when you add or drop a correlated feature?
- Does Ridge improve validation error, or is the baseline fine?
- Are the largest residuals concentrated in one slice of the data?
Practice¶
- Fit a linear regression to a synthetic
y = 2x + noiseand read the coefficient. - Compare
LinearRegression,Ridge(alpha=1), andLasso(alpha=0.1)on the same split. - Add a redundant feature (a copy of an existing column) and watch what happens to the OLS coefficient.
- Log-transform a right-skewed target and compare residual plots before and after.
- Explain one case where Lasso's zero coefficients are a feature and one case where they are a bug.
- Explain why OLS fails when columns are linearly dependent.
- Describe one residual pattern that signals the model shape is wrong.
- Explain when you prefer Ridge over Lasso.
- State one situation where MAE is a better error metric than MSE.
- Explain why scaling matters for regularized linear models but not for plain OLS.
Longer Connection¶
Linear regression sits next to:
- Regression Metrics and Diagnostics — how to read the error numbers honestly
- Feature Selection — Lasso as a selection lens
- Logistic Regression — the classification sibling with the same linear-decision-surface intuition
- Learning Curves and Bias-Variance — where underfitting shows itself for linear models
Linear regression is a diagnostic instrument first and a predictor second. Its real job is to tell you whether the problem is linear, noisy, or outright wrong before you spend compute on bigger models.