Linear Regression¶

What This Is¶

Linear regression fits a line (or a hyperplane) that predicts a continuous target from one or more features. The model has the form y = Xw + b, and the standard training choice is to minimize squared error.

The practical lesson is that linear regression is the first baseline you should beat before trusting anything fancier. Its coefficients are directly interpretable, its failure modes are easy to see in residual plots, and its assumptions — linearity, independent errors, constant variance — are the first hypotheses worth testing.

When You Use It¶

predicting a continuous number where a linear relationship is plausible
producing a coefficient-level story that a stakeholder can read directly
setting a baseline before trying gradient boosting or a neural model
diagnosing whether a problem is really nonlinear or just noisy
giving yourself a reference for what "good enough" should look like

Do Not Use It When¶

the target is a class label — use Logistic Regression instead
the residual plot shows clear curvature or heteroscedasticity and you cannot fix it with transforms or features
the feature matrix has severe collinearity — switch to Ridge or Lasso, or drop features
the task really needs nonlinear interactions and you are not willing to engineer them by hand

Tooling¶

LinearRegression
Ridge
Lasso
ElasticNet
PolynomialFeatures
StandardScaler
Pipeline
r2_score
mean_absolute_error
mean_squared_error
statsmodels.OLS for p-values and diagnostics
residual plots and influence diagnostics

Closed Form And Gradient View¶

The ordinary least squares (OLS) solution has a closed form:

w = (X^T X)^(-1) X^T y

That inverse only exists when X^T X is invertible, which breaks when columns are linearly dependent. Ridge regression replaces the inverse with (X^T X + λI)^(-1) X^T y — the regularizer is exactly what keeps the problem well conditioned under collinearity.

The gradient view is equally simple: squared loss L = (y - Xw)^2 / 2n has gradient -X^T(y - Xw) / n. That is the update that every gradient-descent implementation of linear regression runs.

Regularization Cheat Sheet¶

Variant	Penalty	Use when
`LinearRegression`	none	baseline and when features are already well conditioned
`Ridge`	`λ \\|\\|w\\|\\|_2^2`	many correlated features, keep all coefficients small
`Lasso`	`λ \\|\\|w\\|\\|_1`	you want automatic feature selection and sparse coefficients
`ElasticNet`	`α·L1 + (1-α)·L2`	correlated features where you still want sparsity

Minimal Example¶

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

model = make_pipeline(StandardScaler(), LinearRegression())
model.fit(X_train, y_train)

Scaling matters more than it looks. With unscaled features the coefficient magnitudes cannot be compared, and regularization does not penalize them fairly.

Worked Pattern¶

from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

ridge = Ridge()
grid = GridSearchCV(ridge, {"alpha": [0.01, 0.1, 1.0, 10.0, 100.0]}, cv=5)
grid.fit(X_train, y_train)

The logarithmic grid for alpha is the right shape because the penalty acts multiplicatively on the coefficient vector.

Assumptions To Check¶

linearity — plot each feature against the residuals; any trend means the model shape is wrong, not the data
independent errors — in time series, neighboring residuals correlate; use a sequential split and see Sequential Splits and Lag Features
constant variance — the residual cloud should not fan out; if it does, try log-transforming the target
low collinearity — look at variance inflation factors (VIF); high VIF means unstable coefficients

What To Inspect¶

r2_score on the validation split, alongside MAE for a unit-interpretable error
the residual plot — should look like noise, not a shape
the largest residuals — one or two outliers can swing OLS
the coefficient table — signs and magnitudes must survive a stakeholder read
whether Ridge or Lasso changes the story materially
whether a nonlinear feature (a log, a ratio) fixes the residual pattern before you jump to a new model

Failure Pattern¶

Treating r2_score as a single verdict. A 0.85 R² on a training set that looks curved in the residual plot is a worse result than a 0.78 R² with residuals that look like noise — because the second one is honest about what the model captured.

A second failure pattern is letting one outlier drive the coefficient. Squared loss weights large errors quadratically, so a single point in the corner can tilt the line.

A third failure pattern is reading coefficients when features are highly correlated. With collinearity, the coefficient you see is a consequence of which features happen to enter the design matrix, not a stable statement about the world.

Quick Checks¶

Is the residual plot a cloud or a shape?
Is scaling in the pipeline?
Do the coefficients change by an order of magnitude when you add or drop a correlated feature?
Does Ridge improve validation error, or is the baseline fine?
Are the largest residuals concentrated in one slice of the data?

Practice¶

Fit a linear regression to a synthetic y = 2x + noise and read the coefficient.
Compare LinearRegression, Ridge(alpha=1), and Lasso(alpha=0.1) on the same split.
Add a redundant feature (a copy of an existing column) and watch what happens to the OLS coefficient.
Log-transform a right-skewed target and compare residual plots before and after.
Explain one case where Lasso's zero coefficients are a feature and one case where they are a bug.
Explain why OLS fails when columns are linearly dependent.
Describe one residual pattern that signals the model shape is wrong.
Explain when you prefer Ridge over Lasso.
State one situation where MAE is a better error metric than MSE.
Explain why scaling matters for regularized linear models but not for plain OLS.

Longer Connection¶

Linear regression sits next to:

Regression Metrics and Diagnostics — how to read the error numbers honestly
Feature Selection — Lasso as a selection lens
Logistic Regression — the classification sibling with the same linear-decision-surface intuition
Learning Curves and Bias-Variance — where underfitting shows itself for linear models

Linear regression is a diagnostic instrument first and a predictor second. Its real job is to tell you whether the problem is linear, noisy, or outright wrong before you spend compute on bigger models.