Hyperparameter Tuning¶

What This Is¶

Hyperparameter tuning is a controlled search over model settings without leaking information across the validation boundary.

The deeper point is that tuning is not about finding the most extreme settings. It is about finding the smallest change that gives a repeatable improvement.

When You Use It¶

comparing a few candidate settings honestly
tuning regularization, tree depth, or similar controls
improving a baseline without changing the whole workflow

Tooling¶

Pipeline
GridSearchCV
RandomizedSearchCV
validation_curve
ParameterGrid
HalvingGridSearchCV
HalvingRandomSearchCV
StandardScaler

Library Notes¶

Pipeline keeps preprocessing tied to the model so each fold stays honest.
GridSearchCV is best when the search space is small and you want to inspect every candidate.
RandomizedSearchCV is better when the space is larger or you want a fast first pass.
validation_curve is useful when you want to inspect one parameter at a time instead of tuning several knobs at once.
ParameterGrid helps you reason about the search space before the run starts.
HalvingGridSearchCV and HalvingRandomSearchCV spend fewer resources on weak candidates and are useful when the full search would be too expensive.
StandardScaler should usually live inside the pipeline for linear and distance-based models.

What To Tune First¶

Start with the parameters that control capacity:

C for linear and margin-based models
max_depth, min_samples_leaf, or similar controls for tree models
regularization or shrinkage knobs before secondary preprocessing choices

If the first pass is inconclusive, add one interacting parameter only after you can explain why it belongs in the search.

Honest Tuning Protocol¶

Treat tuning as a controlled decision process:

lock the split or CV design first
choose one primary metric
define a budget for candidates, not an unlimited search
search only inside the training boundary
compare the tuned winner against the untuned baseline
evaluate once on the locked holdout after selection

If the tuned model cannot beat the untuned baseline honestly, the lesson is often about the representation or the split, not the grid size.

Minimal Example¶

from sklearn.model_selection import GridSearchCV

search = GridSearchCV(model, {"C": [0.1, 1.0, 10.0]}, cv=cv, scoring="roc_auc")

Worked Pattern¶

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ("scale", StandardScaler()),
    ("model", LogisticRegression(max_iter=1000)),
])

search = GridSearchCV(
    pipeline,
    {"model__C": [0.1, 1.0, 10.0]},
    cv=cv,
    scoring="average_precision",
    return_train_score=True,
)
search.fit(X_train, y_train)

The important part is not the exact grid. It is that preprocessing stays inside the pipeline and the search happens only inside the training boundary.

What To Read After Fitting¶

Read these outputs before you celebrate a winner:

best_params_
best_score_
best_estimator_
cv_results_

cv_results_ matters because it shows the whole candidate table, not just the winner. That makes it easier to see whether the gain is broad or just a one-point spike.

One-Parameter Check¶

from sklearn.model_selection import validation_curve

train_scores, valid_scores = validation_curve(
    pipeline,
    X_train,
    y_train,
    param_name="model__C",
    param_range=[0.01, 0.1, 1.0, 10.0],
    cv=cv,
    scoring="average_precision",
)

Use this when you want to answer one question first:

is the model under-regularized
is the model over-regularized
is the gain broad enough to matter

If the curve is flat, more tuning may not be the right next move.

Search Helper¶

from sklearn.model_selection import ParameterGrid

list(ParameterGrid({"model__C": [0.1, 1.0], "model__penalty": ["l2"]}))

This is useful when you want to sanity-check the search size before spending time on the run.

Search Space Design Under Budget¶

A good search space is narrow enough to teach you something.

use log-scale sweeps for regularization and learning-rate style parameters
tune the one or two capacity controls most likely to matter before secondary knobs
use coarse-to-fine search instead of a huge first grid
keep a clear reason for every parameter in the search

Bad search spaces usually share one symptom: the candidate table is large, but none of the choices would be easy to defend to a teammate.

One-Standard-Error Rule¶

If several candidates are close, prefer the simplest candidate whose score is within one standard error of the best mean.

Practical version:

best_mean = table["mean_score"].max()
best_sem = table.loc[table["mean_score"].idxmax(), "sem_score"]
safe = table[table["mean_score"] >= best_mean - best_sem]

Then choose the simplest row inside safe, not automatically the row with the very top mean. This protects you from over-reading small tuning differences.

Split And Scoring Must Match The Task¶

Search quality depends on the scorer and the splitter:

imbalanced queue: optimize average_precision or a threshold-aware metric, not plain accuracy
grouped data: use GroupKFold or StratifiedGroupKFold
time-aware data: use an ordered splitter, not shuffled CV

If the search uses the wrong split or the wrong metric, the best parameters are only best for the wrong problem.

What To Watch For¶

a grid that is so wide it becomes hard to interpret
a best setting that barely beats the default
a tuning run that changes the validation story only by chance
a pipeline that accidentally leaks preprocessing information
a large train-validation gap hidden behind one average score
a search that takes longer to explain than the gain is worth

The important signal is not "did the score move?" It is "did the score move in a way I can defend?"

When To Use Halving Search¶

Use halving search when:

the grid is large enough that full search is expensive
you want to eliminate weak candidates early
you can accept a more aggressive search strategy

Use it carefully:

keep the split fixed
compare it against a smaller ordinary search first
check whether the winner is stable enough to justify the shortcut

What To Try¶

tune C for logistic regression
tune max_depth or min_samples_leaf for a tree model
compare a small grid with a randomized search on the same metric
inspect one parameter with validation_curve before tuning two at once
use ParameterGrid to reason about the search space before the run
try halving search only when the full search would be too slow

Failure Pattern¶

Scaling or imputing on the full dataset before the search begins. Preprocessing must stay inside the pipeline so each fold is treated honestly.

Another failure pattern is making the grid too wide. A search that is too big becomes a time sink and often rewards luck more than understanding.

Another failure pattern is tuning several knobs at once before you know which one matters. If you cannot explain why a parameter belongs in the search, it probably should not be there yet.

Another failure pattern is trusting the best score without checking the spread, the training score, and the candidate table.

Another common counterexample is a wide search where one extreme candidate wins by 0.002 on validation but loses the weak slice, inflates training score, or falls outside the one-standard-error safety zone. That is not a robust win.

Inspection Habits¶

compare the best score with the baseline score, not just the neighboring candidates
check whether the train score rises much faster than the validation score
inspect whether one parameter dominates the result
prefer the smallest setting that gives a repeatable gain
read the whole candidate table before announcing a winner

If a smaller setting is nearly as good as the best one, the smaller setting is often the more defensible choice.

Practice¶

Tune one hyperparameter grid for logistic regression.
Tune one small tree-based grid.
Explain why the search happens only inside the training boundary.
Name one setting you would not tune on the first pass.
Explain what a small but consistent gain means compared with a one-off large jump.
Describe how you would decide whether RandomizedSearchCV is enough.
State what you would lock before the second tuning pass.
Explain when a default setting is already good enough.
Use validation_curve to decide whether one parameter is worth tuning further.
Explain what best_estimator_ and cv_results_ each tell you after a search.

Runnable Example¶

Open the matching example in AI Academy and run it from the platform.

Run the same idea in the browser:

Inspect the best parameter choice and the validation metrics after the search finishes.

Longer Connection¶

Continue with scikit-learn Validation and Tuning for a fuller tuning and calibration workflow.