This post comes with a notebook to code along in - it works in your own environment.
Try it on your own data, run it on any Dask cluster!
You’ve selected the best modeling approach - now it’s time to get the high score. What’s the best set of hyperparameters to use? Parallel hyperparameter optimization on a cluster can save you time and give you answers faster.
Working with the Wine Quality dataset, we’ll optimize an XGBoost model using scikit-learn Grid Seach and Optuna. All of this on a cluster directly from your own environment!
Most machine learning algorithms come with a decent set of knobs and levers - hyperparameters - to adjust their training procedure. They come with defaults, and veteran data practitioners have their go-to settings.
Finding the best configuration holds the promise of getting significantly better performance, so how do we strike the balance with resources spent on optimization and absolute performance?
In this post we’ll examine increasingly sophisticated approaches to solving that conundrum:
Let’s find the best set of hyperparameters for an XGBoost model that predicts the quality of the wine based on 11 descriptive features [Cortez et al., 2009].
//import pandas as pd
import numpy as np
Url = '.../wine-quality/winequality-white.csv'
wine = pd.read_csv(url, delimiter=';')
//]]>//from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import cross_val_score
estimator = XGBRegressor(objective='reg:squarederror')
-cross_val_score(
estimator, X, y,
cv=5, scoring='neg_mean_absolute_error'
).mean()
# Out: 0.6171626137196118
//]]>
Out of the box, the model learns to predict wine quality on a scale 1 to 10 with mean absolute error 0.617. Can we do better than that? How many options do we need to try? When should we stop looking?
The appropriate time for hyperparameter optimization is after we have selected one or more high-conviction models that work well on a smaller data sample. Finding the optimal set of hyperparameters for a half-baked solution is not a useful thing to do.
Framing the problem correctly, discovering useful features, and getting promising results on a modeling approach are good boxes to check before moving on to optimization.
Two resources worth checking out before scaling are these excellent notes by Tom Augspurger, and `dabl` a new Python library started by Andreas Mueller, which helps jumpstart work on ML problems!
Problems, and datasets, are unique. Large ensemble models have so many settings, there is no way of knowing if an initial setting is anywhere near optimum. Trying really is the only way.
Trying means evaluating a very large number of parameter combinations.
If we had enough computers, we could try all the interesting options at the same time!
This is a compute-bound problem, as our data fits in ram but the processor keeps cranking out the various options. If you’ve ever done backtesting on financial time series data, yup, that’s the same type of situation.
https://lh6.googleusercontent.com/4AIa8IYovao-TB-iruNcpa_ONOnhTes7qhPYeUS35zOhzhpvmd7deXtQC2SAQPg_jwo78z5L17T8gB58CaF_gPwwzuHBB8yd_taXw28bxn7PdJZ5RTJlKWBpRRbtRijvwJ2MSX6Z
Let’s launch this example grid search - starting with a list of parameter variants to combine and try:
//params = {
'max_depth': [3, 6, 8, 16],
'min_child_weight': [3, 5, 10, 20, 30],
'eta': [.3, .2, .1, .05, .01],
'colsample_bytree': np.arange(.7, 1., .1),
'sampling_method ': ['uniform', 'gradient_based'],
'booster': ['gbtree', 'dart'],
'grow_policy': ['depthwise', 'lossguide']
}//]]>
And running the scikit-learn helper.
//from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(
estimator=estimator,
param_grid=params,
scoring = 'neg_mean_absolute_error',
n_jobs = -1,
cv = 5
)
grid_search.fit(X, y)//]]>
This is a good time to go away and make dinner, as we’ll be trying..
5 folds for each of 3200 candidates, totaling 16000 fits.
That takes 35 minutes on my laptop (8-core CPU).
Not a great option, especially if I had more data, or a greater number of models to work on!
//-grid_search.best_score_
# Out: 0.5683355409303652//]]>
Yay, better score!
https://lh6.googleusercontent.com/iinfccDM6BVOKAz02HVph3qEd82TjPq-J3RkFccOFJBM63r7vAA21GX7fZjaDezEOFWpprklY2jdG8nq39bpd8xjH7itI75xWjInHn4UKEqUEwB6mpYgv-o0e6HgLFmyJdkiPzIB
Dask.distributed comes to my aid - the library directly integrates with sklearn through its parallel backend, and allows me to send my work to a cluster.
There are many ways to set up a Dask cluster, one of them is Coiled.
With Coiled, I can get a cluster of any scale in 2 minutes - at the moment I’m going to take advantage of the free beta limit of 100 cores and order 24 workers with 4 cores each. I’m going to do it directly from my Jupyter notebook.
//import joblib
import coiled
from dask.distributed import Client
cluster = coiled.Cluster(n_workers=24, configuration='michal-mucha/xgb-action')
client = Client(cluster)//]]>
Next, I’ll use joblib - the scikit-learn parallel backend - to create a context that sends my data to all cluster workers and orders them to fit the variety of configurations:
//with joblib.parallel_backend("dask", scatter=[X, y]):
grid_search.fit(X, y)
Fitting 5 folds for each of 3200 candidates, totalling 16000 fits
[Parallel(n_jobs=-1)]: Using backend DaskDistributedBackend with 96 concurrent workers.
//]]>
https://lh3.googleusercontent.com/GYL6ki6bCW_AAdEjV2ZZlXiDSJBlIFQu7KIj-GpyjW3CgkApUNXTR2PIbxGdPNQRRJFVkiN1SAT7e6g6Rl2yLW-fsTaCCfFyHxkxGmpyNVJtf-bQ0hObMDaRAL9Zs3vc65dl93eJ
Checking in on the cluster dashboard, I can see the resources are as busy as they should be.
https://lh4.googleusercontent.com/RbICE40YY5DD_7TGmqF3oyDrDvOApCWr-mwLwWPlH4qmcFFPclqpSXiZQTS5VtfxvtDMY8WNyF-WJ9uNeFrdp6oUXvruPeH4KsjX1YzRmwaQA4LV67QabhJZMl2ZDb6cn9D0Eg_7
Coiled meticulously calculates (and estimates) cost - this exercise took just under 6 minutes and cost a total of $1.50.
For Coiled. For you, it’s free!
https://lh3.googleusercontent.com/pIB8V09axAaBJcaqHonbgz--QajynpHCMKG_Ij7JSSqr9jvtdG_0It04WCj4i6UrdOHXmY13bZv8C4JlZtdDMV3H0HVI9y9LSc8aoyZ0M1k1rslDhdjJ-KDI9k6eKN92A_rzW804
When to stop the search?
As mentioned before, the number of combinations to try explodes very quickly as we add parameters with many possible options. Various approaches to the problem of optimization prescribe specific constraints, which rein in this exuberance. Here are some of them:
For every hyperparameter, the data scientist provides either a distribution to sample random values from, or a list of values to sample from uniformly. What constrains the computation is the explicit number of allowed combinations to try. For example, we can order 100 random sets of values.
This novel algorithm represents the search as an “infinite-armed bandit", referring to the casino analogy of “which is the best-paying slot machine to play”. It uses the same constraint as random search - a budget of iterations. The management of that budget is done much more consciously - winning combinations are tried more while losing combinations are abandoned early.
Hyperband supports a limited set of training models however, since it requires a model to expose the `partial_fit` method. Since XGBoost does not support it, we won’t use it in this example.
A sequential approach to optimization aimed at functions that are expensive to evaluate, Bayesian Optimization seeks to suggest the most valuable next set of trials.
Representing the output metric - for example, model accuracy - with a probabilistic function allows efficient search guided by reducing uncertainty.
Here, as in Hyperband and Random Search, we control how many iterations are computed.
Optuna, a popular Python library, implements a parameter sampler that leverages this technique and works out of the box.
For a deeper introduction, check out this wonderful introduction by Crissman Loomis, AI Engineer at Preferred Networks, the team leading the work on Optuna.
Let’s give it a spin - a LocalCluster is enough.
First, we need to define the objective function for Optuna to optimize. Within it, we will use the probabilistic sampler to get better suggestions for hyperparameter combinations:
//def objective(trial):
# Probabilistic sampler invocation on a parameter-by-parameter basis
params = {
"max_depth": trial.suggest_int("max_depth", 3, 16, log=True),
"min_child_weight": trial.suggest_int("min_child_weight", 3, 30, log=True),
"eta": trial.suggest_float("eta", 1e-8, 1.0, log=True),
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.7, 1.0, log=True),
"sampling_method ": trial.suggest_categorical(
"sampling_method", ["uniform", "gradient_based"], log=True
),
"booster": trial.suggest_categorical("booster", ["gbtree", "dart"]),
"grow_policy": trial.suggest_categorical(
"grow_policy", ["depthwise", "lossguide"]
),
}
# Intializing the model with hyperparameters
estimator = XGBRegressor(objective="reg:squarederror", **params)
score = cross_val_score(
estimator, X, y, cv=5, scoring="neg_mean_absolute_error"
).mean()
return score
//]]>
Then, all that is left to do is to launch an Optuna study, using Dask as a parallel backend:
//import dask_optuna
import joblib
import optuna
storage = dask_optuna.DaskStorage()
study = optuna.create_study(direction="maximize", storage=storage)
with joblib.parallel_backend("dask"):
study.optimize(objective, n_trials=100, n_jobs=-1)
//]]>
This guided approach produces a result very near to the optimal outcome from Grid Search, within just 100 trials.
//-study.best_value
# Out: 0.569361444674037//]]>
https://lh4.googleusercontent.com/iAvy4D1MmMgx5shyn4KGRAU1-T1JiDLusSVkRiOAnmzwE6hN8sXEesP_95Y16hUuQtMTRGQ8J9ekNYu-smZ2hLbSfm4fZi0p1NQqvGj05sEdcIK-3_opxGrPMmXnN7icb99oPosy
Best trial marked in red.
The execution took just 1 minute 13 seconds on a 16-core local cluster on my laptop, illustrating how a state-of-the-art approach can save enormous amounts of computation.
When dealing with bigger datasets, or more expensive objective functions, the best approach is to combine a great algorithm with scalable compute.
For code examples and a fully-configured environment, you can try large-scale, distributed XGBoost hyperparameter optimization with Optuna and Dask in this notebook.
For larger datasets, try the Higgs dataset, composed of 11 million simulated particle collisions, which you can access in our S3 bucket at s3://coiled-data/higgs/.
Best Score (MAE)Number of trialsStarting value0.617161Grid Search0.5683316,000Optuna0.56936100
A combination of Dask and Optuna makes hyperparameter training on big data much more approachable.
Going distributed is a great option to have when solving CPU-bound problems such as hyperparameter optimization - the other option is to wait longer!
Applying a smarter approach is a good remedy, but when working with larger datasets models, the ability to scale is irreplaceable.
What is really special about prototyping Dask code, is that talking to a local cluster feels the same as talking to an enormous remote cluster. We can make sure all the code works before commissioning a fleet of machines.
With a Coiled cluster, you have the choice to burst to the cloud directly from your coding environment. The code you run is the same code that works on your local machine.
You can hire as many CPUs as you need, and release them as soon as you’re done. Keeping cost under total control while using a meaningful power boost.
Some exciting ways to take this work further:
Thanks for reading! If you’re interested in taking Coiled Cloud for a spin, which provides hosted Dask clusters, docker-less managed software, and zero-click deployments, you can do so for free today when you click below.
Dataset citation:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.