Hugo Bowne-Anderson

March 11, 2021

This post comes with a notebook to code along in - it works in your own environment.

Try it on your own data, run it on any Dask cluster!

You’ve selected the best modeling approach - now it’s time to get the high score. What’s the best set of hyperparameters to use? Parallel hyperparameter optimization on a cluster can save you time and give you answers faster.

Working with the Wine Quality dataset, we’ll optimize an XGBoost model using scikit-learn Grid Seach and Optuna. All of this on a cluster directly from your own environment!

Most machine learning algorithms come with a decent set of knobs and levers - hyperparameters - to adjust their training procedure. They come with defaults, and veteran data practitioners have their go-to settings.

Finding the best configuration holds the promise of getting significantly better performance, so how do we strike the balance with resources spent on optimization and absolute performance?

In this post we’ll examine increasingly sophisticated approaches to solving that conundrum:

- Grid Search - to infinity and beyond! Trying every combination.
- Random Search - okay I don’t have all day, let’s buy a fixed amount of lottery tickets
- Hyperband - let’s invest more in lottery tickets similar to those that are winning
- Optuna - intelligent, algorithmic discovery of the hyperparameter space

Let’s find the best set of hyperparameters for an XGBoost model that predicts the quality of the wine based on 11 descriptive features [Cortez et al., 2009].

//import pandas as pd

import numpy as np

Url = '.../wine-quality/winequality-white.csv'

wine = pd.read_csv(url, delimiter=';')

//]]>//from xgboost.sklearn import XGBRegressor

from sklearn.model_selection import cross_val_score

estimator = XGBRegressor(objective='reg:squarederror')

-cross_val_score(

estimator, X, y,

cv=5, scoring='neg_mean_absolute_error'

).mean()

# Out: 0.6171626137196118

//]]>

Out of the box, the model learns to predict wine quality on a scale 1 to 10 with mean absolute error 0.617. Can we do better than that? How many options do we need to try? When should we stop looking?

The appropriate time for hyperparameter optimization is after we have selected one or more high-conviction models that work well on a smaller data sample. Finding the optimal set of hyperparameters for a half-baked solution is not a useful thing to do.

Framing the problem correctly, discovering useful features, and getting promising results on a modeling approach are good boxes to check before moving on to optimization.

Two resources worth checking out before scaling are these excellent notes by Tom Augspurger, and `dabl` a new Python library started by Andreas Mueller, which helps jumpstart work on ML problems!

Problems, and datasets, are unique. Large ensemble models have so many settings, there is no way of knowing if an initial setting is anywhere near optimum. Trying really is the only way.

Trying means evaluating a very large number of parameter combinations.

If we had enough computers, we could try all the interesting options at the same time!

This is a compute-bound problem, as our data fits in ram but the processor keeps cranking out the various options. If you’ve ever done backtesting on financial time series data, yup, that’s the same type of situation.

https://lh6.googleusercontent.com/4AIa8IYovao-TB-iruNcpa_ONOnhTes7qhPYeUS35zOhzhpvmd7deXtQC2SAQPg_jwo78z5L17T8gB58CaF_gPwwzuHBB8yd_taXw28bxn7PdJZ5RTJlKWBpRRbtRijvwJ2MSX6Z

Let’s launch this example grid search - starting with a list of parameter variants to combine and try:

//params = {

'max_depth': [3, 6, 8, 16],

'min_child_weight': [3, 5, 10, 20, 30],

'eta': [.3, .2, .1, .05, .01],

'colsample_bytree': np.arange(.7, 1., .1),

'sampling_method ': ['uniform', 'gradient_based'],

'booster': ['gbtree', 'dart'],

'grow_policy': ['depthwise', 'lossguide']

}//]]>

And running the scikit-learn helper.

//from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(

estimator=estimator,

param_grid=params,

scoring = 'neg_mean_absolute_error',

n_jobs = -1,

cv = 5

)

grid_search.fit(X, y)//]]>

This is a good time to go away and make dinner, as we’ll be trying..

**5 folds for each of 3200 candidates, totaling 16000 fits.**

That takes 35 minutes on my laptop (8-core CPU).

Not a great option, especially if I had more data, or a greater number of models to work on!

//-grid_search.best_score_

# Out: 0.5683355409303652//]]>

Yay, better score!

https://lh6.googleusercontent.com/iinfccDM6BVOKAz02HVph3qEd82TjPq-J3RkFccOFJBM63r7vAA21GX7fZjaDezEOFWpprklY2jdG8nq39bpd8xjH7itI75xWjInHn4UKEqUEwB6mpYgv-o0e6HgLFmyJdkiPzIB

Dask.distributed comes to my aid - the library directly integrates with sklearn through its parallel backend, and allows me to send my work to a cluster.

There are many ways to set up a Dask cluster, one of them is Coiled.

With Coiled, I can get a cluster of any scale in 2 minutes - at the moment I’m going to take advantage of the free beta limit of 100 cores and order 24 workers with 4 cores each. I’m going to do it directly from my Jupyter notebook.

//import joblib

import coiled

from dask.distributed import Client

cluster = coiled.Cluster(n_workers=24, configuration='michal-mucha/xgb-action')

client = Client(cluster)//]]>

Next, I’ll use joblib - the scikit-learn parallel backend - to create a context that sends my data to all cluster workers and orders them to fit the variety of configurations:

//with joblib.parallel_backend("dask", scatter=[X, y]):

grid_search.fit(X, y)

Fitting 5 folds for each of 3200 candidates, totalling 16000 fits

[Parallel(n_jobs=-1)]: Using backend DaskDistributedBackend with 96 concurrent workers.

//]]>

https://lh3.googleusercontent.com/GYL6ki6bCW_AAdEjV2ZZlXiDSJBlIFQu7KIj-GpyjW3CgkApUNXTR2PIbxGdPNQRRJFVkiN1SAT7e6g6Rl2yLW-fsTaCCfFyHxkxGmpyNVJtf-bQ0hObMDaRAL9Zs3vc65dl93eJ

Checking in on the cluster dashboard, I can see the resources are as busy as they should be.

https://lh4.googleusercontent.com/RbICE40YY5DD_7TGmqF3oyDrDvOApCWr-mwLwWPlH4qmcFFPclqpSXiZQTS5VtfxvtDMY8WNyF-WJ9uNeFrdp6oUXvruPeH4KsjX1YzRmwaQA4LV67QabhJZMl2ZDb6cn9D0Eg_7

Coiled meticulously calculates (and estimates) cost - this exercise **took just under 6 minutes** and cost a total of $1.50.

For Coiled. For you, it’s free!

https://lh3.googleusercontent.com/pIB8V09axAaBJcaqHonbgz--QajynpHCMKG_Ij7JSSqr9jvtdG_0It04WCj4i6UrdOHXmY13bZv8C4JlZtdDMV3H0HVI9y9LSc8aoyZ0M1k1rslDhdjJ-KDI9k6eKN92A_rzW804

*When to stop the search?*

As mentioned before, the number of combinations to try explodes very quickly as we add parameters with many possible options. Various approaches to the problem of optimization prescribe specific constraints, which rein in this exuberance. Here are some of them:

For every hyperparameter, the data scientist provides either a distribution to sample random values from, or a list of values to sample from uniformly. What constrains the computation is the explicit number of allowed combinations to try. For example, we can order 100 random sets of values.

This novel algorithm represents the search as an “infinite-armed bandit", referring to the casino analogy of “which is the best-paying slot machine to play”. It uses the same constraint as random search - a budget of iterations. The management of that budget is done much more consciously - winning combinations are tried more while losing combinations are abandoned early.

Hyperband supports a limited set of training models however, since it requires a model to expose the `partial_fit` method. Since XGBoost does not support it, we won’t use it in this example.

A sequential approach to optimization aimed at functions that are expensive to evaluate, Bayesian Optimization seeks to suggest the most valuable next set of trials.

Representing the output metric - for example, model accuracy - with a probabilistic function allows efficient search guided by reducing uncertainty.

Here, as in Hyperband and Random Search, we control how many iterations are computed.

Optuna, a popular Python library, implements a parameter sampler that leverages this technique and works out of the box.

For a deeper introduction, check out this wonderful introduction by Crissman Loomis, AI Engineer at Preferred Networks, the team leading the work on Optuna.

Let’s give it a spin - a LocalCluster is enough.

First, we need to define the objective function for Optuna to optimize. Within it, we will use the probabilistic sampler to get better suggestions for hyperparameter combinations:

//def objective(trial):

# Probabilistic sampler invocation on a parameter-by-parameter basis

params = {

"max_depth": trial.suggest_int("max_depth", 3, 16, log=True),

"min_child_weight": trial.suggest_int("min_child_weight", 3, 30, log=True),

"eta": trial.suggest_float("eta", 1e-8, 1.0, log=True),

"colsample_bytree": trial.suggest_float("colsample_bytree", 0.7, 1.0, log=True),

"sampling_method ": trial.suggest_categorical(

"sampling_method", ["uniform", "gradient_based"], log=True

),

"booster": trial.suggest_categorical("booster", ["gbtree", "dart"]),

"grow_policy": trial.suggest_categorical(

"grow_policy", ["depthwise", "lossguide"]

),

}

# Intializing the model with hyperparameters

estimator = XGBRegressor(objective="reg:squarederror", **params)

score = cross_val_score(

estimator, X, y, cv=5, scoring="neg_mean_absolute_error"

).mean()

return score

//]]>

Then, all that is left to do is to **launch an Optuna study**, using Dask as a parallel backend:

//import dask_optuna

import joblib

import optuna

storage = dask_optuna.DaskStorage()

study = optuna.create_study(direction="maximize", storage=storage)

with joblib.parallel_backend("dask"):

study.optimize(objective, n_trials=100, n_jobs=-1)

//]]>

This guided approach produces a result very near to the optimal outcome from Grid Search, within just 100 trials.

//-study.best_value

# Out: 0.569361444674037//]]>

https://lh4.googleusercontent.com/iAvy4D1MmMgx5shyn4KGRAU1-T1JiDLusSVkRiOAnmzwE6hN8sXEesP_95Y16hUuQtMTRGQ8J9ekNYu-smZ2hLbSfm4fZi0p1NQqvGj05sEdcIK-3_opxGrPMmXnN7icb99oPosy

*Best trial marked in red.*

The execution took just 1 minute 13 seconds on a 16-core local cluster on my laptop, illustrating how a state-of-the-art approach can save enormous amounts of computation.

When dealing with bigger datasets, or more expensive objective functions, the best approach is to combine a great algorithm with scalable compute.

For code examples and a fully-configured environment, you can try large-scale, distributed XGBoost hyperparameter optimization with Optuna and Dask in this notebook.

For larger datasets, try the Higgs dataset, composed of 11 million simulated particle collisions, which you can access in our S3 bucket at **s3://coiled-data/higgs/**.

Best Score (MAE)Number of trialsStarting value0.617161Grid Search**0.56833**16,000Optuna0.56936**100**

- I’ve not done any work on framing the problem or error analysis. The model was trained on a regression task - would making it a classification task allow the model to learn better?
- The grid search work can be called out as a bit unfair. By micromanaging the parameters and ranges to choose from, I could have gotten a pretty good result with fewer tries. I would have been doing Optuna’s work in that case!
- Testing the approaches on more models, especially ones that support Hyperband, would make for a more interesting comparison.
- After demonstrating how well Optuna handles hyperparameter optimization, a natural step would be to leverage a cluster to run the experiment on a much larger dataset.

A combination of Dask and Optuna makes hyperparameter training on big data much more approachable.

Going distributed is a great option to have when solving CPU-bound problems such as hyperparameter optimization - the other option is to wait longer!

Applying a smarter approach is a good remedy, but when working with larger datasets models, the ability to scale is irreplaceable.

What is really special about prototyping Dask code, is that talking to a local cluster feels the same as talking to an enormous remote cluster. We can make sure all the code works *before* commissioning a fleet of machines.

With a Coiled cluster, you have the choice to burst to the cloud directly from your coding environment. The code you run is the same code that works on your local machine.

You can hire as many CPUs as you need, and release them as soon as you’re done. Keeping cost under total control while using a meaningful power boost.

Some exciting ways to take this work further:

- Represent the task as a classification problem.
- Use a different machine learning model and compare Hyperband with Optuna.
- Take the winning approach from a Kaggle competition such as this churn prediction challenge, and try to improve the score by tuning hyperparameters further.

Thanks for reading! If you’re interested in taking Coiled Cloud for a spin, which provides hosted Dask clusters, docker-less managed software, and zero-click deployments, you can do so for free today when you click below.

*Dataset citation:*

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.