You can train a sklearn models in parallel using the sklearn joblib interface. This allows sklearn to take full advantage of the multiple cores in your machine and speed up training.
Using the Dask joblib backend you can maximize parallelism by scaling your sklearn models training out to a remote cluster for even greater performance gains on your data science workflows.
This works for all sklearn models that exposes the n_jobs keyword, like random forest, linear regression, and other machine learning algorithms. You can also use the joblib library for distributed hyperparameter tuning with grid search cross-validation.
When training machine learning models, you can run into 2 types of scalability issues: your model size may increase or your data size may start to cause issues (or even worse–both!). You can typically recognize a scalability problem related to your model size by long training times: the computation will complete (eventually) but it’s becoming a bottleneck in your data processing pipeline. This means you’re in the top-left corner of the diagram below:
Using the sklearn joblib integration, you can address this problem by processing your sklearn models in parallel, distributed over the multiple cores in your machine. This is possible for any sklearn algorithm that exposes the n_jobs keyword.
Note that this is only a good solution if your training data and test data fit in memory comfortably. If this is not the case (and you’re in the “I give up!” quadrant of large models and large datasets), see the “Dataset Too Large?” section below to learn how Dask-ML or XGBoost-on-Dask can help you out.
Let’s see this in action with an sklearn algorithm that is embarrassingly parallel: a random forest model. A random forest model fits a number of decision tree classifiers on sub-samples of the training data and combines the results from the various decision trees to make a prediction. Because each decision tree is an independent process, this model can easily be trained in parallel.
We’ll import the necessary libraries and create a synthetic classification dataset.
Then we’ll instantiate our Random Forest classifier
We can then use a joblib context manager to train this classifier in parallel, specifying the number of cores we want to use with the n_jobs keyword.
It gets even better. To save users from having to use context managers every time they want to train a model in parallel, sklearn developers exposed the n_jobs keyword as part of the instantiating call:
The n_jobs keyword communicates to the joblib backend and you can directly call clf.fit(X, y) without wrapping it in a context manager. This is the recommended approach for using joblib to train sklearn models in parallel locally.
Running this locally with n_job = -1 on a MacBook Pro with 8 cores and 16GB of RAM takes just over 2 minutes.
The context manager syntax can still come in handy when your model size increases beyond the capabilities of your local machine or if you want to train your model even faster. In those situations you may want to consider scaling out to remote clusters on the cloud. This can be especially useful if you’re running heavy grid search cross-validation or other forms of hyperparameter tuning.
You can use the Dask backend to joblib to delegate the distributed training of your model to a Dask cluster of virtual machines in the cloud.
We’ll first use Coiled to launch a Dask cluster of 20 machines in the cloud.
We’ll then go ahead and point Dask to our remote cluster. This will ensure that all future Dask computations are routed to the remote cluster instead of to our local cores.
Now use the joblib context manager to specify the Dask backend. This will delegate all model training to the workers in your remote cluster:
As you can see in this screen capture below, model training is now occurring in parallel on your remote Dask cluster.
This runs in about a minute on my machine, that’s a 2x speed-up. You could make this run even faster by adding more workers to your cluster.
You can also use your Dask cluster to run an extensive hyperparameter grid search that would be too heavy to run efficiently on your local machine:
Before we execute this compute-heavy Grid Search, let’s just scale up our cluster to 100 workers to speed up training and ensure we don’t run into memory overloads:
Now let’s execute the grid search in parallel:
If this is your first time working in a distributed computing system or with remote clusters, you may want to consider reading The Beginner’s Guide to Distributed Computing.
Is your dataset becoming too large, for example because you have acquired new data? In that case you may be experiencing two scalability issues at once: both your model and your dataset are becoming too large. You typically notice an issue with your dataset size by pandas throwing a MemoryError when trying to run a computation over the entire dataset.
Dask exists precisely to solve this problem. Using Dask, you can scale your existing Python data processing workflows to larger-than-memory datasets. You can use Dask DataFrames or Dask Arrays to process data that is too large for pandas or NumPy to handle. If you’re new to Dask, check out this Introduction to Dask.
Dask also scales machine learning for larger-than-memory datasets. Use dask-ml or the distributed Dask backend to XGBoost to train machine learning models on data that doesn't fit into memory. Dask-ML is a library that contains parallel implementation of many sklearn models.
The code snippet below demonstrates training a Logistic Regression model on larger-than-memory data in parallel.
Dask also integrates natively with XGBoost to train XGBoost models in parallel:
Our XGBoost-on-Dask tutorial walks you through a Python tutorial for training XGBoost on 100GB in less than 4 minutes.
In this blogpost we’ve seen that:
If you’d like to scale your Dask work to the cloud, check out Coiled — Coiled provides quick and on-demand Dask clusters along with tools to manage environments, teams, and costs. Click below to learn more!