How to Train your Machine Learning Model on Terabytes of Data in 1 Hour

Training machine learning models on very large datasets is time-intensive. Many data science teams choose to train their models on a subset of their data to speed up training. Coiled enables you to do your data science work without compromise: you can train machine learning models on terabytes of data in 1 hour or less.

Poll on r/dataengineering

Training machine learning models on terabytes of data

Training machine learning models takes time. A lot of time. And while the size of the datasets available to companies keeps increasing, there are still only 24 hours in a day. This means that data science teams are always looking for ways to speed up model training so they can iterate quickly and incorporate the latest data into the final production model.

To solve this problem, data science teams with datasets in the terabyte range may choose to train their model on a subset of the dataset, because training the model on the entire dataset would take hours, maybe even days. This speeds up training time, but does so at the expense of the quality of the model. Machine learning models generally improve the more data you give them, so training on a subset diminishes the quality of the final model results.

Train using all your data with Coiled

Coiled allows data science teams with terabytes of data to do their work without compromise: you can train machine learning models on terabytes of data in 1 hour or less.

Coiled runs on Dask, which scales Python code natively to arbitrarily large datasets by distributing computations over multiple machines, thereby drastically speeding up your processing time. Coiled uses Dask to run your computations on a cluster of virtual machines in the cloud that you can spin up when you need them and shut down when you don’t. Read this blog if you want to learn more about the basics of Dask.

How does it work?

Once you have signed up and installed Coiled, you can spin up on-demand compute resources with two lines of code:

import coiled

cluster = coiled.Cluster(
   n_workers=100,
   worker_memory=’24GB’,
   worker_cpu=4,
)//]]>

Resources are provisioned within your own VPC environment to guarantee data privacy and security.

From there, you can run native Python code to run your computations. If you’re already working in Python with libraries like pandas, NumPy and scikit-learn, the Dask API will feel very familiar.

For example, you can train an XGBoost model like this:

from dask_ml.model_selection import train_test_split
# Create the train-test split
X, y = data.iloc[:, :-1], data["target"]
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.3, shuffle=True, random_state=2
)

import xgboost as xgb
# Create the XGBoost DMatrices
dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)
dtest = xgb.dask.DaskDMatrix(client, X_test, y_test)
# train the model
output = xgb.dask.train(
  client, params, dtrain, num_boost_round=4,
  evals=[(dtrain, 'train')]
)//]]>

More Resources

You may find the following resources helpful:

Level up your Dask using Coiled

Coiled makes it easy to scale Dask maturely in the cloud