Deploying and Scaling Data Science Tools

Hugo Bowne-Anderson
August 18, 2020

Jacob Tomlinson, who works at NVIDIA maintaining libraries like RAPIDS, Dask, Dask-Kubernetes and Dask-Cloudprovider, joins Matt Rocklin and Hugo Bowne-Anderson to discuss deployment and scaling of data science tools on distributed systems.

Dask has many cluster manager utilities which help users set up distributed Dask clusters on a variety of different infrastructures.

Dask’s distributed tooling means that users can start a scheduler with one command and any number of workers with another. However figuring out where to run them, how to requisition lots of infrastructure, how to get everything talking to each other, how to access that cluster, can be a challenge.

Jacob also occasionally live streams open source development work on his Constrained Coding channel. In these streams he often picks a small GitHub issue and opens a pull request to resolve the issue while racing a 30 minute clock. We thought it might be fun to get multiple brains together on this stream and do one together.

After attending, you’ll know

  • How distributed Dask clusters communicate
  • Different cluster types (static and ephemeral)
  • The variety of different platforms you can spin up your Dask cluster on
  • How open source contributions work!

Level up your Dask using Coiled

Coiled makes it easy to scale Dask maturely in the cloud