Coiled is built around Dask. We’re maintainers, contributors, and cheerleaders of Dask. Our goal is to make Dask more accessible to everyone. So, it doesn’t come as a surprise that the entire Coiled team was very excited for the Dask Distributed Summit this year. Coilies were involved in everything from planning and organizing the Summit to presenting and facilitating the sessions.
In this blog post, we’ll discuss some sessions led by the Coiled team and highlight key takeaways from each session. We will cover:
New improvements coming to Dask - Active memory management and an accelerated scheduler;
Deeper look at Dask's Internals and the Dask JupyterLab extension;
Integrating Dask with Snowflake and Dask Cloud Deployments.
Historically, Dask was not developed as a SQL project, but it was built with the capability to interoperate well with other systems. Matthew Rocklin, the CEO of Coiled, joined Miles Adkins, Partner Sales Engineer at Snowflake, for a presentation on the ongoing efforts to get the best of both worlds: SQL data warehouse with Snowflake and distributed data science with Dask.
Snowflake is excellent at centralized data storage and optimized SQL querying, but it doesn’t support advanced Python activities like machine learning.
Dask can work with a wide variety of data (both traditional SQL databases, and non-traditional forms of data) and handle advanced ML workloads, but Dask isn’t good at data storage.
Hence, it makes sense to think of Snowflake + Dask as a best-of-breed solution for scaling.
As of today, Snowflake and Dask can interoperate in the following ways:
Using dask.distributed.read_sql_table() - works on all major databases and can perform distributed read/write, but it’s slow because it’s backed by ODBC and it doesn’t support complex queries.
Using Dask Delayed - can also perform distributed read/write, but it’s complicated to use and requires manual effort.
By exporting to Parquet - can perform fast and parallel read/write, but copies of data occupy unnecessary storage space.
Matthew and Miles demonstrated native Snowflake + Dask support that can quickly read/write data in parallel, take care of data partitioning, and manage cleanups automatically:
ddf = dask_snowflake.from_snowflake(
SELECT * FROM TableA JOIN TableB ON ...
The Dask JupyterLab Extension - Access Dask Diagnostic Dashboards in JupyterLab
Dask’s diagnostics dashboards are an essential part of Dask. They allow us to visualize the state of our cluster in real-time, help diagnose any unexpected behavior, and analyze how the computation is performing. The Dask JupyterLab extension integrates Dask’s Diagnostic dashboards with the JupyterLab interface. Ian Rose, a Software Engineer at Coiled and the maintainer of the Dask JupyterLab extension, gave a very engaging talk at the Summit about this extension.
In a collaborative effort from Dask contributors from NVIDIA, Capital One, Coiled, and the entire Dask community, there is ongoing work to improve the performance of the Dask scheduler. In this talk, Matthew Rocklin, James Bourbeau, and Gabe Joseph from Coiled joined Benjamin Zaitlen, Gil Forsyth, John Kirkham, Mads R. B. Kristensen, and Richard Zamora to discuss the details of this work.
Dask works great most of the time, but in some cases, it performs poorly:
When the tasks are incredibly short, Dask itself becomes an overhead; and
When we have a lot of workers, constant communication with the workers can become a burden on the scheduler.
The task graph is currently generated on the client-side. High-level graphs can be generated quickly, but low-level graphs can get very complex and very expensive to create. Hence, the idea is to move low-level graph generation to the scheduler, which can do the serialization much more efficiently.
Profiling is a big part of the work and helps us understand the flow of tasks. Many profiling techniques are being used:
Dask’s statistical profiler is a good starting point that provides a broad view,
cProfile captures every single function call, and
py-spy provides a good balance between the big-picture and granular viewpoints.
Cython can perform as fast as C and have maintainable Python-like syntax, so some parts of Dask are being re-written in Cython.
Hacking Dask: Diving Into Dask’s Internals
James Bourbeau, Lead OSS Engineer at Coiled, and Julia Signell took a workshop on advanced operations on Dask collections and details about the Dask scheduler.
Blocked algorithms operate on each block in isolation. Custom block computations allow us to perform such block-wise computations:
dask.array.map_blocks lets us map custom function over arrays. It works with single and multiple arrays and provides useful special arguments like block_id and block_info
dask.dataframe.map_partitions is similar to map_blocks, but for Dataframes.
dask.array.map_overlap is useful when we want to perform block-wise computation but require information from neighboring blocks, for example: applying a filter over an image.
For more fine-grained control over block-wise computation, each Dask collection has a reduction method. It is useful when output dimensions do not match the input.
dask.optimization contains various methods to transform graphs in different ways. It is a useful vehicle working with custom graphs.
cull helps remove unnecessary tasks.
inline and inline_functions helps inline constants and cheap operations respectively.
fuse helps combine merge together linear tasks.
There are many ways to deploy Dask clusters: manual setup (using CLI $ dask-scheduler), using cluster managers, Dask projects like Dask-Kubernetes and Dask-Cloudprovider, and companies like Coiled.
Besides the Diagnostic Dashboard, Dask provides a number of objects and attributes that help us understand the state of the cluster.
The Dask Community in Australia
There is an active community of Dask users and contributors in Australia. Some of them came together to engage in this open panel discussion about everything Dask. Hugo Bowne-Anderson, Head of Data Science Evangelism at Coiled, participated in the session along with Genevieve Buckley, Ben Leighton, Draga Doncila Pop, and Tisham Dhar.
Dask has a wide range of use cases. The panelists have seen Dask used for geospatial imaging, biological imaging, large-scale text processing, climate analysis, workflow management, and financial data modeling, among many others.
Dask Dataframe is a common entry path into Dask, but Dask is much bigger than that. It’s a complete distributed framework that many other projects can build on. The Napari project is an excellent example of Dask running behind the scenes and allowing researchers to work with their data comfortably.
There is a perception that parallel computing is difficult. That is true sometimes, but Python, along with tools like Dask and xarray, are helping change that perception.
Dask is good at reducing step-cliffs. Dask makes it seamless to scale from a laptop to a cluster - the code remains similar and familiar.
Dask for Everyone with Coiled
Hugo also presented a talk about the IT challenges of scalable compute and how to make distributed computing in the cloud easier.
Starting with an example of pandas+XGBoost (and +Dask) on a laptop, we see that this workflow is accessible to anyone who owns a laptop and has the time to learn.
Looking at the same pandas+XGBoost+Dask on a cluster, we see this is not accessible to anyone today.
A typical setup for data science on a cluster involves: learning about Dask on the cloud, setting up AWS (or any cloud platform), starting a Kubernetes cluster, and installing kubectl, then running the computation. This throws multiple errors, so follow-up tasks include pushing a Docker image and sharing up AWS credentials across workers.
Then, IT challenges begin - large AWS bills due to idle systems, security issues, and more.
Coiled was built by Dask contributors who saw this play out multiple times. Coiled provides hosted Dask Clusters, manages software environments and costs, runs from anywhere, and even supports GPUs. Products like Coiled Cloud can help reduce the time spent on setting up clusters from weeks to minutes.
Get The Most Out of Your Data Science and ML
Thanks for reading! As always, feel free to let us know what you think about this post. Plus, we’re always looking for more feedback on Coiled Cloud, which provides hosted Dask clusters, docker-less managed software, and zero-click deployments. You can try it out for free by clicking below.