To celebrate Coiled’s first birthday, we hosted a Science Thursday livestream on Accelerating the Dask Scheduler with Matthew Rocklin, the co-creator of Dask and CEO of Coiled. Matthew joined Hugo Bowne-Anderson, the Head of Marketing and Evangelism at Coiled, to discuss some edge cases where the current Dask scheduler doesn’t perform optimally and talk about the ongoing efforts to make the scheduler faster. You can watch the livestream replay on YouTube:
In this post, we will cover:
Amish barn raising GIF
A scheduler is an essential component of parallel and distributed computing.
Amish barn raising provides a great metaphor for distributed computing. Just like the foreman who manages all the builders, the Dask scheduler manages all the Dask workers. In the same metaphor, the architect who designs the blueprints will be the Dask Client that constructs the task graph.
Imagine if there were hundreds or thousands of builders (workers), the foreman (scheduler) who assigns tasks can become a bottleneck. Now instead of raising an entire barn, consider a trivial project, like laying down one brick. It will be faster to do the task than to additionally communicate with the foreman. These are the two cases where Dask performs suboptimally -- if you have a lot of workers, or if you have very short tasks. As we’ll see in the post, there are ways to solve this problem.
Matthew says, a nice theme to have in mind throughout this discussion is:
“High performance isn’t about doing one thing well, it’s about doing nothing poorly.”
He also demonstrate these cases with some examples:
Note that the focus of this work isn’t on making each computation faster or each task faster, but on making the coordination faster.
A typical Dask computation goes through three phases:
The client is where a high-level graph is generated from your Dask code. In a high-level graph, each node corresponds to one operation. It is then translated into a low-level graph where each node represents the tasks performed by workers. It's worth noting that low-level graphs can get really complex for large datasets. The low-level graph is then optimized to be passed to the scheduler. The scheduler takes the graph and pushes it into an internal state machine. This state machine communicates with the workers and gives them tasks to do.
There are costs at each stage of this pipeline. The high-level graph generation is comparatively cheap, but the low-level graph generation, optimization, and serialization (to send the graph over a network to the scheduler) can be very expensive. The conversation between the scheduler and the workers aren’t expensive individually, but the constant need for communication that comes with a large number of workers can add up to be extremely expensive. Matthew demonstrates the costs at each stage with interesting examples, check it out in the livestream replay!
An effective solution to reduce these costs is to send a high-level graph directly to the scheduler, and implement the low-level graph generation and optimization steps at the scheduler stage.
This implementation is available today with the following configuration!
active: false //]]>
Getting this to work has been a joint effort by many people, but in particular Mads Kristensen (NVIDIA), Rick Zamora (NVIDIA), Gil Forsyth (Capital One), Ian Rose (Coiled), and James Bourbeau (Coiled).
We need to measure what is expensive before we can optimize it. Profiling is hard in any computing system and Python is no exception. The profiling options in Python include:
None of the above options are perfect in isolation, but each method provides different types of information, all of which are very useful.
Dask’s statistical profiler can be accessed in the “Profile” tab in Dask’s diagnostic dashboard. Every Dask worker has a statistical profiler running on it and Dask merges all the information and presents it in a flame graph. Dask also profiles the scheduler itself, this information can be accessed through a secret dashboard: <dashboard-url>/profile-server.
Python’s CProfile module can also help profile code. Benjamin Zaitlen, a Dask maintainer, runs nightly benchmarks on large datasets and shares the results in a GitHub repository: quasiben/dask-scheduler-perforance. The graphs show the aggregate time spent in each function, with links towards other functions that are called by that function.
VizTracer provides a very different view on the data and performance. It shows a hierarchy of function calls all the way down to individual Python functions. The previous two methods are broad compared to VizTracer, which shows information on a nanosecond-by-nanosecond basis.
Kudos to Maarten Breddels (vaex) for pointing the Dask team to VizTracer. Check out his recent blogpost using it to trace GIL access.
Python is relatively slow to be used in performant Python libraries like pandas, PyTorch, scikit-learn, and more. These libraries are written in a different language -- Cython. Cython compiles Python-like code to C, and produces Python modules that can be imported from regular Python. It’s a way to write C-speak-code using the familiar Python syntax.
Cython is critical to the Python ecosystem. It is great in many ways, but at the same time, it can be harder to debug, introduces a new compilation step into the development process, and is difficult to package. Hence, Dask has been written in pure Python so far. Dask maintainers are comfortable in Python and currently, Dask is good by being smart, not by being fast.
As of recently, you can write Cython code using pure Python type annotations! This makes adopting Cython very easy.
There is an ongoing effort to rewrite parts of Dask using Cython. Matthew discusses some recent PRs by John Kirkham (NVIDIA) for this rewrite that you can check out in the livestream recap!
If you enjoy working with Dask, you will find Coiled interesting. Coiled makes it very easy to deploy Dask to the Cloud by taking care of the DevOps elements for you.
Sign up today and get 100 free cores on us!