Using Dask for the first time can be a steep learning curve. After years of building Dask and guiding people through their onboarding process, the Coiled team has recognised a number of common pitfalls.
This post presents the 5 most common mistakes we see people make when using Dask – and strategies for how you can avoid making them.
Let’s jump in.
The single-most important thing to do before starting to build things with Dask is to take the time to understand the basic principles of distributed computing first.
The Dask API follows the pandas API as closely as possible, which means you can get started with Dask pretty quickly. But that “as possible” sounds deceptively simple. What that doesn’t tell you is that when you move from pandas to Dask you’re actually entering a whole different universe – with a different language, different basic laws of physics and a different concept of time.
To succeed in Dask you’ll need to know things like why Dask is “lazy”, which kinds of problems are “embarrassingly parallel” and what it means to “persist partitions to cluster memory”. If that all sounds like gibberish to you, read this introduction to basic distributed computing concepts. And don’t worry if this feels slightly overwhelming–take a deep breath and remember that I knew next to nothing about all of this a few months ago, either ;)
The good news is that you don’t need to master any of these new languages or laws or concepts – in most cases, you can navigate your way around with a basic understanding of fundamental concepts. It's a bit like going on a holiday to a place where you don’t speak the language. You don't need to be able to hold an entire conversation on the intricacies of the local political system to have a good time. But it'd be helpful if you were able to ask for a telephone or for the directions to your hotel if you needed to.
One of the most obvious differences between pandas and Dask is that when you call a pandas DataFrame you get, well, the DataFrame:
//import pandas as pd
df = pd.DataFrame(
"Name": ["Mercutio", "Tybalt", "Lady Montague"],
"Age": [3, 2, 4],
"Fur": ["Grey", "Grey", "White"]
..whereas when you call a Dask DataFrame you get the equivalent of the plastic wrapper but not the candy: you’ll see some descriptive information about what’s inside but not the actual contents of your DataFrame:
//import dask.dataframe as dd
dask_df = dd.from_pandas(df)
You’ll quickly discover that this is because of Dask’s “lazy evaluation” and that if you add .compute() to your df call, you will get the results the way you’re used to seeing them in pandas.
But you’ll want to be careful here. Dask evaluates lazily for a reason. Lazy evaluation allows Dask to postpone figuring out how to get you the result until the last moment – i.e. when it has as much knowledge as possible about the data and what you want from it. It can then calculate the most efficient way to get you your result. Calling .compute() too often or too early interrupts this optimisation procedure. It’s a bit like eating all your half-cooked ingredients midway through the recipe – it won’t taste as good and if you still want the end result, you’ll have to start all over again.
Rules of thumb here: use .compute() sparingly. Only use it to materialize results that you are sure: 1) will fit into your local memory, 2) are ‘fully cooked’ – in other words, don’t interrupt Dask’s optimization halfway through, and 3) ideally will be reused later. See this post for more details about compute.
Soon after discovering .compute(), you’ll learn about .persist().
It can be difficult at first to understand the difference between these two: both of these commands materialize results by triggering Dask computations. This definition will help you tell them apart: .persist() is basically the same as .compute() – except that it loads results into cluster memory instead of local memory. If you need a refresher on the difference between local and cluster memory, check out the “cluster” section of this article.
You might be wondering – well, in that case, shouldn’t I just .persist() everything? Especially if I’m working on a cloud-based cluster with virtually unlimited memory, won’t persisting just make everything run faster?
Yes…until it doesn’t. Just like .compute(), calling .persist() tells Dask to start computing the result (cooking the recipe). This means that these results will be ready for you to directly use the next time you need them. But you only have two hands, limited tools and are cooking against a deadline. Just filling all your pans with cooked ingredients you might (or might not) need later is not necessarily a good strategy for getting all your prepared dishes to all of your diners at the right time.
Remote clusters can be a great resource when processing large amounts of data. But even clusters have limits. And especially when running Dask in production environments – where clusters won’t be free but closely cost-monitored – you’ll start caring a lot more about using cluster memory as efficiently as possible.
So the advice here is similar to the one above: use .persist() sparingly. Only use it to materialize results that you are sure: 1) will fit into your cluster memory, 2) are ‘fully cooked’ – in other words, don’t interrupt Dask’s optimization halfway through, and 3) will be reused later. It’s also best practice to assign a .persist() call to a new variable, as calling .persist() returns new Dask objects, backed by in-memory data.
//# assign persisted objects to a new variable
df = dd.read_parquet("s3://bucket/file.parquet")
df_p = df.persist()//]]>
See this post for more details on persist.
We all love our CSV files. They’ve served us well over the past decades and we are grateful to them for their service…but it’s time to let them go.
You’ve now made the daring move to the Universe of Distributed Computing, which means there are entirely new possibilities available to you, like parallel reading and writing of files. CSV is just not made for these advanced processing capabilities and when working with Dask, it’s highly recommended to use the Apache Parquet file format instead. It is a columnar format with high compression options that allows you to perform things like column pruning, predicate pushdown filtering, and avoiding expensive and error-prone schema inference.
Dask allows you to easily convert a CSV file to Parquet using:
//# convert CSV file to Parquet using Dask
df = dd.read_csv("file.csv")
Matthew Powers has written a great blog about the advantages of Parquet for Dask analyses.
If you’re coming from pandas, you’re probably used to having very little visibility into the code that’s running while you wait for a cell to complete. Patiently watching the asterisk next to your running Jupyter Notebook cell is a daily habit for you – perhaps supplemented with a progress bar or notification chime.
Dask changes the game entirely here. Dask gives you powerful and detailed visibility and diagnostics about the computations that it runs for you. You can inspect everything from the number and type of running tasks, the data transfer between workers in your cluster, the amount of memory certain tasks are taking, etc.
Once you get a little more familiar with Dask, you’ll be using the Dashboard regularly to figure out ways to make your code run faster and more reliably. One thing to look out for, for example, is the amount of red space (data transfer) and white space (idle time) that show up – you’ll want to minimize both of those.
You can get a link to the Dask Dashboard with:
//# launch a local Dask cluster using all available cores
from dask.distributed import Client
client = Client()
Or use the Dask Dashboard extension for JupyterLab, as described in this post.
Watch this video by Matt Rocklin, the original author of Dask, to learn more about using the Dask Dashboard to your advantage.
Here's a ‘bonus mistake’ for you that might actually be one of the most important keys to successfully learning how to use Dask.
It’s admirable (and good practice) to want to try things yourself first. But don’t be afraid to reach out and ask for help when you need it. The Dask Discourse is the place to be whenever you want to discuss anything in more detail and get input from the engineers who have been building and maintaining Dask for years.
You’re now well on your way to avoiding the most common mistakes people make when onboarding with Dask. Our public Dask Tutorial is a good next step for anyone serious about exploring the possibilities of distributed computing.
Once you’ve mastered the fundamentals, you’ll find resources below for using Dask for more advanced ETL, data analysis and machine learning tasks:
Good luck on your Dask journey!