Spark vs. Dask

June 27, 2023

If you want to run Python code at scale today you have several options.  Arguably the two most popular today are Apache Spark, and Dask.

There are valid reasons to choose either of these tools. Historically, Spark performs better when you have workloads that consist of mostly SQL queries. Dask is implemented natively in Python and thus makes it trivial to use for Python developers. Dask offers significantly more flexibility than Spark, which makes it a great fit when you have to escape pre-defined SQL-like structures.

Let's look at why you would choose one over the other in more detail. We’ll cover a few dimensions you may consider:

  1. Culture and ecosystem
  2. SQL vs. custom parallelism
  3. Implementation language, setup, and debugging

First, let us acknowledge our bias

We think that both projects are great, but we’re heavily biased towards Dask. Dask made pragmatic design choices that bring benefit to Python users. That said, we’ll try to keep things on-the-level for most of the article. We’ll occasionally include subsections where we express our opinions a bit more freely.

For most applications, any choice is fine

First, most parallel work is simple embarrassingly parallel work like this:

for filename in filenames:

For this kind of work, both systems will be fine. This is like comparing brands of toothpaste or Coke vs Pepsi. There’s some nuance in the flavor, but they’re pretty much the same product.

There are no bad options.

Culture and Ecosystem

Given that in most cases anything is fine, folks often choose what they’re most familiar with. Often this is governed by social cues or familiarity more than architectural or technology decisions. If your colleagues mostly use Apache projects then you’ll probably default to Spark. If you are in the Python data science community then you’ll probably choose Dask.

Community often matters more than technology. It gets you started and helps you along the way.

The surrounding software ecosystem also matters. Spark will have better support among enterprise data storage products for example. Dask will generally have better coverage in the chaotic bazaar of PyData projects. There is value in following your local crowd.

SQL vs. Custom Parallelism

If you primarily run SQL queries, choose Apache Spark.

Spark SQL is better than Dask’s efforts here (despite fun and exciting developments in Dask to tackle this space). Spark is also more battle tested and produces reliably decent results, especially if you’re building a system for semi-literate programmers like SQL analysts.

Conversely, if you want to run generic Python code, Dask is much more flexible.

Map-Shuffle-Reduce: Apache Spark always operates on lots of data in bulk using the same operations. It's like a regular army where every soldier fires the same weapon at the same time.

Spark is fundamentally a Map-Shuffle-Reduce paradigm. It’s much easier to use than Hadoop, but fundamentally the same programming abstraction under the hood. If you want to do something more complex, then Spark can not understand / express these computations. As an example, here is a credit risk model from a retail bank:

Task scheduling: Dask uses dynamic task scheduling, which can be more flexible, and so handle weirder situations (which come up a lot in Python workloads). It's like an irregular army where every soldier can carry a different weapon and can operate independently. There's less organization, but more flexibility

This is not the regular structure of Map-Shuffle-Reduce, so Spark couldn’t actually run this. You need something more flexible. Dask can do this stuff easily.

Dask Bias

Databases are good at SQL. Dask and Python were designed to do stuff that databases can’t do. Our experience is that people choose Python because they need to break out of pre-defined abstractions like SQL or Spark. It’s been exciting seeing all of the new kinds of data and computations people do with Python and Dask, and the new science and technological benefits that arise.

Lightweight setup

Both systems can run at scale. Both systems can run on a laptop. Both systems has some configuration and setup pain.  

If you’re in a corporate environment you should use a managed platform to do this for you and you should now compare Databricks/EMR/Synapse for Spark, Coiled/Saturn/Nebari for Dask.

If you’re on a personal computer they’re all still options, with varying degrees of automated setup.

Dask Bias

OK, we think that we’re just way easier to use a single machine. Dask is super-lightweight.

In [1]: from dask.distributed import LocalCluster

In [2]: %%time
   ...: cluster = LocalCluster()  # automatically configure to your machine
   ...: client = cluster.get_client()
CPU times: user 308 ms, sys: 75.8 ms, total: 384 ms
Wall time: 816 ms

Spark requires more expertise to configure well and brings along the JVM. Dask is just pure Python. It experiences a mild performance hit as a result (which most users never notice) but it’s trivial to use locally and easier to debug, both of which help with onboarding and accessibility.

Implementation Language

In a perfect world this doesn’t matter. Unfortunately it sometimes does:

  • Apache Spark: Scala
  • Dask: Python

Both have excellent Python APIs and are primarily used by Python developers. Spark has to cross the JVM-Native barrier, which sometimes makes people sad. Dask is native (and so using native code is easier). Dask feels a little simpler to hack on for Python users.


In general, both projects can probably do what you want (unless you’re doing something very strange). They’ve each made design choices that have subtle impacts on different workflows. These design choices were determined by their origins:

  • Spark grew up in the web scale big data world to replace lots of Hadoop jobs.
  • Dask grew up catering to the cornucopia of PyData libraries, and also as part of serving Anaconda consulting customers.

Speaking with our Dask bias, we focused on Python user pain constantly. We’re pretty proud of all of the tradeoffs and decisions that have been made and will continue to be made.

Ten years ago “Parallel Python” was a fringe topic that no one took seriously. A lot has changed since then. It’s wonderful that today users can choose between several excellent choices and that many well-funded projects compete for attention.

With GitHub, Google or email.

Use your AWS or GCP account.

Start scaling.

$ pip install coiled
$ coiled setup
$ ipython
>>> import coiled
>>> cluster = coiled.Cluster(n_workers=500)