FAQ

What is Dask?

Dask is a library for parallel computing. It can be on its own, where it’s kinda like multiprocessing on steroids, or it can be used with other PyData libraries. Dask + Pandas is big Pandas, kinda like Spark. Dask + numpy is big Numpy. And so on with PyTorch or XGBoost or Airflow, etc.

Do you support on-prem clusters?

You can deploy Dask anywhere with technologies like HPC Job schedulers, Kubernetes, or cloud deployment APIs. Coiled manages Dask clusters on the cloud, deploying within your own account (we never see your data). If you are truly on-prem (like on openstack), then we can connect you to other companies and contractors that can set things up for you.

How do I set up Dask to run on a cluster?

Actually! That’s what we do! Dask is fully open source and you can deploy it yourself with any technology. Check out the Dask docs to get started.

Another question here?

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

What features does Coiled add to Dask?

Dask is great, but it’s kind of a pain to set up and use in a serious way. Coiled makes it easy to run Dask in the cloud. This includes things like managing clusters, tracking usage, limiting costs, etc.

Why is Dask better than Spark?

Different tools for different folks. Spark is better if you’re doing pure SQL.

Dask is better if:
You have folks that like Python
.
Dask is more Python native. For example the Pandas API is the same, and integration with all of the Python libraries is more native. Debugging is easier.

You need more flexibility.
Many people using Python need to do things that are more complicated than SQL. This includes complex workflows, ML, etc. Python teams tend to do a lot of crazy unstructured stuff.

What features does Coiled add to Dask?

Dask is great, but it’s kind of a pain to set up and use in a serious way. Coiled makes it easy to run Dask in the cloud. This includes things like managing clusters, tracking usage, limiting costs, etc.

Why should I choose Dask over Ray?
I’ve been looking at both.
  • Dask is more integrated into the PyData community, composed of numpy/pandas/scikit-learn maintainers.
  • Ray is better optimized for high performance ML. Dask is better optimized for data exploration, data engineering, etc.  We tend to optimize around things we see in the wild, but are less optimized around hype.
  • Multiple companies (Anaconda, NVIDIA, Coiled, Saturn, Quansight) vs. one (Anyscale).
  • But really, for 90% of boring workloads anything is fine, and it’s more of a cultural thing. Dask is part of the pragmatic PyData culture.
  • If you want to get very technical though, Ray does distributed scheduling and makes lots of dumb+fast decisions while Dask does centralized scheduling and makes smarter+slower decisions (slow is 200us). Dask leverages hardware more intelligently, especially around computing large datasets in small space, which is where we observe the most user pain.

FAQ

What is Dask?

Dask is a python library for parallel computing. It can be used on its own, where it’s kinda like multiprocessing on steroids, or it can be used with other PyData libraries. Dask + pandas is big pandas, like Spark. Dask + Numpy is big Numpy. And so on with PyTorch, XGBoost, Prefect, Airflow, etc...

Can you help me run on-prem?

You can deploy Dask anywhere with technologies like HPC job schedulers, Kubernetes, or cloud deployment APIs. See the Dask documentation. Coiled manages Dask clusters in the cloud, deploying within your own cloud provider account (we never see your data).

If you are truly on-prem (like on OpenStack) then we can connect you to other companies and contractors that can set things up for you. Send us a note and we’ll connect you to groups that we trust to do excellent work.

How do I set up Dask to run on a cluster?

Actually! That’s what we do! Dask is fully open source and you can deploy it yourself with any technology. Check out the Dask docs to get started.

This ends up being easy to start, but kinda hard to actually use seriously in a corporate setting. Coiled makes this super easy on the cloud. See our Build vs. Buy page for more details.

Does this run in your cloud or mine?

Coiled manages resources in your cloud account. Most of our customers don’t trust us with their sensitive data. We work hard to manage cloud resources for you without ever having direct access to sensitive data. For more information on our security posture, see our Coiled Security page.

What does Coiled add on top of Dask?

Coiled makes it easy to set up and use Dask in the cloud. This isn’t one big thing. It’s dozens of small things. For example Coiled does the following:

  • Replicates your local software environment to your workers
  • Forwards data access credentials to your workers
  • Starts up reliably in any region in a couple minutes
  • Turns itself off automatically and cleans up if clusters are left on
  • Tracks usage across a team
  • Gives cost saving measures like Spot, ARM, and others
  • Gives visibility to historical jobs and errors
What is Coiled?

Dask is an open source project, like pandas or Jupyter. Coiled is a for-profit company around Dask. We work on Dask and we also sell a cloud platform to make it easy to deploy Dask in the cloud. Do you know Databricks? We're like that. Request a demo to give it a try.

How much does this cost me?

Nothing until you use 10,000 CPU hours per month. After that we do usage-based pricing. If we’re managing $1M of spend on AWS/GCP then we expect to receive hundreds of thousands of dollars. If we’re managing hundreds of dollars then we don’t care and are happy to give away the product for free. In general we find that our cost-saving measures save customers more money than we charge. More details on our Pricing and Build vs. Buy pages.

Can I get a free trial?

Yes! You don’t even have to ask:
pip install coiled
coiled setup

Coiled is free to use for the first 10,000 CPU hours per month (you still have to pay your cloud provider). You don’t need to give us a credit card or anything. See our documentation to get started or reach out to us, we'd love to chat and are happy to help.

Can I pay to get help with Dask?

If you’re using Coiled you’ll find that our engineers reach out frequently. We’re constantly tracking failures and engaging with users to see how we can make Dask and Coiled better. If you want to ask questions we’re happy to help.

If you’re not using Coiled these engagements are less efficient and so it’s harder to help. For a few large partner organizations with heavy use of Dask on-prem we do sell Enterprise Dask Support.

How does Dask compare to Apache Spark?

Spark is great. If Spark does what you want, use Spark.

Dask is lower level and lighter weight than Spark. It is more flexible and can do more things. If you’re doing primarily SQL queries or common dataframe operations, Spark will likely outperform Dask. People tend to choose Dask for the following two reasons:

They like Python
.
Dask is more Python native. For example the pandas API is the same, and integration with all of the Python libraries is more native. Debugging is easier.

They need more flexibility.
Many people using Python need to do things that are more complicated than SQL. This includes complex workflows, ML, etc. Python teams tend to do a lot of crazy unstructured stuff.

Check out our blog post Dask vs. Spark for more details.

Dask Tutorials

Hands-on training, totally free. Tutorials offered every month.

Frequently Asked Questions

What is Dask?

Dask is a library for parallel computing. It can be on its own, where it’s kinda like multiprocessing on steroids, or it can be used with other PyData libraries. Dask + Pandas is big Pandas, kinda like Spark. Dask + numpy is big Numpy. And so on with PyTorch or XGBoost or Airflow, etc.

What is Coiled?

Dask is an open source project, like Pandas or Jupyter.Coiled is a for-profit company around Dask. We work on Dask and we also sell a cloud platform to make it easy to deploy Dask in the cloud. Do you know Databricks?

Do you support on-prem clusters?

You can deploy Dask anywhere with technologies like HPC Job schedulers, Kubernetes, or cloud deployment APIs. Coiled manages Dask clusters on the cloud, deploying within your own account (we never see your data). If you are truly on-prem (like on openstack), then we can connect you to other companies and contractors that can set things up for you.

What features does Coiled add on top of Dask?

Dask is great, but it’s kind of a pain to set up and use in a serious way. Coiled makes it easy to run Dask in the cloud. This includes things like managing clusters, tracking usage, limiting costs, etc.

How do I set up Dask to run on a cluster?

Actually! That’s what we do! Dask is fully open source and you can deploy it yourself with any technology. Check out the Dask docs to get started.

Why is Dask better than Spark?

Different tools for different folks. Spark is better if you’re doing pure SQL.

Dask is better if:
You have folks that like Python
.
Dask is more Python native. For example the Pandas API is the same, and integration with all of the Python libraries is more native. Debugging is easier.

You need more flexibility.
Many people using Python need to do things that are more complicated than SQL. This includes complex workflows, ML, etc. Python teams tend to do a lot of crazy unstructured stuff.

Why should I choose Dask over Ray?
I’ve been looking at both.
  • Dask is more integrated into the PyData community, composed of numpy/pandas/scikit-learn maintainers.
  • Ray is better optimized for high performance ML. Dask is better optimized for data exploration, data engineering, etc.  We tend to optimize around things we see in the wild, but are less optimized around hype.
  • Multiple companies (Anaconda, NVIDIA, Coiled, Saturn, Quansight) vs. one (Anyscale).
  • But really, for 90% of boring workloads anything is fine, and it’s more of a cultural thing. Dask is part of the pragmatic PyData culture.
  • If you want to get very technical though, Ray does distributed scheduling and makes lots of dumb+fast decisions while Dask does centralized scheduling and makes smarter+slower decisions (slow is 200us). Dask leverages hardware more intelligently, especially around computing large datasets in small space, which is where we observe the most user pain.

Still have questions?

We'd love to chat more!