I ran across an interesting problem yesterday:
A company wanted to serve many Dask computations behind a web API endpoint. This is pretty common whenever people offer computation as a service or data as a service. Today the company uses the single-machine Dask scheduler inside of a web request, but they were curious about moving to serving larger requests on a scaled out cluster.
However, they were concerned about having many thousands of requests all funnel down onto a single Dask scheduler. This lead to a conversation about the ways of using Dask in production. I thought I’d enumerate what we’ve seen historically here in a blogpost, and then finish up with a quick hack I put together that would probably better suit their particular needs.
When thinking about any production workload you need to consider a few axes:
We’ll look at 5 different highly reliable architectures, each of which serves different regions in this space.
For many institutions “production workloads” means “large jobs that need to get done every hour” or “A job that has to run whenever a large dataset comes in”. These jobs typically take several minutes. It’s important that they run, or if they don’t run, that someone is alerted.
In this situation people usually use a fresh Dask distributed scheduler, coupled with some resource management system like Kubernetes, and a workflow manager like Airflow or Prefect.
Whenever a job comes in they spin up a new Dask cluster (this takes a few seconds or a few minutes depending on your infrastructure) and submit the job to that. This often happens within some workflow management system like Airflow or Prefect, which adds logic for retries, alerting on failures, and so on.
For others, such as the company described above, they currently use a single-machine Dask deployment within a web request to gather data, do a bit of computation, and then fire back the result to the user. This typically all happens in the sub-second times expected by web users.
In this situation they use the Dask local scheduler, which doesn’t require any complicated resources to spin up (it runs in the current Python process) and so lives comfortably within existing web frameworks. However this limits them to the resources available in the machine hosting their web server.
Rarely we see companies that want to serve large distributed computations behind a publicly accessible web API. This requires some care. If your users don’t mind a few seconds of delay then it’s typical to spin up a new Dask cluster for every request, and fall back to the “Few Large Requests” model above. Kubernetes backs us here and creates large-but-ephemeral Dask clusters on an on-demand basis.
Serving large distributed computation behind a publicly accessible web API is relatively niche, and users today don’t seem to expect much, so waiting a few seconds for Kubernetes to respond is fine in many cases. But can we do better?
Alternatively, we could put a single Dask scheduler in front of many web requests. This brings our latency overheads down to milliseconds (well within web user tolerances) but introduces some concern about scalability and multi-tenancy. There are a few cases where this can break down:
Having the single central scheduler in there can produce a bottleneck. Dask’s architecture was originally designed to serve computations for a single user and less for high-concurrency production workloads.
Fortunately Dask was also designed to be flexible and evolve. We can take the existing pieces of a Dask cluster, and rearrange them to suit this situation without too much pain.
If the scheduler is a bottleneck, then let’s add more schedulers. We’ll throw in a load balancer too for good measure.
Today each worker can only talk to a single scheduler. That turns out to be easy to change though, and is up in this experimental pull request.
Then, we can use standard web scaling techniques like load balancers and auto-scalers to build a robust system to handle many requests, and deliver them to the same set of Dask workers. This gives us the ability to handle an unbounded number of requests, while still being efficient with our single pool of responsive workers.
Additionally, it’s comforting that this solution fits into familiar patterns of deployment today. Dask schedulers and workers are lightweight web servers that fit into our existing models of deployment.
However, let’s enumerate some problems that this doesn’t solve:
As we've been working on Coiled it has been fun to hear about more companies and other institutions stretching Dask beyond its original intent. If these kinds of problems sound relevant to you then consider getting in touch.