By far the most common issue users encounter when running Dask in a true distributed fashion in the cloud is mismatched software environments. Your computer has one version of a library like Dask or pandas and your distributed cluster has another version. Sometimes this is harmless, other times it results in difficult to understand bugs. It’s both common and painful.
Why do different versions matter? Well there’s a wonderful array of errors that can crop up:
I personally used to maintain several physical machines as an on-prem Dask cluster, and ran into this many times. While I don’t hand maintain physical machines like that anymore, except for the NAS in my closet, the same problems can still crop up. Experts and beginners alike will run into these problems during development.
So how does your environment get out of sync? Let’s run through some examples, ranging from the easy to the insidious.
For a straightforward one, you pip installed something and forgot to run create_software_environment with your new package, so now the library is entirely missing on the cluster. Doh!
In another scenario, you try to get more specific about how to replicate your environment with a conda environment.yml file, this could also be a requirements.txt file for pip:
On paper, this looks ok. After all, you've pinned the exact versions of the packages you need. But let's look at what this produces:
Over 80 packages are installed by conda, and only two of them are pinned, which means any of them could change at any time, oh dear. We forgot to include python too, so even the python version could change! We really only pinned the very tip of our environment iceberg.
If you installed this environment locally and created a Coiled software environment immediately you’d hope things would be in sync at least, but alas this is not the case. Different packages often have different requirements cross platform, so installing this on macOS will lead to a pretty different environment compared to the Linux environment on the cluster. Most of the time not different enough to cause major issues, but the few times it does are very painful.
Using some combination of pipenv, pdm, poetry, conda-lock and Docker is often how people try to solve this for production workflows, and often with high levels of success. If you run your client with the exact same Docker image as the cluster and install all your dependencies from a lock file, it works! In a perfect world everyone uses cutting edge best practices for replicating a python environment, everyone has exactly the same python environment everywhere, and everyone has the time to produce an always available environment artifact in the cloud for your clusters to use.
In reality, things are messy setting up a perfect system can take much longer than what you want to do!
We’ve been working hard on our new feature, package sync. Enabling it is simple:
Package sync will then scan your local python environment and replicate it on your cluster. Your slightly imperfect environment that works great locally is now in the cloud where it will continue to work great! You’re not creating your cluster environments based on the tip of the iceberg anymore, with package sync, Coiled is looking much deeper!
We even handle editable package installations, i.e. those you’ve installed with pip install -e <package>, and packages installed from git! You can read more of the details about how this works (and some of the caveats) in our docs. Give it a go and let us know what you think!
Working with Dask core contributors, we’ve gotten very positive feedback:
[I] couldn't imagine trying to develop in this repo without package_sync. Especially as a non-primary developer in this repo who just wants to come in, add some tests, and move on without too much fuss, it's invaluable.
… spent a fair amount of time the last few days looking at performance testing and looking at some specific performance regressions, and this [package_sync] has been invaluable…