Dask Heartbeat by Coiled: 2020-12-17

The Dask community is highly distributed with different teams working independently. This is powerful but sometimes makes it hard for people within the community to see everything that is going on. The Dask Heartbeat by Coiled is a bi-weekly publication intended to centralize and broadcast Dask news over the previous two weeks.  

If you want something added to this list either send an e-mail at info@coiled.io, or tweet and tag @dask_dev and we’ll try to include it.

Dask 2020.12.0 release

Dask and Distributed version 2020.12.0 was released last week. This release contains many updates (it’s the first release in two months). Some highlights include:

  • Switching to CalVer for the versioning scheme.  We plan to write more about the motivations for this next month. The previous release was version 2.30.1, while this version is 2020.12.0.
  • The scheduler can now receive Dask HighLevelGraphs instead of raw dictionary task graphs. This allows for a much more efficient communication of task graphs from the client to the scheduler.  This is currently off by default but is configurable for early adopters with the optimization.fuse.active config value.
  • Introduction of new HighLevelGraph layer objects including BasicLayer, Blockwise, BlockwiseIO, ShuffleLayer, and more.
  • Added support for applying custom Layer-level annotations like priority, retries, etc. with the new dask.annotate context manager.

XGBoost 1.3.0 release

The newly released version 1.3.0 of XGBoost contains several updates that improve XGBoost + Dask integration. This is part of the larger effort to migrate the functionality of Dask-XGBoost into the mainline XGBoost codebase.

NVTabular is now built on Dask-CuDF

NVTabular, a library for processing tabular data needed to train and deploy recommender-systems models on GPUs, introduced a new Dask-CuDF backend to support scalable preprocessing. Rick Zamora (NVIDIA) outlines some recent NVTabular developments in this blogpost https://medium.com/rapids-ai/nvtabular-all-in-on-dask-6241b4e9ca19


Dask-SQL Updates

Nils Braun (Bosch) shares his blogpost using SQL to drive Dask on Kubernetes


Also, in other fun Dask/Pandas/SQL news, we discover that the Dask-SQL project also magically works on Pandas.  


This is one nice side effect of the close partnership between the two projects.  

Stumpy 1.6.0 release

The STUMPY library for time series analysis improves its dask support in its recent release.


yt integration continues 

Maintainers of the popular yt framework for computation and visualization of volumetric data are busy implementing Dask support.  A slide deck on their recent progress is below


Quansight delivers Dask Webinar

Dhavide Aruliah (Quansight) https://twitter.com/quansightai/status/1334161550504968192 

CZI EOSS Grantee Program

Ben Zaitlen (NVIDIA) presented Dask to OSS maintainers and Life Science practitioners at the Chan Zuckerberg Initiative’s Essential Open Source Software for Science gathering.

Video available here (Dask was on Day Three at the end)

Slides available here

We’re also glad to announce that Genevieve Buckley will be joining full time in February as the Dask Life Science fellow (generously funded by the CZI EOSS program).  We’ll have a more detailed announcement next month, and are very excited.  Genevieve will be the first employee of Dask itself as an organization, rather than one of the supporting companies.

Deploying Jupyter for Dask on ARM on Kubernetes

Holden Karau walks through how to deploy Jupyter Lab/Notebook on ARM on Kubernetes with Dask support in this blogpost https://scalingpythonml.com/2020/12/12/deploying-jupyter-lab-notebook-for-dask-on-arm-on-k8s.html



Activity at annual AGU conference

The American Geophysical Union runs an annual conference.  Dask took this community by storm a couple of years ago with the Pangeo project.  This year is no different 


For reference, CMIP is the Climate Model Intercomparison Project.  It’s the standard multi-institutional model for climate change and one of the grander humanity-focused projects we see today.

There are many other happenings at this conference, including this announcement from the climpred project


2i2c is hiring

2i2c is looking to hire an open-source infrastructure engineer to work on cloud infrastructure for research and education using projects like JupyterHub and Dask. For more information, see their job posting at https://2i2c.org/job/osie-pangeo.

Micro-optimizing and refactoring the Distributed Scheduler

John Kirkham (NVIDIA) has continued making micro-optimization of the scheduler as part of a larger effort to boost performance:

And has recently begun to decouple the state machine and networking communication parts of the scheduler.

Dask-Jobqueue 0.7.2 release

See https://jobqueue.dask.org/en/latest/changelog.html for the full list of changes.

Wrapping Up

That’s it. Thanks for reading all.

If you’re interested in taking Coiled Cloud for a spin, you can do so for free today when you click below.

Level up your Dask using Coiled

Coiled makes it easy to scale Dask maturely in the cloud