The Dask community is highly distributed with different teams working independently. This is powerful but sometimes makes it hard for people within the community to see everything that is going on. The Dask Heartbeat by Coiled is a monthly publication intended to centralize and broadcast Dask news over the previous month.
If you want something added to this list, either send an email at info@coiled.io, or tweet and tag @dask_dev, and we’ll try to include it. Keep reading for the latest updates.
Recently, the community noticed some challenges with processing large datasets stored in the Parquet file format on Amazon S3 buckets. These include issues around crossing the S3 request rate limits, difficulty parsing large metadata files, etc. James Bourbeau, Joris Van den Bossche, Martin Durant, and Richard (Rick) Zamora, are leading the effort to reproduce and fix these issues. You can follow the conversations here.
As a part of GSoC’21, Freyam Mehta improved task graph visualizations and extended the HTML representations for Dask high-level graphs. Task graphs now show the size of each layer and can be color-coded according to layer type. Dask also has new HTML representations in Jupyter Notebooks for the Process Interface and Security classes. Check out the project outcomes and more details in this blog post.
Aaron Meurer and Jim Crist-Harif are exploring how to optimize high-level queries and expressions. As Matthew Rocklin mentioned in the corresponding issue, the idea here is to understand user intent and rewrite their code to improve performance with Dask.
Richard (Rick) Zamora worked on refactoring and expanding the Dask DataFrame ORC API. Dask DataFrame now has better support for reading and writing to ORC files, including support for working with partitioned datasets.
In the ongoing effort to improve and add more HTML Representations to Dask, Jacob Tomlinson migrated HTML representations to use jinja2 templates. This migration makes it easier to create and maintain these representations. It also makes this tooling more accessible to projects that are downstream of Dask.
Over the month of August, both Dask and Distributed versions 2021.08.0 and 2021.08.1 were released.
Some highlights from the September Dask community meeting:
Full meeting notes are available here.
That’s it. Thanks for reading.
If you’re interested in taking Coiled Cloud for a spin, which provides hosted Dask clusters, docker-less managed software, and one-click deployments, you can do so for free today when you click below.