With Dask’s map_partitions(), you can work on each partition of your Dask DataFrame, which is a pandas DataFrame, while leveraging parallelism for various custom workflows.
This analysis tracks the growth of Matplotlib on the preprint server arXiv beginning in 2002 with 1% up to 2022 with 17% of all papers using Matplotlib...
In this article, we discuss an interesting use case of Dask and Coiled: Accelerating Volumetric X-ray Microstructural Analytics using distributed and high-performance computing.
Unmanaged memory is RAM that the Dask scheduler is not directly aware of and which can cause workers to run out of memory and cause computations to hang and crash.
This article discusses the problems users looking for a Spark/Databricks replacement face, the relative strengths of Dask/Coiled for large-scale ETL processing, and also the current shortcomings.
Alex Egg, Senior Data Scientist at Grubhub, joins Matt Rocklin and Hugo Bowne-Anderson to talk and code about how Dask and distributed compute are used throughout the user intent classification pipeline at Grubhub!
You can use Coiled to convert large JSON data into a tabular DataFrame stored as Parquet in a cloud object store. Iterate locally first to build and test your pipeline, then transfer the same workflow to Coiled with only minimal code changes.
This post explains how to create disk partitioned Parquet lakes with Dask using partition_on. It also explains how to read disk partitioned lakes with read_parquet and how this can improve query speeds.