Coiled Blog

Filtered blogs

Dask

Understanding Managed Dask (Dask as a Service)

Dask is a flexible Python library for parallel and distributed computing. There are a number of ways you can create Dask clusters, each with their own benefits. In this article, we explore how Coiled provides a managed cloud infrastructure solution for Dask users.

How Popular is Matplotlib?

November 4, 2022

This analysis tracks the growth of Matplotlib on the preprint server arXiv beginning in 2002 with 1% up to 2022 with 17% of all papers using Matplotlib...

Converting a Dask DataFrame to a pandas DataFrame

October 1, 2021

This post explains how to convert from a Dask DataFrame to a pandas DataFrame and when it’s a good idea to perform this operation.

Parallelize pandas apply() and map() with Dask DataFrame

November 22, 2021

With Dask’s map_partitions(), you can work on each partition of your Dask DataFrame, which is a pandas DataFrame, while leveraging parallelism for various custom workflows.

Reading CSV files into Dask DataFrames with read_csv

February 9, 2022

This blog post explains how to read one or multiple CSV files into a Dask DataFrame with read_csv.

Speed up a pandas query 10x with these 6 Dask DataFrame tricks

February 14, 2022

This post demonstrates how to speed up a pandas query to run 10 times faster with Dask using six performance optimizations.

Search at Grubhub and User Intent

July 26, 2020

Alex Egg, Senior Data Scientist at Grubhub, joins Matt Rocklin and Hugo Bowne-Anderson to talk and code about how Dask and distributed compute are used throughout the user intent classification pipeline at Grubhub!

Accelerating Microstructural Analytics with Dask and Coiled

February 9, 2021

In this article, we discuss an interesting use case of Dask and Coiled: Accelerating Volumetric X-ray Microstructural Analytics using distributed and high-performance computing.

Tackling unmanaged memory with Dask

June 30, 2021

Unmanaged memory is RAM that the Dask scheduler is not directly aware of and which can cause workers to run out of memory and cause computations to hang and crash.

How to Convert a pandas Dataframe into a Dask Dataframe

July 27, 2021

Pandas is a very powerful Python library for manipulating and analyzing structured data, but it has some limitations. Dask can help.

Convert Large JSON to Parquet with Dask

September 15, 2021

You can use Coiled to convert large JSON data into a tabular DataFrame stored as Parquet in a cloud object store. Iterate locally first to build and test your pipeline, then transfer the same workflow to Coiled with only minimal code changes.

Filtering Dask DataFrames with loc

August 24, 2021

This post explains how to filter Dask DataFrames based on the DataFrame index and on column values using loc.

Dask vs Spark | Dask as a Spark Replacement

October 4, 2021

This article discusses the problems users looking for a Spark/Databricks replacement face, the relative strengths of Dask/Coiled for large-scale ETL processing, and also the current shortcomings.

Creating Disk Partitioned Lakes with Dask using partition_on

October 20, 2021

This post explains how to create disk partitioned Parquet lakes with Dask using partition_on. It also explains how to read disk partitioned lakes with read_parquet and how this can improve query speeds.