Coiled Blog

Filtered blogs

Dask

Converting a Dask DataFrame to a pandas DataFrame

October 1, 2021
This post explains how to convert from a Dask DataFrame to a pandas DataFrame and when it’s a good idea to perform this operation.

Parallelize pandas apply() and map() with Dask DataFrame

November 22, 2021
With Dask’s map_partitions(), you can work on each partition of your Dask DataFrame, which is a pandas DataFrame, while leveraging parallelism for various custom workflows.

Reading CSV files into Dask DataFrames with read_csv

February 9, 2022
This blog post explains how to read one or multiple CSV files into a Dask DataFrame with read_csv.

Speed up a pandas query 10x with these 6 Dask DataFrame tricks

February 14, 2022
This post demonstrates how to speed up a pandas query to run 10 times faster with Dask using six performance optimizations.

Search at Grubhub and User Intent

July 26, 2020
Alex Egg, Senior Data Scientist at Grubhub, joins Matt Rocklin and Hugo Bowne-Anderson to talk and code about how Dask and distributed compute are used throughout the user intent classification pipeline at Grubhub!

Accelerating Microstructural Analytics with Dask and Coiled

February 9, 2021
In this article, we discuss an interesting use case of Dask and Coiled: Accelerating Volumetric X-ray Microstructural Analytics using distributed and high-performance computing.

Tackling unmanaged memory with Dask

June 30, 2021
Unmanaged memory is RAM that the Dask scheduler is not directly aware of and which can cause workers to run out of memory and cause computations to hang and crash.

How to Convert a pandas Dataframe into a Dask Dataframe

July 27, 2021
Pandas is a very powerful Python library for manipulating and analyzing structured data, but it has some limitations. Dask can help.

Convert Large JSON to Parquet with Dask

September 15, 2021
You can use Coiled to convert large JSON data into a tabular DataFrame stored as Parquet in a cloud object store. Iterate locally first to build and test your pipeline, then transfer the same workflow to Coiled with only minimal code changes.

Filtering Dask DataFrames with loc

August 24, 2021
This post explains how to filter Dask DataFrames based on the DataFrame index and on column values using loc.

Dask vs Spark | Dask as a Spark Replacement

October 4, 2021
This article discusses the problems users looking for a Spark/Databricks replacement face, the relative strengths of Dask/Coiled for large-scale ETL processing, and also the current shortcomings.

Creating Disk Partitioned Lakes with Dask using partition_on

October 20, 2021
This post explains how to create disk partitioned Parquet lakes with Dask using partition_on. It also explains how to read disk partitioned lakes with read_parquet and how this can improve query speeds.

Better Shuffling in Dask: a Proof-of-Concept

October 14, 2021
This post outlines the Coiled team's recent experimentation with a new approach to DataFrame shuffling in Dask.

Reduce memory usage with Dask dtypes

November 17, 2021
This post gives an overview of DataFrame datatypes (dtypes), explains how to set dtypes when reading data, and shows how to change column types.

Introducing the Dask Active Memory Manager

December 6, 2021
Dask release 2021.10.0 introduces the first piece of a new modular system called Active Memory Manager, which aims to alleviate memory issues.

Setting a Dask DataFrame index

January 4, 2022
This post demonstrates how to change a DataFrame index with sebt_index and explains when you should perform this operation.

How to Merge Dask DataFrames

February 1, 2022
This post demonstrates how to merge Dask DataFrames and discusses important considerations when making large joins.

Writing Parquet Files with Dask using to_parquet

April 4, 2022
This blog post explains how to write Parquet files with Dask using the to_parquet method.

How Coiled sets memory limit for Dask workers

August 17, 2022
Having Dask workers die from memory overuse is common, so we thought that we’d investigate this further.

Snowflake and Dask: a Python Connector for Faster Data Transfer

March 8, 2022
This article discusses why and how to use both together, and dives into the challenges of bulk parallel reads and writes into data warehouses...

How Popular is Matplotlib?

November 4, 2022
This analysis tracks the growth of Matplotlib on the preprint server arXiv beginning in 2002 with 1% up to 2022 with 17% of all papers using Matplotlib...

Repartitioning Dask DataFrames

August 20, 2021

Dask Read Parquet Files into DataFrames with read_parquet

March 14, 2022
This blog post explains how to read Parquet files into Dask DataFrames with read_parquet...

Automate your ETL Jobs in the Cloud with Github Actions, S3 and Coiled

June 23, 2022
This post will demonstrate how running Github Actions on Coiled can be a useful way to schedule automated data-processing jobs...

Surprising Hidden Costs With DIY Dask

October 7, 2021
Coiled is solely focused on building a Dask service and has a dedicated engineering team where thousands of customers can share costs...

Reducing memory usage in Dask workloads by 80%

November 15, 2022
The latest version of Dask (2022.11.0) can significantly reduce your memory usage. Here's how we did it.

Prioritizing Pragmatic Performance for Dask

November 15, 2022
Dask developers care about performance, we’ve always taken a pragmatic rather than exciting approach to the problem...

High-Performance Data Visualization with Datashader and Dask

November 17, 2022
Address performance issues for large-scale data visualizations by making smart choices about cluster memory, data types and data partitioning.

Understanding Managed Dask (Dask as a Service)

Dask is a flexible Python library for parallel and distributed computing. There are a number of ways you can create Dask clusters, each with their own benefits. In this article, we explore how Coiled provides a managed cloud infrastructure solution for Dask users.