Convert Dask DataFrame to Dask Array

This post explains how to convert a Dask DataFrame to a Dask Array.  This isn’t an operation you should normally perform, so this post will also explain the limitations and alternate workflows.

Here’s how to create a Dask DataFrame with two partitions that can be converted to a Dask Array:

//import pandas as pd
import dask.dataframe as dd

pdf = pd.DataFrame(
   {"num1": [1, 2, 3, 4], "num2": [7, 8, 9, 10]},
)

ddf = dd.from_pandas(pdf, npartitions=2)//]]>

Now convert the Dask DataFrame to a Dask Array:

//my_arr = ddf.to_dask_array()
my_arr//]]>

You can run compute to see the values that are contained in the array:

//my_arr.compute()

array([[ 1,  7],
      [ 2,  8],
      [ 3,  9],
      [ 4, 10]])//]]>

This isn’t ideal because Dask doesn’t know the shape of the chunks.  This can cause later computations to error out.  Let’s see how to force Dask to compute the shape of the chunks when converting from a DataFrame to an array.

Convert Dask DataFrame to Dask Array: set length

You can set the length property when converting from a Dask DataFrame to a Dask Array to have Dask immediately compute the length of each partition, so it knows the shape of each chunk.

//my_arr = ddf.to_dask_array(lengths=True)
my_arr//]]>

You can also manually specify a list of chunk sizes, but these aren’t verified, so they may cause downstream computations to error out.

Convert Dask DataFrame to Dask Array: heterogeneous data

Let’s create a Dask DataFrame with columns that have different data types and see what happens when we try to convert it to a Dask Array.  Start by creating the Dask DataFrame.

//pdf = pd.DataFrame(
   {"num": [1, 2, 3, 4], "letter": ["a", "b", "c", "d"]},
)

ddf = dd.from_pandas(pdf, npartitions=2)//]]>

Now convert the Dask DataFrame to a Dask Array:

//my_arr = ddf.to_dask_array(lengths=True)
my_arr//]]>

The Dask Array has the object type.  Heterogeneous data must be stored as objects in Dask Array.

You can also see this data type when running compute:

//my_arr.compute()

array([[1, 'a'],
      [2, 'b'],
      [3, 'c'],
      [4, 'd']], dtype=object)//]]>

Convert Large Dask DataFrame to Dask Array

Create a Dask cluster so we can read a large dataset into a Dask DataFrame and then convert it to a Dask Array.

//import coiled
import dask
import dask.dataframe as dd

cluster = coiled.Cluster(name="powers-crt-003", software="crt-003", n_workers=5)

client = dask.distributed.Client(cluster)//]]>

Read in a Parquet dataset into a Dask DataFrame and inspect the column data types.

//ddf = dd.read_parquet(
   "s3://coiled-datasets/timeseries/20-years/parquet",
   storage_options={"anon": True, "use_ssl": True},
)

ddf.dtypes

id        int64
name     object
x       float64
y       float64
dtype: object//]]>

The x and y columns are both float64.  Let’s convert those two columns to a Dask Array.

//ddf = dd.read_parquet(
   "s3://coiled-datasets/timeseries/20-years/parquet",
   storage_options={"anon": True, "use_ssl": True},
   columns=["x", "y"],
)

some_arr = ddf.to_dask_array(lengths=True)

some_arr//]]>

The Dask Array has 662 million rows, but each chunk only has 604,800 rows.  The data is nicely spread out across multiple chunks.

You shouldn’t normally convert Dask DataFrame to Dask Array

You should not usually be converting a Dask DataFrame to a Dask Array.  It’s usually more straightforward to read data directly into a Dask Array.  Or perhaps you’d like to read data into another array data structure that’s powered by Dask, like xarray.

The example we just saw is a good example of when converting a Dask DataFrame to a Dask Array may be justified.  You cannot read Parquet files directly into Dask Arrays.  Reading them into a Dask DataFrame first and then converting them to a Dask Array may be your best option.

If you have to perform this analysis in production it’s probably best to first convert the data to a Dask Array compliant file format before performing the actual array analysis.  Here’s the full workflow:

  • Read Parquet data into a Dask DataFrame
  • Convert the Dask DataFrame into a Dask Array
  • Write the Dask Array to a file format that’s optimized for array data, like Zarr
  • Read the Zarr files into Dask Array and perform your analysis

Conclusion

Dask makes it easy to convert Dask DataFrames to Dask Array but that’s a workflow you should typically avoid.  Try to brainstorm other solutions if you find yourself converting a DataFrame to an array. If you can’t think of a way to avoid the conversion, then try adding an intermediate step and write the data out to a file format that’s optimized for arrays right after converting. This will make it more likely that your Dask Array analysis can efficiently read data from disk.

Level up your Dask using Coiled

Coiled makes it easy to scale Dask maturely in the cloud