How to encode categorical variables with Dask

Sultan Orazbayev
May 27, 2022

This blog post explains how to encode categorical variables with Dask and how to prepare categorical variables for machine learning algorithms. Using categorical variables correctly can give you significant performance boosts and speed up your data analyses.

The main idea in converting a variable to categorical dtype is creating a lookup table that associates each unique value of the original variable with a unique integer.

For example, if we have a column of 100 rows but only 2 unique names, “Alice” and “Bob”, then we can replace all occurrences of “Alice” with an integer 1 and replace all occurrences of “Bob” with an integer 2. This will reduce the memory needed to operate on the data and will give you performance improvements for filtering/querying the data.

Converting categorical variables is also a necessary step during data preparation for many machine learning algorithms.

Dask Categorical: Minimal Example

Let’s get started with a minimal example using a synthetic time series dataset, which contains four columns indexed by a timestamp. The four columns are: id (integer), name (string), x and y (floats).

from dask.datasets import timeseries

ddf = timeseries(
   start="2000-01-01",
   end="2000-6-30",
   partition_freq="1M",
   seed=0,
   )

print(ddf.dtypes)
# id        int64
# name     object
# x       float64
# y       float64
# dtype: object
//]]>

The dataframe contains 6 partitions, one per month. Let’s inspect the memory footprint of each partition using memory_usage_per_partition:

print(
   "Memory usage per partition (MB):",
   ddf.memory_usage_per_partition(deep=True).compute() / (1024**2),
   sep="\n",
)
# Memory usage per partition (MB):
# 0    225.443604
# 1    240.991914
# 2    233.216937
# 3    240.989604
# 4    233.214606
# dtype: float64
//]]>

To ensure that we obtain the accurate memory usage of the object columns (in this case only one column, “name”), we will pass kwarg `deep=True`. Read this blog for more on Dask memory usage.

Since there is a limited number of names in the dataset that are repeated multiple times, we can save memory by associating each distinct name with an integer and operating on the numeric values. Conversion of Dask dataframe columns to categorical dtype is easy using the `.categorize` method:

ddf_cat = ddf.categorize()

print(ddf_cat.dtypes)
# id         int64
# name    category
# x        float64
# y        float64
# dtype: object

print(
   "Memory usage per partition (MB):",
   ddf_cat.memory_usage_per_partition(deep=True).compute() / (1024**2),
   sep="\n",
)
# Memory usage per partition (MB):
# 0    78.856938
# 1    84.295171
# 2    81.576055
# 3    84.295171
# 4    81.576055
# dtype: float64
//]]>

By converting the object column to categorical we saved about two-thirds of the memory requirements: not bad for a one-line operation! However, the benefits of operating with categorical dtype are not limited just to memory. Let’s explore these benefits in the next section.

Dask Categorical: Performance Benefits

Conversion to categorical dtype is efficient in terms of memory footprint, but does it come at a cost in terms of performance? Let’s create two versions of the same data using strings (object) and categorical columns for name:

from dask.datasets import timeseries

ddf_object = timeseries(
   start="2000-01-01",
   end="2000-6-30",
   partition_freq="1M",
   seed=0,
)

ddf_cat = timeseries(
   start="2000-01-01",
   end="2000-6-30",
   partition_freq="1M",
   seed=0,
   dtypes={"name": "category"},
)//]]>

How long would it take to find all rows corresponding to “Alice” if the raw data is stored as object (string) or as category?

%time _ = ddf_object.query('name=="Alice"').compute()
# CPU times: user 4.65 s, sys: 321 ms, total: 4.97 s
# Wall time: 3.29 s

%time _ = ddf_cat.query('name=="Alice"').compute()
# CPU times: user 526 ms, sys: 87.7 ms, total: 614 ms
# Wall time: 275 ms//]]>

Note the 10x improvement in query speed when the original variable is stored as a categorical variable. Similar improvements can be observed for other operations on the categorical variable, for example obtaining the value counts:

%time _ = ddf_object["name"].value_counts().compute()
# CPU times: user 3.58 s, sys: 303 ms, total: 3.89 s
# Wall time: 3.4 s

%time _ = ddf_cat["name"].value_counts().compute()
# CPU times: user 416 ms, sys: 33.2 ms, total: 449 ms
# Wall time: 188 ms//]]>

The implication from these examples is that paying the fixed cost of converting an object variable into a categorical variable can yield a significant reduction in costs for downstream operations.

When converting a variable to categorical dtype is a bad idea

Conversion to categorical dtype clearly yields improvements in memory usage and calculation speed. Does this mean you should always convert object columns to categorical type? 

The simple answer is no. The benefit of using categorical dtype will depend on the number of unique values of a variable. This is also referred to as the cardinality of the variable in more technical terms. If a variable has a high cardinality (it has many unique values), then conversion to categorical dtype can actually increase the memory usage.

For example, imagine we were working with a dataframe of tweets that contained a column for tweet content. While we can expect some repetition of tweet content, this column is likely to have many, many unique values. If we were to convert the column to a categorical variable, we would increase our memory usage because now all the unique messages would need to be stored as category labels. 

Let’s explore this in an example:

ddf = timeseries(
   start="2000-01-01",
   end="2000-01-05",
   partition_freq="1d",
   seed=0,
)


def make_tweets(df):
   """generate random tweets"""
   from string import ascii_lowercase

   from numpy.random import choice

   return ["".join(c) for c in choice(list(ascii_lowercase), size=(len(df), 10))]


ddf["tweet_content"] = ddf.map_partitions(make_tweets)

print(
   "Memory usage per partition (MB):",
   ddf.memory_usage_per_partition(deep=True).compute() / (1024**2),
   sep="\n",
)
# Memory usage per partition (MB):
# 0    13.294794
# 1    13.294889
# 2    13.294802
# 3    13.294814
# dtype: float64
//]]>

Each partition of this dataset occupies about 13 MB in memory. What will happen if we convert all columns of the Dask dataframe to categorical type?

ddf_cat = ddf.categorize()

print(
   "Memory usage per partition (MB):",
   ddf_cat.memory_usage_per_partition(deep=True).compute() / (1024**2),
   sep="\n",
)
# Memory usage per partition (MB):
# 0    33.196332
# 1    33.196332
# 2    33.196332
# 3    33.196332
# dtype: float64//]]>

Now the memory requirement of each partition is almost tripled! Why? The reason is that now each partition has to keep track of all the unique values of the “tweet_content” column, which has high cardinality.

print(len(ddf_cat["tweet_content"].cat.categories))
# 345600//]]>

In this case, the costs of converting to a categorical variable dtype exceed the benefits, so we can instruct Dask to convert only the name column into categorical dtype by passing kwarg columns=[“name”]:

ddf_cat = ddf.categorize(columns=["name"])

print(
   "Memory usage per partition (MB):",
   ddf_cat.memory_usage_per_partition(deep=True).compute() / (1024**2),
   sep="\n",
)
# Memory usage per partition (MB):
# 0    13.103765
# 1    13.103765
# 2    13.103765
# 3    13.103765
# dtype: float64
//]]>

In general, the benefit of converting to categorical dtype is greater for variables that have few unique values, i.e. low cardinality. In deciding whether this conversion is needed, one will have to examine the cardinality of the variable in question. A quick solution here is to use `.unique()` method on the Dask Series containing the variable. If cardinality is too high to make the conversion to categorical dtype practical, you might consider strategies for reducing the cardinality, e.g. by assigning the unique variables into a smaller number of groups.

Dask Categorical: Working with known and unknown categories

For Dask to assign a unique category code to each unique value of a variable, Dask needs to know all the unique values. When `.categorize()` is executed, Dask performs a scan of the full dataset to determine the unique values. This is helpful because we can be sure that our categories are set correctly. But because all the data needs to be scanned, this operation is more computationally intensive. 
If you want to avoid a full data scan, you can also use the `.astype()` method. Dask will approach this as a lazy task, indicating that conversion of a variable is needed, but without the full scan.

from dask.datasets import timeseries

ddf = timeseries(
   start="2000-01-01",
   end="2000-01-03",
   partition_freq="1d",
   seed=0,
)

ddf_cat_known = ddf.categorize()
ddf_cat_unknown = ddf.astype({"name": "category"})

print(ddf_cat_known["name"].cat.known)  # True
print(ddf_cat_unknown["name"].cat.known)  # False
//]]>

The drawback of this lazy approach is that Dask will not know the possible categories until much later downstream. This could cause further problems downstream, if some operations require knowledge of all the categories (and their codes).

print(ddf_cat_known.name.cat.categories[:3])
# Index(['Xavier', 'Charlie', 'Yvonne'], dtype='object', name='name')

print(ddf_cat_unknown.name.cat.categories[:3])
# NotImplementedError: `df.column.cat.categories` with unknown categories is not supported.  Please use `column.cat.as_known()` or `df.categorize()` beforehand to ensure known categories
//]]>

Does that mean the cost of scanning the data has to always be incurred? No, if we already know all the unique values, then we can avoid the full scan by telling Dask about the possible values.

ddf_cat_unknown["name"] = ddf_cat_unknown["name"].cat.set_categories(
   ddf_cat_known.name.cat.categories
)

print(ddf_cat_unknown["name"].cat.known)  # True//]]>

In the example above, we borrowed the list of categories from the column with known categories, but we could also specify the list of unique values explicitly:

all_names = [
   "Xavier",
   "Charlie",
   "Yvonne",
   "Ray",
   "Frank",
   "Victor",
   "Michael",
   "Alice",
   "George",
   "Laura",
   "Ingrid",
   "Sarah",
   "Quinn",
   "Jerry",
   "Zelda",
   "Oliver",
   "Wendy",
   "Hannah",
   "Patricia",
   "Bob",
   "Dan",
   "Kevin",
   "Edith",
   "Tim",
   "Norbert",
   "Ursula",
]

ddf_cat_unknown["name"] = ddf_cat_unknown["name"].cat.set_categories(all_names)

print(ddf_cat_unknown["name"].cat.known)  # True
//]]>

Preparing categorical variables for machine learning algorithms

Using categorical variables with machine learning algorithms will require transforming the categorical variables in an appropriate format (encoding). The most popular encoder is OneHotEncoder, which creates a binary column for each unique value of the categorical variable and transforms every row into a set of columns, assigning value 1 to the column corresponding to the value in this row and 0 for all other columns.

Dask ML wraps Scikit-learn’s OneHotEncoder to work with Dask dataframes (and arrays). In the example below, we create a one-hot-encoded representation of the information in the name column:

from dask.datasets import timeseries
from dask_ml.preprocessing import OneHotEncoder

ddf = timeseries(
   start="2000-01-01",
   end="2000-01-03",
   partition_freq="1d",
   seed=0,
   dtypes={"name": "category"},
)

X = OneHotEncoder().fit_transform(ddf[["name"]])
//]]>

If we were to attempt this operation without setting the column to categorical dtype, we would receive an error:

from dask.datasets import timeseries
from dask_ml.preprocessing import OneHotEncoder

ddf = timeseries(
   start="2000-01-01",
   end="2000-01-03",
   partition_freq="1d",
   seed=0,
)

X = OneHotEncoder().fit_transform(ddf[["name"]])
# ValueError: All columns must be Categorical dtype when 'categories="auto"'.
//]]>

Note that we would also receive an error if we were to attempt this operation on a categorical variable with unknown categories:

X = OneHotEncoder().fit_transform(ddf[["name"]].astype({'name': 'category'}))
# NotImplementedError: `df.column.cat.categories` with unknown categories is not supported.  Please use `column.cat.as_known()` or `df.categorize()` beforehand to ensure known categories
//]]>

Using `.categorize()` method on the Dask Dataframe will solve this error, so the following snippet will not raise an error:

ddf = ddf.categorize()
X = OneHotEncoder().fit_transform(ddf[["name"]]) # works fine
//]]>

Convert Categorical: Summary

This blog post examined how to convert object columns of a Dask dataframe into categorical dtype, the benefits and costs of this operation, as well as how the converted column can be encoded for downstream machine learning algorithms. Correct encoding of the categorical variables is a key ingredient of efficient data science and machine learning pipelines, and Dask provides several tools to make this encoding easy and scalable.

Level up your Dask using Coiled

Coiled makes it easy to scale Dask maturely in the cloud