High-Performance Data Visualization with Datashader and Dask

November 17, 2022

Visualizing large datasets is hard. It’s a difficult technical challenge to plot millions or billions of data points in a single image. It also tends to take a long time to render these kinds of images: it’s often not a fun experience.

Datashader has solved the first problem of overplotting. This blog will show you how to address the second problem by making smart choices about:

  • using cluster memory
  • choosing the right data types
  • balancing the partitions in your Dask DataFrame

These tips will help you achieve high-performance data visualizations that are both informative and pleasant to watch. You can also use these performance tips in any other Dask workflows.

We’ll start by rendering a 170 million-point dataset naively. This will take about a minute. By the end of the tutorial, you will be able to render a 1.5-billion point dataset in 1.5 seconds. You can code along in this notebook.

Large Scale GIS Visualization

For our application we'll visualize the taxi pickup locations in the classic NYC Taxi dataset. We’ll be running all the computations on a Dask cluster provisioned through Coiled. To learn more about how to create Dask clusters with Coiled take a look at our documentation.

The dataset is available to us in Parquet format on S3. Let's take a brief look at the first few rows and see how many rows we have in total.

# Read in one year of NYC Taxi data
import dask.dataframe as dd

df = dd.read_parquet(
    "s3://coiled-datasets/dask-book/nyc-tlc/2009",
    storage_options={"anon": True},
)
df.head()

len(df)

170896055

This dataset contains 170 million rows. 

The Problem of Overplotting

If you’ve worked with pandas DataFrames before, chances are high you’ve used Matplotlib to visualize your datasets. Matplotlib is a popular library for visualizing datasets (want to know exactly how popular?) but it is not designed to work well with large datasets.

When you’re working with million- or billion-point datasets, you quickly run into the problem of overplotting. Data points start competing with each other for the limited amount of pixels on your screen and you end up with big blobs of colors that aren’t very meaningful.

You could downsample your data to only plot a fraction of your entire dataset, but this would mean losing out on a lot of valuable information. And often you still run into the problem of overplotting anyway.

df.sample(frac=0.001).compute().plot(
    x="pickup_longitude", 
    y="pickup_latitude", 
    kind="scatter",
)

Even at 0.1% downsampling this is still just a big blob of blue. We can do better.

Visualizing Large Datasets with Datashader 

Datashader is a Python library designed to visualize large datasets. It also happens to build on Dask. It renders large volumes of data with better design. Datashader uses a histogram method to determine how many points are in each pixel and then uses a color gradient to represent this point-density.

Let’s use Datashader to create a more informative visualization of our entire dataset:

import datashader
from datashader import transfer_functions as tf
from datashader.colors import Hot
import holoviews as hv

def render(df, x_range=(-74.1, -73.7), y_range=(40.6, 40.9)):
    # Plot
    canvas = datashader.Canvas(
        x_range=x_range,
        y_range=y_range,
    )
    agg = canvas.points(
        source=df, 
        x="dropoff_longitude", 
        y="dropoff_latitude", 
        agg=datashader.count("passenger_count"),
    )
    return datashader.transfer_functions.shade(agg, cmap=Hot, how="eq_hist")

render(df)

CPU times: user 14.3 s, sys: 146 ms, total: 14.4 s
Wall time: 38.5 s

This looks much better. Datashader is able to intelligently group all of our 170 million data points into a single image. 

But it’s taking a while. Our experience would be a lot more pleasant if we could get this to crunch the data faster so we wouldn’t have to wait a minute every time we wanted to zoom in or pan the image.

Speed up Visualization Performance

Let’s look at 3 things we can do to speed this up:

  • Use our cluster memory intelligently
  • Choose the right data types and columns for our problem
  • Balance out the partitions in our Dask DataFrame

The data is loaded into a Dask DataFrame. Dask uses lazy evaluation to delay computations until the moment they are needed. This allows Dask to make intelligent decisions about how to distribute a computation across the multiple workers in a cluster.

In our case, this means that by default, the entire dataset will have to be read into memory each time we render the image. The first performance improvement we can make here is to persist the dataset into the cluster RAM so that it’s directly available whenever we want to render a new image.

# load dataset into cluster memory
df.persist()

Rendering the image will now run much faster.

%%time
render(df)

CPU times: user 624ms, sys: 259 ms, total: 883ms
Wall time: 4.53 s

You’re now able to render a dataset with 170 million data points in under 5 seconds. 

Interactive

Now that we have this running at decent interactive speeds, let's switch Datashader to interactive mode.

import hvplot.dask
 
def interact(df):
    return df.hvplot.scatter(
        x="dropoff_longitude", 
        y="dropoff_latitude", 
        aggregator=datashader.count("passenger_count"), 
        datashade=True, cnorm="eq_hist", cmap=Hot,
        width=600, 
        height=400,
    )
 
interact(df)

 

This is working nicely.

More Data

But what if you want to render an even larger dataset?

Let’s load the data for the years 2009-2013 into a Dask DataFrame to see what happens. 

import dask.dataframe as dd

df = dd.read_parquet(
    "s3://coiled-datasets/dask-book/nyc-tlc/2009-2013/",
    storage_options={"anon": True},
)

len(df)

868504150

We’ll skip trying to render this naively (spoiler: it’ll take a long time to render these 868 million rows). Let’s directly apply the performance tip we just learned: persisting the data into cluster memory.

df = df.persist()

Observing the Dask Dashboard, we see that our cluster is running out of memory and spilling to disk. This is bad because reading and writing data to disk is a performance bottleneck.

We have two options:

  1. Get a bigger cluster
  2. Reduce the data

We could get a bigger cluster with the command cluster.scale() but that would be wasteful. Instead, let's be efficient and reduce the size of our data. There are three things that we can consider:

  1. Use better data types like PyArrow strings for object data types, more compact floats and ints, and categoricals
  2. Eliminate all of the columns that we don't need
  3. Sampling (but we want all of our data)


You can recast data types in Dask using the astype() method, or for Dask >= 2023.7.1 this will happen automatically.

# recast to more efficient dtypes
df = df.astype({"vendor_id": "string[pyarrow]", "trip_distance": "float32"})


We can also be more efficient by only using the columns we need for our visualization.

# subset data
df = df[["dropoff_longitude", "dropoff_latitude", "passenger_count"]]


Let’s now persist this slimmed-down dataset to cluster memory:

df = df.persist()
 
%%time
render(df)

CPU times: user 859ms, sys: 330 ms, total: 1.19s
Wall time: 8.1 s


Not bad: it’s taking us less than 10 seconds to render close to 1 billion data points. But this is still not quite the seamless interactivity we’re looking for. We can still do better than this.

Reduce number of partitions

Our data was originally in nicely sized partitions of 100-500 MiB. This is a good size of data to play with. Now that we have become more efficient, we have reduced the amount of data in our Dask DataFrame, but we still have the same amount of partitions. This means that we now have lots of partitions with relatively little data in each of them. This is inefficient. 

We can reduce overhead by consolidating our dataframe into fewer larger partitions, again aiming for that 100-500 MiB number.

# repartition dataframe
df = df.repartition(partition_size="256MiB").persist()
 
%%time
render(df)
CPU times: user 164ms, sys: 39.5 ms, total: 204ms
Wall time: 1.54 s

Even more data points

So far we’ve been looking at the dropoff points only. But that’s only half of the story. It would be informative if we could visualize both the pickup and dropoff points in a single image. 

That would look something like this:

This makes for a beautiful visualization that is not only pleasant to interact with but also highly informative. Notice, for example, the clear asymmetry between the pickup locations along the large avenues and the dropoff locations along the perpendicular streets. This makes sense: New Yorkers know that your chances of catching a cab are much higher along the higher-traffic avenues and so will be willing to walk there to get a faster pickup. But they will generally want to get dropped off close to their final destinations which are often along the side streets.

Build your own high-performance visualizations

The code to build the interactive visualization of 1.5 billion data points is in this notebook. You can also watch the full video for step-by-step instructions to guide you through the process.

With GitHub, Google or email.

Use your AWS or GCP account.

Start scaling.

$ pip install coiled
$ coiled setup
$ ipython
>>> import coiled
>>> cluster = coiled.Cluster(n_workers=500)