Visualizing large datasets is hard. It’s a difficult technical challenge to plot millions or billions of data points in a single image. It also tends to take a long time to render these kinds of images: it’s often not a fun experience.
Datashader has solved the first problem of overplotting. This blog will show you how to address the second problem by making smart choices about:
These tips will help you achieve high-performance data visualizations that are both informative and pleasant to watch. You can also use these performance tips in any other Dask workflows.
We’ll start by rendering a 170 million-point dataset naively. This will take about a minute. By the end of the tutorial, you will be able to render a 1.5-billion point dataset in 1.5 seconds. You can code along in this notebook.
For our application we'll visualize the taxi pickup locations in the classic NYC Taxi dataset. We’ll be running all the computations on a Dask cluster provisioned through Coiled. To learn more about how to create Dask clusters with Coiled take a look at our documentation.
The dataset is available to us in Parquet format on S3. Let's take a brief look at the first few rows and see how many rows we have in total.
This dataset contains 170 million rows.
If you’ve worked with pandas DataFrames before, chances are high you’ve used Matplotlib to visualize your datasets. Matplotlib is a popular library for visualizing datasets (want to know exactly how popular?) but it is not designed to work well with large datasets.
When you’re working with million- or billion-point datasets, you quickly run into the problem of overplotting. Data points start competing with each other for the limited amount of pixels on your screen and you end up with big blobs of colors that aren’t very meaningful.
You could downsample your data to only plot a fraction of your entire dataset, but this would mean losing out on a lot of valuable information. And often you still run into the problem of overplotting anyway.
Even at 0.1% downsampling this is still just a big blob of blue. We can do better.
Datashader is a Python library designed to visualize large datasets. It also happens to build on Dask. It renders large volumes of data with better design. Datashader uses a histogram method to determine how many points are in each pixel and then uses a color gradient to represent this point-density.
Let’s use Datashader to create a more informative visualization of our entire dataset:
This looks much better. Datashader is able to intelligently group all of our 170 million data points into a single image.
But it’s taking a while. Our experience would be a lot more pleasant if we could get this to crunch the data faster so we wouldn’t have to wait a minute every time we wanted to zoom in or pan the image.
Let’s look at 3 things we can do to speed this up:
The data is loaded into a Dask DataFrame. Dask uses lazy evaluation to delay computations until the moment they are needed. This allows Dask to make intelligent decisions about how to distribute a computation across the multiple workers in a cluster.
In our case, this means that by default, the entire dataset will have to be read into memory each time we render the image. The first performance improvement we can make here is to persist the dataset into the cluster RAM so that it’s directly available whenever we want to render a new image.
Rendering the image will now run much faster.
You’re now able to render a dataset with 170 million data points in under 5 seconds.
Now that we have this running at decent interactive speeds, let's switch Datashader to interactive mode.
This is working nicely.
But what if you want to render an even larger dataset?
Let’s load the data for the years 2009-2013 into a Dask DataFrame to see what happens.
We’ll skip trying to render this naively (spoiler: it’ll take a long time to render these 868 million rows). Let’s directly apply the performance tip we just learned: persisting the data into cluster memory.
Observing the Dask Dashboard, we see that our cluster is running out of memory and spilling to disk. This is bad because reading and writing data to disk is a performance bottleneck.
We have two options:
We could get a bigger cluster with the command cluster.scale() but that would be wasteful. Instead, let's be efficient and reduce the size of our data. There are three things that we can consider:
You can recast data types in Dask using the astype() method, or for Dask >= 2023.7.1 this will happen automatically.
We can also be more efficient by only using the columns we need for our visualization.
Let’s now persist this slimmed-down dataset to cluster memory:
Not bad: it’s taking us less than 10 seconds to render close to 1 billion data points. But this is still not quite the seamless interactivity we’re looking for. We can still do better than this.
Our data was originally in nicely sized partitions of 100-500 MiB. This is a good size of data to play with. Now that we have become more efficient, we have reduced the amount of data in our Dask DataFrame, but we still have the same amount of partitions. This means that we now have lots of partitions with relatively little data in each of them. This is inefficient.
We can reduce overhead by consolidating our dataframe into fewer larger partitions, again aiming for that 100-500 MiB number.
So far we’ve been looking at the dropoff points only. But that’s only half of the story. It would be informative if we could visualize both the pickup and dropoff points in a single image.
That would look something like this:
This makes for a beautiful visualization that is not only pleasant to interact with but also highly informative. Notice, for example, the clear asymmetry between the pickup locations along the large avenues and the dropoff locations along the perpendicular streets. This makes sense: New Yorkers know that your chances of catching a cab are much higher along the higher-traffic avenues and so will be willing to walk there to get a faster pickup. But they will generally want to get dropped off close to their final destinations which are often along the side streets.
The code to build the interactive visualization of 1.5 billion data points is in this notebook. You can also watch the full video for step-by-step instructions to guide you through the process.