In this article, we will discuss an interesting use case of Dask and Coiled: Accelerating Volumetric X-ray Microstructural Analytics using distributed and high-performance computing. This blog post is inspired by the article published in Kitware Blog and corresponding research.
We will explore:
Do you have any exciting Dask or Coiled showcases to share? Tweet them to us @CoiledHQ, we would love to hear about it!
Dask is a library for parallel and distributed computing in Python. It helps scale machine learning workflows and provides a powerful framework to build distributed applications. Its familiar API, flexibility, and seamless integration with the Scientific Python Ecosystem make it a popular choice among data scientists and researchers.
Dask is used by academics and industry professionals across a variety of domains from life sciences and geophysical sciences to finance and retail corporations. It is also used behind-the-scenes by other software libraries like RAPIDS, Apache Airflow, Prefect, and many more. Learn more about Dask in What is Dask? and Who uses Dask?
In this showcase, we see Dask being used on a personal laptop, a supercomputer, and on the cloud with Coiled for accelerating the analysis of microstructural data in materials sciences.
X-Ray Imaging is widely used to study new materials and to measure their properties. Advances in X-Ray technology at synchrotron light sources greatly increase the size of datasets, which has led to efforts in automating analysis. So far, the efforts have been promising, but three-dimensional imagery is still in its infancy compared to human labeling. Human labeling (manual curation) is not feasible for large 3D datasets, especially when a single sample is often imaged many times, and each volume can contain several gigabytes of data. Moreover, previous approaches have scalability issues. The approach described in this paper provides a solution for all these problems using Dask.
The data acquired from previous related work are available publically in a data collection of 6 terabytes in total, where each file contains about 60 GB and more than 14 billion voxels. That’s a lot of data! This research uses one of the volumes in the data collection alongside synthetic datasets of about 2003 voxels generated to suit the need. The team uses Dask on a local machine (Dell XPS 13 7390), on a supercomputer (NERSC Cori), and on the cloud (AWS) with Coiled to analyze this data.
The dataset is stored in Zarr format using xarray. Zarr is a format for storing chunked n-dimensional arrays, and XArray is a library for labeled arrays and datasets in Python. This chunked and compressed multi-scale pyramid storage allows parallel analysis and visualization with Dask. Insight Toolkit (ITK) is used for its rich assortment of 2D and 3D image analysis algorithms. itkwidgets provides interactive Jupyter widgets to visualize images, point sets, and meshes in 3D or 2D. Jupyter notebooks are the interactive IDEs used to write Python code. Finally, Dask is used for parallel computing as described in the following section.
Dask is used to parallelize computation and accelerate the analysis. Dask schedulers are a core part of Dask. They take a task graph and execute it on parallel hardware, where the task graphs can also be specified as Python dictionaries. There are various schedulers with different performance optimizations ranging from single machine, single core to threaded, multiprocessing, and distributed.
Dask schedulers proved very useful for this research:
“The Dask single-machine schedulers have logic to execute the graph in a way that minimizes memory footprint. As a result, our algorithms can be quickly extended on smaller datasets with a laptop or workstation. Then, the same code can scale to extremely large datasets by running Dask over a HPC cluster backed by MPI, a cluster with a scheduler like SLURM, or a dynamically-generated, cloud service running a Kubernetes cluster.”
Coiled Cloud makes it easy to deploy Dask on the cloud. It takes care of all the DevOps involved in cloud computing and provides a simple interface that can be run on any machine to leverage hundreds of cores on an AWS cluster. Support for Azure is also in the works! In this showcase, Coiled Cloud was used to run the same computing environment on the cloud to compare runtime benchmarks.
The computation took 13 minutes on a standard laptop and the same processing pipeline was executed on a Coiled cluster in 2 minutes. The paper notes how the same computing environment could run the JupyterLab server and client with custom JupyterLab extensions, the Jupyter Kernel process, as well as the Dask scheduler, workers, and dashboard.
The team also describes powerful similarities across computing environments:
“High-memory resources were not required – the limited memory capacities of the laptop and Coiled workers were sufficient for the large dataset. The user experience, i.e. interactive Jupyter workspaces accessed through the browser on the laptop, was exactly the same. And, most importantly, the same software was used without modification; to change the computational backend, the Dask Distributed backend was simply changed to a distributed cluster resource, which resulted in a 25X acceleration when processing the larger dataset.”
We always love seeing Dask and Coiled being used to make data science accessible to the scientific community. This use case for scaling computer vision and 3D data analysis is a perfect demonstration of Dask’s flexibility and power. You can check out their code repository on GitHub and try Coiled Cloud for free below!