Dask is built and maintained by hundreds of people collaborating from around the world. In this series, we talk to some of these Dask contributors, discuss their journey into open source development, and hear their thoughts on the PyData ecosystem of tools. In this edition, we’re delighted to chat with Martin Durant, Senior Software and Data Engineer at Anaconda.
Martin has been a champion contributor in the PyData space. He is a core developer and maintainer of Dask and Zarr (a format for storing chunked n-dimensional arrays). He is the creator of some widely-used open source projects including Intake, fsspec, S3Fs, GCSFS, fastparquet, Streamz, and uAvro. Martin contributes to collaborations with Pangeo, Pangeo Forge, and Awkward Array. He also occasionally contributes to xarray, pandas, and CuPy.
Martin has a background in astronomy, and in his own words, here’s how he got interested in data science:
“I trained for and worked in astronomy for many years, and in some ways, I still think of myself as an astronomer. After realizing that the academic path was not for me after all, I evaluated what I had learned so far and realized that this new “data science” thing seemed to match me rather well. Along the way, I have lived in five countries, and am now permanently in Toronto. I’ve now been working remotely for Anaconda for six years, and contributed to OSS throughout.”
How did you get started with programming?
Although I was taught coding in school (Pascal!), I was already coding on my Atari ST much earlier, making simple games. I started to do serious programming in postgrad, as a way to go from raw data to reduced data to model fits and publication-quality plots. I was initially doing this in IDL, but it quickly proved insufficient, so I learned Python instead. I still remember running through the official tutorial end-to-end. I didn’t look back.
Why is open source important to you?
I have the rare privilege of having a job dedicated to making OSS. I rely on other contributors for upstream libraries, feature requests issues, and testing. Part of my job is also to be a face of OSS and to try to push the whole community forward and towards more integration.
How did you first get introduced to Dask?
I did know about Dask before working for Anaconda. At my previous job, the only time I was officially a “data scientist”, I did some prototyping with Spark, but really wanted there to be a Python way to do those kinds of workflows. I didn’t really understand what Dask was about, though, until I started to work with the small team that Anaconda had at the time I joined. I started by writing tutorial material and example notebooks, which is an excellent way to get to know the capabilities of some software.
How did you start contributing to Dask?
I think the first major thing I did for Dask was hdfs3, the wrapper around the C++ libhdf3 library, giving access to HDFS-stored data in Python, and some parts in Dask to interface with it. Hadoop was a major target for storage back then, so this was rather important for dask to be considered enterprise-ready. As I recall, three of us tackled the wrapper simultaneously over that Christmas, one with a direct CPython extension, one with Cython, and one (mine) with ctypes. Mine turned out to be easiest to run with, because it didn’t need its own compilation, and the target C++ library had a nice C API to hook into.
What part of Dask do you mainly contribute to?
Filesystems and file formats have been my thing. Matt Rocklin genuinely called me “Baron of Bytes” at some point. Since Dask spawned fsspec, I’ve been trying to push more and more storage features that Dask can use, but that doesn’t live inside Dask anymore. My contributions to Dask have been pretty sporadic and scattered across the codebase.
For example, I’ve taken an interest in Actors, because of how important they are to Ray, and have a number of PRs there; but I also mentored a GSoC student on graph visualizations, and I’m trying to help with cythonizing the scheduler core and finding fixes for the SQL loader. I also make sure that other software I build, such as the Intake stack, integrate well with Dask and occasionally contribute something back to Dask.
Why does being a contributor excite you?
Aside from the recognition and watching download numbers? :). If I tell people I work on Dask, there is a good chance that they know what that means, and most people think that Dask is great.
It’s about making people’s lives that little bit easier. Yes, I know that data processing can be used for good or not good, but having it be based on open code, developed in the public with full scrutiny, is an excellent example to set. Anaconda itself is playing this part, slowly changing the mindset about what it is to be a net- and code-enabled citizen.
Besides developing Dask, do you use it too? If so, how does one affect the other?
As I said before, I care a lot about making sure that other projects work seamlessly with Dask, so that is a manner of “using it”. The best example here is Intake and its drivers, which all support a .to_dask() method, returning the lazy dask container object appropriate for that particular data source. For large data processing or model fitting… not that much. It does come in very handy sometimes, though. One example: I have a new project called fsspec-reference-maker, which creates a kind of virtual file-system for fsspec, such that you can access the contents of binary files directly without the limitations of the package designed for that file type.
One example I created with colleagues, has 367,000 HDF5 files in an S3 bucket. To make the references required, you need to scan every one of these files to find where the bytes chunks are. Now that’s a job for Dask! In fact, it turned out that the metadata alone was so big, that it took too much time and memory to merge them in one kernel, so we did a tree reduction to make partially-merged intermediates sets first. The whole thing took over 2 hours on 30 workers.
What is your favorite part of Dask?
Everyone says the dashboard here, but I suppose it’s true: there is nothing like setting a big job in motion and then seeing it play out graphically, or using the wealth of information there to diagnose problems and bottlenecks.
What are some things that you want to see improved in Dask?
There is a big push to try to optimize big graphs, reordering, or choosing not to do unnecessary work. This will become increasingly important at scale. And, of course, the scheduler needs to be able to keep up. Probably the most critical thing Dask needs to get good at is active resource management - workers dying due to internal or external limits is probably the most consistently frustrating thing that can happen to Dask users these days.
What do you see in the future of Dask? And, scalable computing in general?
There is no doubt that Dask is still in the growth phase, along with Python, data science, and machine learning in general. Problems will get bigger, of course, but a number of other factors are on the move: the availability of massive nodes and increasing heterogeneity of hardware even within a single cluster. All of this is increasingly driving neural-net-based learning rather than classical statistical inference, so you also have increasingly heterogeneous programming styles at work simultaneously. This is all very complex to marshal, so Dask (or any work orchestrator) needs to get smart!
Thank you, Martin, for all your wonderful contributions and for making PyData what it is today.
Thanks for reading! And if you’re interested in trying out Coiled Cloud, which provides hosted Dask clusters, docker-less managed software, and one-click deployments, you can do so for free today when you click below.